# Prototyping the new json schema


As part of the consolidation of the evidence objects in the backend, we are re-modeling the [json schema](https://github.com/opentargets/json_schema) to reflect the new simplified/flattened design.

**Link to ticket: [#1249](https://github.com/opentargets/platform/issues/1249)**

Based on the meeting we had on 2020.11.11 the following conclusions were reached:

* We need to maintain a json schema that guides our data providers and can be used as template to generate evidence strings.
* The schema will reflect the concepts of the new platform design, so the units of the schema is going to be data source centric instead of data type.
* Each of the valuable columns will be defined in a common section.
* For each data source there will only be a list of required fields.
* We haven't reached a consensus on how the unique association fields are defined, and at which point of the evidence generation. So for the first iteration of the json schema, the unique_association_fields will be omitted.

The schema is written based on the most recent iteration of the [evidence schema review](https://docs.google.com/spreadsheets/d/11jdPCo_vxY3jaP54xKTsXBshR5HMrpUf5oXJNgtbKm8/edit#gid=1735847104) document.

The technical approach:

* To avoid manual work with the json document, I'm collating information in an excel file and will use that as a source for the definitions.
* The same excel file will be used to get the source names from where we are expecting the given field.

## The first run completed:

- [X] processing the review document to get the rough list of fields
- [x] get fields2datasource mapping
- [x] generate json schema based on the meeting


## The first run didn't cover:

- [ ] some fields are missing eg. uniprot id
- [ ] some fields shoudl not be here: `score` and `id`
- [ ] the precise requirements of the fields are still sparse -> add more data to `field_description.xlsx`
- [ ] no structure whatsoever.

## 1. Getting the list of data source for every field

This information is extracted from the evidence schema review file. The end of the process is a comma separated list of data sources for every field. This column is used later to generate the mandatory list of fields for every data source.

In [57]:
# 1. Get the source names for every field:
import pandas as pd
import json
from collections import OrderedDict, defaultdict
import numpy as np
import requests

field_mapping_df = pd.read_csv('fields_sources.tsv', sep='\t')
field_mapping_df.head()

Unnamed: 0,column,sources
0,allelicRequirements,"genomics_england,clingen,eva,gene2phenotype,ev..."
1,biologicalModelAllelicComposition,phenodigm
2,biologicalModelGeneticBackground,phenodigm
3,clinicalPhase,chembl
4,clinicalSignificances,"eva,eva_somatic"


## Reading field description

1. Read excel file with the field descriptions: from [google sheets](https://docs.google.com/spreadsheets/d/1vHoyIsQDBNmUfq2IUdZDoz3G457V5cqW/edit#gid=613969206)
2. Parse values. 
3. Start building json object.


In [89]:
def get_dataframe():
    '''
    This function fetches the field definitinos from google spreadsheets
    '''
    field_description = 'field_descriptions.tsv'

    # The file from now on is stored on google sheets:
    url = 'https://docs.google.com/spreadsheets/d/1vHoyIsQDBNmUfq2IUdZDoz3G457V5cqW/export?format=tsv&id=1vHoyIsQDBNmUfq2IUdZDoz3G457V5cqW&gid=613969206'
    r = requests.get(url)

    with open(field_description, 'w') as f:
        f.write(r.text)

    fields_df = pd.read_csv(field_description, sep='\t')
    return fields_df


def parse_data_sources(df):
    '''
    Parsing dataframe to get list of fields for each data source:
    
    input: features_df
    output: oneof ordered dictionary
    '''
    
    parsed_sources = defaultdict(list)
    for i, row in df.iterrows():
        for source in row['data_source'].split(','):                
            parsed_sources[source].append(row['field_name'])

    # Each data source then exploded into schemas:
    source_schemas = []
    sorted_items = sorted(parsed_sources.items())
    for source, fields in sorted_items:
        source_schema = OrderedDict()

        # Adding property definitions:
        source_schema['properties'] = OrderedDict({'datasourceId': {"const": source}})
        
        for field in fields:
            if field == 'datasourceId':
                continue
                
            source_schema['properties'][field] = {"$ref": f"#/definitions/{field}"}
        
        source_schema['required'] = ['datasourceId', 'targetFromSourceId', 'diseaseId']
        source_schema["additionalProperties"] = False

        # Adding source schema:
        source_schemas.append(source_schema)
        
    return(source_schemas)


def add_definition(row, fields_df):
    '''
    This is the main function to add definitions to the schema
    '''
    
    # If the feature is simple:
    if row['type'] in ['string', 'integer', 'number']:
        return add_simple_definition(row)
    
    # If the feature is complex:
    elif row['type'] == 'array':
        print(row["field_name"])
        return add_array(row, fields_df)
        print(f'complex objectfound: {row["field_name"]}')
    
    
def add_array(row, df):
    '''
    If the definition is an array, things have to be treated separately
    '''
    field_annotation = OrderedDict({'type':'array'})
    field_name = row['field_name']
    
    # If there's a description:
    if isinstance(row['description'], str):
        field_annotation['description'] = row['description']

    # Items are objects: 
    if len(df.loc[df.location == field_name]) > 0:
        field_annotation['items'] = OrderedDict({
            'type': "object",
            'properties': OrderedDict()
        })
        
        sub_df = df.loc[df.location == field_name]
        for index, sub_row in sub_df.iterrows():
             field_annotation['items']['properties'][sub_row['field_name']] = add_definition(sub_row, sub_df)

        
    # Items are string:
    else:
        row['type'] = 'string'
        row['description'] = None

        field_annotation['items'] = add_simple_definition(row)
        
    field_annotation["uniqueItems"] = True
    
    return field_annotation
    
    
def add_simple_definition(row):
    '''
    If the feature is a simple object handled easy.
    '''
    field = row['field_name']

    field_annotation = OrderedDict()
    
    # Setting type - maybe nullable:
    field_annotation['type'] = row['type']
    
    # Adding description:
    if isinstance(row['description'], str):
        field_annotation['description'] = row['description']

    # Adding minimum:
    if not np.isnan(row['minimum']):     
        if row['type'] == 'integer':
            field_annotation['minimum'] = int(row['minimum'])
        else:
            field_annotation['minimum'] = float(row['minimum'])
            
                
    # Adding maximum:
    if not np.isnan(row['maximum']):
        if row['type'] == 'integer':
            field_annotation['maximum'] = int(row['maximum'])
        else:
            field_annotation['maximum'] = float(row['maximum'])
            
        
    # Is it exclusive minimum:
    if not np.isnan(row['exclusiveMinimum']):
        if row['type'] == 'integer':
            field_annotation['exclusiveMinimum'] = int(row['exclusiveMinimum'])
        else:
            field_annotation['exclusiveMinimum'] = float(row['exclusiveMinimum'])
            
        
    # Adding pattern:
    if isinstance(row['pattern'], str):
        field_annotation['pattern'] = row['pattern']

    # Adding examples:
    if isinstance(row['example'], str):
        field_annotation['examples'] = row['example'].split('|')
        
    # Is there a list of accepted values (might be a list of floats!!):
    if isinstance(row['accepted_values'], str):
        enum_values = row['accepted_values'].split('|')
        try:
            field_annotation['enum'] = [float(x) for x in enum_values]
        except:
            field_annotation['enum'] = enum_values
        
    return field_annotation



In [90]:
# Reloading possible modifications from the xlsx file:
fields_df = get_dataframe()
field_mapping_df = pd.read_csv('fields_sources.tsv', sep='\t')

##
## initialize json schema:
##
schema_obj = OrderedDict({
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "OpenTargets",
    "description": "OpenTargets evidence objects",
    "type": "object",
    "oneOf": {},
    "definitions": OrderedDict()
})

##
## Generating field definitions:
##
    
# This will be updated with every row:
root_object = OrderedDict() 
fields_df = get_dataframe()
merged = fields_df.merge(field_mapping_df, left_on='field_name', right_on='column', how='left')
merged.rename(columns = {'sources': 'data_source'}, inplace=True)

# Adding descxription for all data sources:
schema_obj["oneOf"] = parse_data_sources(merged.loc[merged.location == 'root'])


# Looping through all fields that are in the root of the document:
for i, row in merged.loc[merged.location == 'root'].iterrows():
    schema_obj['definitions'][row['field_name']] = add_definition(row, merged)

# Saving object into json file:
with open('/Users/dsuveges/repositories/json_schema/opentargets.json', 'w') as f:
    json.dump(schema_obj, f, indent=2)
    
    

allelicRequirements
clinicalSignificances
clinicalUrls
cohortPhenotypes
diseaseCellLines
diseaseModelAssociatedHumanPhenotypes
diseaseModelAssociatedModelPhenotypes
literature
mutatedSamples
significantDriverMethods
textMiningSentences
variantAminoacidDescriptions


In [87]:
from scraper_api import ScraperAPIClient
client = ScraperAPIClient('53403c7ec1dce6a87740bae27d78e048')
result = client.get(url = 'http://httpbin.org/ip').text
print(result);

You've hit the request limit for your current plan. Please upgrade to continue using Scraper API, or contact support@scraperapi.com.


In [78]:
53403c7ec1dce6a87740bae27d78e048

In [79]:
fake.user_agent()

'Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/536.0 (KHTML, like Gecko) FxiOS/18.9t2558.0 Mobile/57Y416 Safari/536.0'