# Prototyping the new json schema


As part of the consolidation of the evidence objects in the backend, we are re-modeling the [json schema](https://github.com/opentargets/json_schema) to reflect the new simplified/flattened design.

Based on the meeting we had on 2020.11.11 the following conclusions were reached:

* We need to maintain a json schema that guides our data providers and can be used as template to generate evidence strings.
* The schema will reflect the concepts of the new platform design, so the units of the schema is going to be data source centric instead of data type.
* Each of the valuable columns will be defined in a common section.
* For each data source there will only be a list of required fields.
* We haven't reached a consensus on how the unique association fields are defined, and at which point of the evidence generation. So for the first iteration of the json schema, the unique_association_fields will be omitted.

The schema is written based on the most recent iteration of the [evidence schema review](https://docs.google.com/spreadsheets/d/11jdPCo_vxY3jaP54xKTsXBshR5HMrpUf5oXJNgtbKm8/edit#gid=1735847104) document.

The technical approach:

* To avoid manual work with the json document, I'm collating information in an excel file and will use that as a source for the definitions.
* The same excel file will be used to get the source names from where we are expecting the given field.

## The first run completed:

- [X] processing the review document to get the rough list of fields
- [x] get fields2datasource mapping
- [x] generate json schema based on the meeting


## The first run didn't cover:

- [ ] some fields are missing eg. uniprot id
- [ ] some fields shoudl not be here: `score` and `id`
- [ ] the precise requirements of the fields are still sparse -> add more data to `field_description.xlsx`
- [ ] no structure whatsoever.

## 1. Getting the list of data source for every field

This information is extracted from the evidence schema review file. The end of the process is a comma separated list of data sources for every field. This column is used later to generate the mandatory list of fields for every data source.

In [85]:
# 1. Get the source names for every field:
import pandas as pd
import json
from collections import OrderedDict, defaultdict
import numpy as np

notnull_table = pd.read_csv('iter9_notnull_table.tsv', sep='\t')
notnull_table.index = notnull_table.key.tolist()
notnull_table.drop('key', axis=1, inplace=True)
notnull_table.head()

Unnamed: 0,cancer_gene_census,chembl,clingen,crispr,europepmc,eva,eva_somatic,expression_atlas,gene2phenotype,genomics_england,intogen,ot_genetics_portal,phenodigm,phewas_catalog,progeny,reactome,slapenrich,sysbio,uniprot_literature,uniprot_somatic
allelicRequirement,0,0,1075,0,0,0,0,0,2451,0,0,0,0,0,0,0,0,0,0,0
biologicalModelAllelicComposition,0,0,0,0,0,0,0,0,0,0,0,0,564310,0,0,0,0,0,0,0
biologicalModelGeneticBackground,0,0,0,0,0,0,0,0,0,0,0,0,564310,0,0,0,0,0,0,0
clinicalPhase,0,427943,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
clinicalSignificance,0,0,0,0,0,107532,8173,0,0,0,0,0,0,0,0,0,0,0,0,0


In [25]:
# Looping through all fields and extracting all sources where the data is present:
lookup = lambda x: ','.join(notnull_table.loc[x].where(lambda x: x != 0).dropna().index.to_list())
pd.Series(notnull_table.index).apply(lookup)


0         clingen,gene2phenotype
1                      phenodigm
2                      phenodigm
3                         chembl
4                eva,eva_somatic
                 ...            
57    cancer_gene_census,intogen
58                      reactome
59    cancer_gene_census,intogen
60    cancer_gene_census,intogen
61            cancer_gene_census
Length: 62, dtype: object

## Reading field description

1. Read excel file with the field descriptions
2. Parse values. 
3. Start building json object.


In [74]:
field_description = 'field_descriptions.xlsx'
fields_df = pd.read_excel(field_description)
fields_df.head()

Unnamed: 0,backend_name,type,description,minimum,exclusiveMinimum,maximum,accepted_values,nullable,pattern,data_source
0,allelicRequirement,string,,,,,,,,"clingen,gene2phenotype"
1,biologicalModelAllelicComposition,string,,,,,,,,phenodigm
2,biologicalModelGeneticBackground,string,,,,,,,,phenodigm
3,clinicalPhase,integer,,,,,,,,chembl
4,clinicalSignificance,string,,,,,,,,"eva,eva_somatic"


In [64]:
def parse_data_sources(df):
    # Parsing dataframe to get list of fields for each data source:
    parsed_sources = defaultdict(list)
    for i, row in df.iterrows():
        for source in row['data_source'].split(','):
            
            # Only using ot_genetics for now:
            if source != 'ot_genetics_portal':
                continue
                
            parsed_sources[source].append(row['backend_name'])

    # Each data source then exploded into schemas:
    source_schemas = []
    for source, fields in parsed_sources.items():
        source_schema = OrderedDict()

        # Adding property definitions:
        source_schema['properties'] = OrderedDict({'datasourceId': {"const": source}})
        
        for field in fields:
            if field == 'datasourceId':
                continue
                
            source_schema['properties'][field] = {"$ref": f"#/definitions/{field}"}
        
        source_schema['required'] = fields


        # Adding source schema:
        source_schemas.append(source_schema)
        
    return(source_schemas)


In [106]:
# Reloading possible modifications from the xlsx file:
fields_df = pd.read_excel(field_description)

# constants for the json schema:
schema_obj = OrderedDict({
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "OpenTargets",
    "description": "OpenTargets evidence objects",
    "type": "object",
    "oneOf": {},
    "definitions": OrderedDict()
})

# Adding all the required fields for all data sources:
schema_obj["oneOf"] = parse_data_sources(fields_df)

# Parsing dataframe to get list of fields for each data source:
parsed_sources = defaultdict(list)

for i, row in fields_df.iterrows():
    field = row['backend_name']
    
    # Only using ot_genetics for now:
    if 'ot_genetics_portal' not in row['data_source']:
        continue

    field_annotation = OrderedDict()
    
    # Setting type - maybe nullable:
    if row['nullable']:
        field_annotation['type'] = [row['type'], "null"]
    else:
        field_annotation['type'] = row['type']
    
    # Adding description:
    if isinstance(row['description'], str):
        field_annotation['description'] = row['description']

    # Adding minimum:
    if not np.isnan(row['minimum']):
        field_annotation['minimum'] = row['minimum']
        
    # Adding description:
    if not np.isnan(row['maximum']):
        field_annotation['maximum'] = row['maximum']

    # Adding minimum:
    if not np.isnan(row['exclusiveMinimum']):
        field_annotation['exclusiveMinimum'] = row['exclusiveMinimum']
        
    # Adding pattern:
    if isinstance(row['pattern'], str):
        field_annotation['pattern'] = row['pattern']
        
    schema_obj['definitions'][field] = field_annotation
    
    
    
print(json.dumps(schema_obj, indent=2))

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "OpenTargets",
  "description": "OpenTargets evidence objects",
  "type": "object",
  "oneOf": [
    {
      "properties": {
        "datasourceId": {
          "const": "ot_genetics_portal"
        },
        "confidenceIntervalLower": {
          "$ref": "#/definitions/confidenceIntervalLower"
        },
        "confidenceIntervalUpper": {
          "$ref": "#/definitions/confidenceIntervalUpper"
        },
        "diseaseFromSource": {
          "$ref": "#/definitions/diseaseFromSource"
        },
        "diseaseId": {
          "$ref": "#/definitions/diseaseId"
        },
        "id": {
          "$ref": "#/definitions/id"
        },
        "literature": {
          "$ref": "#/definitions/literature"
        },
        "locus2GeneScore": {
          "$ref": "#/definitions/locus2GeneScore"
        },
        "oddsRatio": {
          "$ref": "#/definitions/oddsRatio"
        },
        "publicationFirstAuthor":