# Prototyping the new json schema


As part of the consolidation of the evidence objects in the backend, we are re-modeling the [json schema](https://github.com/opentargets/json_schema) to reflect the new simplified/flattened design.

Based on the meeting we had on 2020.11.11 the following conclusions were reached:

* We need to maintain a json schema that guides our data providers and can be used as template to generate evidence strings.
* The schema will reflect the concepts of the new platform design, so the units of the schema is going to be data source centric instead of data type.
* Each of the valuable columns will be defined in a common section.
* For each data source there will only be a list of required fields.
* We haven't reached a consensus on how the unique association fields are defined, and at which point of the evidence generation. So for the first iteration of the json schema, the unique_association_fields will be omitted.

The schema is written based on the most recent iteration of the [evidence schema review](https://docs.google.com/spreadsheets/d/11jdPCo_vxY3jaP54xKTsXBshR5HMrpUf5oXJNgtbKm8/edit#gid=1735847104) document.

The technical approach:

* To avoid manual work with the json document, I'm collating information in an excel file and will use that as a source for the definitions.
* The same excel file will be used to get the source names from where we are expecting the given field.-

## 1. Getting the list of data source for every field

This information is extracted from the evidence schema review file. The end of the process is a comma separated list of data sources for every field. This column is used later to generate the mandatory list of fields for every data source.

In [33]:
# 1. Get the source names for every field:
import pandas as pd
import json
from collections import OrderedDict, defaultdict

notnull_table = pd.read_csv('iter9_notnull_table.tsv', sep='\t')
notnull_table.index = notnull_table.key.tolist()
notnull_table.drop('key', axis=1, inplace=True)
notnull_table.head()

Unnamed: 0,cancer_gene_census,chembl,clingen,crispr,europepmc,eva,eva_somatic,expression_atlas,gene2phenotype,genomics_england,intogen,ot_genetics_portal,phenodigm,phewas_catalog,progeny,reactome,slapenrich,sysbio,uniprot_literature,uniprot_somatic
allelicRequirement,0,0,1075,0,0,0,0,0,2451,0,0,0,0,0,0,0,0,0,0,0
biologicalModelAllelicComposition,0,0,0,0,0,0,0,0,0,0,0,0,564310,0,0,0,0,0,0,0
biologicalModelGeneticBackground,0,0,0,0,0,0,0,0,0,0,0,0,564310,0,0,0,0,0,0,0
clinicalPhase,0,427943,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
clinicalSignificance,0,0,0,0,0,107532,8173,0,0,0,0,0,0,0,0,0,0,0,0,0


In [25]:
# Looping through all fields and extracting all sources where the data is present:
lookup = lambda x: ','.join(notnull_table.loc[x].where(lambda x: x != 0).dropna().index.to_list())
pd.Series(notnull_table.index).apply(lookup)


0         clingen,gene2phenotype
1                      phenodigm
2                      phenodigm
3                         chembl
4                eva,eva_somatic
                 ...            
57    cancer_gene_census,intogen
58                      reactome
59    cancer_gene_census,intogen
60    cancer_gene_census,intogen
61            cancer_gene_census
Length: 62, dtype: object

## Reading field description

1. Read excel file with the field descriptions
2. Parse values. 
3. Start building json object.


In [27]:
field_description = 'field_descriptions.xlsx'
fields_df = pd.read_excel(field_description)
fields_df.head()

Unnamed: 0,backend_name,type,description,minimum,exclusiveMinimum,maximum,accepted_values,nullable,pattern,data_source
0,allelicRequirement,string,,,,,,,,"clingen,gene2phenotype"
1,biologicalModelAllelicComposition,string,,,,,,,,phenodigm
2,biologicalModelGeneticBackground,string,,,,,,,,phenodigm
3,clinicalPhase,integer,,,,,,,,chembl
4,clinicalSignificance,string,,,,,,,,"eva,eva_somatic"


In [44]:
# constants for the json schema:
schema_obj = OrderedDict({
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "OpenTargets",
    "description": "OpenTargets evidence objects",
    "type": "object",
})

# Adding all the required fields for all data sources:
schema_obj["allOf"] = parse_data_sources(fields_df)

clingen
gene2phenotype
phenodigm
chembl
eva
eva_somatic
intogen
ot_genetics_portal
expression_atlas
cancer_gene_census
crispr
europepmc
genomics_england
phewas_catalog
progeny
reactome
slapenrich
sysbio
uniprot_literature
uniprot_somatic


In [43]:
def parse_data_sources(df):
    # Parsing dataframe to get list of fields for each data source:
    parsed_sources = defaultdict(list)
    for i, row in df.iterrows():
        for source in row['data_source'].split(','):
            parsed_sources[source].append(row['backend_name'])

    # Each data source then exploded into schemas:
    source_schemas = []
    for source, fields in parsed_sources.items():
        print(source)
        source_schema = OrderedDict()
        source_schema['if'] = OrderedDict({'properties': {'datasourceId': {"const": source}}})
        source_schema['then'] = OrderedDict({"properties": {}})
        source_schema['then']['required'] = fields

        # Adding property definitions:
        for field in fields:
            source_schema['then']['properties'][field] = {"$ref": f"#/definitions/{field}"}

        # Adding source schema:
        source_schemas.append(source_schema)
        
    return(source_schemas)

In [45]:
import json
print(json.dumps(schema_obj, indent=2))

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "OpenTargets",
  "description": "OpenTargets evidence objects",
  "type": "object",
  "allOf": [
    {
      "if": {
        "properties": {
          "datasourceId": {
            "const": "clingen"
          }
        }
      },
      "then": {
        "properties": {
          "allelicRequirement": {
            "$ref": "#/definitions/allelicRequirement"
          },
          "confidence": {
            "$ref": "#/definitions/confidence"
          },
          "datasourceId": {
            "$ref": "#/definitions/datasourceId"
          },
          "diseaseFromSource": {
            "$ref": "#/definitions/diseaseFromSource"
          },
          "diseaseId": {
            "$ref": "#/definitions/diseaseId"
          },
          "id": {
            "$ref": "#/definitions/id"
          },
          "recordId": {
            "$ref": "#/definitions/recordId"
          },
          "resourceScore": {
            "$ref