## Specification aggregator for bioschemas

The DDE intentionally does not allow existing namespaces to be used by others. This is a feature that ensures that user-generated schemas or user-customized schemas are not confused for existing, registered schemas. Unfortunately, this means that users looking to update an existing bioschemas cannot use the bioschemas namespace when creating a schema within the DDE. The aggregator includes a function will replace the temporary namespace with the bioschemas namespace for the merge.

### What this script does

1. Loads the list of jsonschema specification files to ingest
2. Replaces the temporary namespace in the merged json document
3. Include a check for multiple same classes: 
  * A profile might reference another profile. In order for the profile to work, a dummy class of the referenced profile may need to be created. This dummy profile should be included in the merged file ONLY IF the actual class (which should have a validation section of its own) does not exist
  * Use the existence of a validation to determine which class to keep in the event of the same class coming from different profiles (real or dummy)
4. Includes a check for the subclass of and update it to match the list 
  * This is in anticipation of the use of the DDE to update an existing profile
5. Includes a check for multiple same properties:
  * A single property might be used across different bioschemas classes
  * This means that the `"schema:domainIncludes"` property should be updated to reflect ALL of the classes that use it, rather than just the first class that uses it.
  * eg- "bioschemas:output" might be a property in "ComputationalTool" and "ComputationalWorkflow". Since these profiles are developed separately, the one in "ComputationalTool" will include `"schema:domainIncludes": {"@id": "bioschemas:ComputationalTool"}` while the one from ComputationalWorkflow will include `"schema:domainIncludes": {"@id": "bioschemas:ComputationalWorkflow"}`. These will need to be merged into a single property with `"schema:domainIncludes": [{"@id": "bioschemas:ComputationalTool"},{"@id": "bioschemas:ComputationalWorkflow"}]`
6. Include a check for the proper url for bioschemas in @context
7. Automatically include `dct:conformsTo` to all classes definied by the schema
8. Automatically include `schema:schemaVersion` to all classes defined by the schema

  

### To do
1. Include a check for properties which already have a list for the domainIncludes




In [1]:
import json
import requests
import pandas as pd
from pandas import read_csv
import os
import pathlib

In [2]:
def get_raw_url(url):
    if 'raw' not in url:
        rawrawurl = url.replace('github.com','raw.githubusercontent.com')
        if 'master' in rawrawurl:
            rawurl = rawrawurl.replace('/blob/master/','/master/')
        elif 'main' in rawrawurl:
            rawurl = rawrawurl.replace('/blob/main/','/main/')
    else:
        rawurl = url
    return(rawurl)

In [3]:
def rename_namespace(spec_list,eachurl,rawtext):
    tmpinfo = spec_list.loc[spec_list['url']==eachurl]
    tmpnamespace = tmpinfo.iloc[0]['namespace']
    if tmpnamespace!='bioschemas':
        tmptext = '"@id": "'+tmpnamespace+':'
        cleantext = rawtext.replace(tmptext,'"@id": "bioschemas:')
    else:
        cleantext = rawtext
    return(cleantext)

def check_context_url(spec_json):
    contextInfo = spec_json['@context']
    bioschemasUrl = "https://discovery.biothings.io/view/bioschemas/"
    contextInfo["bioschemas"]=bioschemasUrl
    contextInfo["dct"] = "http://purl.org/dc/terms/"
    contextInfo["owl"] = "http://www.w3.org/2002/07/owl#"
    return(contextInfo)

def update_subclass(spec_list,eachurl,cleantext):
    spec_json = json.loads(cleantext)
    tmpinfo = spec_list.loc[spec_list['url']==eachurl]
    tmpsubclass = tmpinfo.iloc[0]['subclassOf']
    classname = tmpinfo.iloc[0]['name']
    truesubclass = {"@id": tmpsubclass}
    for x in spec_json['@graph']:
        if x['@id']=="bioschemas:"+classname:
            x['rdfs:subClassOf']=truesubclass
    return(spec_json)

In [4]:
def add_conformsTo(spec_list,x):
    spec_info = spec_list.loc[spec_list['name']==x['@id'].replace("bioschemas:","")]
    spec_url = spec_info.iloc[0]['url']
    conformsTodict = {
            "description": "This is used to state the Bioschemas profile that the markup relates to. The identifier can be the url for the version of this bioschemas class on github: "+spec_url,
            "$ref": "#/definitions/conformsDefinition"
          }
    conformdef={
                "@type": "CreativeWork",
                "type": "object",
                "properties": {
                  "identifier":{
                    "description": "The url of the version bioschemas profile that was used. For jsonschema, set @id to the identifier",
                    "oneOf": [
                      {
                        "enum": [spec_url] 
                      },
                      {
                        "type": "string",
                        "format": "uri"
                      }
                    ]
                  }
                },
                "required": [
                  "identifier"
                ]              
        }
    x['$validation']['properties']['conformsTo'] = conformsTodict
    requirementlist = x['$validation']['required']
    requirementlist.append('conformsTo')
    x['$validation']['required'] = requirementlist
    try:
        definitiondict = x['$validation']['definitions']
    except:
        definitiondict = {}
    definitiondict["conformsDefinition"]=conformdef
    x['$validation']['definitions']=definitiondict
    return(x)

In [5]:
def add_schemaVersion(spec_list,x):
    spec_info = spec_list.loc[spec_list['name']==x['@id'].replace("bioschemas:","")]
    spec_url = spec_info.iloc[0]['url']
    baseurl = "https://bioschemas.org"
    versionurl = baseurl+'/'+spec_info.iloc[0]['type'].lower()+'s/'+spec_info.iloc[0]['name']+'/'+spec_info.iloc[0]['version']
    try:
        existingversions = x["schema:schemaVersion"]
        if isinstance(schemaversions, list) == False:
            schemaversions = existingversions.strip("[").strip("]").split(",")
        else:
            schemaversions = existingversions
    except:
        schemaversions = []
    schemaversions.append(versionurl)
    schemaversions.append(spec_url)
    ## Ensure uniqueness of elements
    x["schema:schemaVersion"] = list(set(schemaversions))
    return(x)

In [6]:
def add_specification_type(spec_list,x):
    spec_info = spec_list.loc[spec_list['name']==x['@id'].replace("bioschemas:","")]
    if spec_info.iloc[0]['type']=='Type':
        baseurl = 'https://bioschemas.org/types#nav-'
    elif spec_info.iloc[0]['type']=='Profile':
        baseurl = 'https://bioschemas.org/profiles#nav-'
    if 'deprecated' in spec_info.iloc[0]['version'].lower():
        typeurl = baseurl+'deprecated'
    elif 'release' in spec_info.iloc[0]['version'].lower():
        typeurl = baseurl+'release'
    elif 'draft' in spec_info.iloc[0]['version'].lower():
        typeurl = baseurl+'draft'
    x['additional_type'] = typeurl
    return(x)

In [7]:
def merge_specs(spec_list):
    bioschemas_json = {}
    graphlist = []
    classlist = []
    propertylist = []
    for eachurl in spec_list['url']:
        rawurl = get_raw_url(eachurl)
        r = requests.get(rawurl)
        if r.status_code == 200:
            cleantext = rename_namespace(spec_list,eachurl,r.text)
            spec_json = json.loads(cleantext)
            bioschemas_json['@context'] = check_context_url(spec_json)
            for x in spec_json['@graph']:
                graphlist.append(x)
                if x["@type"]=="rdfs:Class":
                    classlist.append(x["@id"])
                if x["@type"]=="rdf:Property":
                    propertylist.append(x["@id"])
    cleanclassgraph = clean_duplicate_classes(spec_list,graphlist,classlist)
    cleanpropsgraph = clean_duplicate_properties(graphlist, propertylist)
    cleangraph = []
    for z in cleanclassgraph:
        cleangraph.append(z)
    for a in cleanpropsgraph:
        cleangraph.append(a)
    conformsTo = define_conformsTo(classlist)
    cleangraph.append(conformsTo)
    bioschemas_json['@graph']=cleangraph
    return(bioschemas_json)

In [8]:
def clean_duplicate_classes(spec_list,graphlist,classlist):
    duplicates = [i for i in set(classlist) if classlist.count(i) > 1]
    nondupes = [x for x in classlist if x not in duplicates]
    cleanclassgraph = []
    if len(duplicates)>0:  ## There are duplicate classes to clean up
        for x in graphlist:
            if x["@id"] in nondupes:
                x = add_specification_type(spec_list,x)
                x = add_schemaVersion(spec_list,x)
                if "$validation" in x.keys():
                    x = add_conformsTo(spec_list,x)
                cleanclassgraph.append(x)
            for eachclass in duplicates:
                if x["@id"]==eachclass:
                    x = add_specification_type(spec_list,x)
                    x = add_schemaVersion(spec_list,x)
                    if "$validation" in x.keys():
                        x = add_conformsTo(spec_list,x)
                    cleanclassgraph.append(x)
    else:  ## There are not duplicate classes to clean up
        for x in graphlist:
            if x["@id"] in nondupes:
                x = add_specification_type(spec_list,x)
                x = add_schemaVersion(spec_list,x)
                if "$validation" in x.keys():
                    x = add_conformsTo(spec_list,x)
                cleanclassgraph.append(x)        
    return(cleanclassgraph)

def clean_duplicate_properties(graphlist, propertylist):            
    duplicates = [i for i in set(propertylist) if propertylist.count(i) > 1]
    nondupes = [x for x in propertylist if x not in duplicates]
    cleanpropsgraph = []
    dupepropsgraph = []
    if len(duplicates)>0:  ## There are duplicate properties to clean up
        for x in graphlist:
            if x["@id"] in nondupes:
                x = remove_NaN_fields(x)
                cleanpropsgraph.append(x)
            elif x["@id"] in duplicates:
                x = remove_NaN_fields(x)
                dupepropsgraph.append(x)
        #dupepropsgraph[0]["dummyProp"]={"@id":"dummyValue"} #### creates dummy property for testing only
        dupepropsdf = pd.DataFrame(dupepropsgraph)
        for eachprop in duplicates:
            tmpdf = dupepropsdf.loc[dupepropsdf['@id']==eachprop].copy()
            domainlist = []
            domainlist = [y for y in tmpdf["schema:domainIncludes"] if y not in domainlist]
            #### Get the row with the least number of NaNs (ie- the row with the most properties) to serve as the base property
            tmpdf["nullcount"]=tmpdf.isnull().sum(axis=1)
            tmpdf.sort_values("nullcount",ascending=True,inplace=True)
            tmpdict = tmpdf.iloc[0].to_dict()
            del tmpdict["nullcount"]
            tmpdict["schema:domainIncludes"]=domainlist #### Set the domainIncludes list
            cleanpropsgraph.append(tmpdict)       
    else:
        for x in graphlist:
            if x["@id"] in nondupes:
                x = remove_NaN_fields(x)
                cleanpropsgraph.append(x)
    return(cleanpropsgraph)   

In [9]:
def define_conformsTo(classlist):
    uniqueclasses =  list(set(classlist))
    classidlist = [{"@id":x} for x in classlist]
    conformsTo = {
      "@id": "dct:conformsTo",
      "@type": "rdf:Property",
      "rdfs:comment": "Used to state the Bioschemas profile that the markup relates to. The versioned URL of the profile must be used. Note that we use a CURIE in the table here but the full URL for Dublin Core terms must be used in the markup (http://purl.org/dc/terms/conformsTo), see example.",
      "rdfs:label": "conformsTo",
      "schema:domainIncludes": classidlist,
      "schema:rangeIncludes": [
        {"@id": "schema:CreativeWork"},{"@id": "schema:Text"},{"@id": "schema:Thing"}
      ]
    }
    return(conformsTo)

In [10]:
def remove_NaN_fields(propdef):
    if isinstance(propdef,dict):
        cleandict = {}
        for k, v in propdef.items():
            if k != "schema:sameAs":
                cleandict[k]=v
            elif k == "schema:sameAs": 
                if isinstance(v,type(None))==False:
                    cleandict[k]=v
    if isinstance(propdef,str):
        cleandict = propdef.replace(', "schema:sameAs": NaN','')
        cleandict = cleandict.replace('"schema:sameAs": NaN, ','')
    return(cleandict)

In [11]:
def update_specs(script_path):
    spec_list = read_csv('specifications_list.txt',delimiter='\t',header=0)
    bioschemas_json = remove_NaN_fields(merge_specs(spec_list))
    bioschemasfile = os.path.join(script_path,'bioschemas.json')
    jsonstring = json.dumps(bioschemas_json, indent=2)
    cleanstring = remove_NaN_fields(jsonstring)
    with open(bioschemasfile,'w') as outfile:
        outfile.write(cleanstring)

In [17]:
spec_list = read_csv('specifications_list.txt',delimiter='\t',header=0)
bioschemas_json = merge_specs(spec_list)
bioschemasfile = os.path.join(script_path,'bioschemas.json')
jsonstring = json.dumps(bioschemas_json)
cleanstring = remove_NaN_fields(jsonstring)
if ', "schema:sameAs": NaN' in cleanstring:
    print("dang")
print(cleanstring)
#with open(bio"schemasfile,'w') as outfile:
#    outfile.write(cleanstring)

KeyboardInterrupt: 

In [22]:
## main
script_path = ""
#script_path = pathlib.Path(__file__).parent.absolute()
update_specs(script_path)


In [14]:
## test
spec_list = read_csv('specifications_list.txt',delimiter='\t',header=0)
bioschemas_json = {}
graphlist = []
classlist = []
propertylist = []
for eachurl in spec_list['url']:
    rawurl = get_raw_url(eachurl)
    r = requests.get(rawurl)
    if r.status_code == 200:
        cleantext = rename_namespace(spec_list,eachurl,r.text)
        spec_json = update_subclass(spec_list,eachurl,cleantext)
        bioschemas_json['@context'] = check_context_url(spec_json)
        for x in spec_json['@graph']:
            graphlist.append(x)
            if x["@type"]=="rdfs:Class":
                classlist.append(x["@id"])
            if x["@type"]=="rdf:Property":
                propertylist.append(x["@id"])

cleanclassgraph = clean_duplicate_classes(graphlist,classlist)
cleanpropsgraph = clean_duplicate_properties(graphlist, propertylist)
cleangraph = []
for z in cleanclassgraph:
    cleangraph.append(z)
for a in cleanpropsgraph:
    cleangraph.append(a)
conformsTo = define_conformsTo(classlist)
cleangraph.append(conformsTo)
print(len(cleangraph))
print(len(cleanclassgraph)+len(cleanpropsgraph))
print(len(cleanclassgraph),len(cleanpropsgraph))
print(cleangraph[-1])

JSONDecodeError: Invalid control character at: line 12 column 306 (char 636)

In [None]:
classidlist = [{"@id":x} for x in classlist]
print(classidlist)

In [None]:
spec_list = read_csv('specifications_list.txt',delimiter='\t',header=0)
spec_info = spec_list.loc[spec_list['name']=="Gene"]
print(spec_info.iloc[0]['url'])

## Test a schema

In [21]:
def check_json_formatting(spec_list):
    json_issues = []
    bioschemas_json = {}
    graphlist = []
    classlist = []
    propertylist = []
    for eachurl in spec_list['url']:
        print(eachurl)
        rawurl = get_raw_url(eachurl)
        r = requests.get(rawurl)
        if r.status_code == 200:
            cleantext = rename_namespace(spec_list,eachurl,r.text)
            try:
                spec_json = json.loads(cleantext)
                print("successfully loaded")
            except:
                json_issues.append(eachurl)
    
    if len(json_issues)==0:
        return("No json formatting errors found in any listed specs")
    else:
        return("Json formatting errors found in the following listed specs: ",json_issues)

#### Test a schema's compatibility with the DDE

To do this, you will need to install the biothings schema tools:
pip install git+https://github.com/biothings/biothings_schema.py#egg=biothings_schema

In [20]:
from biothings_schema import Schema

url = "https://raw.githubusercontent.com/gtsueng/bioschemas-dde/main/bioschemas.json"

sc = Schema(url, base_schema=["schema.org"])

ValueError: field "biogicalRole" in "$validation" is not defined in this class or any of its parent classes