# Auto-generate x-bte annotations for BioThings SEMMEDDB

This notebook walks a developer through the process of [taking the SEMMEDDB database](https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/SemMedDB_download.html) and using this data to auto-generate the x-bte operations for [BTE](https://github.com/biothings/BioThings_Explorer_TRAPI). This is needed for BTE to query + process the responses from the [BioThings SEMMEDDB API](https://biothings.ncats.io/semmeddb). 

---

When one sees this:  
**PAUSE**

read the accompanying text that will explain what the developer needs to do before running the code chunks below that text block.

---

The [yaml](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/generated_operations.yaml) [segments](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/generated_list.yaml) generated by this notebook are added to [this file](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/version_without_operations.yaml) to make [yaml used for the smartapi registration](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/smartapi.yaml) for BioThings SEMMEDDB API.

## Setup

Requirements:
* install [Biolink-model Toolkit (bmt)](https://github.com/biolink/biolink-model-toolkit/). I'm [installing as a user](https://biolink.github.io/biolink-model-toolkit/intro/intro.html#for-users) with pip (`pip install bmt`). Using release 1.1.1 at the moment.
* install ruamel.yaml with [pip](https://pypi.org/project/ruamel.yaml/). Using 0.17.32 at the moment. It's more recent than the version from [conda-forge](https://anaconda.org/conda-forge/ruamel_yaml). The documentation is [here](https://yaml.readthedocs.io/en/latest/)

Files:
* Get SEMMEDDB PREDICATION CSV [here](https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/SemMedDB_download.html). This notebook was originally made using the version `semmedVER43_2021_R_PREDICATION`
* Get SEMMEDDB SRDEF file needed for interpreting and mapping SEMMED semantic types: 
  * download the compressed file [here](https://lhncbc.nlm.nih.gov/semanticnetwork/download.html) or pick the latest version [here](https://lhncbc.nlm.nih.gov/semanticnetwork/SemanticNetworkArchive.html)
  * information on the SRDEF file [here](https://www.ncbi.nlm.nih.gov/books/NBK9679/#ch05.sec5.2)

In [1]:
## CX: allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pathlib
import pandas as pd

## see above for install instructions
from bmt import Toolkit
import ruamel.yaml as ryml    ## use ruamel_yaml if using the conda-forge version

import json
import pprint

## used in trying things out
# import re

**PAUSE**

Review the code chunk below before running it:
* Check and correct the path for `raw_data_location`
* Check that the columns specified in `usecols` and `names` match the columns of the PREDICATION [file](https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/dbinfo.html). 
* Check that the `na_values`, `sep` are correct. One can use a command in Terminal like `head`
* If there are encoding issues, try different encodings. latin1 was used and worked, [ref](https://stackoverflow.com/questions/61163367/how-to-resolve-unicodedecodeerror-in-pandas-read-csv-while-loading-dataset)

In [2]:
raw_data_location = pathlib.Path.home().joinpath(
            "Desktop", "RawDataFiles", "SEMMEDDB", "semmedVER43_2021_R_PREDICATION.csv")

raw_data = pd.read_csv(raw_data_location, header=None, sep=",", encoding="latin1",
                          usecols=[3, 6, 7, 10, 11],
                          names=["PREDICATE","SUBJECT_SEMTYPE","SUBJECT_NOVELTY",
                                 "OBJECT_SEMTYPE", "OBJECT_NOVELTY"],
                          na_values=r"\N")

In [3]:
raw_data.shape

(113863366, 5)

In [4]:
raw_data.head()

Unnamed: 0,PREDICATE,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_SEMTYPE,OBJECT_NOVELTY
0,PROCESS_OF,virs,1,mamm,1.0
1,ISA,virs,1,virs,1.0
2,ISA,virs,1,virs,1.0
3,ISA,virs,1,virs,1.0
4,PROCESS_OF,dsyn,0,humn,0.0


## Basic Filtering

### keep only novelty = 1

now filter it to only keep novelty == 1 for both subject / object...

since those with novelty == 0 probably aren't very helpful / interesting to Translator. The entities with novelty == 0 are [listed](https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/dbinfo.html) in the SEMMEDDB GENERIC_CONCEPT table files, which can be downloaded [here](https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/SemMedDB_download.html)

In [5]:
filtered_data = raw_data[(raw_data["SUBJECT_NOVELTY"] == 1) &
                         (raw_data["OBJECT_NOVELTY"] == 1)].copy()

In [6]:
filtered_data.shape

(78688677, 5)

In [7]:
## remove the novelty stuff since that will make computations faster and it's always 1 now
filtered_data = filtered_data[['SUBJECT_SEMTYPE', 'PREDICATE', 'OBJECT_SEMTYPE']]

what semantic types are even there? we have to prune down to what we actually want BTE operations on...

interesting since the stats from the [official website](https://lhncbc.nlm.nih.gov/semanticnetwork/SemanticNetworkArchive.html) say 127 semantic types and 54 predicates

In [8]:
subject_semtypes = set(filtered_data["SUBJECT_SEMTYPE"].unique())
object_semtypes = set(filtered_data["OBJECT_SEMTYPE"].unique())
predicates = set(filtered_data["PREDICATE"].unique())

len(subject_semtypes)  ## got 133
len(object_semtypes)   ## got 135
len(predicates)        ## got 68

133

135

68

**PAUSE**

* Review the 3 sets above to see if there are things I want to remove. The normal entity semantic types have 4-letter codes...

### Remove semantic types that we don't want to make operations from

so there's some object stuff that I'm going to remove...

In [9]:
object_semtypes

{'C0030193',
 'C0030705',
 'aapp',
 'acab',
 'acty',
 'aggp',
 'alga',
 'amas',
 'amph',
 'anab',
 'anim',
 'anst',
 'antb',
 'arch',
 'bacs',
 'bact',
 'bdsu',
 'bdsy',
 'bhvr',
 'biof',
 'bird',
 'blor',
 'bmod',
 'bodm',
 'bpoc',
 'bsoj',
 'carb',
 'celc',
 'celf',
 'cell',
 'cgab',
 'chem',
 'chvf',
 'chvs',
 'clas',
 'clna',
 'clnd',
 'comd',
 'diap',
 'dora',
 'drdd',
 'dsyn',
 'edac',
 'eehu',
 'eico',
 'elii',
 'emod',
 'emst',
 'enzy',
 'euka',
 'famg',
 'ffas',
 'fish',
 'fndg',
 'fngs',
 'food',
 'ftcn',
 'genf',
 'geoa',
 'gngm',
 'gngm,aapp',
 'gora',
 'grup',
 'hcpp',
 'hcro',
 'hlca',
 'hops',
 'horm',
 'humn',
 'idcn',
 'imft',
 'inbe',
 'inch',
 'inpo',
 'inpr',
 'invt',
 'irda',
 'lang',
 'lbpr',
 'lbtr',
 'lipd',
 'mamm',
 'mbrt',
 'mcha',
 'medd',
 'menp',
 'mnob',
 'mobd',
 'moft',
 'mosq',
 'neop',
 'nnon',
 'npop',
 'nsba',
 'nusq',
 'ocac',
 'ocdi',
 'opco',
 'orch',
 'orga',
 'orgf',
 'orgm',
 'orgt',
 'ortf',
 'patf',
 'phob',
 'phpr',
 'phsf',
 'phsu',
 'plnt

In [10]:
## make the set of stuff we want to remove
removal1 = set(["C0030193", "C0030705", "gngm,aapp", "podg,humn"])

## remove it from the data 
filtered_data = filtered_data[ ~ filtered_data['OBJECT_SEMTYPE'].isin(removal1)]

In [11]:
## look at the semantic types again after this removal
subject_semtypes = set(filtered_data["SUBJECT_SEMTYPE"].unique())
object_semtypes = set(filtered_data["OBJECT_SEMTYPE"].unique())
predicates = set(filtered_data["PREDICATE"].unique())

len(subject_semtypes) ## was 133, now 133: didn't change
len(object_semtypes)  ## was 135, now 131: decreased by 4 (expected)
len(predicates)       ## was 68, now 65: decreased by 3 (cool)

133

131

65

### Remove Predicates we don't want to make operations from

In [12]:
predicates

{'ADMINISTERED_TO',
 'AFFECTS',
 'ASSOCIATED_WITH',
 'AUGMENTS',
 'CAUSES',
 'COEXISTS_WITH',
 'COMPLICATES',
 'CONVERTS_TO',
 'DIAGNOSES',
 'DISRUPTS',
 'INHIBITS',
 'INTERACTS_WITH',
 'ISA',
 'LOCATION_OF',
 'MANIFESTATION_OF',
 'MEASUREMENT_OF',
 'MEASURES',
 'METHOD_OF',
 'NEG_ADMINISTERED_TO',
 'NEG_AFFECTS',
 'NEG_ASSOCIATED_WITH',
 'NEG_AUGMENTS',
 'NEG_CAUSES',
 'NEG_COEXISTS_WITH',
 'NEG_COMPLICATES',
 'NEG_CONVERTS_TO',
 'NEG_DIAGNOSES',
 'NEG_DISRUPTS',
 'NEG_INHIBITS',
 'NEG_INTERACTS_WITH',
 'NEG_ISA',
 'NEG_LOCATION_OF',
 'NEG_MANIFESTATION_OF',
 'NEG_MEASUREMENT_OF',
 'NEG_MEASURES',
 'NEG_METHOD_OF',
 'NEG_OCCURS_IN',
 'NEG_PART_OF',
 'NEG_PRECEDES',
 'NEG_PREDISPOSES',
 'NEG_PREVENTS',
 'NEG_PROCESS_OF',
 'NEG_PRODUCES',
 'NEG_STIMULATES',
 'NEG_TREATS',
 'NEG_USES',
 'NEG_higher_than',
 'NEG_lower_than',
 'NEG_same_as',
 'NOM',
 'OCCURS_IN',
 'PART_OF',
 'PRECEDES',
 'PREDISPOSES',
 'PREP',
 'PREVENTS',
 'PROCESS_OF',
 'PRODUCES',
 'STIMULATES',
 'TREATS',
 'USES',
 '

**PAUSE**

* Decide what predicates you want to remove

**Current logic**

I'm using biolink-model 3.5.3 right now. 

With biolink-model >= 3.5.1, Translator has reviewed some predicates for domain-predicate and predicate-range combo exclusions - ref: [Translator google group, google sheet link](https://docs.google.com/spreadsheets/d/1c1gx0Jgm9rJUOXcQhBtZgvx50Cvz1-jh0DdGtg1zcd8/edit#gid=1801185264).

We'll want to keep the predicates that have been reviewed and only do the custom removals below for predicates that:
* haven't been reviewed yet
* are problematic to use right now: 
    * Translator currently isn't great with negation, so removing those
    * Also some predicates seem to actually be practice phrases? see the [article](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-486#Sec26)
    * And I didn't find the following relationships useful: ISA (subclass relationship), lower_than / higher_than (these two entities were compared and 1 was better (higher) or worse (lower) than the other), and compared_with...
    
And we'll want to adjust the lists below if Translator reviews more predicates in the future. 

Old note on same_as (previously excluded): it's meaning seems to be "equivalent / it did just as well as some other thing"

In [13]:
negative_preds = ["NEG_ADMINISTERED_TO", "NEG_AFFECTS", "NEG_ASSOCIATED_WITH",
                  "NEG_AUGMENTS", "NEG_CAUSES", "NEG_COEXISTS_WITH", "NEG_COMPLICATES",
                  "NEG_CONVERTS_TO", "NEG_DIAGNOSES", "NEG_DISRUPTS", "NEG_INHIBITS",
                  "NEG_INTERACTS_WITH", "NEG_ISA", "NEG_LOCATION_OF", "NEG_MANIFESTATION_OF",
                  "NEG_MEASUREMENT_OF", "NEG_MEASURES", "NEG_METHOD_OF", "NEG_OCCURS_IN",
                  "NEG_PART_OF", "NEG_PRECEDES", "NEG_PREDISPOSES", "NEG_PREVENTS",
                  "NEG_PROCESS_OF", "NEG_PRODUCES", "NEG_STIMULATES", "NEG_TREATS", "NEG_USES",
                  "NEG_higher_than", "NEG_lower_than", "NEG_same_as"
                 ]

practice_phrase = ["VERB", "NOM", "PREP"]

dont_like = ["ISA", "lower_than", "higher_than", "compared_with"]

In [14]:
filtered_data = filtered_data[ ~ filtered_data['PREDICATE'].isin(negative_preds)]
filtered_data = filtered_data[ ~ filtered_data['PREDICATE'].isin(practice_phrase)]
filtered_data = filtered_data[ ~ filtered_data['PREDICATE'].isin(dont_like)]

now look at the stats. 

(interesting since the stats from the [official website](https://lhncbc.nlm.nih.gov/semanticnetwork/SemanticNetworkArchive.html) say 127 semantic types and 54 predicates. 54 / 2 = 27 predicates without negation + 1 for same_as inclusion)

In [15]:
## look at the semantic types again after this removal
subject_semtypes = set(filtered_data["SUBJECT_SEMTYPE"].unique())
object_semtypes = set(filtered_data["OBJECT_SEMTYPE"].unique())
predicates = set(filtered_data["PREDICATE"].unique())

len(subject_semtypes) ## was 133, now 127: decreased by 6
len(object_semtypes)  ## was 131, now 127: decreased by 4
len(predicates)       ## was 65, now 28: decreased by 37 (was expected)

127

127

28

so....that's how many kinds of subjects, objects, and predicates we have to go forward with...

In [16]:
combos = filtered_data.value_counts().reset_index()

In [17]:
combos.shape
## so that's still a lot....
## without same_as, it was 14796

(15469, 4)

In [18]:
combos.head()

Unnamed: 0,SUBJECT_SEMTYPE,PREDICATE,OBJECT_SEMTYPE,count
0,dsyn,PROCESS_OF,humn,2143816
1,bpoc,PART_OF,mamm,1159577
2,bpoc,LOCATION_OF,neop,1023005
3,fndg,PROCESS_OF,humn,1000549
4,bpoc,LOCATION_OF,aapp,953010


## Mapping and Specific Filtering

First, I'm going to map SEMMED semantic info to biolink-model - this is a legacy from when I used the semantic info + mapping to decide what semantic-types + predicates to write x-bte annotation with. Now, I'll do minimal adjustments.  

Then I'll use Translator-curated exclusions to remove some semantic-type, domain-predicate, and predicate-range exclusions. ref: [Translator google group, google sheet link](https://docs.google.com/spreadsheets/d/1c1gx0Jgm9rJUOXcQhBtZgvx50Cvz1-jh0DdGtg1zcd8/edit#gid=1801185264)

As a reminder:

* Get SEMMEDDB SRDEF file needed for interpreting and mapping SEMMED semantic types: 
  * download the compressed file [here](https://lhncbc.nlm.nih.gov/semanticnetwork/download.html) or pick the latest version [here](https://lhncbc.nlm.nih.gov/semanticnetwork/SemanticNetworkArchive.html)
  * information on the SRDEF file [here](https://www.ncbi.nlm.nih.gov/books/NBK9679/#ch05.sec5.2)

### Ingest SEMMED semantic info

In [19]:
srdef_location = pathlib.Path.home().joinpath(
            "Desktop", "RawDataFiles", "SEMMEDDB", "2020AA", "SRDEF")

In [20]:
srdef = pd.read_csv(srdef_location, sep="|", header=0, index_col=False,
                    names=["Record Type (RT)",
                           "Unique Identifier (UI)",
                           "Full Name (STY/RL)", 
                           "Tree Number (STN/RTN)", 
                           "Definition (DEF)",
                           "Examples (EX)",
                           "Usage Note (UN)",
                           "Non-Human Flag (NH)",
                           "Abbreviation (ABR)",
                           "Inverse Relation (RIN)"])

In [21]:
srdef[srdef['Abbreviation (ABR)'] == 'aapp']

Unnamed: 0,Record Type (RT),Unique Identifier (UI),Full Name (STY/RL),Tree Number (STN/RTN),Definition (DEF),Examples (EX),Usage Note (UN),Non-Human Flag (NH),Abbreviation (ABR),Inverse Relation (RIN)
97,STY,T116,"Amino Acid, Peptide, or Protein",A1.4.1.2.1.7,Amino acids and chains of amino acids connecte...,,When the concept is both an enzyme and a prote...,,aapp,


### Semantic type processing

#### Add semantic types that are in the data but not in SRDEF

Yao [noticed](https://github.com/biothings/pending.api/issues/30#issuecomment-903609946) that the data file uses "old SEMMED semantic types" from 2013AA, and these semantic types didn't exist in the latest SRDEF file (2020AA).

In response, Andrew said to [keep](https://github.com/biothings/pending.api/issues/30#issuecomment-903879782) this data

In [22]:
## simplify the df
SEMMED_entity_types = srdef[srdef['Record Type (RT)'] == 'STY'].copy()
SEMMED_entity_types = SEMMED_entity_types[['Unique Identifier (UI)',
                                           'Full Name (STY/RL)', 
                                           'Abbreviation (ABR)']].copy()
SEMMED_entity_types.columns = ['UI', 'FullName', "Abbrev"]
SEMMED_entity_types.sort_values(by='Abbrev', inplace=True)

## quick view
SEMMED_entity_types.shape
SEMMED_entity_types[0:10]

(126, 3)

Unnamed: 0,UI,FullName,Abbrev
97,T116,"Amino Acid, Peptide, or Protein",aapp
15,T020,Acquired Abnormality,acab
44,T052,Activity,acty
90,T100,Age Group,aggp
77,T087,Amino Acid Sequence,amas
6,T011,Amphibian,amph
115,T190,Anatomical Abnormality,anab
4,T008,Animal,anim
12,T017,Anatomical Structure,anst
119,T195,Antibiotic,antb


In [23]:
## get all semantic types in the data
data_types = subject_semtypes.union(object_semtypes)
only_in_data = data_types - set(SEMMED_entity_types['Abbrev'])

print("these semantic types are in the data, but not in SRDEF:")
only_in_data

len(only_in_data)

these semantic types are in the data, but not in SRDEF:


{'alga',
 'carb',
 'eico',
 'invt',
 'lipd',
 'nsba',
 'opco',
 'orgm',
 'rich',
 'strd'}

10

How I found info on these semantic types from [previous semantic versions](https://lhncbc.nlm.nih.gov/semanticnetwork/SemanticNetworkArchive.html):

was in 2009AB file (sets with SG_2?)
* alga
* invt
* rich

was in 2014AB file (sets with SG_3?)
* carb 
* eico 
* lipd 
* nsba 
* opco
* orgm
* strd

In [24]:
## build the missing info and add to the table
missing_semantic_types = [{"UI": "T003", "FullName": "Alga", "Abbrev": "alga"},
                          {"UI": "T118", "FullName": "Carbohydrate", "Abbrev": "carb"},
                          {"UI": "T111", "FullName": "Eicosanoid", "Abbrev": "eico"},
                          {"UI": "T009", "FullName": "Invertebrate", "Abbrev": "invt"},
                          {"UI": "T119", "FullName": "Lipid", "Abbrev": "lipd"},
                          {"UI": "T124", "FullName": "Neuroreactive Substance or Biogenic Amine", "Abbrev": "nsba"},
                          {"UI": "T115", "FullName": "Organophosphorus Compound", "Abbrev": "opco"},
                          {"UI": "T001", "FullName": "Organism", "Abbrev": "orgm"},
                          {"UI": "T006", "FullName": "Rickettsia or Chlamydia", "Abbrev": "rich"},
                          {"UI": "T110", "FullName": "Steroid", "Abbrev": "strd"}
                         ]

missing_semantic_types = pd.DataFrame.from_records(missing_semantic_types)

## add it to the table
## SEMMED_entity_types = SEMMED_entity_types.append(missing_semantic_types)

SEMMED_entity_types = pd.concat([SEMMED_entity_types, missing_semantic_types])

In [25]:
SEMMED_entity_types.sort_values(by='Abbrev', inplace=True)

## quick view
SEMMED_entity_types.shape
SEMMED_entity_types[0:10]

(136, 3)

Unnamed: 0,UI,FullName,Abbrev
97,T116,"Amino Acid, Peptide, or Protein",aapp
15,T020,Acquired Abnormality,acab
44,T052,Activity,acty
90,T100,Age Group,aggp
0,T003,Alga,alga
77,T087,Amino Acid Sequence,amas
6,T011,Amphibian,amph
115,T190,Anatomical Abnormality,anab
4,T008,Animal,anim
12,T017,Anatomical Structure,anst


#### Get the biolink-mappings

In [26]:
## as of 2023-08-2: using default which is 3.5.3 
bmt_tool = Toolkit()

In [27]:
bmt_tool.get_element_by_mapping('STY:T123').title()

'Small Molecule'

In [28]:
## getting biolink-mapping, in the format needed to create operations
SEMMED_entity_types['BiolinkMapping'] = [bmt_tool.get_element_by_mapping('STY:'+i)
                                         for i in SEMMED_entity_types['UI']]
## put these node categories/semantic-types in the correct format: PascalCase
SEMMED_entity_types['BiolinkMapping'] = [i.title().replace(" ", "") if isinstance(i, str)
                                         else i for i in SEMMED_entity_types['BiolinkMapping']]

In [29]:
SEMMED_entity_types

Unnamed: 0,UI,FullName,Abbrev,BiolinkMapping
97,T116,"Amino Acid, Peptide, or Protein",aapp,Polypeptide
15,T020,Acquired Abnormality,acab,Disease
44,T052,Activity,acty,Activity
90,T100,Age Group,aggp,Cohort
0,T003,Alga,alga,
...,...,...,...,...
70,T079,Temporal Concept,tmco,InformationContentEntity
53,T061,Therapeutic or Preventive Procedure,topp,Procedure
2,T005,Virus,virs,Virus
104,T127,Vitamin,vita,SmallMolecule


#### Clean up to only have terms in the data

In [30]:
print("note that these semantic types are in SRDEF but not in the data")
set(SEMMED_entity_types['Abbrev']) - data_types

note that these semantic types are in SRDEF but not in the data


{'cnce', 'crbs', 'enty', 'grpa', 'lang', 'phob'}

In [31]:
## get the subset of semantic network terms that are actually in the data
SEMMED_entity_types = SEMMED_entity_types[SEMMED_entity_types['Abbrev'].isin(data_types)].copy()

In [32]:
SEMMED_entity_types

Unnamed: 0,UI,FullName,Abbrev,BiolinkMapping
97,T116,"Amino Acid, Peptide, or Protein",aapp,Polypeptide
15,T020,Acquired Abnormality,acab,Disease
44,T052,Activity,acty,Activity
90,T100,Age Group,aggp,Cohort
0,T003,Alga,alga,
...,...,...,...,...
70,T079,Temporal Concept,tmco,InformationContentEntity
53,T061,Therapeutic or Preventive Procedure,topp,Procedure
2,T005,Virus,virs,Virus
104,T127,Vitamin,vita,SmallMolecule


#### review mappings

**PAUSE**

We are still reviewing / changing mappings, since BTE's responses would probably differ a lot if this behavior was changed. 

However, we are now removing ONLY semantic-types that lack biolink-model mappings. Other curated semantic-type removals will be done in a later section using the Translator exclusions. ref: [Translator google group, google sheet link](https://docs.google.com/spreadsheets/d/1c1gx0Jgm9rJUOXcQhBtZgvx50Cvz1-jh0DdGtg1zcd8/edit#gid=1801185264).

I'm going to keep the old notes + code (commented out), just in case. 

---

Old notes:

This is a place to **STOP** and review all the SEMMED semantic types and their mapping to biolink semantic types...to decide what we are interested in keeping. This involves some knowledge of what biolink semantic types are prioritized in Translator. 

One can use the definitions of the SEMMED semantic types (from the SRDEF file or [browsing the UMLS vocab online](https://uts.nlm.nih.gov/uts/umls/semantic-network/root)) and the definitions of biolink semantic types (look for a comment with 'THINGS' in the biolink-model yaml file)

See the last section (section 6) for notes on decisions that were made...

Sections 3.5 + 3.6 below involves this review, changing mappings, and removing some SEMMED semantic types

In [33]:
## code used to review 
SEMMED_entity_types['BiolinkMapping'].unique()

array(['Polypeptide', 'Disease', 'Activity', 'Cohort', None,
       'Invertebrate', 'OrganismalEntity', 'AnatomicalEntity', 'Drug',
       'SmallMolecule', 'Bacterium', 'Behavior', 'Phenomenon', 'Device',
       'GrossAnatomicalStructure', 'CellularComponent',
       'PhysiologicalProcess', 'Cell', 'ChemicalEntity',
       'InformationContentEntity', 'ClinicalAttribute', 'Procedure',
       'BiologicalEntity', 'Protein', 'Event',
       'DiseaseOrPhenotypicFeature', 'Fungus', 'Food',
       'GeographicLocation', 'GenomicEntity', 'Agent', 'Human',
       'PathologicalProcess', 'Publication', 'DiagnosticAid', 'Mammal',
       'PhysicalEntity', 'MolecularActivity', 'MolecularEntity',
       'NucleicAcidEntity', 'OrganismAttribute', 'Plant',
       'PopulationOfIndividualOrganisms', 'PhenotypicFeature', 'Virus',
       'Vertebrate'], dtype=object)

In [34]:
# SEMMED_entity_types[SEMMED_entity_types['BiolinkMapping'] == 'Vertebrate']
SEMMED_entity_types[SEMMED_entity_types['BiolinkMapping'].isna()]

Unnamed: 0,UI,FullName,Abbrev,BiolinkMapping
0,T003,Alga,alga,
118,T194,Archaeon,arch,
17,T022,Body System,bdsy,
7,T012,Bird,bird,
81,T091,Biomedical Occupation or Discipline,bmod,
98,T120,Chemical Viewed Functionally,chvf,
94,T104,Chemical Viewed Structurally,chvs,
125,T204,Eukaryote,euka,
16,T021,Fully Formed Anatomical Structure,ffas,
8,T013,Fish,fish,


Still re-mapping (not following a strict biolink-model mapping). May revisit later, but concerned about affecting responses to creative-mode queries / requiring template adjustments...

In [35]:
## re-mapping based on putting IDs into normalization service / our operation system...
##   we have UMLS for Disease (mydisease, mychem), SmallMolecule (idisk)

## leaving aapp as Polypeptide, enzy mapped to Protein
## leaving Disease mappings as-is, but some seem like they could be PathologicalAnatomicalStructure 
##   (acab, anab, cgab) or PathologicalProcess (comd) instead
## left clnd (Clinical Drug) as Drug since it really seemed like a drug (dosage)

## currently mapped to Drug, but re-mapping here
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'antb'),'BiolinkMapping'] = 'SmallMolecule'  ## Antibiotic
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'phsu'),'BiolinkMapping'] = 'SmallMolecule'  ## Pharmacologic Substance

## currently mapped to GenomicEntity, but re-mapping here
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'gngm'),'BiolinkMapping'] = 'Gene'

## currently mapped to NucleicAcidEntity, but re-mapping here
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'nnon'),'BiolinkMapping'] = 'SmallMolecule'  ## Nucleic Acid, Nucleoside, or Nucleotide

We are removing semantic-types that lack biolink-model mappings.

In [39]:
currently_unmapped_types = set(SEMMED_entity_types[SEMMED_entity_types['BiolinkMapping'].isna()].Abbrev)

In [40]:
currently_unmapped_types

{'alga',
 'arch',
 'bdsy',
 'bird',
 'bmod',
 'chvf',
 'chvs',
 'euka',
 'ffas',
 'fish',
 'genf',
 'invt',
 'ocdi',
 'orgm',
 'rept',
 'rich',
 'sbst'}

Commenting out: we aren't removing semantic-types here anymore, because we will do so in a later section using the Translator exclusions. ref: [Translator google group, google sheet link](https://docs.google.com/spreadsheets/d/1c1gx0Jgm9rJUOXcQhBtZgvx50Cvz1-jh0DdGtg1zcd8/edit#gid=1801185264).

In [38]:
# currently_unused_entities = [
#     ## missing mappings
#     "arch", "bdsy", "bird", "bmod", "chvf", "chvs", "euka", "ffas", 
#     "fish", "genf", "ocdi", "rept", "sbst", 
#     ## old terms here
#     "alga", "invt", "orgm", "rich",
#     ## mapped to Polypeptide
#     "amas",     
#     ## all that are mapped to Activity
#     "acty", "dora", "edac", "gora", "hlca", "mcha", "ocac", "resa",  
#     ## all that are mapped to Cohort
#     "aggp", "famg", "podg", "prog",  
#     ## all that are mapped to Invertebrate
#     "amph",
#     ## all that are mapped to OrganismalEntity
#     "anim", 
#     ## all that are mapped to Anatomical Entity
#     "anst", "bdsu", "blor", "bsoj", 
#     ## mapped to SmallMolecule
#     "elii",
#     ## mapped to Bacterium
#     "bact",
#     ## all that are mapped to Behavior
#     "bhvr", "inbe", "menp", "socb",
#     ## all that are mapped to Phenomenon
#     "biof", "eehu", "hcpp", "lbtr", "npop", "phpr",
#     ## all that are mapped to Device
#     "bodm", "drdd", "medd", "resd",
#     ## mapped to GrossAnatomicalStructure
#     "emst", "tisu", 
#     ## mapped to ChemicalEntity
#     "chem",
#     ## all that are mapped to InformationContentEntity
#     "clas", "ftcn", "idcn", "qlco", "qnco", "rnlw", "spco", "tmco",
#     ## all that are mapped to ClinicalAttribute
#     "clna",
#     ## mapped to Procedure
#     "lbpr", "mbrt", 
#     ## mapped to BiologicalEntity
#     "emod",
#     ## mapped to Protein
#     "rcpt",
#     ## all that are mapped to Event
#     "evnt",
#     ## all that are mapped to DiseaseOrPhenotypicFeature
#     "fndg",
#     ## all that are mapped to Fungus
#     "fngs",
#     ## all that are mapped to GeographicLocation
#     "geoa",
#     ## all that are mapped to Agent
#     "grup", "hcro", "orgt", "pros", "shro",
#     ## all that are mapped to Human
#     "humn", 
#     ## all that are mapped to Publication
#     "inpr",
#     ## all that are mapped to DiagnosticAid
#     "irda", 
#     ## all that are mapped to Mammal
#     "mamm",  
#     ## all that are mapped to PhysicalEntity
#     "mnob",
#     ## all that are mapped to MolecularEntity
#     "mosq",
#     ## mapped to NucleicAcidEntity
#     "nusq",
#     ## all that are mapped to OrganismAttribute
#     "orga",
#     ## all that are mapped to Plant
#     "plnt", 
#     ## all that are mapped to PopulationOfIndividualOrganisms
#     "popg",
#     ## all that are mapped to Virus
#     "virs",
#     ## all that are mapped to Vertebrate
#     "vtbt"
# ]

#### prune data based on what node entities to remove

We are removing semantic-types that lack biolink-model mappings.

In [44]:
currently_unmapped_types
len(currently_unmapped_types)

{'alga',
 'arch',
 'bdsy',
 'bird',
 'bmod',
 'chvf',
 'chvs',
 'euka',
 'ffas',
 'fish',
 'genf',
 'invt',
 'ocdi',
 'orgm',
 'rept',
 'rich',
 'sbst'}

17

In [42]:
## prune this doc
SEMMED_entity_types = SEMMED_entity_types[ ~ SEMMED_entity_types['Abbrev'].isin(currently_unmapped_types)]

## prune the data doc
filtered_data = filtered_data[ ~ filtered_data['SUBJECT_SEMTYPE'].isin(currently_unmapped_types)]
filtered_data = filtered_data[ ~ filtered_data['OBJECT_SEMTYPE'].isin(currently_unmapped_types)]

In [43]:
## look at the semantic types again after this removal
subject_semtypes = set(filtered_data["SUBJECT_SEMTYPE"].unique())
object_semtypes = set(filtered_data["OBJECT_SEMTYPE"].unique())
predicates = set(filtered_data["PREDICATE"].unique())

len(subject_semtypes) ## was 127, now 112: decreased by 15
len(object_semtypes)  ## was 127, now 110: decreased by 17 (number of unmapped entities)
len(predicates)       ## was 28, still 28 (expected)

112

110

28

In [45]:
SEMMED_entity_types

Unnamed: 0,UI,FullName,Abbrev,BiolinkMapping
97,T116,"Amino Acid, Peptide, or Protein",aapp,Polypeptide
15,T020,Acquired Abnormality,acab,Disease
44,T052,Activity,acty,Activity
90,T100,Age Group,aggp,Cohort
77,T087,Amino Acid Sequence,amas,Polypeptide
...,...,...,...,...
70,T079,Temporal Concept,tmco,InformationContentEntity
53,T061,Therapeutic or Preventive Procedure,topp,Procedure
2,T005,Virus,virs,Virus
104,T127,Vitamin,vita,SmallMolecule


### Predicate processing

#### getting mappings for predicates and qualifiers

In [46]:
SEMMED_predicates = srdef[srdef['Record Type (RT)'] == 'RL'].copy()

SEMMED_predicates = SEMMED_predicates[['Full Name (STY/RL)'
                                      ]].copy()
SEMMED_predicates.columns = ['SemmedPred']
SEMMED_predicates.sort_values(by='SemmedPred', inplace=True)

**PAUSE**

The SRDEF file might be missing predicates that are in the data, so check that using the code block below and see if you would like to add those back in...

In [47]:
## predicates currently in the filtered data

missing_preds = list()

for i in predicates:
    if i.lower() not in set(SEMMED_predicates['SemmedPred']):
        missing_preds.append(i.lower())

missing_preds.sort()
missing_preds

# [i for i in predicates if i.lower() in SEMMED_predicates['SemmedName']]

['administered_to',
 'augments',
 'coexists_with',
 'converts_to',
 'inhibits',
 'predisposes',
 'same_as',
 'stimulates']

Currently we add them back in 

In [48]:
## manually add some predicates that are in the data, but not in SRDEF...

SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['administered_to']
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['augments']
## notice that the data doesn't have a hyphen for coexists...even though SRDEF has hyphen for co-occurs_with
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['coexists_with']
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['converts_to']
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['inhibits']
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['predisposes']
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['same_as']
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['stimulates']

SEMMED_predicates.sort_values(by='SemmedPred', inplace=True)

In [49]:
## since biolink-model mappings doesn't use hyphens
SEMMED_predicates['Semmed_in_BM']= ['SEMMEDDB:'+i 
                                   for i in SEMMED_predicates['SemmedPred'].str.replace("-", "")]
SEMMED_predicates['Semmed_in_BM'] = SEMMED_predicates['Semmed_in_BM'].str.upper()

## Specifically change same_as to be lower-case for the mapping process
SEMMED_predicates.loc[
    (SEMMED_predicates['SemmedPred'] == 'same_as'),'Semmed_in_BM'] = 'SEMMEDDB:same_as'

SEMMED_predicates

Unnamed: 0,SemmedPred,Semmed_in_BM
162,adjacent_to,SEMMEDDB:ADJACENT_TO
54,administered_to,SEMMEDDB:ADMINISTERED_TO
145,affects,SEMMEDDB:AFFECTS
176,analyzes,SEMMEDDB:ANALYZES
158,assesses_effect_of,SEMMEDDB:ASSESSES_EFFECT_OF
...,...,...
130,temporally_related_to,SEMMEDDB:TEMPORALLY_RELATED_TO
166,traverses,SEMMEDDB:TRAVERSES
148,treats,SEMMEDDB:TREATS
178,tributary_of,SEMMEDDB:TRIBUTARY_OF


In [50]:
## getting the mapping
SEMMED_predicates['BMPred'] = [bmt_tool.get_element_by_mapping(i)
                                       for i in SEMMED_predicates['Semmed_in_BM']]

In [51]:
## remember there are 26 predicates in the data
len(predicates)
pred_compare = [i.lower() for i in predicates]

SEMMED_predicates[SEMMED_predicates['SemmedPred'].isin(pred_compare)]

28

Unnamed: 0,SemmedPred,Semmed_in_BM,BMPred
54,administered_to,SEMMEDDB:ADMINISTERED_TO,related to
145,affects,SEMMEDDB:AFFECTS,affects
160,associated_with,SEMMEDDB:ASSOCIATED_WITH,related to
55,augments,SEMMEDDB:AUGMENTS,
141,causes,SEMMEDDB:CAUSES,causes
56,coexists_with,SEMMEDDB:COEXISTS_WITH,coexists with
143,complicates,SEMMEDDB:COMPLICATES,exacerbates
57,converts_to,SEMMEDDB:CONVERTS_TO,derives into
157,diagnoses,SEMMEDDB:DIAGNOSES,diagnoses
140,disrupts,SEMMEDDB:DISRUPTS,disrupts


**PAUSE**

Review the mappings in the table printed by the previous code chunk. 

The mapping for several SEMMEDDB predicates are missing. 

I'll add the mappings for some predicates below: and will be added below
* augments
* inhibits
* stimulates

I'll keep some as missing (mapping doesn't exist in biolink-model):
* measurement_of 
* method_of

Notes:
* semmeddb `administered_to` is "trying something out with a goal", not that it actually does something...see [definition](https://uts.nlm.nih.gov/uts/umls/semantic-network/T154)
* semmeddb `diagnoses` is more of a "this thing distinguishes / identifies this other thing"...see [definition](https://uts.nlm.nih.gov/uts/umls/semantic-network/T163)
* semmeddb `process_of` does seem to map well to biolink-model's `occurs in` so...okay. As an example, see the use of `process_of` in the [API's data](https://biothings.ncats.io/semmeddb/query?q=predicate:PROCESS_OF%20AND%20object.semantic_type_abbreviation:cell)
* keeping semmeddb `treats` is more general than biolink-model's `treats`. It's more of "trying this out as a treatment". see [definition](https://uts.nlm.nih.gov/uts/umls/semantic-network/T154)

Starting with the biolink 3.1.1 update, we:
* map semmeddb `augments` and `stimulates` to biolink's `affects` with qualifiers: 
  * qualified_predicate: causes
  * object_aspect_qualifier: activity_or_abundance
  * object_direction_qualifier: increased
* map semmeddb `inhibits` to biolink's `affects` with qualifiers: 
  * qualified_predicate: causes
  * object_aspect_qualifier: activity_or_abundance
  * object_direction_qualifier: decreased

And for the reverse of these operations with qualifiers, we'll want to get the inverted qualified_predicates and we'll use "subject_" rather than "object_" qualifiers (but the values will remain the same). 

In [52]:
## complex change: modify table to use qualifiers for semmeddb augments, stimulates, and inhibits

## At the moment, only need the following qualifiers added, to map these predicates
##   QualPred = qualified_predicate
##   ObjAsp = object_aspect_qualifier
##   ObjDirect = object_direction_qualifier
##   Inv_QualPred = for reverse operations, inverse of qualified_predicate
##   Inv_SubjAsp = for reverse operations, subject_aspect_qualifier
##   Inv_SubjDirect = for reverse operations, subject_direction_qualifier


## First, add the columns. Most values will stay blank (None)
SEMMED_predicates['BMQualPred'] = None  
SEMMED_predicates['BMObjAsp'] = None 
SEMMED_predicates['BMObjDirect'] = None
SEMMED_predicates['BM_Inv_QualPred'] = None  
SEMMED_predicates['BM_Inv_SubjAsp'] = None 
SEMMED_predicates['BM_Inv_SubjDirect'] = None

## Next, mutate the rows for biolink predicates being remapping to predicates + qualifiers
for index in SEMMED_predicates.index:
    ## semmeddb augments and stimulates should change
    if (SEMMED_predicates.loc[index,'SemmedPred']=='augments') or \
    (SEMMED_predicates.loc[index,'SemmedPred']=='stimulates'):
        SEMMED_predicates.loc[index,'BMPred'] = 'affects'
        SEMMED_predicates.loc[index,'BMQualPred'] = 'causes'
        SEMMED_predicates.loc[index,'BMObjAsp'] = 'activity_or_abundance'
        SEMMED_predicates.loc[index,'BMObjDirect'] = 'increased'
        SEMMED_predicates.loc[index,'BM_Inv_QualPred'] = 'caused_by'
        SEMMED_predicates.loc[index,'BM_Inv_SubjAsp'] = 'activity_or_abundance'
        SEMMED_predicates.loc[index,'BM_Inv_SubjDirect'] = 'increased'
    ## semmeddb inhibits should change
    elif SEMMED_predicates.loc[index,'SemmedPred']=='inhibits':
        SEMMED_predicates.loc[index,'BMPred'] = 'affects'
        SEMMED_predicates.loc[index,'BMQualPred'] = 'causes'
        SEMMED_predicates.loc[index,'BMObjAsp'] = 'activity_or_abundance'
        SEMMED_predicates.loc[index,'BMObjDirect'] = 'decreased'
        SEMMED_predicates.loc[index,'BM_Inv_QualPred'] = 'caused_by'
        SEMMED_predicates.loc[index,'BM_Inv_SubjAsp'] = 'activity_or_abundance'
        SEMMED_predicates.loc[index,'BM_Inv_SubjDirect'] = 'decreased'

Clean up table to only include what we have biolink-model mappings for: this removes measurement_of and method_of predicates. 

In [53]:
## only keep the stuff we found biolink mappings for
SEMMED_predicates = SEMMED_predicates[SEMMED_predicates['BMPred'].notna()].copy()

Now to get the inverse predicates. With biolink-model-toolkit > 0.9.0 and biolink-model > 3.1.1, I've encountered issues retrieving inverse predicates. 

In [54]:
## Testing methods of getting inverse
# print(bmt_tool.get_element('causes').inverse)  ## get None
# print(bmt_tool.get_inverse('subclass of'))   ## get None
# print(bmt_tool.get_inverse('subclass_of'))   ## get AttributeError: 'NoneType' object has no attribute 'name'
# print(bmt_tool.get_inverse('quantifier_qualifier'))    ## also get AttributeError

# for i in SEMMED_predicates['BMPred']:
#     print(i)
#     if bmt_tool.get_element(i).symmetric:
#         print('is symmetrical')
#     elif bmt_tool.get_inverse(i):
#         print('inverse: ' + bmt_tool.get_inverse(i))
#     else:
#         print('missing inverse')
#     print('\n')

In [55]:
SEMMED_predicates['BM_Inv_Pred'] = [i if bmt_tool.get_element(i).symmetric
                                      else bmt_tool.get_inverse(i) if bmt_tool.get_inverse(i)
                                      else 'MISSING' 
                                    for i in SEMMED_predicates['BMPred']]

SEMMED_predicates = SEMMED_predicates.reindex(columns=[
    'SemmedPred', 'Semmed_in_BM', 'BMPred', 'BMQualPred', 'BMObjAsp', 'BMObjDirect', 
    'BM_Inv_Pred', 'BM_Inv_QualPred', 'BM_Inv_SubjAsp', 'BM_Inv_SubjDirect'])

## old code that doesn't work anymore, was previously used to get inverses for predicates and qualified predicates
## get inverses for basic predicates + qualified_predicates to generate reverse operations
# SEMMED_predicates['BM_Inv_Pred'] = [bmt_tool.get_element(i).inverse 
#                                        if isinstance(bmt_tool.get_element(i).inverse, str)
#                                        else i 
#                                        for i in SEMMED_predicates['BMPred']]

# SEMMED_predicates['BM_Inv_QualPred'] = [bmt_tool.get_element(i).inverse 
#                                        if i and isinstance(bmt_tool.get_element(i).inverse, str)
#                                        else i 
#                                        for i in SEMMED_predicates['BMQualPred']]

In [56]:
SEMMED_predicates

Unnamed: 0,SemmedPred,Semmed_in_BM,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BM_Inv_Pred,BM_Inv_QualPred,BM_Inv_SubjAsp,BM_Inv_SubjDirect
54,administered_to,SEMMEDDB:ADMINISTERED_TO,related to,,,,related to,,,
145,affects,SEMMEDDB:AFFECTS,affects,,,,affected by,,,
160,associated_with,SEMMEDDB:ASSOCIATED_WITH,related to,,,,related to,,,
55,augments,SEMMEDDB:AUGMENTS,affects,causes,activity_or_abundance,increased,affected by,caused_by,activity_or_abundance,increased
141,causes,SEMMEDDB:CAUSES,causes,,,,caused by,,,
56,coexists_with,SEMMEDDB:COEXISTS_WITH,coexists with,,,,coexists with,,,
143,complicates,SEMMEDDB:COMPLICATES,exacerbates,,,,is exacerbated by,,,
57,converts_to,SEMMEDDB:CONVERTS_TO,derives into,,,,derives from,,,
157,diagnoses,SEMMEDDB:DIAGNOSES,diagnoses,,,,is diagnosed by,,,
140,disrupts,SEMMEDDB:DISRUPTS,disrupts,,,,disrupted by,,,


**PAUSE**

Check if the predicate inverses look correct. 

For bmt >=1.1.0 and biolink-model >=3.5.2, we want to:
* leave the inverse of 'subclass of' as 'MISSING' because no inverse exists, so we'll want to remove it in the next step
* leave the inverse of 'quanitifer qualifier' as 'MISSING' since it's actually not a predicate, so we'll want to remove it in the next step

Old notes: 
* Maybe some aren't correct because they don't have an inverse but they are directional...If so, I suggest going back and changing the predicates to ones with inverses. 
* Some predicates aren't directional (their entry in the biolink-model yaml will have the property `symmetric == true`) so those are identical in either direction...

In [57]:
## finding predicates where inverse is missing (issues retrieving from biolink-model-toolkit)
SEMMED_predicates[SEMMED_predicates['BM_Inv_Pred'] == 'MISSING']

Unnamed: 0,SemmedPred,Semmed_in_BM,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BM_Inv_Pred,BM_Inv_QualPred,BM_Inv_SubjAsp,BM_Inv_SubjDirect
172,isa,SEMMEDDB:ISA,subclass of,,,,MISSING,,,
156,measures,SEMMEDDB:MEASURES,quantifier qualifier,,,,MISSING,,,


Now we'll make sure all biolink-model-mapped columns are in snake_case, and clean up the table

In [58]:
## FOR RESOURCES, these predicates must all be in snake_case
for col in ['BMPred', 'BMQualPred', 'BM_Inv_Pred', 'BM_Inv_QualPred']:
    SEMMED_predicates[col] = SEMMED_predicates[col].str.replace(" ", "_")

In [59]:
## can mutate and rename the Semmed_in_BM column (not needed anymore) with the format needed
SEMMED_predicates['Semmed_in_BM'] = SEMMED_predicates['SemmedPred'].str.upper()
SEMMED_predicates['Semmed_in_BM'] = SEMMED_predicates['Semmed_in_BM'].str.replace("-", "")
SEMMED_predicates.rename(columns = {'Semmed_in_BM':'Semmed_in_Data'}, inplace = True)

## special handling keep lowercase for same_as
SEMMED_predicates['Semmed_in_Data'] = [i.lower() if i == 'SAME_AS' else i
                                       for i in SEMMED_predicates['Semmed_in_Data']]

## Can remove the SemmedPred column: not needed anymore
SEMMED_predicates.drop(columns='SemmedPred', inplace=True)

In [60]:
SEMMED_predicates

Unnamed: 0,Semmed_in_Data,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BM_Inv_Pred,BM_Inv_QualPred,BM_Inv_SubjAsp,BM_Inv_SubjDirect
54,ADMINISTERED_TO,related_to,,,,related_to,,,
145,AFFECTS,affects,,,,affected_by,,,
160,ASSOCIATED_WITH,related_to,,,,related_to,,,
55,AUGMENTS,affects,causes,activity_or_abundance,increased,affected_by,caused_by,activity_or_abundance,increased
141,CAUSES,causes,,,,caused_by,,,
56,COEXISTS_WITH,coexists_with,,,,coexists_with,,,
143,COMPLICATES,exacerbates,,,,is_exacerbated_by,,,
57,CONVERTS_TO,derives_into,,,,derives_from,,,
157,DIAGNOSES,diagnoses,,,,is_diagnosed_by,,,
140,DISRUPTS,disrupts,,,,disrupted_by,,,


#### custom-prune the data of some predicates

We are still doing this because some semmeddb predicates or their mappings are problematic. 

As of 2023-08-07: the Translator exclusions weren't on a predicate level. ref: [Translator google group, google sheet link](https://docs.google.com/spreadsheets/d/1c1gx0Jgm9rJUOXcQhBtZgvx50Cvz1-jh0DdGtg1zcd8/edit#gid=1801185264).

In [61]:
print("in data but not in mapping table")
predicates - set(SEMMED_predicates['Semmed_in_Data'])

print("in mapping file but not in data")
set(SEMMED_predicates['Semmed_in_Data']) - predicates

in data but not in mapping table


{'MEASUREMENT_OF', 'METHOD_OF'}

in mapping file but not in data


{'ISA'}

In [62]:
## finding predicates where inverse is missing (issues retrieving from biolink-model-toolkit)
SEMMED_predicates[SEMMED_predicates['BM_Inv_Pred'] == 'MISSING']

Unnamed: 0,Semmed_in_Data,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BM_Inv_Pred,BM_Inv_QualPred,BM_Inv_SubjAsp,BM_Inv_SubjDirect
172,ISA,subclass_of,,,,MISSING,,,
156,MEASURES,quantifier_qualifier,,,,MISSING,,,


**PAUSE**

Remove predicates in the mapping table, but we don't want to use for associations:
- no biolink-model mapping: METHOD_OF, MEASUREMENT_OF
- no biolink-model inverse (`MISSING`) and other issues:
  - ISA: already removed from data during Basic Filtering section 2.3 (predicates)
  - MEASURES: mapping isn't to a predicate (quantifier_qualifier)

Past: used to exclude USES (not helpful?), but now keeping because the predicate has been reviewed in the Translator-curated exclusions effort. ref: [Translator google group, google sheet link](https://docs.google.com/spreadsheets/d/1c1gx0Jgm9rJUOXcQhBtZgvx50Cvz1-jh0DdGtg1zcd8/edit#gid=1801185264).

In [63]:
more_removals = {'MEASUREMENT_OF', 'METHOD_OF', 'ISA', 'MEASURES'}
more_removals

{'ISA', 'MEASUREMENT_OF', 'MEASURES', 'METHOD_OF'}

In [64]:
## remove this set from the SEMMED_predicates table
SEMMED_predicates = SEMMED_predicates[~ SEMMED_predicates['Semmed_in_Data'].isin(more_removals)]
SEMMED_predicates
SEMMED_predicates.shape

Unnamed: 0,Semmed_in_Data,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BM_Inv_Pred,BM_Inv_QualPred,BM_Inv_SubjAsp,BM_Inv_SubjDirect
54,ADMINISTERED_TO,related_to,,,,related_to,,,
145,AFFECTS,affects,,,,affected_by,,,
160,ASSOCIATED_WITH,related_to,,,,related_to,,,
55,AUGMENTS,affects,causes,activity_or_abundance,increased,affected_by,caused_by,activity_or_abundance,increased
141,CAUSES,causes,,,,caused_by,,,
56,COEXISTS_WITH,coexists_with,,,,coexists_with,,,
143,COMPLICATES,exacerbates,,,,is_exacerbated_by,,,
57,CONVERTS_TO,derives_into,,,,derives_from,,,
157,DIAGNOSES,diagnoses,,,,is_diagnosed_by,,,
140,DISRUPTS,disrupts,,,,disrupted_by,,,


(25, 9)

In [65]:
## remove this set from the data record
filtered_data = filtered_data[ ~ filtered_data['PREDICATE'].isin(more_removals)]

In [66]:
## look at the semantic types again after this removal
subject_semtypes = set(filtered_data["SUBJECT_SEMTYPE"].unique())
object_semtypes = set(filtered_data["OBJECT_SEMTYPE"].unique())
predicates = set(filtered_data["PREDICATE"].unique())

len(subject_semtypes) ## was 112, still 112 (expected)
len(object_semtypes)  ## was 110, still 109 (cool, not intended)
len(predicates)       ## was 28, now 25: decreased by 3 (expected: ISA was already removed earlier)

112

109

25

In [67]:
## look at number of combos after this removal
combos = filtered_data.value_counts().reset_index()
combos.columns = ['SUBJECT_SEMTYPE', 'PREDICATE', 'OBJECT_SEMTYPE', 'COUNT']
combos.shape
combos.head(10)

(13150, 4)

Unnamed: 0,SUBJECT_SEMTYPE,PREDICATE,OBJECT_SEMTYPE,COUNT
0,dsyn,PROCESS_OF,humn,2143816
1,bpoc,PART_OF,mamm,1159577
2,bpoc,LOCATION_OF,neop,1023005
3,fndg,PROCESS_OF,humn,1000549
4,bpoc,LOCATION_OF,aapp,953010
5,bdsu,LOCATION_OF,aapp,884425
6,topp,TREATS,dsyn,849825
7,cell,LOCATION_OF,aapp,815374
8,bpoc,LOCATION_OF,patf,799234
9,topp,TREATS,neop,707094


### Use translator-curated exclusions to prune combos 

I'll use Translator-curated exclusions to remove some semantic-type, domain-predicate, and predicate-range exclusions. ref: [Translator google group, google sheet link](https://docs.google.com/spreadsheets/d/1c1gx0Jgm9rJUOXcQhBtZgvx50Cvz1-jh0DdGtg1zcd8/edit#gid=1801185264)

Translator-curated exclusions work on the combo-level. 

Note: the `_t_code` columns sometimes have the value of `not_found`. 

In [68]:
imported_exclusions = pd.read_csv(
    'https://raw.githubusercontent.com/biolink/biolink-model/master/SEMMEDDB_exclude_list.tsv',
    sep="\t"
)

In [69]:
imported_exclusions.head()
imported_exclusions.shape
imported_exclusions.exclusion_type.unique()  ## number of unique kinds of exclusions

Unnamed: 0,semmed_subject_code,semmed_subject_t_code,semmed_predicate,semmed_object_code,semmed_object_t_code,exclusion_type
0,chem,T103,,,,semantic type exclusion
1,,,,chem,T103,semantic type exclusion
2,chvs,T104,,,,semantic type exclusion
3,,,,chvs,T104,semantic type exclusion
4,chvf,T120,,,,semantic type exclusion


(1443, 6)

array(['semantic type exclusion', 'Domain exclusion', 'Range exclusion'],
      dtype=object)

First, I'm checking that the data fits the "rules" I expect:

In [77]:
for row in imported_exclusions.itertuples(index=False):
    ## all 'semantic type exclusions' don't consider the predicates, 
    ##   just specific entities/instances of the types
    if row.exclusion_type == 'semantic type exclusion':
        if isinstance(row.semmed_predicate, str):
            print('semantic-type exclusion problem: predicate there?')
            print(row)
    ## 'Domain exclusions' must have a domain/subject + predicate specified
    ##   and no range/object specified
    elif row.exclusion_type == 'Domain exclusion':
        if (not isinstance(row.semmed_subject_code, str)) or (not isinstance(row.semmed_predicate, str)):
            print('domain exclusion problem: missing subject or predicate')
            print(row)
        elif isinstance(row.semmed_object_code, str):
            print('domain exclusion problem: object present')
            print(row)
    ## 'Range exclusions' must have a domain/subject + predicate specified
    ##   and no range/object specified
    elif row.exclusion_type == 'Range exclusion':
        if (not isinstance(row.semmed_object_code, str)) or (not isinstance(row.semmed_predicate, str)):
            print('range exclusion problem: missing object or predicate')
            print(row)
        elif isinstance(row.semmed_subject_code, str):
            print('range exclusion problem: subject present')
            print(row)
    else: 
        print('some other kind of exclusion?')
        print(row)

Then, I noticed the same_as predicate was not in the expected case, so adjusting this...

In [86]:
imported_exclusions[imported_exclusions['semmed_predicate'] == 'same_as']
imported_exclusions[imported_exclusions['semmed_predicate'] == 'SAME_AS']

Unnamed: 0,semmed_subject_code,semmed_subject_t_code,semmed_predicate,semmed_object_code,semmed_object_t_code,exclusion_type


Unnamed: 0,semmed_subject_code,semmed_subject_t_code,semmed_predicate,semmed_object_code,semmed_object_t_code,exclusion_type
715,chvf,T120,SAME_AS,,,Domain exclusion
716,chvs,T104,SAME_AS,,,Domain exclusion
717,resa,T062,SAME_AS,,,Domain exclusion
718,chem,T103,SAME_AS,,,Domain exclusion
719,qnco,T081,SAME_AS,,,Domain exclusion
720,ftcn,T169,SAME_AS,,,Domain exclusion
721,inpr,T170,SAME_AS,,,Domain exclusion
722,tmco,T079,SAME_AS,,,Domain exclusion
1387,,,SAME_AS,chvf,T120,Range exclusion
1388,,,SAME_AS,resa,T062,Range exclusion


In [88]:
imported_exclusions['semmed_predicate'] = ['same_as' if i == 'SAME_AS' else i \
                                           for i in imported_exclusions['semmed_predicate']]

In [89]:
imported_exclusions[imported_exclusions['semmed_predicate'] == 'same_as']
imported_exclusions[imported_exclusions['semmed_predicate'] == 'SAME_AS']

Unnamed: 0,semmed_subject_code,semmed_subject_t_code,semmed_predicate,semmed_object_code,semmed_object_t_code,exclusion_type
715,chvf,T120,same_as,,,Domain exclusion
716,chvs,T104,same_as,,,Domain exclusion
717,resa,T062,same_as,,,Domain exclusion
718,chem,T103,same_as,,,Domain exclusion
719,qnco,T081,same_as,,,Domain exclusion
720,ftcn,T169,same_as,,,Domain exclusion
721,inpr,T170,same_as,,,Domain exclusion
722,tmco,T079,same_as,,,Domain exclusion
1387,,,same_as,chvf,T120,Range exclusion
1388,,,same_as,resa,T062,Range exclusion


Unnamed: 0,semmed_subject_code,semmed_subject_t_code,semmed_predicate,semmed_object_code,semmed_object_t_code,exclusion_type


Now I can iterate through the `imported_exclusions` table and remove rows from combo that match

In [78]:
imported_exclusions[imported_exclusions['exclusion_type'] == 'Domain exclusion']

Unnamed: 0,semmed_subject_code,semmed_subject_t_code,semmed_predicate,semmed_object_code,semmed_object_t_code,exclusion_type
118,hlca,T058,ADMINISTERED_TO,,,Domain exclusion
119,inpr,T170,ADMINISTERED_TO,,,Domain exclusion
120,lbpr,T059,ADMINISTERED_TO,,,Domain exclusion
121,resa,T062,ADMINISTERED_TO,,,Domain exclusion
122,qnco,T081,ADMINISTERED_TO,,,Domain exclusion
...,...,...,...,...,...,...
796,idcn,T078,USES,,,Domain exclusion
797,qlco,T080,USES,,,Domain exclusion
798,rcpt,T192,USES,,,Domain exclusion
799,strd,not_found,USES,,,Domain exclusion


In [102]:
for row in imported_exclusions.itertuples(index=False):
    if row.exclusion_type == 'semantic type exclusion':
        if isinstance(row.semmed_subject_code, str):  ## exclusion is in the subject
            combos = combos[ ~ (combos['SUBJECT_SEMTYPE'] == row.semmed_subject_code)]
        else:  ## exclusion should be in object
            combos = combos[ ~ (combos['OBJECT_SEMTYPE'] == row.semmed_object_code)]
    elif row.exclusion_type == 'Domain exclusion':
        combos = combos[~ ( (combos['SUBJECT_SEMTYPE'] == row.semmed_subject_code ) & 
                            (combos['PREDICATE'] == row.semmed_predicate ))]
    elif row.exclusion_type == 'Range exclusion':
        combos = combos[~ ( (combos['OBJECT_SEMTYPE'] == row.semmed_object_code ) & 
                            (combos['PREDICATE'] == row.semmed_predicate ))]

In [103]:
combos.shape
combos.head(10)

(9667, 4)

Unnamed: 0,SUBJECT_SEMTYPE,PREDICATE,OBJECT_SEMTYPE,COUNT
2,bpoc,LOCATION_OF,neop,1023005
4,bpoc,LOCATION_OF,aapp,953010
5,bdsu,LOCATION_OF,aapp,884425
6,topp,TREATS,dsyn,849825
7,cell,LOCATION_OF,aapp,815374
8,bpoc,LOCATION_OF,patf,799234
9,topp,TREATS,neop,707094
10,bpoc,LOCATION_OF,dsyn,696732
11,bpoc,LOCATION_OF,fndg,670960
13,phsu,TREATS,dsyn,660654


## Final Filter: row counts per combo

**PAUSE**

use the code block below to decide how many combos to keep based on how many predications/records there are per combo...

---

Current:

Counts right now are still based on predication. For now, [we want > 3 publications for each triple](https://github.com/NCATSTranslator/Feedback/issues/100#issuecomment-1632806388), so the count must be > 3 at least...

In [98]:
combos[(combos['COUNT'] > 3)]

Unnamed: 0,SUBJECT_SEMTYPE,PREDICATE,OBJECT_SEMTYPE,COUNT
2,bpoc,LOCATION_OF,neop,1023005
4,bpoc,LOCATION_OF,aapp,953010
5,bdsu,LOCATION_OF,aapp,884425
6,topp,TREATS,dsyn,849825
7,cell,LOCATION_OF,aapp,815374
...,...,...,...,...
10034,medd,TREATS,virs,4
10035,anab,AFFECTS,neop,4
10036,opco,ADMINISTERED_TO,aggp,4
10038,moft,PROCESS_OF,gngm,4


Hmmm...I'm not happy with some odd same_as Metatriples that are still around, but keeping them in for now...

In [104]:
combos[(combos['COUNT'] > 3) &
       (combos['PREDICATE'] == 'same_as') & 
       (combos['SUBJECT_SEMTYPE'] == 'orch')
      ]

Unnamed: 0,SUBJECT_SEMTYPE,PREDICATE,OBJECT_SEMTYPE,COUNT
1718,orch,same_as,orch,2849
3832,orch,same_as,phsu,343
4151,orch,same_as,aapp,257
4196,orch,same_as,medd,248
4402,orch,same_as,topp,209
4575,orch,same_as,hops,184
4608,orch,same_as,antb,177
4627,orch,same_as,bacs,175
5519,orch,same_as,horm,82
5643,orch,same_as,inch,74


In [105]:
filtered_combos = combos[(combos['COUNT'] > 3)].copy()
filtered_combos.drop(columns='COUNT', inplace=True)
filtered_combos.shape
filtered_combos[0:3]

(7685, 3)

Unnamed: 0,SUBJECT_SEMTYPE,PREDICATE,OBJECT_SEMTYPE
2,bpoc,LOCATION_OF,neop
4,bpoc,LOCATION_OF,aapp
5,bdsu,LOCATION_OF,aapp


In [106]:
## now have to map all subject to biolink

filtered_combos = filtered_combos.merge(SEMMED_entity_types[['Abbrev', 'BiolinkMapping']], 
                      how='left', left_on='SUBJECT_SEMTYPE', right_on='Abbrev')

filtered_combos.drop(columns = 'Abbrev', inplace=True)

filtered_combos.columns = ['OriginalSubject', 'OriginalPredicate', 'OriginalObject',
                  'BiolinkSubject']

In [107]:
## now have to map all object to biolink

filtered_combos = filtered_combos.merge(SEMMED_entity_types[['Abbrev', 'BiolinkMapping']],
                      how='left', left_on='OriginalObject', right_on='Abbrev')

filtered_combos.drop(columns = 'Abbrev', inplace=True)

filtered_combos.columns = ['OriginalSubject', 'OriginalPredicate', 'OriginalObject',
                  'BiolinkSubject', 'BiolinkObject']

In [108]:
filtered_combos

Unnamed: 0,OriginalSubject,OriginalPredicate,OriginalObject,BiolinkSubject,BiolinkObject
0,bpoc,LOCATION_OF,neop,GrossAnatomicalStructure,Disease
1,bpoc,LOCATION_OF,aapp,GrossAnatomicalStructure,Polypeptide
2,bdsu,LOCATION_OF,aapp,AnatomicalEntity,Polypeptide
3,topp,TREATS,dsyn,Procedure,Disease
4,cell,LOCATION_OF,aapp,Cell,Polypeptide
...,...,...,...,...,...
7680,medd,TREATS,virs,Device,Virus
7681,anab,AFFECTS,neop,Disease,Disease
7682,opco,ADMINISTERED_TO,aggp,SmallMolecule,Cohort
7683,moft,PROCESS_OF,gngm,MolecularActivity,Gene


In [109]:
filtered_combos = filtered_combos.merge(
                      SEMMED_predicates,
                      how='left', left_on='OriginalPredicate', right_on='Semmed_in_Data')

filtered_combos.drop(columns = 'Semmed_in_Data', inplace=True)

In [110]:
filtered_combos[0:3]

Unnamed: 0,OriginalSubject,OriginalPredicate,OriginalObject,BiolinkSubject,BiolinkObject,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BM_Inv_Pred,BM_Inv_QualPred,BM_Inv_SubjAsp,BM_Inv_SubjDirect
0,bpoc,LOCATION_OF,neop,GrossAnatomicalStructure,Disease,location_of,,,,located_in,,,
1,bpoc,LOCATION_OF,aapp,GrossAnatomicalStructure,Polypeptide,location_of,,,,located_in,,,
2,bdsu,LOCATION_OF,aapp,AnatomicalEntity,Polypeptide,location_of,,,,located_in,,,


In [111]:
## check to make sure everything is mapped to Biolink successfully
filtered_combos[filtered_combos['BiolinkSubject'].isna()]
filtered_combos[filtered_combos['BiolinkObject'].isna()]

filtered_combos[filtered_combos['BMPred'].isna()]
filtered_combos[filtered_combos['BM_Inv_Pred'].isna()]

Unnamed: 0,OriginalSubject,OriginalPredicate,OriginalObject,BiolinkSubject,BiolinkObject,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BM_Inv_Pred,BM_Inv_QualPred,BM_Inv_SubjAsp,BM_Inv_SubjDirect


Unnamed: 0,OriginalSubject,OriginalPredicate,OriginalObject,BiolinkSubject,BiolinkObject,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BM_Inv_Pred,BM_Inv_QualPred,BM_Inv_SubjAsp,BM_Inv_SubjDirect


Unnamed: 0,OriginalSubject,OriginalPredicate,OriginalObject,BiolinkSubject,BiolinkObject,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BM_Inv_Pred,BM_Inv_QualPred,BM_Inv_SubjAsp,BM_Inv_SubjDirect


Unnamed: 0,OriginalSubject,OriginalPredicate,OriginalObject,BiolinkSubject,BiolinkObject,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BM_Inv_Pred,BM_Inv_QualPred,BM_Inv_SubjAsp,BM_Inv_SubjDirect


### Optional Analysis: looking at row counts, organized by biolink combos

PAUSE

Deprecated code: will need to include COUNT column after creating the filtered_combos object...

In [None]:
### deep-copy combos so it's not messed up for the next step
biolink_counting = filtered_combos.copy()
biolink_counting = biolink_counting.groupby(
                       ['BiolinkSubject', 'BiolinkPredicate', 'BiolinkObject']
                   ).agg(
                       { "COUNT": "sum"}
)
## other stuff that can go into agg
#                          "OriginalSubject": lambda x: set(x),
#                          "OriginalPredicate": lambda x: set(x),
#                          "OriginalObject": lambda x: set(x)

In [None]:
biolink_counting.reset_index(inplace = True)

biolink_counting.sort_values(by='COUNT', ascending=False, inplace = True)

biolink_counting[0:50]

## Generate operation yaml!

**PAUSE**

* If needed, change the code within the functions below to change the x-bte annotations that are made...
  * review the qualifier-generating code! Currently it's very simple because all operations with qualifiers have the same set: qualified_predicate, an aspect_qualifier, and a direction_qualifier...
* the yaml created below refers to umls-subj and umls-obj...those are specified [here close to the bottom](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/version_without_operations.yaml)

In [116]:
yaml=ryml.YAML()
folded = ryml.scalarstring.FoldedScalarString
doublequote = ryml.scalarstring.DoubleQuotedScalarString

In [117]:
def generate_forward_op(original_subj, original_pred, original_obj,
                        biolink_subj, 
                        biolink_pred, biolink_qualified_pred, biolink_obj_asp, biolink_obj_direct,
                        biolink_obj):
    ## set size parameter for biothings POST query. This will change depending on what is set for the API
    POST_size = 1000  
    
    ## create the keys for the operation names
    normal_op_name = f"{original_subj}-{original_pred}-{original_obj}"
    
    ## USE FOLDED in order to have the quotes handled properly (no escape \) in the dumped document
    ## original direction: subject -> object
    normal_op_body = folded(
    '{' 
        '"q": {{ queryInputs ' 
        '| replPrefix(\'predicate:' + f'{original_pred} AND object.semantic_type_abbreviation:{original_obj}' + ' AND pmid_count:>3 AND subject.umls\')' 
        '| dump ' 
        '}}, ' 
        '"scopes": []' 
    '}')
    
    if biolink_qualified_pred:
    ## if there is a qualified_predicate
    ##   there's a set of qualifiers that also has obj aspect and obj direction
      temp = {
            ## original direction: subject -> object
            normal_op_name: [
                {
                    'supportBatch': True,
                    'useTemplating': True,
                    'inputs': [
                        {
                            'id': 'UMLS',
                            'semantic': biolink_subj  ## input is subject!
                        }
                    ],
                    'requestBodyType': 'object',
                    'requestBody': {'body': normal_op_body},
                    'parameters': {
                        'fields': 'object.umls,predication.pmid,predication.sentence,subject.umls,subject.name,object.name',
                        'size': POST_size
                    },
                    'outputs': [
                        {
                            'id': 'UMLS',
                            'semantic': biolink_obj  ## output is object
                        }
                    ],
                    'predicate': biolink_pred,
                    'qualifiers': {
                        'qualified_predicate': biolink_qualified_pred,
                        'object_aspect_qualifier': biolink_obj_asp,
                        'object_direction_qualifier': biolink_obj_direct

                    },
                    'source': doublequote('infores:semmeddb'),
                    'response_mapping': {
                        "$ref": '#/components/x-bte-response-mapping/umls-obj'  ## matches output as object
                    }
                }
            ]
        }
    else:
    ## create operation without qualifiers..
        temp = {
            ## original direction: subject -> object
            normal_op_name: [
                {
                    'supportBatch': True,
                    'useTemplating': True,
                    'inputs': [
                        {
                            'id': 'UMLS',
                            'semantic': biolink_subj  ## input is subject!
                        }
                    ],
                    'requestBodyType': 'object',
                    'requestBody': {'body': normal_op_body},
                    'parameters': {
                        'fields': 'object.umls,predication.pmid,predication.sentence,subject.umls,subject.name,object.name',
                        'size': POST_size
                    },
                    'outputs': [
                        {
                            'id': 'UMLS',
                            'semantic': biolink_obj  ## output is object
                        }
                    ],
                    'predicate': biolink_pred,
                    'source': doublequote('infores:semmeddb'),
                    'response_mapping': {
                        "$ref": '#/components/x-bte-response-mapping/umls-obj'  ## matches output as object
                    }
                }
            ]
        }
        
    return temp

In [118]:
def generate_reverse_op(original_subj, original_pred, original_obj,
                        biolink_subj, 
                        ## NOTICE THE INVERSES USED HERE
                        biolink_inverse_pred, biolink_inverse_qualified_pred, 
                        biolink_inverse_subj_asp, biolink_inverse_subj_direct,
                        biolink_obj):
    ## set size parameter for biothings POST query. This will change depending on what is set for the API
    POST_size = 1000  
    
    ## create the keys for the operation names
    rev_op_name = f"{original_subj}-{original_pred}-{original_obj}-rev"
    
    ## USE FOLDED in order to have the quotes handled properly (no escape \) in the dumped document
    ## reverse direction: object -> subject
    rev_op_body = folded(
    '{' 
        '"q": {{ queryInputs ' 
        '| replPrefix(\'predicate:' + f'{original_pred} AND subject.semantic_type_abbreviation:{original_subj}' + ' AND pmid_count:>3 AND object.umls\')' 
        '| dump ' 
        '}}, ' 
        '"scopes": []' 
    '}')
    
    if biolink_inverse_qualified_pred:
    ## if there is a qualified_predicate
    ##   there's a set of qualifiers that also has subj aspect and subj direction
        temp = {
            ## reverse direction: object -> subject
            rev_op_name: [
                {
                    'supportBatch': True,
                    'useTemplating': True,
                    'inputs': [
                        {
                            'id': 'UMLS',
                            'semantic': biolink_obj  ## input is object!
                        }
                    ],
                    'requestBodyType': 'object',
                    'requestBody': {'body': rev_op_body},
                    'parameters': {
                        'fields': 'object.umls,predication.pmid,predication.sentence,subject.umls,subject.name,object.name',
                        'size': POST_size
                    },
                    'outputs': [
                        {
                            'id': 'UMLS',
                            'semantic': biolink_subj  ## output is subject
                        }
                    ],
                    'predicate': biolink_inverse_pred,  ## use inverse pred!
                    'qualifiers': {
                        'qualified_predicate': biolink_inverse_qualified_pred,  ## use inverse pred!
                        ## use subject qualifiers
                        'subject_aspect_qualifier': biolink_inverse_subj_asp,         
                        'subject_direction_qualifier': biolink_inverse_subj_direct                        
                    },                    
                    'source': doublequote('infores:semmeddb'),
                    'response_mapping': {
                        "$ref": '#/components/x-bte-response-mapping/umls-subj'  ## matches output as subj
                    }
                }
            ]
        }
    else:
    ## create operation without qualifiers..
        temp = {
            ## reverse direction: object -> subject
            rev_op_name: [
                {
                    'supportBatch': True,
                    'useTemplating': True,
                    'inputs': [
                        {
                            'id': 'UMLS',
                            'semantic': biolink_obj  ## input is object!
                        }
                    ],
                    'requestBodyType': 'object',
                    'requestBody': {'body': rev_op_body},
                    'parameters': {
                        'fields': 'object.umls,predication.pmid,predication.sentence,subject.umls,subject.name,object.name',
                        'size': POST_size
                    },
                    'outputs': [
                        {
                            'id': 'UMLS',
                            'semantic': biolink_subj  ## output is subject
                        }
                    ],
                    'predicate': biolink_inverse_pred,  ## use inverse pred!
                    'source': doublequote('infores:semmeddb'),
                    'response_mapping': {
                        "$ref": '#/components/x-bte-response-mapping/umls-subj'  ## matches output as subj
                    }
                }
            ]
        }
    
    return temp

In [119]:
def generate_all_operations(combo_df):
    op_tracking = set()
    
    saved = dict()
    ## iterate through rows of combos dataframe
    for row in combo_df.itertuples(index = False):
        
        ## forward: only make operation if it's not going to be a dupe
        ##          dupes happen when query ends up being the same (predicate,object used here)
        forward_op_record = f"{row.BiolinkSubject}-{row.OriginalPredicate}-{row.OriginalObject}"
        if forward_op_record not in op_tracking:
            op_tracking.add(forward_op_record)
            saved.update(generate_forward_op(original_subj = row.OriginalSubject,
                                             original_pred = row.OriginalPredicate,
                                             original_obj = row.OriginalObject,
                                             biolink_subj = row.BiolinkSubject,
                                             biolink_pred = row.BMPred,
                                             biolink_qualified_pred = row.BMQualPred,
                                             biolink_obj_asp = row.BMObjAsp,
                                             biolink_obj_direct = row.BMObjDirect,
                                             biolink_obj = row.BiolinkObject
                                            ))        
        
        ## reverse: make operation if it's not going to be a dupe
        ##          dupes happen when query ends up being the same (predicate,subject used here)
        reverse_op_record = f"rev-{row.BiolinkObject}-{row.OriginalPredicate}-{row.OriginalSubject}"
        if reverse_op_record not in op_tracking:
            op_tracking.add(reverse_op_record)
            saved.update(generate_reverse_op(original_subj = row.OriginalSubject,
                                             original_pred = row.OriginalPredicate,
                                             original_obj = row.OriginalObject,
                                             biolink_subj = row.BiolinkSubject,
                                             biolink_inverse_pred = row.BM_Inv_Pred,
                                             biolink_inverse_qualified_pred = row.BM_Inv_QualPred,
                                             biolink_inverse_subj_asp = row.BM_Inv_SubjAsp,
                                             biolink_inverse_subj_direct = row.BM_Inv_SubjDirect,
                                             biolink_obj = row.BiolinkObject
                                             ))           
            
    final = {"x-bte-kgs-operations": saved}
    return final

Get the file made!

In [120]:
kgs_operations = generate_all_operations(filtered_combos)

In [121]:
len(kgs_operations['x-bte-kgs-operations'])

6851

In [122]:
## double-check qualifier + rev operation is written properly?
for i in kgs_operations['x-bte-kgs-operations'].keys():
    if ('INHIBITS' in i) and ('rev' in i):
        pprint.pprint(kgs_operations['x-bte-kgs-operations'][i], sort_dicts = False)
        break

[{'supportBatch': True,
  'useTemplating': True,
  'inputs': [{'id': 'UMLS', 'semantic': 'Polypeptide'}],
  'requestBodyType': 'object',
  'requestBody': {'body': '{"q": {{ queryInputs | '
                          "replPrefix('predicate:INHIBITS AND "
                          'subject.semantic_type_abbreviation:aapp AND '
                          "pmid_count:>3 AND object.umls')| dump }}, "
                          '"scopes": []}'},
  'parameters': {'fields': 'object.umls,predication.pmid,predication.sentence,subject.umls,subject.name,object.name',
                 'size': 1000},
  'outputs': [{'id': 'UMLS', 'semantic': 'Polypeptide'}],
  'predicate': 'affected_by',
  'qualifiers': {'qualified_predicate': 'caused_by',
                 'subject_aspect_qualifier': 'activity_or_abundance',
                 'subject_direction_qualifier': 'decreased'},
  'source': 'infores:semmeddb',
  'response_mapping': {'$ref': '#/components/x-bte-response-mapping/umls-subj'}}]


**PAUSE**

* it's cool how condensed the operations become, due to the way querying is done (keeping track of not creating duplicated operations using the sets in generate_all_operations function)
* set where to download the yamls in the code chunks below
  * operations_path
  * list_path

In [123]:
yaml.boolean_representation = ['False', 'True']

operations_path = pathlib.Path.home().joinpath(
            "Desktop", "translator-api-registry", "semmeddb", "generated_operations.yaml")

yaml.dump(kgs_operations, operations_path)

Wait a sec! Need the operations list too!

In [124]:
def generate_kgs_operations_list(operations_dict):
    kgs_op_list = []
    for key in operations_dict.keys():
        kgs_op_list.append( {"$ref": f"#/components/x-bte-kgs-operations/{key}"} )
    final2 = {"x-bte-kgs-operations": kgs_op_list}
    return final2

In [125]:
operations_list = generate_kgs_operations_list(kgs_operations['x-bte-kgs-operations'])

In [126]:
list_path = pathlib.Path.home().joinpath(
            "Desktop", "translator-api-registry", "semmeddb", "generated_list.yaml")

yaml.dump(operations_list, list_path)

**PAUSE**

* now the yaml segments downloaded have to be indented manually and inserted into the correct sections of the smartapi yaml...
  * It's easier to do with an IDE like Visual Code studio where one can select large sections of text
  * the amount of indent to do and where to put things is specified in the [yaml that acts as a template](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/version_without_operations.yaml)
  * the finished file is meant to be [here](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/smartapi.yaml) so one could select-paste the sections directly there

## defunct: Notes on previously removed semantic types

* amas: (Amino Acid Sequence) looks like protein "domains". Examples: Nuclear Export Signals, DNA Binding Domain
* acty (Activity) Examples: War, Retirement, Euthanasia, Lifting
* dora (Daily or Recreational Activity) Examples: Physical activity, Light Exercise, Relaxation
* edac (Educational Activity) Examples: Training, Medical Residencies
* gora (Governmental or Regulatory Activity) Examples: Health Care Reform, Advisory Committees
* hlca (Health Care Activity) Examples: follow-up, Diagnosis
* mcha (Machine Activity) Examples: Refridgeration, Neural Network Simulation
* ocac (Occupational Activity) Examples: Promotion, Work, Mining
* resa (Research Activity) Examples: Clinical Trials, research study
* aggp (Age Group) Examples: Infant, Child, Adult, Elderly
* famg (Family Group) Examples: spouse, Sister, Foster Parent
* podg (Patient or Disabled Group) Examples: Patients
* prog (Professional or Occupational Group) Examples: Administrators, Employee, Author
* algae (Algae)
* invt (Invertebrate)
* orgm (Organism)
* rich (Rickettsia or Chlamydia)
* amph (Amphibian) Examples: Toad, Bufo boreas, Anura
* anim (Animal) Examples: Animals, Laboratory /  Control Animal
* arch (Archaeon) Examples: Archaea, halophilic bacteria, Thermoplasma acidophilum
* bact (Bacterium) Examples: Escherichia coli, Salmonella, Borrelia burgdorferi
* bdsu (Body Substance) too general. Examples: Urine, Milk, Lymph, Urine specimen
* bdsy (Body System) too general. Examples: hypothalamic-pituitary-adrenal axis, Neurosecretory Systems
* bird (Bird) Examples: Geese, Passeriformes, Raptors
* blor (Body Location or Region) too general. Examples: Hepatic, Lysosomal, Cytoplasmic
* bmod (Biomedical Occupation or Discipline) Examples: Medicine, Dentistry, Midwifery
* bsoj (Body Space or Junction) too general. Examples: Compartments, Synapses, Cistern
* chvf (Chemical Viewed Functionally) too general. Examples: inhibitors, antagonists, Agent
* chvs (Chemical Viewed Structurally) too general. Examples: particle, solid state, vapor
* euka (Eukaryote) Examples: Wasps, Protozoan parasite
* ffas (Fully Formed Anatomical Structure) Examples: Carcass
* fish (Fish) Examples: Eels, Fishes, Electric Fish
* fngs (Fungus) Examples: Saccharomyces cerevisiae, Alternaria brassicicola, fungus
* humn (Human) Examples: Family, Patients, Males
* irda (Indicator, Reagent, or Diagnostic Aid) Examples: Fluorescent Probes, Chelating Agents
* mamm (Mammal) Examples: Rattus norvegicus, Felis catus, Mus
* ocdi (Occupation or Discipline) Examples: Science, Politics
* plnt (Plant) Examples: Chrysanthemum x morifolium, Pollen, Oryza sativa
* rept (Reptile) Examples: Snakes, Turtles, Reptiles
* sbst (Substance) too general. Examples: Materials, Plastics, Photons, Substance
* virs (Virus) Examples: Herpesvirus 4, Human / GB virus C / Herpesviridae
* vtbt (Vertebrate) Examples: Vertebrates / Poikilotherm, NOS
* anst (Anatomical Structure) Examples: Entire fetus, Whole body, Cadaver
* bhvr (Behavior) too general. Examples: Sexuality, Nest Building, Behavioral phenotype
* inbe (Individual Behavior) too general. Examples: impulsivity, Habits, Performance
* menp (Mental Process) too general. Examples: mind control, Learning, experience
* socb (Social Behavior) too general. Examples: Communication, Gestures, Marriage
* biof (Biologic Function) too general. Examples: dose-response relationship, Pharmacodynamics, Anabolism
* eehu (Environmental Effect of Humans) too general. Examples: Sewage, Pollution, Smoke
* hcpp (Human-caused Phenomenon or Process) too general. Now not in API. Examples: particle beam, Conferences, Victimization
* lbtr (Laboratory or Test Result) too general. Examples: False Positive Reactions, Bone Density, Serum Calcium Level
* npop (Natural Phenomenon or Process) too general. Examples: Floods, Fluorescence, Freezing
* phpr (Phenomenon or Process) too general. Examples: Disasters, Acceleration, Feedback
* bodm (Biomedical or Dental Material) too general. Examples: Pill, Gel, Talc, calcium phosphate
* drdd (Drug Delivery Device) too general. Examples: Epipen, Skin Patch, Lilly cyanide antidote kit
* medd (Medical Device) too general. Examples: Implants / Denture, Overlay / Silicone gel implant / Swab
* resd (Research Device) too general. Examples: Study models, Slide
* emst (Embryonic Structure) Examples: Chick Embryo, Blastocyst structure, Placenta
* tisu (Tissue) Examples: Tissue specimen, Blood, Human tissue, Mucous Membrane
* genf (Genetic Function) too general. Examples: Transcription, Genetic / Transcriptional Activation / Recombination, Genetic
* chem (Chemical) too general. Examples: Chemicals, Acids, Ligands, Ozone
* clas (Classification) too general. Examples: Research Diagnostic Criteria, Group C
* ftcn (Functional Concept) too general. Examples: Techniques, Intravenous Route of Drug Administration
* idcn (Idea or Concept) too general. Examples: Significant, subject, Data, Owner
* qlco (Qualitative Concept) too general. Examples: Effect, Associated with, Advanced phase
* qnco (Quantitative Concept) too general. Examples: Calibration, occurrence, degrees Celsius
* rnlw (Regulation or Law) too general. Examples: Medicare, Medicaid, regulatory
* spco (Spatial Concept) too general. Examples: Structure, Longitudinal, Asymmetry
* tmco (Temporal Concept) too general. Examples: New, /period, 24 Hours
* clna (Clinical Attribute) too general. Examples: response, Renin secretion, BAND PATTERN
* lbpr (Laboratory Procedure) too general. Examples: Western Blot, Radioimmunoassay, Staining method
* mbrt (Molecular Biology Research Technique) too general. Examples: Polymerase Chain Reaction / Blotting, Northern
* elii (Element, Ion, or Isotope) too general. Examples: Atom, Aluminum, Superoxides
* emod (Experimental Model of Disease) too general. Examples: Experimental Autoimmune Encephalomyelitis, Rodent Model
* rcpt (Receptor) too general. Examples: Binding Sites / Receptors, Metabotropic Glutamate
* evnt (Event) too general. Now not in API. Examples: Stressful Events
* fndg (Finding) too general. Examples: spinal cord; lesion, Normal birth weight, Sedentary job
* geoa (Geographic Area) too general. Examples: Country, Canada
* grup (Group) too general. Examples: Human, Individual
* hcro (Health Care Related Organization) too general. Examples: Hospitals, Health System
* orgt (Organization) too general. Examples: United Nations, Organization administrative structures
* pros (Professional Society) too general. Examples: Professional Organizations, American Nurses' Association
* shro (Self-help or Relief Organization) too general. Examples: Social Welfare, Support Groups
* inpr (Intellectual Product) too general. Examples: Methodology, Study models
* mnob (Manufactured Object) too general. Examples: Glass, Manuals
* mosq (Molecular Sequence) too general. Now not in API. Examples: Genetic Code
* nusq (Nucleotide Sequence) too general. Examples: Base Sequence, DNA Sequence, 22q11
* orga (Organism Attribute) too general. Examples: Ability, Body Composition
* popg (Population Group) too general. Examples: Male population group, Woman