# Auto-generate BTE annotations for BioThings SEMMEDDB

This notebook walks a developer through the process of [taking the SEMMEDDB database](https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/SemMedDB_download.html) and using this data to auto-generate the x-bte operations for [BTE](https://github.com/biothings/BioThings_Explorer_TRAPI). This is needed for BTE to query + process the responses from the [BioThings SEMMEDDB API](https://biothings.ncats.io/semmeddb). 

---

When one sees this:  
**PAUSE**

read the accompanying text that will explain what the developer needs to do before running the code chunks below that text block.

---

The [yaml](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/generated_operations.yaml) [segments](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/generated_list.yaml) generated by this notebook are added to [this file](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/version_without_operations.yaml) to make [yaml used for the smartapi registration](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/smartapi.yaml) for BioThings SEMMEDDB API.

## Setup

Requirements:
* get [Biolink-model Toolkit (bmt)](https://github.com/biolink/biolink-model-toolkit/) from github (using pip to [install](https://biolink.github.io/biolink-model-toolkit/intro/intro.html)). Using release 0.9.0 at the moment.
* get ruamel_yaml from [conda-forge](https://anaconda.org/conda-forge/ruamel_yaml). However, the [version from pip](https://pypi.org/project/ruamel.yaml/) should work as well with this code, but you'll have to change the import statement below to import `ruamel.yaml`. The documentation is [here](https://yaml.readthedocs.io/en/latest/)

Files:
* Get SEMMEDDB PREDICATION CSV [here](https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/SemMedDB_download.html). This notebook was originally made using the version `semmedVER43_2021_R_PREDICATION`
* Get SEMMEDDB SRDEF file needed for interpreting and mapping SEMMED semantic types: 
  * download the compressed file [here](https://lhncbc.nlm.nih.gov/semanticnetwork/download.html) or pick the latest version [here](https://lhncbc.nlm.nih.gov/semanticnetwork/SemanticNetworkArchive.html)
  * information on the SRDEF file [here](https://www.ncbi.nlm.nih.gov/books/NBK9679/#ch05.sec5.2)

In [1]:
## CX: allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pathlib
import pandas as pd

## see above for install instructions
from bmt import Toolkit
import ruamel_yaml as ryml

import json
import pprint

## used in trying things out
# import re

**PAUSE**

Review the code chunk below before running it:
* Check and correct the path for `raw_data_location`
* Check that the columns specified in `usecols` and `names` match the columns of the PREDICATION [file](https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/dbinfo.html). 
* Check that the `na_values`, `sep` are correct. One can use a command in Terminal like `head`
* If there are encoding issues, try different encodings. latin1 was used and worked, [ref](https://stackoverflow.com/questions/61163367/how-to-resolve-unicodedecodeerror-in-pandas-read-csv-while-loading-dataset)

In [2]:
raw_data_location = pathlib.Path.home().joinpath(
            "Desktop", "RawDataFiles", "SEMMEDDB", "semmedVER43_2021_R_PREDICATION.csv")

raw_data = pd.read_csv(raw_data_location, header=None, sep=",", encoding="latin1",
                          usecols=[3, 6, 7, 10, 11],
                          names=["PREDICATE","SUBJECT_SEMTYPE","SUBJECT_NOVELTY",
                                 "OBJECT_SEMTYPE", "OBJECT_NOVELTY"],
                          na_values=r"\N")

In [3]:
raw_data.shape

(113863366, 5)

In [4]:
raw_data.head()

Unnamed: 0,PREDICATE,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_SEMTYPE,OBJECT_NOVELTY
0,PROCESS_OF,virs,1,mamm,1.0
1,ISA,virs,1,virs,1.0
2,ISA,virs,1,virs,1.0
3,ISA,virs,1,virs,1.0
4,PROCESS_OF,dsyn,0,humn,0.0


## Basic Filtering

### keep only novelty = 1

now filter it to only keep novelty == 1 for both subject / object...

since those with novelty == 0 probably aren't very helpful / interesting to Translator. The entities with novelty == 0 are [listed](https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/dbinfo.html) in the SEMMEDDB GENERIC_CONCEPT table files, which can be downloaded [here](https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/SemMedDB_download.html)

In [5]:
filtered_data = raw_data[(raw_data["SUBJECT_NOVELTY"] == 1) &
                         (raw_data["OBJECT_NOVELTY"] == 1)].copy()

In [6]:
filtered_data.shape

(78688677, 5)

In [7]:
## remove the novelty stuff since that will make computations faster and it's always 1 now
filtered_data = filtered_data[['SUBJECT_SEMTYPE', 'PREDICATE', 'OBJECT_SEMTYPE']]

what semantic types are even there? we have to prune down to what we actually want BTE operations on...

interesting since the stats from the [official website](https://lhncbc.nlm.nih.gov/semanticnetwork/SemanticNetworkArchive.html) say 127 semantic types and 54 predicates

In [8]:
subject_semtypes = set(filtered_data["SUBJECT_SEMTYPE"].unique())
object_semtypes = set(filtered_data["OBJECT_SEMTYPE"].unique())
predicates = set(filtered_data["PREDICATE"].unique())

len(subject_semtypes)  ## got 133
len(object_semtypes)   ## got 135
len(predicates)        ## got 68

133

135

68

**PAUSE**

* Review the 3 sets above to see if there are things I want to remove. The normal entity semantic types have 4-letter codes...

### Remove semantic types that we don't want to make operations from

so there's some object stuff that I'm going to remove...

In [9]:
object_semtypes

{'C0030193',
 'C0030705',
 'aapp',
 'acab',
 'acty',
 'aggp',
 'alga',
 'amas',
 'amph',
 'anab',
 'anim',
 'anst',
 'antb',
 'arch',
 'bacs',
 'bact',
 'bdsu',
 'bdsy',
 'bhvr',
 'biof',
 'bird',
 'blor',
 'bmod',
 'bodm',
 'bpoc',
 'bsoj',
 'carb',
 'celc',
 'celf',
 'cell',
 'cgab',
 'chem',
 'chvf',
 'chvs',
 'clas',
 'clna',
 'clnd',
 'comd',
 'diap',
 'dora',
 'drdd',
 'dsyn',
 'edac',
 'eehu',
 'eico',
 'elii',
 'emod',
 'emst',
 'enzy',
 'euka',
 'famg',
 'ffas',
 'fish',
 'fndg',
 'fngs',
 'food',
 'ftcn',
 'genf',
 'geoa',
 'gngm',
 'gngm,aapp',
 'gora',
 'grup',
 'hcpp',
 'hcro',
 'hlca',
 'hops',
 'horm',
 'humn',
 'idcn',
 'imft',
 'inbe',
 'inch',
 'inpo',
 'inpr',
 'invt',
 'irda',
 'lang',
 'lbpr',
 'lbtr',
 'lipd',
 'mamm',
 'mbrt',
 'mcha',
 'medd',
 'menp',
 'mnob',
 'mobd',
 'moft',
 'mosq',
 'neop',
 'nnon',
 'npop',
 'nsba',
 'nusq',
 'ocac',
 'ocdi',
 'opco',
 'orch',
 'orga',
 'orgf',
 'orgm',
 'orgt',
 'ortf',
 'patf',
 'phob',
 'phpr',
 'phsf',
 'phsu',
 'plnt

In [10]:
## make the set of stuff we want to remove
removal1 = set(["C0030193", "C0030705", "gngm,aapp", "podg,humn"])

## remove it from the data 
filtered_data = filtered_data[ ~ filtered_data['OBJECT_SEMTYPE'].isin(removal1)]

In [11]:
## look at the semantic types again after this removal
subject_semtypes = set(filtered_data["SUBJECT_SEMTYPE"].unique())
object_semtypes = set(filtered_data["OBJECT_SEMTYPE"].unique())
predicates = set(filtered_data["PREDICATE"].unique())

len(subject_semtypes) ## was 133, now 133: didn't change
len(object_semtypes)  ## was 135, now 131: decreased by 4 (expected)
len(predicates)       ## was 68, now 65: decreased by 3 (cool)

133

131

65

### Remove Predicates we don't want to make operations from

In [12]:
predicates

{'ADMINISTERED_TO',
 'AFFECTS',
 'ASSOCIATED_WITH',
 'AUGMENTS',
 'CAUSES',
 'COEXISTS_WITH',
 'COMPLICATES',
 'CONVERTS_TO',
 'DIAGNOSES',
 'DISRUPTS',
 'INHIBITS',
 'INTERACTS_WITH',
 'ISA',
 'LOCATION_OF',
 'MANIFESTATION_OF',
 'MEASUREMENT_OF',
 'MEASURES',
 'METHOD_OF',
 'NEG_ADMINISTERED_TO',
 'NEG_AFFECTS',
 'NEG_ASSOCIATED_WITH',
 'NEG_AUGMENTS',
 'NEG_CAUSES',
 'NEG_COEXISTS_WITH',
 'NEG_COMPLICATES',
 'NEG_CONVERTS_TO',
 'NEG_DIAGNOSES',
 'NEG_DISRUPTS',
 'NEG_INHIBITS',
 'NEG_INTERACTS_WITH',
 'NEG_ISA',
 'NEG_LOCATION_OF',
 'NEG_MANIFESTATION_OF',
 'NEG_MEASUREMENT_OF',
 'NEG_MEASURES',
 'NEG_METHOD_OF',
 'NEG_OCCURS_IN',
 'NEG_PART_OF',
 'NEG_PRECEDES',
 'NEG_PREDISPOSES',
 'NEG_PREVENTS',
 'NEG_PROCESS_OF',
 'NEG_PRODUCES',
 'NEG_STIMULATES',
 'NEG_TREATS',
 'NEG_USES',
 'NEG_higher_than',
 'NEG_lower_than',
 'NEG_same_as',
 'NOM',
 'OCCURS_IN',
 'PART_OF',
 'PRECEDES',
 'PREDISPOSES',
 'PREP',
 'PREVENTS',
 'PROCESS_OF',
 'PRODUCES',
 'STIMULATES',
 'TREATS',
 'USES',
 '

**PAUSE**

* Decide what predicates you want to remove

**Current logic**

Translator currently isn't great with negation, so removing those

Also some predicates seem to actually be practice phrases? see the [article](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-486#Sec26)

And I didn't find the following relationships useful: ISA (subclass relationship), same_as (equivalent / it did just as well as some other thing), lower_than / higher_than (these two entities were compared and 1 was better (higher) or worse (lower) than the other), and compared_with...

In [13]:
negative_preds = ["NEG_ADMINISTERED_TO", "NEG_AFFECTS", "NEG_ASSOCIATED_WITH",
                  "NEG_AUGMENTS", "NEG_CAUSES", "NEG_COEXISTS_WITH", "NEG_COMPLICATES",
                  "NEG_CONVERTS_TO", "NEG_DIAGNOSES", "NEG_DISRUPTS", "NEG_INHIBITS",
                  "NEG_INTERACTS_WITH", "NEG_ISA", "NEG_LOCATION_OF", "NEG_MANIFESTATION_OF",
                  "NEG_MEASUREMENT_OF", "NEG_MEASURES", "NEG_METHOD_OF", "NEG_OCCURS_IN",
                  "NEG_PART_OF", "NEG_PRECEDES", "NEG_PREDISPOSES", "NEG_PREVENTS",
                  "NEG_PROCESS_OF", "NEG_PRODUCES", "NEG_STIMULATES", "NEG_TREATS", "NEG_USES",
                  "NEG_higher_than", "NEG_lower_than", "NEG_same_as"
                 ]

practice_phrase = ["VERB", "NOM", "PREP"]

dont_like = ["ISA", "same_as", "lower_than", "higher_than", "compared_with"]

In [14]:
filtered_data = filtered_data[ ~ filtered_data['PREDICATE'].isin(negative_preds)]
filtered_data = filtered_data[ ~ filtered_data['PREDICATE'].isin(practice_phrase)]
filtered_data = filtered_data[ ~ filtered_data['PREDICATE'].isin(dont_like)]

now look at the stats. interesting since the stats from the [official website](https://lhncbc.nlm.nih.gov/semanticnetwork/SemanticNetworkArchive.html) say 127 semantic types and 54 predicates: f we removed half of 54 because we removed negated predicates, then we get 27...so we match the official stats now.

In [15]:
## look at the semantic types again after this removal
subject_semtypes = set(filtered_data["SUBJECT_SEMTYPE"].unique())
object_semtypes = set(filtered_data["OBJECT_SEMTYPE"].unique())
predicates = set(filtered_data["PREDICATE"].unique())

len(subject_semtypes) ## was 133, now 127: decreased by 6
len(object_semtypes)  ## was 131, now 127: decreased by 4
len(predicates)       ## was 65, now 27: decreased by 38 (was expected)

127

127

27

so....that's how many kinds of subjects, objects, and predicates we have to go forward with...

In [16]:
combos = filtered_data.value_counts().reset_index()

In [17]:
combos.shape
## so that's still a lot....

(14796, 4)

## Specific Filtering

so we actually only need to use certain semantic types...it depends on how they map to the biolink model

As a reminder:

* Get SEMMEDDB SRDEF file needed for interpreting and mapping SEMMED semantic types: 
  * download the compressed file [here](https://lhncbc.nlm.nih.gov/semanticnetwork/download.html) or pick the latest version [here](https://lhncbc.nlm.nih.gov/semanticnetwork/SemanticNetworkArchive.html)
  * information on the SRDEF file [here](https://www.ncbi.nlm.nih.gov/books/NBK9679/#ch05.sec5.2)

### Ingest SEMMED semantic info and biolink model

In [18]:
srdef_location = pathlib.Path.home().joinpath(
            "Desktop", "RawDataFiles", "SEMMEDDB", "2020AA", "SRDEF")

In [19]:
srdef = pd.read_csv(srdef_location, sep="|", header=0, index_col=False,
                    names=["Record Type (RT)",
                           "Unique Identifier (UI)",
                           "Full Name (STY/RL)", 
                           "Tree Number (STN/RTN)", 
                           "Definition (DEF)",
                           "Examples (EX)",
                           "Usage Note (UN)",
                           "Non-Human Flag (NH)",
                           "Abbreviation (ABR)",
                           "Inverse Relation (RIN)"])

In [20]:
srdef[srdef['Abbreviation (ABR)'] == 'aapp']

Unnamed: 0,Record Type (RT),Unique Identifier (UI),Full Name (STY/RL),Tree Number (STN/RTN),Definition (DEF),Examples (EX),Usage Note (UN),Non-Human Flag (NH),Abbreviation (ABR),Inverse Relation (RIN)
97,STY,T116,"Amino Acid, Peptide, or Protein",A1.4.1.2.1.7,Amino acids and chains of amino acids connecte...,,When the concept is both an enzyme and a prote...,,aapp,


### Add semantic types that are in the data but not in SRDEF

Yao [noticed](https://github.com/biothings/pending.api/issues/30#issuecomment-903609946) that the data file uses "old SEMMED semantic types" from 2013AA, and these semantic types didn't exist in the latest SRDEF file (2020AA).

In response, Andrew said to [keep](https://github.com/biothings/pending.api/issues/30#issuecomment-903879782) this data

In [21]:
## simplify the df
SEMMED_entity_types = srdef[srdef['Record Type (RT)'] == 'STY'].copy()
SEMMED_entity_types = SEMMED_entity_types[['Unique Identifier (UI)',
                                           'Full Name (STY/RL)', 
                                           'Abbreviation (ABR)']].copy()
SEMMED_entity_types.columns = ['UI', 'FullName', "Abbrev"]
SEMMED_entity_types.sort_values(by='Abbrev', inplace=True)

## quick view
SEMMED_entity_types.shape
SEMMED_entity_types[0:10]

(126, 3)

Unnamed: 0,UI,FullName,Abbrev
97,T116,"Amino Acid, Peptide, or Protein",aapp
15,T020,Acquired Abnormality,acab
44,T052,Activity,acty
90,T100,Age Group,aggp
77,T087,Amino Acid Sequence,amas
6,T011,Amphibian,amph
115,T190,Anatomical Abnormality,anab
4,T008,Animal,anim
12,T017,Anatomical Structure,anst
119,T195,Antibiotic,antb


In [22]:
## get all semantic types in the data
data_types = subject_semtypes.union(object_semtypes)
only_in_data = data_types - set(SEMMED_entity_types['Abbrev'])

print("these semantic types are in the data, but not in SRDEF:")
only_in_data

len(only_in_data)

these semantic types are in the data, but not in SRDEF:


{'alga',
 'carb',
 'eico',
 'invt',
 'lipd',
 'nsba',
 'opco',
 'orgm',
 'rich',
 'strd'}

10

How I found info on these semantic types from [previous semantic versions](https://lhncbc.nlm.nih.gov/semanticnetwork/SemanticNetworkArchive.html):

was in 2009AB file (sets with SG_2?)
* alga
* invt
* rich

was in 2014AB file (sets with SG_3?)
* carb 
* eico 
* lipd 
* nsba 
* opco
* orgm
* strd

In [23]:
## build the missing info and add to the table
missing_semantic_types = [{"UI": "T003", "FullName": "Alga", "Abbrev": "alga"},
                          {"UI": "T118", "FullName": "Carbohydrate", "Abbrev": "carb"},
                          {"UI": "T111", "FullName": "Eicosanoid", "Abbrev": "eico"},
                          {"UI": "T009", "FullName": "Invertebrate", "Abbrev": "invt"},
                          {"UI": "T119", "FullName": "Lipid", "Abbrev": "lipd"},
                          {"UI": "T124", "FullName": "Neuroreactive Substance or Biogenic Amine", "Abbrev": "nsba"},
                          {"UI": "T115", "FullName": "Organophosphorus Compound", "Abbrev": "opco"},
                          {"UI": "T001", "FullName": "Organism", "Abbrev": "orgm"},
                          {"UI": "T006", "FullName": "Rickettsia or Chlamydia", "Abbrev": "rich"},
                          {"UI": "T110", "FullName": "Steroid", "Abbrev": "strd"}
                         ]

missing_semantic_types = pd.DataFrame.from_records(missing_semantic_types)

## add it to the table
## SEMMED_entity_types = SEMMED_entity_types.append(missing_semantic_types)

SEMMED_entity_types = pd.concat([SEMMED_entity_types, missing_semantic_types])

In [24]:
SEMMED_entity_types.sort_values(by='Abbrev', inplace=True)

## quick view
SEMMED_entity_types.shape
SEMMED_entity_types[0:10]

(136, 3)

Unnamed: 0,UI,FullName,Abbrev
97,T116,"Amino Acid, Peptide, or Protein",aapp
15,T020,Acquired Abnormality,acab
44,T052,Activity,acty
90,T100,Age Group,aggp
0,T003,Alga,alga
77,T087,Amino Acid Sequence,amas
6,T011,Amphibian,amph
115,T190,Anatomical Abnormality,anab
4,T008,Animal,anim
12,T017,Anatomical Structure,anst


### Get the biolink-mappings

In [25]:
## using biolink 3.1.1
bmt_tool = Toolkit('https://raw.githubusercontent.com/biolink/biolink-model/v3.1.1/biolink-model.yaml')

In [26]:
bmt_tool.get_element_by_mapping('STY:T123').title()

'Small Molecule'

In [27]:
## getting biolink-mapping, in the format needed to create operations
SEMMED_entity_types['BiolinkMapping'] = [bmt_tool.get_element_by_mapping('STY:'+i)
                                         for i in SEMMED_entity_types['UI']]
## put these node categories/semantic-types in the correct format: PascalCase
SEMMED_entity_types['BiolinkMapping'] = [i.title().replace(" ", "") if isinstance(i, str)
                                         else i for i in SEMMED_entity_types['BiolinkMapping']]

In [28]:
SEMMED_entity_types

Unnamed: 0,UI,FullName,Abbrev,BiolinkMapping
97,T116,"Amino Acid, Peptide, or Protein",aapp,Polypeptide
15,T020,Acquired Abnormality,acab,Disease
44,T052,Activity,acty,Activity
90,T100,Age Group,aggp,Cohort
0,T003,Alga,alga,
...,...,...,...,...
70,T079,Temporal Concept,tmco,InformationContentEntity
53,T061,Therapeutic or Preventive Procedure,topp,Procedure
2,T005,Virus,virs,
104,T127,Vitamin,vita,SmallMolecule


### Clean up the semantic type to only have terms in the data

In [29]:
print("note that these semantic types are in SRDEF but not in the data")
set(SEMMED_entity_types['Abbrev']) - data_types

note that these semantic types are in SRDEF but not in the data


{'cnce', 'crbs', 'enty', 'grpa', 'lang', 'phob'}

In [30]:
## get the subset of semantic network terms that are actually in the data
SEMMED_entity_types = SEMMED_entity_types[SEMMED_entity_types['Abbrev'].isin(data_types)].copy()

In [31]:
SEMMED_entity_types

Unnamed: 0,UI,FullName,Abbrev,BiolinkMapping
97,T116,"Amino Acid, Peptide, or Protein",aapp,Polypeptide
15,T020,Acquired Abnormality,acab,Disease
44,T052,Activity,acty,Activity
90,T100,Age Group,aggp,Cohort
0,T003,Alga,alga,
...,...,...,...,...
70,T079,Temporal Concept,tmco,InformationContentEntity
53,T061,Therapeutic or Preventive Procedure,topp,Procedure
2,T005,Virus,virs,
104,T127,Vitamin,vita,SmallMolecule


### review mappings and deciding what to change, remove

**PAUSE**

This is a place to **STOP** and review all the SEMMED semantic types and their mapping to biolink semantic types...to decide what we are interested in keeping. This involves some knowledge of what biolink semantic types are prioritized in Translator. 

One can use the definitions of the SEMMED semantic types (from the SRDEF file or [browsing the UMLS vocab online](https://uts.nlm.nih.gov/uts/umls/semantic-network/root)) and the definitions of biolink semantic types (look for a comment with 'THINGS' in the biolink-model yaml file)

See the last section (section 6) for notes on decisions that were made...

---

Sections 3.5 + 3.6 below involves this review, changing mappings, and removing some SEMMED semantic types

In [32]:
## code used to review 
SEMMED_entity_types['BiolinkMapping'].unique()

array(['Polypeptide', 'Disease', 'Activity', 'Cohort', None,
       'AnatomicalEntity', 'Drug', 'SmallMolecule', 'Behavior',
       'Phenomenon', 'Device', 'GrossAnatomicalStructure',
       'CellularComponent', 'PhysiologicalProcess', 'Cell',
       'ChemicalEntity', 'InformationContentEntity', 'ClinicalAttribute',
       'Procedure', 'BiologicalEntity', 'Protein', 'Event',
       'DiseaseOrPhenotypicFeature', 'Food', 'GeographicLocation',
       'GenomicEntity', 'Agent', 'PathologicalProcess', 'Publication',
       'PhysicalEntity', 'MolecularActivity', 'MolecularEntity',
       'NucleicAcidEntity', 'OrganismAttribute',
       'PopulationOfIndividualOrganisms', 'PhenotypicFeature'],
      dtype=object)

In [33]:
SEMMED_entity_types[SEMMED_entity_types['BiolinkMapping'] == 'NucleicAcidEntity']
# SEMMED_entity_types[SEMMED_entity_types['BiolinkMapping'].isna()]

Unnamed: 0,UI,FullName,Abbrev,BiolinkMapping
96,T114,"Nucleic Acid, Nucleoside, or Nucleotide",nnon,NucleicAcidEntity
76,T086,Nucleotide Sequence,nusq,NucleicAcidEntity


In [34]:
## re-mapping based on putting IDs into normalization service / our operation system...
##   we have UMLS for Disease (mydisease, mychem), SmallMolecule (idisk)

## leaving aapp as Polypeptide, enzy mapped to Protein
## leaving Disease mappings as-is, but some seem like they could be PathologicalAnatomicalStructure 
##   (acab, anab, cgab) or PathologicalProcess (comd) instead
## left clnd (Clinical Drug) as Drug since it really seemed like a drug (dosage)

## currently mapped to Drug, but re-mapping here
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'antb'),'BiolinkMapping'] = 'SmallMolecule'  ## Antibiotic
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'phsu'),'BiolinkMapping'] = 'SmallMolecule'  ## Pharmacologic Substance

## currently mapped to GenomicEntity, but re-mapping here
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'gngm'),'BiolinkMapping'] = 'Gene'

## currently mapped to NucleicAcidEntity, but re-mapping here
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'nnon'),'BiolinkMapping'] = 'SmallMolecule'  ## Nucleic Acid, Nucleoside, or Nucleotide

## currently doesn't have a mapping
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'vita'),'BiolinkMapping'] = 'SmallMolecule'  ## Vitamin

In [35]:
currently_unused_entities = [
    ## mapped to Polypeptide
    "amas",     
    ## all that are mapped to Activity
    "acty", "dora", "edac", "gora", "hlca", "mcha", "ocac", "resa",  
    ## all that are mapped to Cohort
    "aggp", "famg", "podg", "prog",  
    ## missing mappings
    "amph", "anim", "arch", "bact", "bdsu", "bdsy", "bird", "blor", 
    "bmod", "bsoj", "chvf", "chvs", "euka", "ffas", "fish", "fngs", 
    "humn", "irda", "mamm", "ocdi", "plnt", "rept", "sbst", "virs", "vtbt",
    ## old terms here
    "alga", "invt", "orgm", "rich",
    ## all that are mapped to Anatomical Entity
    "anst",
    ## all that are mapped to Behavior
    "bhvr", "inbe", "menp", "socb",
    ## all that are mapped to Phenomenon
    "biof", "eehu", "hcpp", "lbtr", "npop", "phpr",
    ## all that are mapped to Device
    "bodm", "drdd", "medd", "resd",
    ## mapped to GrossAnatomicalStructure
    "emst", "tisu", 
    ## mapped to PhysiologicalProcess
    "genf",
    ## mapped to ChemicalEntity
    "chem",
    ## all that are mapped to InformationContentEntity
    "clas", "ftcn", "idcn", "qlco", "qnco", "rnlw", "spco", "tmco",
    ## all that are mapped to ClinicalAttribute
    "clna",
    ## mapped to Procedure
    "lbpr", "mbrt", 
    ## mapped to SmallMolecule
    "elii",
    ## all that are mapped to BiologicalEntity
    "emod",
    ## mapped to Protein
    "rcpt",
    ## all that are mapped to Event
    "evnt",
    ## all that are mapped to DiseaseOrPhenotypicFeature
    "fndg",
    ## all that are mapped to GeographicLocation
    "geoa",
    ## all that are mapped to Agent
    "grup", "hcro", "orgt", "pros", "shro",
    ## all that are mapped to Publication
    "inpr",
    ## all that are mapped to PhysicalEntity
    "mnob",
    ## all that are mapped to MolecularEntity
    "mosq",
    ## mapped to NucleicAcidEntity
    "nusq",
    ## all that are mapped to OrganismAttribute
    "orga",
    ## all that are mapped to PopulationOfIndividualOrganisms
    "popg"
]

### prune data based on what node entities to remove

In [36]:
## prune this doc
SEMMED_entity_types = SEMMED_entity_types[ ~ SEMMED_entity_types['Abbrev'].isin(currently_unused_entities)]

## prune the data doc
filtered_data = filtered_data[ ~ filtered_data['SUBJECT_SEMTYPE'].isin(currently_unused_entities)]
filtered_data = filtered_data[ ~ filtered_data['OBJECT_SEMTYPE'].isin(currently_unused_entities)]

In [37]:
## look at the semantic types again after this removal
subject_semtypes = set(filtered_data["SUBJECT_SEMTYPE"].unique())
object_semtypes = set(filtered_data["OBJECT_SEMTYPE"].unique())
predicates = set(filtered_data["PREDICATE"].unique())

len(subject_semtypes) ## was 127, now 41: decreased by 86
len(object_semtypes)  ## was 127, now 41: decreased by 86
len(predicates)       ## was 27, now 26: decreased by 1 (cool)

41

41

26

In [38]:
SEMMED_entity_types

Unnamed: 0,UI,FullName,Abbrev,BiolinkMapping
97,T116,"Amino Acid, Peptide, or Protein",aapp,Polypeptide
15,T020,Acquired Abnormality,acab,Disease
115,T190,Anatomical Abnormality,anab,Disease
119,T195,Antibiotic,antb,SmallMolecule
101,T123,Biologically Active Substance,bacs,SmallMolecule
18,T023,"Body Part, Organ, or Organ Component",bpoc,GrossAnatomicalStructure
1,T118,Carbohydrate,carb,SmallMolecule
21,T026,Cell Component,celc,CellularComponent
35,T043,Cell Function,celf,PhysiologicalProcess
20,T025,Cell,cell,Cell


### getting mappings for predicates and qualifiers

In [39]:
SEMMED_predicates = srdef[srdef['Record Type (RT)'] == 'RL'].copy()

SEMMED_predicates = SEMMED_predicates[['Full Name (STY/RL)'
                                      ]].copy()
SEMMED_predicates.columns = ['SemmedPred']
SEMMED_predicates.sort_values(by='SemmedPred', inplace=True)

**PAUSE**

The SRDEF file might be missing predicates that are in the data, so check that using the code block below and see if you would like to add those back in...

In [40]:
## predicates currently in the filtered data

missing_preds = list()

for i in predicates:
    if i.lower() not in set(SEMMED_predicates['SemmedPred']):
        missing_preds.append(i.lower())

missing_preds.sort()
missing_preds

# [i for i in predicates if i.lower() in SEMMED_predicates['SemmedName']]

['administered_to',
 'augments',
 'coexists_with',
 'converts_to',
 'inhibits',
 'predisposes',
 'stimulates']

Currently we add them back in 

In [41]:
## manually add some predicates that are in the data, but not in SRDEF...

SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['administered_to']
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['augments']
## notice that the data doesn't have a hyphen for coexists...even though SRDEF has hyphen for co-occurs_with
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['coexists_with']
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['converts_to']
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['inhibits']
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['predisposes']
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['stimulates']

SEMMED_predicates.sort_values(by='SemmedPred', inplace=True)

In [42]:
## since biolink model doesn't use hyphens
SEMMED_predicates['Semmed_in_BM']= ['SEMMEDDB:'+i 
                                   for i in SEMMED_predicates['SemmedPred'].str.replace("-", "")]
SEMMED_predicates['Semmed_in_BM'] = SEMMED_predicates['Semmed_in_BM'].str.upper()
SEMMED_predicates

Unnamed: 0,SemmedPred,Semmed_in_BM
162,adjacent_to,SEMMEDDB:ADJACENT_TO
54,administered_to,SEMMEDDB:ADMINISTERED_TO
145,affects,SEMMEDDB:AFFECTS
176,analyzes,SEMMEDDB:ANALYZES
158,assesses_effect_of,SEMMEDDB:ASSESSES_EFFECT_OF
...,...,...
130,temporally_related_to,SEMMEDDB:TEMPORALLY_RELATED_TO
166,traverses,SEMMEDDB:TRAVERSES
148,treats,SEMMEDDB:TREATS
178,tributary_of,SEMMEDDB:TRIBUTARY_OF


In [43]:
## getting the mapping
SEMMED_predicates['BMPred'] = [bmt_tool.get_element_by_mapping(i)
                                       for i in SEMMED_predicates['Semmed_in_BM']]

## since the biolink-mapping uses spaces but resources need to use snake_case
SEMMED_predicates['BMPred'] = SEMMED_predicates['BMPred'].str.replace(" ", "_")

In [44]:
## remember there are 26 predicates in the data
len(predicates)
pred_compare = [i.lower() for i in predicates]

SEMMED_predicates[SEMMED_predicates['SemmedPred'].isin(pred_compare)]

26

Unnamed: 0,SemmedPred,Semmed_in_BM,BMPred
54,administered_to,SEMMEDDB:ADMINISTERED_TO,affects
145,affects,SEMMEDDB:AFFECTS,affects
160,associated_with,SEMMEDDB:ASSOCIATED_WITH,related_to
55,augments,SEMMEDDB:AUGMENTS,
141,causes,SEMMEDDB:CAUSES,causes
56,coexists_with,SEMMEDDB:COEXISTS_WITH,coexists_with
143,complicates,SEMMEDDB:COMPLICATES,exacerbates
57,converts_to,SEMMEDDB:CONVERTS_TO,derives_into
157,diagnoses,SEMMEDDB:DIAGNOSES,diagnoses
140,disrupts,SEMMEDDB:DISRUPTS,disrupts


**PAUSE**

Review the mappings in the table printed by the previous code chunk. 

For the biolink 3.1.1 update, we:
* map semmeddb `augments` and `stimulates` to biolink's `affects` with qualifiers: 
  * qualified_predicate: causes
  * object_aspect_qualifier: activity_or_abundance
  * object_direction_qualifier: increased
* map semmeddb `inhibits` to biolink's `affects` with qualifiers: 
  * qualified_predicate: causes
  * object_aspect_qualifier: activity_or_abundance
  * object_direction_qualifier: decreased

Notes:
* keeping the mapping for semmeddb `administered_to` to biolink-model's `affects` but feel like it's not quite right since it's "trying something out with a goal", not that it actually does something...see [definition](https://uts.nlm.nih.gov/uts/umls/semantic-network/T154)
* semmeddb `diagnoses` is more of a "this thing distinguishes / identifies this other thing"...see [definition](https://uts.nlm.nih.gov/uts/umls/semantic-network/T163)
* semmeddb `process_of` does seem to map well to biolink-model's `occurs_in` so...okay. As an example, see the use of `process_of` in the [API's data](https://biothings.ncats.io/semmeddb/query?q=predicate:PROCESS_OF%20AND%20object.semantic_type_abbreviation:cell)
* keeping semmeddb `treats` is more general than biolink-model's `treats`. It's more of "trying this out as a treatment". see [definition](https://uts.nlm.nih.gov/uts/umls/semantic-network/T154)

In [45]:
## complex change: modify table to use qualifiers for semmeddb augments, stimulates, and inhibits

## At the moment, only need the following qualifiers added, to map these predicates
##   QualPred = qualified_predicate
##   ObjAsp = object_aspect_qualifier
##   ObjDirect = object_direction_qualifier

## First, add the columns. Most values will stay blank (None)
SEMMED_predicates['BMQualPred'] = None  
SEMMED_predicates['BMObjAsp'] = None 
SEMMED_predicates['BMObjDirect'] = None

## Next, mutate the rows for biolink predicates being remapping to predicates + qualifiers
for index in SEMMED_predicates.index:
    ## semmeddb augments and stimulates should change
    if (SEMMED_predicates.loc[index,'SemmedPred']=='augments') or \
       (SEMMED_predicates.loc[index,'SemmedPred']=='stimulates'):
        SEMMED_predicates.loc[index,'BMPred'] = 'affects'
        SEMMED_predicates.loc[index,'BMQualPred'] = 'causes'
        SEMMED_predicates.loc[index,'BMObjAsp'] = 'activity_or_abundance'
        SEMMED_predicates.loc[index,'BMObjDirect'] = 'increased'
    ## semmeddb inhibits should change
    elif SEMMED_predicates.loc[index,'SemmedPred']=='inhibits':
        SEMMED_predicates.loc[index,'BMPred'] = 'affects'
        SEMMED_predicates.loc[index,'BMQualPred'] = 'causes'
        SEMMED_predicates.loc[index,'BMObjAsp'] = 'activity_or_abundance'
        SEMMED_predicates.loc[index,'BMObjDirect'] = 'decreased'

In [46]:
## only keep the stuff we found biolink mappings for
SEMMED_predicates = SEMMED_predicates[SEMMED_predicates['BMPred'].notna()].copy()

In [47]:
## get inverses for basic predicates + qualified_predicates to generate reverse operations

SEMMED_predicates['BMPred_Inv'] = [bmt_tool.get_element(i).inverse 
                                       if isinstance(bmt_tool.get_element(i).inverse, str)
                                       else i 
                                       for i in SEMMED_predicates['BMPred']]

SEMMED_predicates['BMQualPred_Inv'] = [bmt_tool.get_element(i).inverse 
                                       if i and isinstance(bmt_tool.get_element(i).inverse, str)
                                       else i 
                                       for i in SEMMED_predicates['BMQualPred']]



SEMMED_predicates['BMPred_Inv'] = SEMMED_predicates['BMPred_Inv'].str.replace(" ", "_")
SEMMED_predicates['BMQualPred_Inv'] = SEMMED_predicates['BMQualPred_Inv'].str.replace(" ", "_")

In [48]:
SEMMED_predicates

Unnamed: 0,SemmedPred,Semmed_in_BM,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BMPred_Inv,BMQualPred_Inv
54,administered_to,SEMMEDDB:ADMINISTERED_TO,affects,,,,affected_by,
145,affects,SEMMEDDB:AFFECTS,affects,,,,affected_by,
160,associated_with,SEMMEDDB:ASSOCIATED_WITH,related_to,,,,related_to,
55,augments,SEMMEDDB:AUGMENTS,affects,causes,activity_or_abundance,increased,affected_by,caused_by
141,causes,SEMMEDDB:CAUSES,causes,,,,caused_by,
56,coexists_with,SEMMEDDB:COEXISTS_WITH,coexists_with,,,,coexists_with,
143,complicates,SEMMEDDB:COMPLICATES,exacerbates,,,,is_exacerbated_by,
57,converts_to,SEMMEDDB:CONVERTS_TO,derives_into,,,,derives_from,
157,diagnoses,SEMMEDDB:DIAGNOSES,diagnoses,,,,is_diagnosed_by,
140,disrupts,SEMMEDDB:DISRUPTS,disrupts,,,,disrupted_by,


**PAUSE**

Check if the predicate inverses look correct. For biolink 3.0.3, we don't need to correct any inverses.

* Maybe some aren't correct because they don't have an inverse but they are directional...If so, I suggest going back and changing the predicates to ones with inverses. 
* Some predicates aren't directional (their entry in the biolink-model yaml will have the property `symmetric == true`) so those are identical in either direction...

<br>

However, for the biolink 3.1.1 update, we need more than the inverted predicates...we need the qualifiers for the reverse operations:
* "object_" qualifiers will invert to "subject_" qualifiers, but will keep their exact value 
* Already did above: when retrieving the inverted predicates, we also retrieved the inverted qualified_predicates


In [49]:
SEMMED_predicates['BMInv_SubjAsp'] = SEMMED_predicates['BMObjAsp'] 
SEMMED_predicates['BMInv_SubjDirect'] = SEMMED_predicates['BMObjDirect'] 

In [50]:
SEMMED_predicates

Unnamed: 0,SemmedPred,Semmed_in_BM,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BMPred_Inv,BMQualPred_Inv,BMInv_SubjAsp,BMInv_SubjDirect
54,administered_to,SEMMEDDB:ADMINISTERED_TO,affects,,,,affected_by,,,
145,affects,SEMMEDDB:AFFECTS,affects,,,,affected_by,,,
160,associated_with,SEMMEDDB:ASSOCIATED_WITH,related_to,,,,related_to,,,
55,augments,SEMMEDDB:AUGMENTS,affects,causes,activity_or_abundance,increased,affected_by,caused_by,activity_or_abundance,increased
141,causes,SEMMEDDB:CAUSES,causes,,,,caused_by,,,
56,coexists_with,SEMMEDDB:COEXISTS_WITH,coexists_with,,,,coexists_with,,,
143,complicates,SEMMEDDB:COMPLICATES,exacerbates,,,,is_exacerbated_by,,,
57,converts_to,SEMMEDDB:CONVERTS_TO,derives_into,,,,derives_from,,,
157,diagnoses,SEMMEDDB:DIAGNOSES,diagnoses,,,,is_diagnosed_by,,,
140,disrupts,SEMMEDDB:DISRUPTS,disrupts,,,,disrupted_by,,,


### prune the data based on what predicates to keep

In [51]:
## can mutate and rename the Semmed_in_BM column (not needed anymore) with the format needed

SEMMED_predicates['Semmed_in_BM'] = SEMMED_predicates['SemmedPred'].str.upper()
SEMMED_predicates['Semmed_in_BM'] = SEMMED_predicates['Semmed_in_BM'].str.replace("-", "")
SEMMED_predicates.rename(columns = {'Semmed_in_BM':'Semmed_in_Data'}, inplace = True)

## Can remove the SemmedPred column: not needed anymore
SEMMED_predicates.drop(columns='SemmedPred', inplace=True)

In [52]:
SEMMED_predicates

Unnamed: 0,Semmed_in_Data,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BMPred_Inv,BMQualPred_Inv,BMInv_SubjAsp,BMInv_SubjDirect
54,ADMINISTERED_TO,affects,,,,affected_by,,,
145,AFFECTS,affects,,,,affected_by,,,
160,ASSOCIATED_WITH,related_to,,,,related_to,,,
55,AUGMENTS,affects,causes,activity_or_abundance,increased,affected_by,caused_by,activity_or_abundance,increased
141,CAUSES,causes,,,,caused_by,,,
56,COEXISTS_WITH,coexists_with,,,,coexists_with,,,
143,COMPLICATES,exacerbates,,,,is_exacerbated_by,,,
57,CONVERTS_TO,derives_into,,,,derives_from,,,
157,DIAGNOSES,diagnoses,,,,is_diagnosed_by,,,
140,DISRUPTS,disrupts,,,,disrupted_by,,,


**PAUSE**

Remove predicates in the mapping table, but we don't want to use for associations:
- no mapping: METHOD_OF
- already removed from data: ISA
- don't want to use it as an association predicate after seeing the mapping: 
  - MEASURES
  - USES

In [53]:
print("in data but not in mapping table")
predicates - set(SEMMED_predicates['Semmed_in_Data'])

print("in mapping file but not in data")
set(SEMMED_predicates['Semmed_in_Data']) - predicates

in data but not in mapping table


{'METHOD_OF'}

in mapping file but not in data


{'ISA'}

In [54]:
more_removals = {'METHOD_OF', 'ISA', 'MEASURES', 'USES'}
more_removals

{'ISA', 'MEASURES', 'METHOD_OF', 'USES'}

In [55]:
## remove this set from the SEMMED_predicates table
SEMMED_predicates = SEMMED_predicates[~ SEMMED_predicates['Semmed_in_Data'].isin(more_removals)]
SEMMED_predicates
SEMMED_predicates.shape

Unnamed: 0,Semmed_in_Data,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BMPred_Inv,BMQualPred_Inv,BMInv_SubjAsp,BMInv_SubjDirect
54,ADMINISTERED_TO,affects,,,,affected_by,,,
145,AFFECTS,affects,,,,affected_by,,,
160,ASSOCIATED_WITH,related_to,,,,related_to,,,
55,AUGMENTS,affects,causes,activity_or_abundance,increased,affected_by,caused_by,activity_or_abundance,increased
141,CAUSES,causes,,,,caused_by,,,
56,COEXISTS_WITH,coexists_with,,,,coexists_with,,,
143,COMPLICATES,exacerbates,,,,is_exacerbated_by,,,
57,CONVERTS_TO,derives_into,,,,derives_from,,,
157,DIAGNOSES,diagnoses,,,,is_diagnosed_by,,,
140,DISRUPTS,disrupts,,,,disrupted_by,,,


(23, 9)

In [56]:
## remove this set from the data record
filtered_data = filtered_data[ ~ filtered_data['PREDICATE'].isin(more_removals)]

In [57]:
## look at the semantic types again after this removal
subject_semtypes = set(filtered_data["SUBJECT_SEMTYPE"].unique())
object_semtypes = set(filtered_data["OBJECT_SEMTYPE"].unique())
predicates = set(filtered_data["PREDICATE"].unique())

len(subject_semtypes) ## was 41, still 41 (expected)
len(object_semtypes)  ## was 41, still 41 (expected)
len(predicates)       ## was 26, now 23: decreased by 3 (expected: ISA was already removed earlier)

41

41

23

## Final Filter: row counts per combo

In [58]:
## look at number of combos after this removal
combos = filtered_data.value_counts().reset_index()
combos.columns = ['SUBJECT_SEMTYPE', 'PREDICATE', 'OBJECT_SEMTYPE', 'COUNT']
combos.shape
combos.head(10)

(5936, 4)

Unnamed: 0,SUBJECT_SEMTYPE,PREDICATE,OBJECT_SEMTYPE,COUNT
0,bpoc,LOCATION_OF,neop,1023005
1,bpoc,LOCATION_OF,aapp,953010
2,topp,TREATS,dsyn,849825
3,cell,LOCATION_OF,aapp,815374
4,bpoc,LOCATION_OF,patf,799234
5,topp,TREATS,neop,707094
6,bpoc,LOCATION_OF,dsyn,696732
7,phsu,TREATS,dsyn,660654
8,cell,PART_OF,bpoc,598884
9,bpoc,PART_OF,bpoc,542301


**PAUSE**

use the code block below to decide how many combos to keep based on how many predications/records there are per combo...

---

Current:

let's try setting a limit that we're only building operations if there are > 100 objects for the triple...

In [59]:
combos[(combos['COUNT'] > 100)]

Unnamed: 0,SUBJECT_SEMTYPE,PREDICATE,OBJECT_SEMTYPE,COUNT
0,bpoc,LOCATION_OF,neop,1023005
1,bpoc,LOCATION_OF,aapp,953010
2,topp,TREATS,dsyn,849825
3,cell,LOCATION_OF,aapp,815374
4,bpoc,LOCATION_OF,patf,799234
...,...,...,...,...
3027,lipd,TREATS,neop,102
3028,strd,AUGMENTS,neop,101
3029,clnd,TREATS,anab,101
3030,imft,PREDISPOSES,inpo,101


In [60]:
## checking that the aapp - PRODUCES - carb triple appears w/ this limit
## https://biothings.ncats.io/semmeddb/query?q=object.umls:C0043369%20AND%20subject.umls:C0002003
## from this issue: https://github.com/biothings/BioThings_Explorer_TRAPI/issues/317
combos[(combos['COUNT'] > 100) &
       (combos['OBJECT_SEMTYPE'] == 'carb') &
       (combos['SUBJECT_SEMTYPE'] == 'aapp')]
## okay we have the desired triple with this limit

Unnamed: 0,SUBJECT_SEMTYPE,PREDICATE,OBJECT_SEMTYPE,COUNT
320,aapp,INTERACTS_WITH,carb,17938
610,aapp,COEXISTS_WITH,carb,7547
794,aapp,STIMULATES,carb,4555
961,aapp,INHIBITS,carb,3296
2127,aapp,CONVERTS_TO,carb,438
2383,aapp,PRODUCES,carb,289


In [61]:
filtered_combos = combos[(combos['COUNT'] > 100)].copy()
filtered_combos.drop(columns='COUNT', inplace=True)
filtered_combos.shape
filtered_combos[0:3]

(3032, 3)

Unnamed: 0,SUBJECT_SEMTYPE,PREDICATE,OBJECT_SEMTYPE
0,bpoc,LOCATION_OF,neop
1,bpoc,LOCATION_OF,aapp
2,topp,TREATS,dsyn


In [62]:
## now have to map all subject to biolink

filtered_combos = filtered_combos.merge(SEMMED_entity_types[['Abbrev', 'BiolinkMapping']], 
                      how='left', left_on='SUBJECT_SEMTYPE', right_on='Abbrev')

filtered_combos.drop(columns = 'Abbrev', inplace=True)

filtered_combos.columns = ['OriginalSubject', 'OriginalPredicate', 'OriginalObject',
                  'BiolinkSubject']

In [63]:
## now have to map all object to biolink

filtered_combos = filtered_combos.merge(SEMMED_entity_types[['Abbrev', 'BiolinkMapping']],
                      how='left', left_on='OriginalObject', right_on='Abbrev')

filtered_combos.drop(columns = 'Abbrev', inplace=True)

filtered_combos.columns = ['OriginalSubject', 'OriginalPredicate', 'OriginalObject',
                  'BiolinkSubject', 'BiolinkObject']

In [64]:
filtered_combos

Unnamed: 0,OriginalSubject,OriginalPredicate,OriginalObject,BiolinkSubject,BiolinkObject
0,bpoc,LOCATION_OF,neop,GrossAnatomicalStructure,Disease
1,bpoc,LOCATION_OF,aapp,GrossAnatomicalStructure,Polypeptide
2,topp,TREATS,dsyn,Procedure,Disease
3,cell,LOCATION_OF,aapp,Cell,Polypeptide
4,bpoc,LOCATION_OF,patf,GrossAnatomicalStructure,PathologicalProcess
...,...,...,...,...,...
3027,lipd,TREATS,neop,SmallMolecule,Disease
3028,strd,AUGMENTS,neop,SmallMolecule,Disease
3029,clnd,TREATS,anab,Drug,Disease
3030,imft,PREDISPOSES,inpo,ChemicalEntity,PathologicalProcess


In [65]:
filtered_combos = filtered_combos.merge(
                      SEMMED_predicates,
                      how='left', left_on='OriginalPredicate', right_on='Semmed_in_Data')

filtered_combos.drop(columns = 'Semmed_in_Data', inplace=True)

In [66]:
filtered_combos[0:3]

Unnamed: 0,OriginalSubject,OriginalPredicate,OriginalObject,BiolinkSubject,BiolinkObject,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BMPred_Inv,BMQualPred_Inv,BMInv_SubjAsp,BMInv_SubjDirect
0,bpoc,LOCATION_OF,neop,GrossAnatomicalStructure,Disease,location_of,,,,located_in,,,
1,bpoc,LOCATION_OF,aapp,GrossAnatomicalStructure,Polypeptide,location_of,,,,located_in,,,
2,topp,TREATS,dsyn,Procedure,Disease,treats,,,,treated_by,,,


In [67]:
## check to make sure everything is mapped to Biolink successfully
filtered_combos[filtered_combos['BiolinkSubject'].isna()]
filtered_combos[filtered_combos['BiolinkObject'].isna()]

filtered_combos[filtered_combos['BMPred'].isna()]
filtered_combos[filtered_combos['BMPred_Inv'].isna()]

Unnamed: 0,OriginalSubject,OriginalPredicate,OriginalObject,BiolinkSubject,BiolinkObject,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BMPred_Inv,BMQualPred_Inv,BMInv_SubjAsp,BMInv_SubjDirect


Unnamed: 0,OriginalSubject,OriginalPredicate,OriginalObject,BiolinkSubject,BiolinkObject,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BMPred_Inv,BMQualPred_Inv,BMInv_SubjAsp,BMInv_SubjDirect


Unnamed: 0,OriginalSubject,OriginalPredicate,OriginalObject,BiolinkSubject,BiolinkObject,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BMPred_Inv,BMQualPred_Inv,BMInv_SubjAsp,BMInv_SubjDirect


Unnamed: 0,OriginalSubject,OriginalPredicate,OriginalObject,BiolinkSubject,BiolinkObject,BMPred,BMQualPred,BMObjAsp,BMObjDirect,BMPred_Inv,BMQualPred_Inv,BMInv_SubjAsp,BMInv_SubjDirect


### Optional Analysis: looking at row counts, organized by biolink combos

PAUSE

Deprecated code: will need to include COUNT column after creating the filtered_combos object...

In [None]:
### deep-copy combos so it's not messed up for the next step
biolink_counting = filtered_combos.copy()
biolink_counting = biolink_counting.groupby(
                       ['BiolinkSubject', 'BiolinkPredicate', 'BiolinkObject']
                   ).agg(
                       { "COUNT": "sum"}
)
## other stuff that can go into agg
#                          "OriginalSubject": lambda x: set(x),
#                          "OriginalPredicate": lambda x: set(x),
#                          "OriginalObject": lambda x: set(x)

In [None]:
biolink_counting.reset_index(inplace = True)

biolink_counting.sort_values(by='COUNT', ascending=False, inplace = True)

biolink_counting[0:50]

## Generate operation yaml!

**PAUSE**

* If needed, change the code within the functions below to change the x-bte annotations that are made...
  * review the qualifier-generating code! Currently it's very simple because all operations with qualifiers have the same set: qualified_predicate, an aspect_qualifier, and a direction_qualifier...
* the yaml created below refers to umls-subj and umls-obj...those are specified [here close to the bottom](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/version_without_operations.yaml)

In [98]:
yaml=ryml.YAML()
folded = ryml.scalarstring.FoldedScalarString
doublequote = ryml.scalarstring.DoubleQuotedScalarString

In [112]:
def generate_forward_op(original_subj, original_pred, original_obj,
                        biolink_subj, 
                        biolink_pred, biolink_qualified_pred, biolink_obj_asp, biolink_obj_direct,
                        biolink_obj):
    ## set size parameter for biothings POST query. This will change depending on what is set for the API
    POST_size = 1000  
    
    ## create the keys for the operation names
    normal_op_name = f"{original_subj}-{original_pred}-{original_obj}"
    
    ## USE FOLDED in order to have the quotes handled properly (no escape \) in the dumped document
    ## original direction: subject -> object
    normal_op_body = folded(
    '{' 
        '"q": {{ queryInputs ' 
        '| replPrefix(\'predicate:' + f'{original_pred} AND object.semantic_type_abbreviation:{original_obj}' + ' AND pmid_count:>3 AND subject.umls\')' 
        '| dump ' 
        '}}, ' 
        '"scopes": []' 
    '}')
    
    if biolink_qualified_pred:
    ## if there is a qualified_predicate
    ##   there's a set of qualifiers that also has obj aspect and obj direction
      temp = {
            ## original direction: subject -> object
            normal_op_name: [
                {
                    'supportBatch': True,
                    'useTemplating': True,
                    'inputs': [
                        {
                            'id': 'UMLS',
                            'semantic': biolink_subj  ## input is subject!
                        }
                    ],
                    'requestBodyType': 'object',
                    'requestBody': {'body': normal_op_body},
                    'parameters': {
                        'fields': 'object.umls,predication.pmid,predication.sentence,subject.umls,subject.name,object.name',
                        'size': POST_size
                    },
                    'outputs': [
                        {
                            'id': 'UMLS',
                            'semantic': biolink_obj  ## output is object
                        }
                    ],
                    'predicate': biolink_pred,
                    'qualifiers': {
                        'qualified_predicate': biolink_qualified_pred,
                        'object_aspect_qualifier': biolink_obj_asp,
                        'object_direction_qualifier': biolink_obj_direct

                    },
                    'source': 'infores:semmeddb',
#                     'source': doublequote('infores:semmeddb'),
                    'response_mapping': {
                        "$ref": '#/components/x-bte-response-mapping/umls-obj'  ## matches output as object
                    }
                }
            ]
        }
    else:
    ## create operation without qualifiers..
        temp = {
            ## original direction: subject -> object
            normal_op_name: [
                {
                    'supportBatch': True,
                    'useTemplating': True,
                    'inputs': [
                        {
                            'id': 'UMLS',
                            'semantic': biolink_subj  ## input is subject!
                        }
                    ],
                    'requestBodyType': 'object',
                    'requestBody': {'body': normal_op_body},
                    'parameters': {
                        'fields': 'object.umls,predication.pmid,predication.sentence,subject.umls,subject.name,object.name',
                        'size': POST_size
                    },
                    'outputs': [
                        {
                            'id': 'UMLS',
                            'semantic': biolink_obj  ## output is object
                        }
                    ],
                    'predicate': biolink_pred,
                    'source': 'infores:semmeddb',
#                     'source': doublequote('infores:semmeddb'),
                    'response_mapping': {
                        "$ref": '#/components/x-bte-response-mapping/umls-obj'  ## matches output as object
                    }
                }
            ]
        }
        
    return temp

In [113]:
def generate_reverse_op(original_subj, original_pred, original_obj,
                        biolink_subj, 
                        ## NOTICE THE INVERSES USED HERE
                        biolink_inverse_pred, biolink_inverse_qualified_pred, 
                        biolink_inverse_subj_asp, biolink_inverse_subj_direct,
                        biolink_obj):
    ## set size parameter for biothings POST query. This will change depending on what is set for the API
    POST_size = 1000  
    
    ## create the keys for the operation names
    rev_op_name = f"{original_subj}-{original_pred}-{original_obj}-rev"
    
    ## USE FOLDED in order to have the quotes handled properly (no escape \) in the dumped document
    ## reverse direction: object -> subject
    rev_op_body = folded(
    '{' 
        '"q": {{ queryInputs ' 
        '| replPrefix(\'predicate:' + f'{original_pred} AND subject.semantic_type_abbreviation:{original_subj}' + ' AND pmid_count:>3 AND object.umls\')' 
        '| dump ' 
        '}}, ' 
        '"scopes": []' 
    '}')
    
    if biolink_inverse_qualified_pred:
    ## if there is a qualified_predicate
    ##   there's a set of qualifiers that also has subj aspect and subj direction
        temp = {
            ## reverse direction: object -> subject
            rev_op_name: [
                {
                    'supportBatch': True,
                    'useTemplating': True,
                    'inputs': [
                        {
                            'id': 'UMLS',
                            'semantic': biolink_obj  ## input is object!
                        }
                    ],
                    'requestBodyType': 'object',
                    'requestBody': {'body': rev_op_body},
                    'parameters': {
                        'fields': 'object.umls,predication.pmid,predication.sentence,subject.umls,subject.name,object.name',
                        'size': POST_size
                    },
                    'outputs': [
                        {
                            'id': 'UMLS',
                            'semantic': biolink_subj  ## output is subject
                        }
                    ],
                    'predicate': biolink_inverse_pred,  ## use inverse pred!
                    'qualifiers': {
                        'qualified_predicate': biolink_inverse_qualified_pred,  ## use inverse pred!
                        ## use subject qualifiers
                        'subject_aspect_qualifier': biolink_inverse_subj_asp,         
                        'subject_direction_qualifier': biolink_inverse_subj_direct                        
                    },                    
                    'source': 'infores:semmeddb',
#                     'source': doublequote('infores:semmeddb'),
                    'response_mapping': {
                        "$ref": '#/components/x-bte-response-mapping/umls-subj'  ## matches output as subj
                    }
                }
            ]
        }
    else:
    ## create operation without qualifiers..
        temp = {
            ## reverse direction: object -> subject
            rev_op_name: [
                {
                    'supportBatch': True,
                    'useTemplating': True,
                    'inputs': [
                        {
                            'id': 'UMLS',
                            'semantic': biolink_obj  ## input is object!
                        }
                    ],
                    'requestBodyType': 'object',
                    'requestBody': {'body': rev_op_body},
                    'parameters': {
                        'fields': 'object.umls,predication.pmid,predication.sentence,subject.umls,subject.name,object.name',
                        'size': POST_size
                    },
                    'outputs': [
                        {
                            'id': 'UMLS',
                            'semantic': biolink_subj  ## output is subject
                        }
                    ],
                    'predicate': biolink_inverse_pred,  ## use inverse pred!
                    'source': 'infores:semmeddb',
#                     'source': doublequote('infores:semmeddb'),
                    'response_mapping': {
                        "$ref": '#/components/x-bte-response-mapping/umls-subj'  ## matches output as subj
                    }
                }
            ]
        }
    
    return temp

In [101]:
def generate_all_operations(combo_df):
    op_tracking = set()
    
    saved = dict()
    ## iterate through rows of combos dataframe
    for row in combo_df.itertuples(index = False):
        
        ## forward: only make operation if it's not going to be a dupe
        ##          dupes happen when query ends up being the same (predicate,object used here)
        forward_op_record = f"{row.BiolinkSubject}-{row.OriginalPredicate}-{row.OriginalObject}"
        if forward_op_record not in op_tracking:
            op_tracking.add(forward_op_record)
            saved.update(generate_forward_op(original_subj = row.OriginalSubject,
                                             original_pred = row.OriginalPredicate,
                                             original_obj = row.OriginalObject,
                                             biolink_subj = row.BiolinkSubject,
                                             biolink_pred = row.BMPred,
                                             biolink_qualified_pred = row.BMQualPred,
                                             biolink_obj_asp = row.BMObjAsp,
                                             biolink_obj_direct = row.BMObjDirect,
                                             biolink_obj = row.BiolinkObject
                                            ))        
        
        ## reverse: make operation if it's not going to be a dupe
        ##          dupes happen when query ends up being the same (predicate,subject used here)
        reverse_op_record = f"rev-{row.BiolinkObject}-{row.OriginalPredicate}-{row.OriginalSubject}"
        if reverse_op_record not in op_tracking:
            op_tracking.add(reverse_op_record)
            saved.update(generate_reverse_op(original_subj = row.OriginalSubject,
                                             original_pred = row.OriginalPredicate,
                                             original_obj = row.OriginalObject,
                                             biolink_subj = row.BiolinkSubject,
                                             biolink_inverse_pred = row.BMPred_Inv,
                                             biolink_inverse_qualified_pred = row.BMQualPred_Inv,
                                             biolink_inverse_subj_asp = row.BMInv_SubjAsp,
                                             biolink_inverse_subj_direct = row.BMInv_SubjDirect,
                                             biolink_obj = row.BiolinkObject
                                             ))           
            
    final = {"x-bte-kgs-operations": saved}
    return final

Get the file made!

In [114]:
kgs_operations = generate_all_operations(filtered_combos)

In [115]:
len(kgs_operations['x-bte-kgs-operations'])

2497

In [116]:
for i in kgs_operations['x-bte-kgs-operations'].keys():
    if ('INHIBITS' in i) and ('rev' in i):
        pprint.pprint(kgs_operations['x-bte-kgs-operations'][i], sort_dicts = False)
        break

[{'supportBatch': True,
  'useTemplating': True,
  'inputs': [{'id': 'UMLS', 'semantic': 'Polypeptide'}],
  'requestBodyType': 'object',
  'requestBody': {'body': '{"q": {{ queryInputs | '
                          "replPrefix('predicate:INHIBITS AND "
                          'subject.semantic_type_abbreviation:aapp AND '
                          "pmid_count:>5 AND object.umls')| dump }}, "
                          '"scopes": []}'},
  'parameters': {'fields': 'object.umls,predication.pmid,pmid_count,predication_count,subject.umls,subject.name,predicate,object.name',
                 'size': 1000},
  'outputs': [{'id': 'UMLS', 'semantic': 'Polypeptide'}],
  'predicate': 'affected_by',
  'qualifiers': {'qualified_predicate': 'caused_by',
                 'subject_aspect_qualifier': 'activity_or_abundance',
                 'subject_direction_qualifier': 'decreased'},
  'source': 'infores:semmeddb',
  'response_mapping': {'$ref': '#/components/x-bte-response-mapping/umls-subj'}}]


**PAUSE**

* it's cool how condensed the operations become, due to the way querying is done (keeping track of not creating duplicated operations using the sets in generate_all_operations function)
* set where to download the yamls in the code chunks below
  * operations_path
  * list_path

In [117]:
yaml.boolean_representation = ['False', 'True']

operations_path = pathlib.Path.home().joinpath(
            "Desktop", "translator-api-registry", "semmeddb", "semmeddb2_specific", "generated_operations2.yaml")

yaml.dump(kgs_operations, operations_path)

Wait a sec! Need the operations list too!

In [None]:
def generate_kgs_operations_list(operations_dict):
    kgs_op_list = []
    for key in operations_dict.keys():
        kgs_op_list.append( {"$ref": f"#/components/x-bte-kgs-operations/{key}"} )
    final2 = {"x-bte-kgs-operations": kgs_op_list}
    return final2

In [None]:
operations_list = generate_kgs_operations_list(kgs_operations['x-bte-kgs-operations'])

In [None]:
list_path = pathlib.Path.home().joinpath(
            "Desktop", "translator-api-registry", "semmeddb", "generated_list.yaml")

yaml.dump(operations_list, list_path)

**PAUSE**

* now the yaml segments downloaded have to be indented manually and inserted into the correct sections of the smartapi yaml...
  * It's easier to do with an IDE like Visual Code studio where one can select large sections of text
  * the amount of indent to do and where to put things is specified in the [yaml that acts as a template](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/version_without_operations.yaml)
  * the finished file is meant to be [here](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/smartapi.yaml) so one could select-paste the sections directly there
  
---

The code below is optional, in case one wants to convert the yaml to json (for BTE's test/query endpoint testing)

In [None]:
## extra code in case we want to convert to json...

yaml_path = pathlib.Path.home().joinpath(
            "Desktop", "translator-api-registry", "semmeddb", "smartapi.yaml")

here = yaml.load(yaml_path)

json_path = pathlib.Path.home().joinpath(
            "Desktop", "translator-api-registry", "semmeddb", "smartapi.json")

with open(json_path, 'w') as file:
    json.dump(here, file)

## Notes on choices here

### Removing

* was mapped to Polypeptide:
    * amas: (Amino Acid Sequence) looks like protein "domains". Examples: Nuclear Export Signals, DNA Binding Domain
* everything mapped to Activity (8)
    * acty (Activity) Examples: War, Retirement, Euthanasia, Lifting
    * dora (Daily or Recreational Activity) Examples: Physical activity, Light Exercise, Relaxation
    * edac (Educational Activity) Examples: Training, Medical Residencies
    * gora (Governmental or Regulatory Activity) Examples: Health Care Reform, Advisory Committees
    * hlca (Health Care Activity) Examples: follow-up, Diagnosis
    * mcha (Machine Activity) Examples: Refridgeration, Neural Network Simulation
    * ocac (Occupational Activity) Examples: Promotion, Work, Mining
    * resa (Research Activity) Examples: Clinical Trials, research study
* everything mapped to Cohort (4)
    * aggp (Age Group) Examples: Infant, Child, Adult, Elderly
    * famg (Family Group) Examples: spouse, Sister, Foster Parent
    * podg (Patient or Disabled Group) Examples: Patients
    * prog (Professional or Occupational Group) Examples: Administrators, Employee, Author
* older semantic types (4)
    * algae (Algae)
    * invt (Invertebrate)
    * orgm (Organism)
    * rich (Rickettsia or Chlamydia)
* missing mappings to biolink-model (25):
    * amph (Amphibian) Examples: Toad, Bufo boreas, Anura
    * anim (Animal) Examples: Animals, Laboratory /  Control Animal
    * arch (Archaeon) Examples: Archaea, halophilic bacteria, Thermoplasma acidophilum
    * bact (Bacterium) Examples: Escherichia coli, Salmonella, Borrelia burgdorferi
    * bdsu (Body Substance) too general. Examples: Urine, Milk, Lymph, Urine specimen
    * bdsy (Body System) too general. Examples: hypothalamic-pituitary-adrenal axis, Neurosecretory Systems
    * bird (Bird) Examples: Geese, Passeriformes, Raptors
    * blor (Body Location or Region) too general. Examples: Hepatic, Lysosomal, Cytoplasmic
    * bmod (Biomedical Occupation or Discipline) Examples: Medicine, Dentistry, Midwifery
    * bsoj (Body Space or Junction) too general. Examples: Compartments, Synapses, Cistern
    * chvf (Chemical Viewed Functionally) too general. Examples: inhibitors, antagonists, Agent
    * chvs (Chemical Viewed Structurally) too general. Examples: particle, solid state, vapor
    * euka (Eukaryote) Examples: Wasps, Protozoan parasite
    * ffas (Fully Formed Anatomical Structure) Examples: Carcass
    * fish (Fish) Examples: Eels, Fishes, Electric Fish
    * fngs (Fungus) Examples: Saccharomyces cerevisiae, Alternaria brassicicola, fungus
    * humn (Human) Examples: Family, Patients, Males
    * irda (Indicator, Reagent, or Diagnostic Aid) Examples: Fluorescent Probes, Chelating Agents
    * mamm (Mammal) Examples: Rattus norvegicus, Felis catus, Mus
    * ocdi (Occupation or Discipline) Examples: Science, Politics
    * plnt (Plant) Examples: Chrysanthemum x morifolium, Pollen, Oryza sativa
    * rept (Reptile) Examples: Snakes, Turtles, Reptiles
    * sbst (Substance) too general. Examples: Materials, Plastics, Photons, Substance
    * virs (Virus) Examples: Herpesvirus 4, Human / GB virus C / Herpesviridae
    * vtbt (Vertebrate) Examples: Vertebrates / Poikilotherm, NOS
* was mapped to AnatomicalEntity:
    * anst (Anatomical Structure) Examples: Entire fetus, Whole body, Cadaver
* was mapped to Behavior (4)
    * bhvr (Behavior) too general. Examples: Sexuality, Nest Building, Behavioral phenotype
    * inbe (Individual Behavior) too general. Examples: impulsivity, Habits, Performance
    * menp (Mental Process) too general. Examples: mind control, Learning, experience
    * socb (Social Behavior) too general. Examples: Communication, Gestures, Marriage
* was mapped to Phenomenon (6)
    * biof (Biologic Function) too general. Examples: dose-response relationship, Pharmacodynamics, Anabolism
    * eehu (Environmental Effect of Humans) too general. Examples: Sewage, Pollution, Smoke
    * hcpp (Human-caused Phenomenon or Process) too general. Now not in API. Examples: particle beam, Conferences, Victimization
    * lbtr (Laboratory or Test Result) too general. Examples: False Positive Reactions, Bone Density, Serum Calcium Level
    * npop (Natural Phenomenon or Process) too general. Examples: Floods, Fluorescence, Freezing
    * phpr (Phenomenon or Process) too general. Examples: Disasters, Acceleration, Feedback
* was mapped to Device (4)
    * bodm (Biomedical or Dental Material) too general. Examples: Pill, Gel, Talc, calcium phosphate
    * drdd (Drug Delivery Device) too general. Examples: Epipen, Skin Patch, Lilly cyanide antidote kit
    * medd (Medical Device) too general. Examples: Implants / Denture, Overlay / Silicone gel implant / Swab
    * resd (Research Device) too general. Examples: Study models, Slide
* was mapped to GrossAnatomicalStructure
    * emst (Embryonic Structure) Examples: Chick Embryo, Blastocyst structure, Placenta
    * tisu (Tissue) Examples: Tissue specimen, Blood, Human tissue, Mucous Membrane
* was mapped to PhysiologicalProcess
    * genf (Genetic Function) too general. Examples: Transcription, Genetic / Transcriptional Activation / Recombination, Genetic
* was mapped to ChemicalEntity
    * chem (Chemical) too general. Examples: Chemicals, Acids, Ligands, Ozone
* everything mapped to InformationContentEntity (8)
    * clas (Classification) too general. Examples: Research Diagnostic Criteria, Group C
    * ftcn (Functional Concept) too general. Examples: Techniques, Intravenous Route of Drug Administration
    * idcn (Idea or Concept) too general. Examples: Significant, subject, Data, Owner
    * qlco (Qualitative Concept) too general. Examples: Effect, Associated with, Advanced phase
    * qnco (Quantitative Concept) too general. Examples: Calibration, occurrence, degrees Celsius
    * rnlw (Regulation or Law) too general. Examples: Medicare, Medicaid, regulatory
    * spco (Spatial Concept) too general. Examples: Structure, Longitudinal, Asymmetry
    * tmco (Temporal Concept) too general. Examples: New, /period, 24 Hours
* everything mapped to ClinicalAttribute
    * clna (Clinical Attribute) too general. Examples: response, Renin secretion, BAND PATTERN
* was mapped to Procedure
    * lbpr (Laboratory Procedure) too general. Examples: Western Blot, Radioimmunoassay, Staining method
    * mbrt (Molecular Biology Research Technique) too general. Examples: Polymerase Chain Reaction / Blotting, Northern
* everything mapped to SmallMolecule
    * elii (Element, Ion, or Isotope) too general. Examples: Atom, Aluminum, Superoxides
* everything mapped to BiologicalEntity
    * emod (Experimental Model of Disease) too general. Examples: Experimental Autoimmune Encephalomyelitis, Rodent Model
* was mapped to Protein
    * rcpt (Receptor) too general. Examples: Binding Sites / Receptors, Metabotropic Glutamate
* was mapped to Event
    * evnt (Event) too general. Now not in API. Examples: Stressful Events
* was mapped to DiseaseOrPhenotypicFeature
    * fndg (Finding) too general. Examples: spinal cord; lesion, Normal birth weight, Sedentary job
* was mapped to GeographicLocation
    * geoa (Geographic Area) too general. Examples: Country, Canada
* everything mapped to Agent
    * grup (Group) too general. Examples: Human, Individual
    * hcro (Health Care Related Organization) too general. Examples: Hospitals, Health System
    * orgt (Organization) too general. Examples: United Nations, Organization administrative structures
    * pros (Professional Society) too general. Examples: Professional Organizations, American Nurses' Association
    * shro (Self-help or Relief Organization) too general. Examples: Social Welfare, Support Groups
* everything mapped to Publication
    * inpr (Intellectual Product) too general. Examples: Methodology, Study models
* everything mapped to PhysicalEntity
    * mnob (Manufactured Object) too general. Examples: Glass, Manuals
* everything mapped to MolecularEntity
    * mosq (Molecular Sequence) too general. Now not in API. Examples: Genetic Code
* was mapped to NucleicAcidEntity
    * nusq (Nucleotide Sequence) too general. Examples: Base Sequence, DNA Sequence, 22q11
* was mapped to OrganismAttribute
    * orga (Organism Attribute) too general. Examples: Ability, Body Composition
* was mapped to PopulationOfIndividualOrganisms
    * popg (Population Group) too general. Examples: Male population group, Woman