# Auto-generate BTE annotations for BioThings SEMMEDDB

This notebook walks a developer through the process of [taking the SEMMEDDB database](https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/SemMedDB_download.html) and using this data to auto-generate the x-bte operations for [BTE](https://github.com/biothings/BioThings_Explorer_TRAPI). This is needed for BTE to query + process the responses from the [BioThings SEMMEDDB API](https://pending.biothings.io/semmeddb). 

---

When one sees this:  
**PAUSE**

read the accompanying text that will explain what the developer needs to do before running the code chunks below it

---

The [yaml](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/generated_operations.yaml) [segments](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/generated_list.yaml) generated by this notebook are added to [this file](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/version_without_operations.yaml) to make [yaml used for the smartapi registration](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/smartapi.yaml) for BioThings SEMMEDDB API.

## Setup

Requirements:
* get bmt from pip (see instructions here https://github.com/biolink/biolink-model-toolkit/)
* get ruamel.yaml from pip (see instructions here https://yaml.readthedocs.io/en/latest/install.html)

Files:
* Get SEMMEDDB PREDICATION CSV here: https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/SemMedDB_download.html
* Get SEMMEDDB SRDEF file needed for interpreting and mapping SEMMED semantic types: 
  * download the compressed file [here](https://lhncbc.nlm.nih.gov/semanticnetwork/download.html) or a direct link [here](https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/SRDEF.txt)
  * information on the SRDEF file [here](https://www.ncbi.nlm.nih.gov/books/NBK9679/#ch05.sec5.2)

In [1]:
## CX: allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pathlib
import pandas as pd

## get using pip https://github.com/biolink/biolink-model-toolkit/
from bmt import Toolkit

## get from pip (using instead of pyyaml's import yaml)
import ruamel.yaml as ryml
import json
import pprint

## used in trying things out
# import re

**PAUSE**

* Check and correct the path for `raw_data_location`
* Check that the columns specified in `usecols` and `names` match the columns of the PREDICATION [file](https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/dbinfo.html). 
* Check that the `na_values`, `sep` are correct. One can use a command in Terminal like `head`
* If there are encoding issues, try different encodings. latin1 was used and worked, ref: https://stackoverflow.com/questions/61163367/how-to-resolve-unicodedecodeerror-in-pandas-read-csv-while-loading-dataset

In [2]:
raw_data_location = pathlib.Path.home().joinpath(
            "Desktop", "RawDataFiles", "SEMMEDDB", "semmedVER43_2021_R_PREDICATION.csv")

raw_data = pd.read_csv(raw_data_location, header=None, sep=",", encoding="latin1",
                          usecols=[3, 6, 7, 10, 11],
                          names=["PREDICATE","SUBJECT_SEMTYPE","SUBJECT_NOVELTY",
                                 "OBJECT_SEMTYPE", "OBJECT_NOVELTY"],
                          na_values=r"\N")

In [3]:
raw_data.shape

(113863366, 5)

In [4]:
raw_data.head()

Unnamed: 0,PREDICATE,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_SEMTYPE,OBJECT_NOVELTY
0,PROCESS_OF,virs,1,mamm,1.0
1,ISA,virs,1,virs,1.0
2,ISA,virs,1,virs,1.0
3,ISA,virs,1,virs,1.0
4,PROCESS_OF,dsyn,0,humn,0.0


## Basic Filtering

### keep only novelty = 1

now filter it to only keep novelty == 1 for both subject / object...

since those with novelty == 0 probably aren't very helpful / interesting to Translator. The entities with novelty == 0 are [listed](https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/dbinfo.html) in the SEMMEDDB GENERIC_CONCEPT table files, which can be downloaded [here](https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/SemMedDB_download.html)

In [5]:
filtered_data = raw_data[(raw_data["SUBJECT_NOVELTY"] == 1) &
                         (raw_data["OBJECT_NOVELTY"] == 1)].copy()

In [6]:
filtered_data.shape

(78688677, 5)

In [7]:
## remove the novelty stuff since that will make computations faster and it's always 1 now
filtered_data = filtered_data[['SUBJECT_SEMTYPE', 'PREDICATE', 'OBJECT_SEMTYPE']]

what semantic types are even there? we have to prune down to what we actually want BTE operations on...

interesting since the stats from the official website say 127 semantic types and 54 predicates
https://lhncbc.nlm.nih.gov/semanticnetwork/SemanticNetworkArchive.html

In [8]:
subject_semtypes = set(filtered_data["SUBJECT_SEMTYPE"].unique())
object_semtypes = set(filtered_data["OBJECT_SEMTYPE"].unique())
predicates = set(filtered_data["PREDICATE"].unique())

len(subject_semtypes)  ## got 133
len(object_semtypes)   ## got 135
len(predicates)        ## got 68

133

135

68

**PAUSE**

* Review the 3 sets above to see if there are things I want to remove. The normal entity semantic types have 4-letter codes...

### Remove semantic types that we don't want to make operations from

so there's some object stuff that I'm going to remove...

In [9]:
object_semtypes

{'C0030193',
 'C0030705',
 'aapp',
 'acab',
 'acty',
 'aggp',
 'alga',
 'amas',
 'amph',
 'anab',
 'anim',
 'anst',
 'antb',
 'arch',
 'bacs',
 'bact',
 'bdsu',
 'bdsy',
 'bhvr',
 'biof',
 'bird',
 'blor',
 'bmod',
 'bodm',
 'bpoc',
 'bsoj',
 'carb',
 'celc',
 'celf',
 'cell',
 'cgab',
 'chem',
 'chvf',
 'chvs',
 'clas',
 'clna',
 'clnd',
 'comd',
 'diap',
 'dora',
 'drdd',
 'dsyn',
 'edac',
 'eehu',
 'eico',
 'elii',
 'emod',
 'emst',
 'enzy',
 'euka',
 'famg',
 'ffas',
 'fish',
 'fndg',
 'fngs',
 'food',
 'ftcn',
 'genf',
 'geoa',
 'gngm',
 'gngm,aapp',
 'gora',
 'grup',
 'hcpp',
 'hcro',
 'hlca',
 'hops',
 'horm',
 'humn',
 'idcn',
 'imft',
 'inbe',
 'inch',
 'inpo',
 'inpr',
 'invt',
 'irda',
 'lang',
 'lbpr',
 'lbtr',
 'lipd',
 'mamm',
 'mbrt',
 'mcha',
 'medd',
 'menp',
 'mnob',
 'mobd',
 'moft',
 'mosq',
 'neop',
 'nnon',
 'npop',
 'nsba',
 'nusq',
 'ocac',
 'ocdi',
 'opco',
 'orch',
 'orga',
 'orgf',
 'orgm',
 'orgt',
 'ortf',
 'patf',
 'phob',
 'phpr',
 'phsf',
 'phsu',
 'plnt

In [10]:
## make the set of stuff we want to remove
removal1 = set(["C0030193", "C0030705", "gngm,aapp", "podg,humn"])

## remove it from the data 
filtered_data = filtered_data[ ~ filtered_data['OBJECT_SEMTYPE'].isin(removal1)]

In [11]:
## look at the semantic types again after this removal
subject_semtypes = set(filtered_data["SUBJECT_SEMTYPE"].unique())
object_semtypes = set(filtered_data["OBJECT_SEMTYPE"].unique())
predicates = set(filtered_data["PREDICATE"].unique())

len(subject_semtypes) ## was 133, now 133: didn't change
len(object_semtypes)  ## was 135, now 131: decreased by 4 (expected)
len(predicates)       ## was 68, now 65: decreased by 3 (cool)

133

131

65

### Remove Predicates we don't want to make operations from

In [12]:
predicates

{'ADMINISTERED_TO',
 'AFFECTS',
 'ASSOCIATED_WITH',
 'AUGMENTS',
 'CAUSES',
 'COEXISTS_WITH',
 'COMPLICATES',
 'CONVERTS_TO',
 'DIAGNOSES',
 'DISRUPTS',
 'INHIBITS',
 'INTERACTS_WITH',
 'ISA',
 'LOCATION_OF',
 'MANIFESTATION_OF',
 'MEASUREMENT_OF',
 'MEASURES',
 'METHOD_OF',
 'NEG_ADMINISTERED_TO',
 'NEG_AFFECTS',
 'NEG_ASSOCIATED_WITH',
 'NEG_AUGMENTS',
 'NEG_CAUSES',
 'NEG_COEXISTS_WITH',
 'NEG_COMPLICATES',
 'NEG_CONVERTS_TO',
 'NEG_DIAGNOSES',
 'NEG_DISRUPTS',
 'NEG_INHIBITS',
 'NEG_INTERACTS_WITH',
 'NEG_ISA',
 'NEG_LOCATION_OF',
 'NEG_MANIFESTATION_OF',
 'NEG_MEASUREMENT_OF',
 'NEG_MEASURES',
 'NEG_METHOD_OF',
 'NEG_OCCURS_IN',
 'NEG_PART_OF',
 'NEG_PRECEDES',
 'NEG_PREDISPOSES',
 'NEG_PREVENTS',
 'NEG_PROCESS_OF',
 'NEG_PRODUCES',
 'NEG_STIMULATES',
 'NEG_TREATS',
 'NEG_USES',
 'NEG_higher_than',
 'NEG_lower_than',
 'NEG_same_as',
 'NOM',
 'OCCURS_IN',
 'PART_OF',
 'PRECEDES',
 'PREDISPOSES',
 'PREP',
 'PREVENTS',
 'PROCESS_OF',
 'PRODUCES',
 'STIMULATES',
 'TREATS',
 'USES',
 '

**PAUSE**

* Decide what predicates you want to remove

**Current logic**

Translator currently isn't great with negation, so removing those

Also some predicates seem to actually be practice phrases? see https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-486#Sec26

And I didn't find the following relationships useful: ISA (subclass relationship), same_as (equivalent / done just as well), lower_than / higher_than (these two entities were compared and 1 was better (higher) or worse (lower) than the other), and compared_with...

In [13]:
negative_preds = ["NEG_ADMINISTERED_TO", "NEG_AFFECTS", "NEG_ASSOCIATED_WITH",
                  "NEG_AUGMENTS", "NEG_CAUSES", "NEG_COEXISTS_WITH", "NEG_COMPLICATES",
                  "NEG_CONVERTS_TO", "NEG_DIAGNOSES", "NEG_DISRUPTS", "NEG_INHIBITS",
                  "NEG_INTERACTS_WITH", "NEG_ISA", "NEG_LOCATION_OF", "NEG_MANIFESTATION_OF",
                  "NEG_MEASUREMENT_OF", "NEG_MEASURES", "NEG_METHOD_OF", "NEG_OCCURS_IN",
                  "NEG_PART_OF", "NEG_PRECEDES", "NEG_PREDISPOSES", "NEG_PREVENTS",
                  "NEG_PROCESS_OF", "NEG_PRODUCES", "NEG_STIMULATES", "NEG_TREATS", "NEG_USES",
                  "NEG_higher_than", "NEG_lower_than", "NEG_same_as"
                 ]

practice_phrase = ["VERB", "NOM", "PREP"]

dont_like = ["ISA", "same_as", "lower_than", "higher_than", "compared_with"]

In [14]:
filtered_data = filtered_data[ ~ filtered_data['PREDICATE'].isin(negative_preds)]
filtered_data = filtered_data[ ~ filtered_data['PREDICATE'].isin(practice_phrase)]
filtered_data = filtered_data[ ~ filtered_data['PREDICATE'].isin(dont_like)]

now look at the stats

In [15]:
## look at the semantic types again after this removal
subject_semtypes = set(filtered_data["SUBJECT_SEMTYPE"].unique())
object_semtypes = set(filtered_data["OBJECT_SEMTYPE"].unique())
predicates = set(filtered_data["PREDICATE"].unique())

len(subject_semtypes) ## was 133, now 127: decreased by 6
len(object_semtypes)  ## was 131, now 127: decreased by 4
len(predicates)       ## was 65, now 27: decreased by 38 (was expected)

127

127

27

so....that's how many kinds of subjects, objects, and predicates we have to go forward with...

In [16]:
combos = filtered_data.value_counts().reset_index()

In [17]:
combos.shape
## so that's still a lot....

(14796, 4)

## Specific Filtering

so we actually only need to use certain semantic types...it depends on how they map to the biolink model

As a reminder:
* Get SEMMEDDB SRDEF file needed for interpreting and mapping SEMMED semantic types: 
  * download the compressed file [here](https://lhncbc.nlm.nih.gov/semanticnetwork/download.html) or a direct link [here](https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/SRDEF.txt)
  * information on the SRDEF file [here](https://www.ncbi.nlm.nih.gov/books/NBK9679/#ch05.sec5.2)

### Ingest SEMMED semantic info and biolink model

In [18]:
srdef_location = pathlib.Path.home().joinpath(
            "Desktop", "RawDataFiles", "SEMMEDDB", "2020AA", "SRDEF")

In [19]:
srdef = pd.read_csv(srdef_location, sep="|", header=0, index_col=False,
                    names=["Record Type (RT)",
                           "Unique Identifier (UI)",
                           "Full Name (STY/RL)", 
                           "Tree Number (STN/RTN)", 
                           "Definition (DEF)",
                           "Examples (EX)",
                           "Usage Note (UN)",
                           "Non-Human Flag (NH)",
                           "Abbreviation (ABR)",
                           "Inverse Relation (RIN)"])

In [20]:
srdef[srdef['Abbreviation (ABR)'] == 'aapp']

Unnamed: 0,Record Type (RT),Unique Identifier (UI),Full Name (STY/RL),Tree Number (STN/RTN),Definition (DEF),Examples (EX),Usage Note (UN),Non-Human Flag (NH),Abbreviation (ABR),Inverse Relation (RIN)
97,STY,T116,"Amino Acid, Peptide, or Protein",A1.4.1.2.1.7,Amino acids and chains of amino acids connecte...,,When the concept is both an enzyme and a prote...,,aapp,


### Get the biolink-mappings

In [21]:
bmt_tool = Toolkit()



In [22]:
SEMMED_entity_types = srdef[srdef['Record Type (RT)'] == 'STY'].copy()

In [23]:
## getting biolink-mapping, in the format needed to create operations
SEMMED_entity_types['BiolinkMapping'] = [bmt_tool.get_element_by_mapping('STY:'+i)
                                         for i in SEMMED_entity_types['Unique Identifier (UI)']]
SEMMED_entity_types['BiolinkMapping'] = [i.title().replace(" ", "") if isinstance(i, str)
                                         else i for i in SEMMED_entity_types['BiolinkMapping']]

## clean up the df
SEMMED_entity_types = SEMMED_entity_types[['Full Name (STY/RL)', 
                                           'Abbreviation (ABR)', 
                                           'BiolinkMapping']].copy()
SEMMED_entity_types.columns = ['FullName', "Abbrev", "BiolinkMapping"]
SEMMED_entity_types.sort_values(by='Abbrev', inplace=True)

### Remove semantic types in data but not in SRDEF

**PAUSE**

One can choose whether to do this removal step or not...currently we remove them

In [24]:
## get all semantic types in the data
data_types = subject_semtypes.union(object_semtypes)
only_in_data = data_types - set(SEMMED_entity_types['Abbrev'])
print("removing these that are in data, but not in SRDEF:")
only_in_data

filtered_data = filtered_data[ ~ filtered_data['SUBJECT_SEMTYPE'].isin(only_in_data)]
filtered_data = filtered_data[ ~ filtered_data['OBJECT_SEMTYPE'].isin(only_in_data)]

removing these that are in data, but not in SRDEF:


{'alga',
 'carb',
 'eico',
 'invt',
 'lipd',
 'nsba',
 'opco',
 'orgm',
 'rich',
 'strd'}

In [25]:
## look at the semantic types again after this removal
subject_semtypes = set(filtered_data["SUBJECT_SEMTYPE"].unique())
object_semtypes = set(filtered_data["OBJECT_SEMTYPE"].unique())
predicates = set(filtered_data["PREDICATE"].unique())

len(subject_semtypes) ## was 127, now 117: decreased by 10 (expected)
len(object_semtypes)  ## was 127, now 117: decreased by 10 (expected)
len(predicates)       ## was 27, still 27 (expected)

117

117

27

### Clean up the mapping file to only have terms in the data

In [26]:
## refresh set of all semantic types in the data
data_types = subject_semtypes.union(object_semtypes)

## get the subset of semantic network terms that are actually in the data
SEMMED_entity_types = SEMMED_entity_types[SEMMED_entity_types['Abbrev'].isin(data_types)].copy()

### review mappings and deciding what to change, remove

**PAUSE**

This is a place to **STOP** and review all the SEMMED semantic types and their mapping to biolink semantic types...to decide what we are interested in keeping. This involves some knowledge of what biolink semantic types are prioritized in Translator. 

One can use the definitions of the SEMMED semantic types (from the SRDEF file or [browsing the UMLS vocab online](https://uts.nlm.nih.gov/uts/umls/semantic-network/root)) and the definitions of biolink semantic types (look under [things](https://github.com/biolink/biolink-model/blob/b831767e02f25c7869f760e80567ae05ceefe06c/biolink-model.yaml#L6653))

See the last section (section 6) for notes on decisions that were made...

---

Sections 3.5 + 3.6 below involves this review, changing mappings, and removing some SEMMED semantic types

In [27]:
## code used to review 
SEMMED_entity_types['BiolinkMapping'].unique()

SEMMED_entity_types[SEMMED_entity_types['BiolinkMapping'] == 'Vitamin']

array(['Polypeptide', 'Disease', 'Activity', 'Cohort', None,
       'AnatomicalEntity', 'Drug', 'Behavior', 'Phenomenon', 'Device',
       'GrossAnatomicalStructure', 'CellularComponent',
       'PhysiologicalProcess', 'Cell', 'ChemicalEntity',
       'InformationContentEntity', 'ClinicalAttribute', 'Procedure',
       'SmallMolecule', 'BiologicalEntity', 'Protein', 'Event',
       'DiseaseOrPhenotypicFeature', 'Food', 'GeographicLocation',
       'GenomicEntity', 'Agent', 'Publication', 'PhysicalEntity',
       'MolecularActivity', 'MolecularEntity', 'NucleicAcidEntity',
       'OrganismAttribute', 'PopulationOfIndividualOrganisms',
       'PhenotypicFeature', 'Vitamin'], dtype=object)

Unnamed: 0,FullName,Abbrev,BiolinkMapping
104,Vitamin,vita,Vitamin


In [28]:
## re-mapping based on putting IDs into normalization service / our operation system...
##   we have UMLS for Disease (mydisease, mychem), SmallMolecule (idisk)

## Previously were mapped to ChemicalEntity, currently not mapped, so doing the mapping here
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'bacs'),'BiolinkMapping'] = 'SmallMolecule'  ## Biologically Active Substance
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'hops'),'BiolinkMapping'] = 'SmallMolecule'  ## Hazardous or Poisonous Substance
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'horm'),'BiolinkMapping'] = 'SmallMolecule'  ## Hormone
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'imft'),'BiolinkMapping'] = 'ChemicalEntity'  ## Immunologic Factor: I didn't know where to put antigens and vaccines...
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'inch'),'BiolinkMapping'] = 'SmallMolecule'  ## Inorganic Chemical
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'orch'),'BiolinkMapping'] = 'SmallMolecule'  ## Organic Chemical

## currently mapped to Drug, but re-mapping here
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'antb'),'BiolinkMapping'] = 'SmallMolecule'  ## Antibiotic
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'phsu'),'BiolinkMapping'] = 'SmallMolecule'  ## Pharmacologic Substance

## currently mapped to GenomicEntity, but re-mapping here
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'gngm'),'BiolinkMapping'] = 'Gene'

## currently mapped to NucleicAcidEntity, but re-mapping here
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'nnon'),'BiolinkMapping'] = 'SmallMolecule'  ## Nucleic Acid, Nucleoside, or Nucleotide

## currently mapped to Vitamin, but re-mapping here
SEMMED_entity_types.loc[
    (SEMMED_entity_types['Abbrev'] == 'vita'),'BiolinkMapping'] = 'SmallMolecule'  ## Vitamin

In [29]:
currently_unused_entities = [
    ## mapped to Polypeptide
    "amas",     
    ## all that are mapped to Activity
    "acty", "dora", "edac", "gora", "hlca", "mcha", "ocac", "resa",  
    ## all that are mapped to Cohort
    "aggp", "famg", "podg", "prog",  
    ## missing mappings
    "amph", "anim", "arch", "bact", "bdsu", "bdsy", "bird", "blor",
    "bmod", "bsoj", "chvf", "chvs", "euka", "ffas", "fish", "fngs", 
    "humn", "irda", "mamm", "ocdi", "plnt", "rept", "sbst", "virs", "vtbt",
    ## mapped to Anatomical Entity
    "anst",
    ## mapped to Behavior
    "bhvr", "inbe", "menp", "socb",
    ## mapped to Phenomenon
    "biof", "eehu", "hcpp", "lbtr", "npop", "phpr",
    ## mapped to Device
    "bodm", "drdd", "medd", "resd",
    ## mapped to GrossAnatomicalStructure
    "emst", "tisu", 
    ## mapped to PhysiologicalProcess
    "genf",
    ## mapped to ChemicalEntity
    "chem",
    ## all that are mapped to InformationContentEntity
    "clas", "cnce", "ftcn", "grpa", "idcn", "lang", "qlco", "qnco", 
    "rnlw", "spco", "tmco",
    ## all that are mapped to ClinicalAttribute
    "clna",
    ## mapped to Procedure
    "lbpr", "mbrt", 
    ## all that are mapped to SmallMolecule
    "elii",
    ## all that are mapped to BiologicalEntity
    "emod",
    ## mapped to Protein
    "rcpt",
    ## all that are mapped to Event
    "evnt",
    ## all that are mapped to DiseaseOrPhenotypicFeature
    "fndg",
    ## all that are mapped to GeographicLocation
    "geoa",
    ## all that are mapped to Agent
    "grup", "hcro", "orgt", "pros", "shro",
    ## all that are mapped to Publication
    "inpr",
    ## all that are mapped to PhysicalEntity
    "mnob",
    ## all that are mapped to MolecularEntity
    "mosq",
    ## mapped to NucleicAcidEntity
    "nusq",
    ## all that are mapped to OrganismAttribute
    "orga",
    ## all that are mapped to PopulationOfIndividualOrganisms
    "popg"
]

### prune data based on what node entities to remove

In [30]:
## prune this doc
SEMMED_entity_types = SEMMED_entity_types[ ~ SEMMED_entity_types['Abbrev'].isin(currently_unused_entities)]

## prune the data doc
filtered_data = filtered_data[ ~ filtered_data['SUBJECT_SEMTYPE'].isin(currently_unused_entities)]
filtered_data = filtered_data[ ~ filtered_data['OBJECT_SEMTYPE'].isin(currently_unused_entities)]

In [31]:
## look at the semantic types again after this removal
subject_semtypes = set(filtered_data["SUBJECT_SEMTYPE"].unique())
object_semtypes = set(filtered_data["OBJECT_SEMTYPE"].unique())
predicates = set(filtered_data["PREDICATE"].unique())

len(subject_semtypes) ## was 117, now 35: decreased by 82
len(object_semtypes)  ## was 117, now 35: decreased by 82
len(predicates)       ## was 27, now 26: decreased by 1 (cool)

35

35

26

In [32]:
SEMMED_entity_types

Unnamed: 0,FullName,Abbrev,BiolinkMapping
97,"Amino Acid, Peptide, or Protein",aapp,Polypeptide
15,Acquired Abnormality,acab,Disease
115,Anatomical Abnormality,anab,Disease
119,Antibiotic,antb,SmallMolecule
101,Biologically Active Substance,bacs,SmallMolecule
18,"Body Part, Organ, or Organ Component",bpoc,GrossAnatomicalStructure
21,Cell Component,celc,CellularComponent
35,Cell Function,celf,PhysiologicalProcess
20,Cell,cell,Cell
14,Congenital Abnormality,cgab,Disease


### getting mappings for predicates

In [33]:
SEMMED_predicates = srdef[srdef['Record Type (RT)'] == 'RL'].copy()

SEMMED_predicates = SEMMED_predicates[['Full Name (STY/RL)', 'Inverse Relation (RIN)'
                                      ]].copy()
SEMMED_predicates.columns = ['SemmedName', 'SemmedInverse']
SEMMED_predicates.sort_values(by='SemmedName', inplace=True)

**PAUSE**

The SRDEF file might be missing predicates that are in the data, so check that and see if you would like to add those back in...

In [34]:
## manually add some predicates that are in the data, but not in SRDEF...
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['administered_to', '']
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['augments', '']
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['co-exists_with', '']
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['converts_to', '']
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['inhibits', '']
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['predisposes', '']
SEMMED_predicates.loc[len(SEMMED_predicates.index)] = ['stimulates', '']

SEMMED_predicates.sort_values(by='SemmedName', inplace=True)

In [35]:
SEMMED_predicates['SemmedCurie']= ['SEMMEDDB:'+i 
                                   for i in SEMMED_predicates['SemmedName'].str.replace("-", "")]

SEMMED_predicates['BiolinkMapping'] = [bmt_tool.get_element_by_mapping(i)
                                       for i in SEMMED_predicates['SemmedCurie']]

SEMMED_predicates['BiolinkMapping'] = SEMMED_predicates['BiolinkMapping'].str.replace(" ", "_")

In [36]:
SEMMED_predicates[SEMMED_predicates['BiolinkMapping'].notna()]

Unnamed: 0,SemmedName,SemmedInverse,SemmedCurie,BiolinkMapping
54,administered_to,,SEMMEDDB:administered_to,affects
145,affects,affected_by,SEMMEDDB:affects,affects
160,associated_with,associated_with,SEMMEDDB:associated_with,related_to
55,augments,,SEMMEDDB:augments,positively_regulates
141,causes,caused_by,SEMMEDDB:causes,causes
56,co-exists_with,,SEMMEDDB:coexists_with,coexists_with
143,complicates,complicated_by,SEMMEDDB:complicates,interacts_with
57,converts_to,,SEMMEDDB:converts_to,derives_into
157,diagnoses,diagnosed_by,SEMMEDDB:diagnoses,biomarker_for
140,disrupts,disrupted_by,SEMMEDDB:disrupts,disrupts


**PAUSE**

Below some mappings for predicates are changed. Check and decide whether you want to do this and how...

---
Going to re-map for these entries...

In [37]:
## since it's currently mapped to a mixin....move it to a real predicate
SEMMED_predicates.loc[
    (SEMMED_predicates['SemmedName'] == 'augments'),'BiolinkMapping'] = 'entity_positively_regulates_entity' 

SEMMED_predicates.loc[
    (SEMMED_predicates['SemmedName'] == 'complicates'),'BiolinkMapping'] = 'exacerbates' 

## the "diagnoses" predicate is more of a "this thing distinguishes / identifies this other thing"...
##   so I didn't want to leave it mapped to "biomarker_for"
SEMMED_predicates.loc[
    (SEMMED_predicates['SemmedName'] == 'diagnoses'),'BiolinkMapping'] = 'related_to' 

## treats and administered_to (as affects) are
##   also a bit strong (it was used but unclear whether it actually helped or had effect)

**PAUSE**

Check and decide whether you want to do this

---

Missing mappings for:
- measures 
- method of
- uses

Going to remove those combos from the data in section 3.8 later...

In [38]:
## only keep the stuff we found biolink mappings for
SEMMED_predicates = SEMMED_predicates[SEMMED_predicates['BiolinkMapping'].notna()].copy()

In [39]:
## get inverse predicates so generating reverse operations is easier

SEMMED_predicates['BiolinkInverse'] = [bmt_tool.get_element(i).inverse 
                                       if isinstance(bmt_tool.get_element(i).inverse, str)
                                       else i 
                                       for i in SEMMED_predicates['BiolinkMapping']]

SEMMED_predicates['BiolinkInverse'] = SEMMED_predicates['BiolinkInverse'].str.replace(" ", "_")

In [40]:
SEMMED_predicates

Unnamed: 0,SemmedName,SemmedInverse,SemmedCurie,BiolinkMapping,BiolinkInverse
54,administered_to,,SEMMEDDB:administered_to,affects,affected_by
145,affects,affected_by,SEMMEDDB:affects,affects,affected_by
160,associated_with,associated_with,SEMMEDDB:associated_with,related_to,related_to
55,augments,,SEMMEDDB:augments,entity_positively_regulates_entity,entity_positively_regulated_by_entity
141,causes,caused_by,SEMMEDDB:causes,causes,caused_by
56,co-exists_with,,SEMMEDDB:coexists_with,coexists_with,coexists_with
143,complicates,complicated_by,SEMMEDDB:complicates,exacerbates,exacerbates
57,converts_to,,SEMMEDDB:converts_to,derives_into,derives_from
157,diagnoses,diagnosed_by,SEMMEDDB:diagnoses,related_to,related_to
140,disrupts,disrupted_by,SEMMEDDB:disrupts,disrupts,disrupted_by


**PAUSE**

Check if the predicate inverses look correct. Currently some aren't correct because they don't have an inverse but they are directional...

This is why they need to be manually changed. 

Some predicates aren't directional (`symmetric == true`) so those are identical in either direction...

In [41]:
## currently no inverse but it needs one

SEMMED_predicates.loc[
    (SEMMED_predicates['SemmedName'] == 'complicates'),'BiolinkInverse'] = 'related_to' 


SEMMED_predicates.loc[
    (SEMMED_predicates['SemmedName'] == 'predisposes'),'BiolinkInverse'] = 'risk_affected_by' 
## using inverse of the parent term (affects risk for)

In [42]:
SEMMED_predicates

Unnamed: 0,SemmedName,SemmedInverse,SemmedCurie,BiolinkMapping,BiolinkInverse
54,administered_to,,SEMMEDDB:administered_to,affects,affected_by
145,affects,affected_by,SEMMEDDB:affects,affects,affected_by
160,associated_with,associated_with,SEMMEDDB:associated_with,related_to,related_to
55,augments,,SEMMEDDB:augments,entity_positively_regulates_entity,entity_positively_regulated_by_entity
141,causes,caused_by,SEMMEDDB:causes,causes,caused_by
56,co-exists_with,,SEMMEDDB:coexists_with,coexists_with,coexists_with
143,complicates,complicated_by,SEMMEDDB:complicates,exacerbates,related_to
57,converts_to,,SEMMEDDB:converts_to,derives_into,derives_from
157,diagnoses,diagnosed_by,SEMMEDDB:diagnoses,related_to,related_to
140,disrupts,disrupted_by,SEMMEDDB:disrupts,disrupts,disrupted_by


### prune the data based on what predicates to keep

**PAUSE**

Note that this step is used to remove the predicates that were dropped earlier...

now to filter out the predicates I dropped (couldn't map / didn't find useful):
- measures
- method of
- uses

In [43]:
SEMMED_predicates['NameInDataFormat'] = SEMMED_predicates['SemmedName'].str.upper()
SEMMED_predicates['NameInDataFormat'] = SEMMED_predicates['NameInDataFormat'].str.replace("-", "")

In [44]:
more_removals = predicates - set(SEMMED_predicates['NameInDataFormat'])
more_removals

{'MEASURES', 'METHOD_OF', 'USES'}

In [45]:
filtered_data = filtered_data[ ~ filtered_data['PREDICATE'].isin(more_removals)]

In [46]:
## look at the semantic types again after this removal
subject_semtypes = set(filtered_data["SUBJECT_SEMTYPE"].unique())
object_semtypes = set(filtered_data["OBJECT_SEMTYPE"].unique())
predicates = set(filtered_data["PREDICATE"].unique())

len(subject_semtypes) ## was 35, still 35 (expected)
len(object_semtypes)  ## was 35, still 35 (expected)
len(predicates)       ## was 26, now 23: decreased by 3 (expected)

35

35

23

## Looking at combos

In [47]:
## look at number of combos after this removal
combos = filtered_data.value_counts().reset_index()
combos.columns = ['SUBJECT_SEMTYPE', 'PREDICATE', 'OBJECT_SEMTYPE', 'COUNT']
combos.shape
combos.head(10)

(4211, 4)

Unnamed: 0,SUBJECT_SEMTYPE,PREDICATE,OBJECT_SEMTYPE,COUNT
0,bpoc,LOCATION_OF,neop,1023005
1,bpoc,LOCATION_OF,aapp,953010
2,topp,TREATS,dsyn,849825
3,cell,LOCATION_OF,aapp,815374
4,bpoc,LOCATION_OF,patf,799234
5,topp,TREATS,neop,707094
6,bpoc,LOCATION_OF,dsyn,696732
7,phsu,TREATS,dsyn,660654
8,cell,PART_OF,bpoc,598884
9,bpoc,PART_OF,bpoc,542301


**PAUSE**

use the code block below to decide how many combos to keep based on how many predications/records there are per combo...

---

Current:

let's try setting a limit that we're only building operations if there are > 100 objects for the triple...

In [50]:
combos[(combos['COUNT'] > 100)]

Unnamed: 0,SUBJECT_SEMTYPE,PREDICATE,OBJECT_SEMTYPE,COUNT
0,bpoc,LOCATION_OF,neop,1023005
1,bpoc,LOCATION_OF,aapp,953010
2,topp,TREATS,dsyn,849825
3,cell,LOCATION_OF,aapp,815374
4,bpoc,LOCATION_OF,patf,799234
...,...,...,...,...
2180,antb,AUGMENTS,celf,102
2181,food,CAUSES,anab,102
2182,clnd,TREATS,anab,101
2183,antb,INHIBITS,imft,101


In [51]:
filtered_combos = combos[(combos['COUNT'] > 100)].copy()
filtered_combos.drop(columns='COUNT', inplace=True)
filtered_combos.shape

(2185, 3)

In [52]:
SEMMED_entity_types.columns

Index(['FullName', 'Abbrev', 'BiolinkMapping'], dtype='object')

In [53]:
## now have to map all subject to biolink

filtered_combos = filtered_combos.merge(SEMMED_entity_types[['Abbrev', 'BiolinkMapping']], 
                                        how='left', left_on='SUBJECT_SEMTYPE', right_on='Abbrev')

filtered_combos.drop(columns = 'Abbrev', inplace=True)

filtered_combos.columns = ['OriginalSubject', 'OriginalPredicate', 'OriginalObject', 
                           'BiolinkSubject']

In [54]:
## now have to map all predicate to biolink

filtered_combos = filtered_combos.merge(SEMMED_predicates[['NameInDataFormat', 'BiolinkMapping', 'BiolinkInverse']],
                                        how='left', left_on='OriginalPredicate', right_on='NameInDataFormat')

filtered_combos.drop(columns = 'NameInDataFormat', inplace=True)

filtered_combos.columns = ['OriginalSubject', 'OriginalPredicate', 'OriginalObject', 
                           'BiolinkSubject', 'BiolinkPredicate', 'BiolinkInversePred']

In [55]:
## now have to map all object to biolink

filtered_combos = filtered_combos.merge(SEMMED_entity_types[['Abbrev', 'BiolinkMapping']],
                                        how='left', left_on='OriginalObject', right_on='Abbrev')

filtered_combos.drop(columns = 'Abbrev', inplace=True)

filtered_combos.columns = ['OriginalSubject', 'OriginalPredicate', 'OriginalObject', 
                           'BiolinkSubject', 'BiolinkPredicate', 'BiolinkInversePred', 'BiolinkObject']

In [56]:
filtered_combos[0:3]

Unnamed: 0,OriginalSubject,OriginalPredicate,OriginalObject,BiolinkSubject,BiolinkPredicate,BiolinkInversePred,BiolinkObject
0,bpoc,LOCATION_OF,neop,GrossAnatomicalStructure,location_of,located_in,Disease
1,bpoc,LOCATION_OF,aapp,GrossAnatomicalStructure,location_of,located_in,Polypeptide
2,topp,TREATS,dsyn,Procedure,treats,treated_by,Disease


## Generate operation yaml!

**PAUSE**

* If needed, change the code within the functions below to change the x-bte annotations that are made...
* the yaml created below refers to umls-subj and umls-obj...those are specified [here close to the bottom](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/version_without_operations.yaml)

In [57]:
yaml=ryml.YAML()
folded = ryml.scalarstring.FoldedScalarString

In [58]:
def generate_forward_op(original_subj, original_pred, original_obj,
                        biolink_subj, biolink_pred, biolink_obj):
    ## set size parameter for biothings POST query. This will change depending on what is set for the API
    POST_size = 1000  
    
    ## create the keys for the operation names
    normal_op_name = f"{original_subj}-{original_pred}-{original_obj}"
    
    ## USE FOLDED in order to have the quotes handled properly (no escape \) in the dumped document
    ## original direction: subject -> object
    normal_op_body = folded(
    '{' \
        '"q": [ {{ queryInputs | wrap( \'["\', \'",' + f'"{original_pred}","{original_obj}",1,1]\' )' + ' }} ], ' \
        '"scopes": ["subject.umls", "predicate", "object.semantic_type_abbreviation", ' \
                   '"subject.novelty", "object.novelty"]' \
    '}')
    
    ## create operation...
    temp = {
        ## original direction: subject -> object
        normal_op_name: [
            {
                'supportBatch': True,
                'useTemplating': True,
                'inputs': [
                    {
                        'id': 'UMLS',
                        'semantic': biolink_subj  ## input is subject!
                    }
                ],
                'requestBodyType': 'object',
                'requestBody': {'body': normal_op_body},
                'parameters': {
                    'fields': 'object.umls,pmid,subject.umls,subject.name,predicate,object.name',
                    'size': POST_size
                },
                'outputs': [
                    {
                        'id': 'UMLS',
                        'semantic': biolink_obj  ## output is object
                    }
                ],
                'predicate': biolink_pred,
                'source': 'infores:semmeddb',
                'response_mapping': {
                    "$ref": '#/components/x-bte-response-mapping/umls-obj'  ## matches output as object
                }
            }
        ]
    }
    return temp

In [59]:
def generate_reverse_op(original_subj, original_pred, original_obj,
                            biolink_subj, biolink_inverse_pred,  ## NOTICE THE INVERSE PRED USED HERE
                            biolink_obj):
    ## set size parameter for biothings POST query. This will change depending on what is set for the API
    POST_size = 1000  
    
    ## create the keys for the operation names
    rev_op_name = f"{original_subj}-{original_pred}-{original_obj}-rev"
    
    ## USE FOLDED in order to have the quotes handled properly (no escape \) in the dumped document
    ## reverse direction: object -> subject
    rev_op_body = folded(
    '{' \
        '"q": [ {{ queryInputs | wrap( \'["\', \'",' + f'"{original_pred}","{original_subj}",1,1]\' )' + '}}], ' \
        '"scopes": ["object.umls", "predicate", "subject.semantic_type_abbreviation", ' \
                   '"subject.novelty", "object.novelty"]' \
    '}')
    
    ## create the operation...
    temp = {
        ## reverse direction: object -> subject
        rev_op_name: [
            {
                'supportBatch': True,
                'useTemplating': True,
                'inputs': [
                    {
                        'id': 'UMLS',
                        'semantic': biolink_obj  ## input is object!
                    }
                ],
                'requestBodyType': 'object',
                'requestBody': {'body': rev_op_body},
                'parameters': {
                    'fields': 'subject.umls,pmid,subject.name,predicate,object.umls,object.name',
                    'size': POST_size
                },
                'outputs': [
                    {
                        'id': 'UMLS',
                        'semantic': biolink_subj  ## output is subject
                    }
                ],
                'predicate': biolink_inverse_pred,  ## use inverse pred!
                'source': 'infores:semmeddb',
                'response_mapping': {
                    "$ref": '#/components/x-bte-response-mapping/umls-subj'  ## matches output as subj
                }
            }
        ]
    }
    return temp

In [60]:
def generate_all_operations(combo_df):
    op_tracking = set()
    
    saved = dict()
    ## iterate through rows of combos dataframe
    for row in combo_df.itertuples(index = False):
        
        ## forward: only make operation if it's not going to be a dupe
        ##          dupes happen when query ends up being the same (predicate,object used here)
        forward_op_record = f"{row.BiolinkSubject}-{row.OriginalPredicate}-{row.OriginalObject}"
        if forward_op_record not in op_tracking:
            op_tracking.add(forward_op_record)
            saved.update(generate_forward_op(original_subj = row.OriginalSubject,
                                             original_pred = row.OriginalPredicate,
                                             original_obj = row.OriginalObject,
                                             biolink_subj = row.BiolinkSubject,
                                             biolink_pred = row.BiolinkPredicate,
                                             biolink_obj = row.BiolinkObject
                                            ))
            
        ## reverse: make operation if it's not going to be a dupe
        ##          dupes happen when query ends up being the same (predicate,subject used here)
        reverse_op_record = f"rev-{row.BiolinkObject}-{row.OriginalPredicate}-{row.OriginalSubject}"
        if reverse_op_record not in op_tracking:
            op_tracking.add(reverse_op_record)
            saved.update(generate_reverse_op(original_subj = row.OriginalSubject,
                                             original_pred = row.OriginalPredicate,
                                             original_obj = row.OriginalObject,
                                             biolink_subj = row.BiolinkSubject,
                                             biolink_inverse_pred = row.BiolinkInversePred,
                                             biolink_obj = row.BiolinkObject
                                             ))
            
    final = {"x-bte-kgs-operations": saved}
    return final

Get the file made!

In [62]:
kgs_operations = generate_all_operations(filtered_combos)

In [63]:
len(kgs_operations['x-bte-kgs-operations'])

1883

**PAUSE**

* it's cool how condensed the operations become, due to the way querying is done (keeping track of not creating duplicated operations using the sets in generate_all_operations function)
* set where to download the yamls in the code chunks below
  * operations_path
  * list_path

In [64]:
yaml.boolean_representation = ['False', 'True']

operations_path = pathlib.Path.home().joinpath(
            "Desktop", "translator-api-registry", "semmeddb", "generated_operations.yaml")

yaml.dump(kgs_operations, operations_path)

Wait a sec! Need the operations list too!

In [65]:
def generate_kgs_operations_list(operations_dict):
    kgs_op_list = []
    for key in operations_dict.keys():
        kgs_op_list.append( {"$ref": f"#/components/x-bte-kgs-operations/{key}"} )
    final2 = {"x-bte-kgs-operations": kgs_op_list}
    return final2

In [66]:
operations_list = generate_kgs_operations_list(kgs_operations['x-bte-kgs-operations'])

In [67]:
list_path = pathlib.Path.home().joinpath(
            "Desktop", "translator-api-registry", "semmeddb", "generated_list.yaml")

yaml.dump(operations_list, list_path)

**PAUSE**

* now the yaml segments downloaded have to be indented manually and inserted into the correct sections of the smartapi yaml...
  * It's easier to do with an IDE like Visual Code studio where one can select large sections of text
  * the amount of indent to do and where to put things is specified in the [yaml that acts as a template](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/version_without_operations.yaml)
  * the finished file is meant to be [here](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/smartapi.yaml) so one could select-paste the sections directly there
  
---

The code below is optional, in case one wants to convert the yaml to json (for BTE's test/query endpoint testing)

In [68]:
## extra code in case we want to convert to json...

yaml_path = pathlib.Path.home().joinpath(
            "Desktop", "translator-api-registry", "semmeddb", "smartapi.yaml")

here = yaml.load(yaml_path)

json_path = pathlib.Path.home().joinpath(
            "Desktop", "translator-api-registry", "semmeddb", "smartapi.json")

with open(json_path, 'w') as file:
    json.dump(here, file)

## Notes on choices here

### Leaving

* leaving aapp as Polypeptide (will be a chemical unless Protein is also in its SRI ID resolution (then it'll be Gene too))
* leaving Disease mappings as-is, but some seem like they could be PathologicalAnatomicalStructure (acab, anab, cgab), PathologicalProcess (comd, patf), or PhenotypicFeature (inpo) instead
* left clnd (Clinical Drug) as Drug since it really seemed like a drug (dosage)

### Removing

* was mapped to Polypeptide:
    * amas: (Amino Acid Sequence) looks like protein "domains". Examples: Nuclear Export Signals, DNA Binding Domain
* everything mapped to Activity (8)
    * acty (Activity) Examples: War, Retirement, Euthanasia, Lifting
    * dora (Daily or Recreational Activity) Examples: Physical activity, Light Exercise, Relaxation
    * edac (Educational Activity) Examples: Training, Medical Residencies
    * gora (Governmental or Regulatory Activity) Examples: Health Care Reform, Advisory Committees
    * hlca (Health Care Activity) Examples: follow-up, Diagnosis
    * mcha (Machine Activity) Examples: Refridgeration, Neural Network Simulation
    * ocac (Occupational Activity) Examples: Promotion, Work, Mining
    * resa (Research Activity) Examples: Clinical Trials, research study
* everything mapped to Cohort (4)
    * aggp (Age Group) Examples: Infant, Child, Adult, Elderly
    * famg (Family Group) Examples: spouse, Sister, Foster Parent
    * podg (Patient or Disabled Group) Examples: Patients
    * prog (Professional or Occupational Group) Examples: Administrators, Employee, Author
* missing mappings to biolink-model (25):
    * amph (Amphibian) Examples: Toad, Bufo boreas, Anura
    * anim (Animal) Examples: Animals, Laboratory /  Control Animal
    * arch (Archaeon) Examples: Archaea, halophilic bacteria, Thermoplasma acidophilum
    * bact (Bacterium) Examples: Escherichia coli, Salmonella, Borrelia burgdorferi
    * bdsu (Body Substance) too general. Examples: Urine, Milk, Lymph, Urine specimen
    * bdsy (Body System) too general. Examples: hypothalamic-pituitary-adrenal axis, Neurosecretory Systems
    * bird (Bird) Examples: Geese, Passeriformes, Raptors
    * blor (Body Location or Region) too general. Examples: Hepatic, Lysosomal, Cytoplasmic
    * bmod (Biomedical Occupation or Discipline) Examples: Medicine, Dentistry, Midwifery
    * bsoj (Body Space or Junction) too general. Examples: Compartments, Synapses, Cistern
    * chvf (Chemical Viewed Functionally) too general. Examples: inhibitors, antagonists, Agent
    * chvs (Chemical Viewed Structurally) too general. Examples: particle, solid state, vapor
    * euka (Eukaryote) Examples: Wasps, Protozoan parasite
    * ffas (Fully Formed Anatomical Structure) Examples: Carcass
    * fish (Fish) Examples: Eels, Fishes, Electric Fish
    * fngs (Fungus) Examples: Saccharomyces cerevisiae, Alternaria brassicicola, fungus
    * humn (Human) Examples: Family, Patients, Males
    * irda (Indicator, Reagent, or Diagnostic Aid) Examples: Fluorescent Probes, Chelating Agents
    * mamm (Mammal) Examples: Rattus norvegicus, Felis catus, Mus
    * ocdi (Occupation or Discipline) Examples: Science, Politics
    * plnt (Plant) Examples: Chrysanthemum x morifolium, Pollen, Oryza sativa
    * rept (Reptile) Examples: Snakes, Turtles, Reptiles
    * sbst (Substance) too general. Examples: Materials, Plastics, Photons, Substance
    * virs (Virus) Examples: Herpesvirus 4, Human / GB virus C / Herpesviridae
    * vtbt (Vertebrate) Examples: Vertebrates / Poikilotherm, NOS
* was mapped to AnatomicalEntity:
    * anst (Anatomical Structure) Examples: Entire fetus, Whole body, Cadaver
* was mapped to Behavior (4)
    * bhvr (Behavior) too general. Examples: Sexuality, Nest Building, Behavioral phenotype
    * inbe (Individual Behavior) too general. Examples: impulsivity, Habits, Performance
    * menp (Mental Process) too general. Examples: mind control, Learning, experience
    * socb (Social Behavior) too general. Examples: Communication, Gestures, Marriage
* was mapped to Phenomenon (6)
    * biof (Biologic Function) too general. Examples: dose-response relationship, Pharmacodynamics, Anabolism
    * eehu (Environmental Effect of Humans) too general. Examples: Sewage, Pollution, Smoke
    * hcpp (Human-caused Phenomenon or Process) too general. Now not in API. Examples: particle beam, Conferences, Victimization
    * lbtr (Laboratory or Test Result) too general. Examples: False Positive Reactions, Bone Density, Serum Calcium Level
    * npop (Natural Phenomenon or Process) too general. Examples: Floods, Fluorescence, Freezing
    * phpr (Phenomenon or Process) too general. Examples: Disasters, Acceleration, Feedback
* was mapped to Device (4)
    * bodm (Biomedical or Dental Material) too general. Examples: Pill, Gel, Talc, calcium phosphate
    * drdd (Drug Delivery Device) too general. Examples: Epipen, Skin Patch, Lilly cyanide antidote kit
    * medd (Medical Device) too general. Examples: Implants / Denture, Overlay / Silicone gel implant / Swab
    * resd (Research Device) too general. Examples: Study models, Slide
* was mapped to GrossAnatomicalStructure
    * emst (Embryonic Structure) Examples: Chick Embryo, Blastocyst structure, Placenta
    * tisu (Tissue) Examples: Tissue specimen, Blood, Human tissue, Mucous Membrane
* was mapped to PhysiologicalProcess
    * genf (Genetic Function) too general. Examples: Transcription, Genetic / Transcriptional Activation / Recombination, Genetic
* was mapped to ChemicalEntity
    * chem (Chemical) too general. Examples: Chemicals, Acids, Ligands, Ozone
* everything mapped to InformationContentEntity (11)
    * clas (Classification) too general. Examples: Research Diagnostic Criteria, Group C
    * cnce (Conceptual Entity) Now not in data. 1 matching subject. Examples: LNA
    * ftcn (Functional Concept) too general. Examples: Techniques, Intravenous Route of Drug Administration
    * grpa (Group Attribute) didn't see in API...
    * idcn (Idea or Concept) too general. Examples: Significant, subject, Data, Owner
    * lang (Language) too general. Now not in API. Examples: Nuosu Language, Chinook Jargon language
    * qlco (Qualitative Concept) too general. Examples: Effect, Associated with, Advanced phase
    * qnco (Quantitative Concept) too general. Examples: Calibration, occurrence, degrees Celsius
    * rnlw (Regulation or Law) too general. Examples: Medicare, Medicaid, regulatory
    * spco (Spatial Concept) too general. Examples: Structure, Longitudinal, Asymmetry
    * tmco (Temporal Concept) too general. Examples: New, /period, 24 Hours
* everything mapped to ClinicalAttribute
    * clna (Clinical Attribute) too general. Examples: response, Renin secretion, BAND PATTERN
* was mapped to Procedure
    * lbpr (Laboratory Procedure) too general. Examples: Western Blot, Radioimmunoassay, Staining method
    * mbrt (Molecular Biology Research Technique) too general. Examples: Polymerase Chain Reaction / Blotting, Northern
* everything mapped to SmallMolecule
    * elii (Element, Ion, or Isotope) too general. Examples: Atom, Aluminum, Superoxides
* everything mapped to BiologicalEntity
    * emod (Experimental Model of Disease) too general. Examples: Experimental Autoimmune Encephalomyelitis, Rodent Model
* was mapped to Protein
    * rcpt (Receptor) too general. Examples: Binding Sites / Receptors, Metabotropic Glutamate
* was mapped to Event
    * evnt (Event) too general. Now not in API. Examples: Stressful Events
* was mapped to DiseaseOrPhenotypicFeature
    * fndg (Finding) too general. Examples: spinal cord; lesion, Normal birth weight, Sedentary job
* was mapped to GeographicLocation
    * geoa (Geographic Area) too general. Examples: Country, Canada
* everything mapped to Agent
    * grup (Group) too general. Examples: Human, Individual
    * hcro (Health Care Related Organization) too general. Examples: Hospitals, Health System
    * orgt (Organization) too general. Examples: United Nations, Organization administrative structures
    * pros (Professional Society) too general. Examples: Professional Organizations, American Nurses' Association
    * shro (Self-help or Relief Organization) too general. Examples: Social Welfare, Support Groups
* everything mapped to Publication
    * inpr (Intellectual Product) too general. Examples: Methodology, Study models
* everything mapped to PhysicalEntity
    * mnob (Manufactured Object) too general. Examples: Glass, Manuals
* everything mapped to MolecularEntity
    * mosq (Molecular Sequence) too general. Now not in API. Examples: Genetic Code
* was mapped to NucleicAcidEntity
    * nusq (Nucleotide Sequence) too general. Examples: Base Sequence, DNA Sequence, 22q11
* was mapped to OrganismAttribute
    * orga (Organism Attribute) too general. Examples: Ability, Body Composition
* was mapped to PopulationOfIndividualOrganisms
    * popg (Population Group) too general. Examples: Male population group, Woman