# UMLS Import
This notebook will load all UMLS concepts as nodes into a graph, connect them with all relationships defined by the UMLS sources, and provide the means to update the graph with updated versions of UMLS.
  
1. Load all UMLS concepts
2. Load all UMLS relationships
3. Connect UMLS concepts to MIMIC-III entities
4. Update concepts and relationships from a new UMLS version

## 1. Load all UMLS concepts

In [3]:
import pandas as pd
import dask.dataframe as dd
from progressbar import ProgressBar
import subprocess

In [19]:
import getpass
password = getpass.getpass("\nPlease enter the Neo4j database password to continue \n")


Please enter the Neo4j database password to continue 
 ···············


In [20]:
from neo4j import GraphDatabase
driver=GraphDatabase.driver(uri="bolt://localhost:7687", auth=('neo4j',password))
session=driver.session()

In [2]:
# Load MRCONSO.RRF into a dataframe. Reference for MRCONSO column headers and descriptions: https://www.ncbi.nlm.nih.gov/books/NBK9685/table/ch03.T.concept_names_and_sources_file_mr/?report=objectonly
path = '/media/sata_1TB_internal/umls-2020AB-full/2020AB/META/'
mrconso = pd.read_csv(path + 'MRCONSO.RRF', sep='|', usecols=[0,1,2,4,6,7,11,12,14,15], header=None, encoding='utf-8')
mrconso.columns = ['CUI', 'LAT', 'TS', 'STT', 'ISPREF', 'AUI', 'SAB', 'TTY', 'STR', 'SRL']
mrconso.head()

Unnamed: 0,CUI,LAT,TS,STT,ISPREF,AUI,SAB,TTY,STR,SRL
0,C0000005,ENG,P,PF,Y,A26634265,MSH,PEP,(131)I-Macroaggregated Albumin,0
1,C0000005,ENG,S,PF,Y,A26634266,MSH,ET,(131)I-MAA,0
2,C0000005,FRE,P,PF,Y,A13433185,MSHFRE,PEP,Macroagrégats d'albumine marquée à l'iode 131,3
3,C0000005,FRE,S,PF,Y,A27488794,MSHFRE,ET,MAA-I 131,3
4,C0000005,FRE,S,PF,Y,A27614225,MSHFRE,ET,Macroagrégats d'albumine humaine marquée à l'i...,3


In [3]:
mrconso = mrconso[mrconso['LAT'] == 'ENG']

In [4]:
mrsty = pd.read_csv(path + 'MRSTY.RRF', usecols=[0,3], sep='|', header=None)
mrsty.columns = ['CUI', 'STY']
mrsty.head()

Unnamed: 0,CUI,STY
0,C0000005,"Amino Acid, Peptide, or Protein"
1,C0000005,Pharmacologic Substance
2,C0000005,"Indicator, Reagent, or Diagnostic Aid"
3,C0000039,Organic Chemical
4,C0000039,Pharmacologic Substance


In [5]:
# Merge semantic type ("STY") into the mrconso dataframe
mrconso = pd.merge(mrconso, mrsty, on=['CUI'], how='outer')
mrconso.head()

Unnamed: 0,CUI,LAT,TS,STT,ISPREF,AUI,SAB,TTY,STR,SRL,STY
0,C0000005,ENG,P,PF,Y,A26634265,MSH,PEP,(131)I-Macroaggregated Albumin,0.0,"Amino Acid, Peptide, or Protein"
1,C0000005,ENG,P,PF,Y,A26634265,MSH,PEP,(131)I-Macroaggregated Albumin,0.0,Pharmacologic Substance
2,C0000005,ENG,P,PF,Y,A26634265,MSH,PEP,(131)I-Macroaggregated Albumin,0.0,"Indicator, Reagent, or Diagnostic Aid"
3,C0000005,ENG,S,PF,Y,A26634266,MSH,ET,(131)I-MAA,0.0,"Amino Acid, Peptide, or Protein"
4,C0000005,ENG,S,PF,Y,A26634266,MSH,ET,(131)I-MAA,0.0,Pharmacologic Substance


In [6]:
# Delete rows that cause problems with Neo4j csv import
mrconso.dropna(inplace=True, subset=['STR'])

bad_list = [
    'GELATIN,ABSORB PROSTATECTOMY CONE,SZ 18', 
    'Oryctolagus cuniculus f. domestica ""CC',
    'Oryctolagus cuniculus f. domestica',
    '""ff',
    'Morning after pill',
    '\\\\'
]

for term in bad_list:
    mrconso.drop(mrconso[mrconso['STR'].str.contains(term)].index, inplace=True)
    print(term)

GELATIN,ABSORB PROSTATECTOMY CONE,SZ 18
Oryctolagus cuniculus f. domestica ""CC
Oryctolagus cuniculus f. domestica
""ff
Morning after pill
\\


In [7]:
mrconso.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11679703 entries, 0 to 11679763
Data columns (total 11 columns):
 #   Column  Dtype  
---  ------  -----  
 0   CUI     object 
 1   LAT     object 
 2   TS      object 
 3   STT     object 
 4   ISPREF  object 
 5   AUI     object 
 6   SAB     object 
 7   TTY     object 
 8   STR     object 
 9   SRL     float64
 10  STY     object 
dtypes: float64(1), object(10)
memory usage: 1.0+ GB


In [8]:
# Collect semantic types into a list for each AUI using groupby
# ['CUI','LAT','TS','STT','ISPREF','AUI','SAB','TTY','STR','SRL']
sty_groupby = mrconso.groupby(['CUI','LAT','TS','STT','ISPREF','AUI','SAB','TTY','STR','SRL'])['STY']
# sty_groupby = mrconso.groupby(['CUI','LAT','AUI','STR'])['STY']
mrconso = sty_groupby.apply(list).reset_index()
mrconso.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10594371 entries, 0 to 10594370
Data columns (total 11 columns):
 #   Column  Dtype  
---  ------  -----  
 0   CUI     object 
 1   LAT     object 
 2   TS      object 
 3   STT     object 
 4   ISPREF  object 
 5   AUI     object 
 6   SAB     object 
 7   TTY     object 
 8   STR     object 
 9   SRL     float64
 10  STY     object 
dtypes: float64(1), object(10)
memory usage: 889.1+ MB


In [9]:
mrconso

Unnamed: 0,CUI,LAT,TS,STT,ISPREF,AUI,SAB,TTY,STR,SRL,STY
0,C0000005,ENG,P,PF,Y,A26634265,MSH,PEP,(131)I-Macroaggregated Albumin,0.0,"[Amino Acid, Peptide, or Protein, Pharmacologi..."
1,C0000005,ENG,S,PF,Y,A26634266,MSH,ET,(131)I-MAA,0.0,"[Amino Acid, Peptide, or Protein, Pharmacologi..."
2,C0000039,ENG,P,PF,N,A28315139,RXNORM,IN,"1,2-dipalmitoylphosphatidylcholine",0.0,"[Organic Chemical, Pharmacologic Substance]"
3,C0000039,ENG,P,PF,Y,A28572604,MTH,PN,"1,2-dipalmitoylphosphatidylcholine",0.0,"[Organic Chemical, Pharmacologic Substance]"
4,C0000039,ENG,P,VC,Y,A0016515,MSH,MH,"1,2-Dipalmitoylphosphatidylcholine",0.0,"[Organic Chemical, Pharmacologic Substance]"
...,...,...,...,...,...,...,...,...,...,...,...
10594366,C5399740,ENG,P,PF,Y,A32339932,MVX,PT,Bavarian Nordic A/S,0.0,[Health Care Related Organization]
10594367,C5399741,ENG,P,PF,Y,A32340032,SRC,VPT,"Medication Reference Terminology, 2020_09_08",0.0,[Intellectual Product]
10594368,C5399741,ENG,S,PF,Y,A32340033,SRC,VAB,MED-RT_2020_09_08,0.0,[Intellectual Product]
10594369,C5399742,ENG,P,PF,N,A32340042,MED-RT,FN,Inactive Preparations by FDA Established Pharm...,0.0,[Pharmacologic Substance]


In [10]:
# Select which AUI to mark as preferred for each CUI. The metathesaurus ranks term preference based on source vocabulary and term type in the MRRANK.RRF file

# Load MRRANK
mrrank = pd.read_csv('/media/sata_1TB_internal/umls-2020AB-full/2020AB/META/MRRANK.RRF', usecols=[0,1,2], sep='|', header=None)
mrrank.columns = ['Rank', 'SAB', 'TTY']

In [11]:
# Merge the "Rank" column from mrrank into mrconso
mrconso = pd.merge(mrconso, mrrank, on=['SAB', 'TTY'])
mrconso.head()

Unnamed: 0,CUI,LAT,TS,STT,ISPREF,AUI,SAB,TTY,STR,SRL,STY,Rank
0,C0000005,ENG,P,PF,Y,A26634265,MSH,PEP,(131)I-Macroaggregated Albumin,0.0,"[Amino Acid, Peptide, or Protein, Pharmacologi...",712
1,C0000074,ENG,P,PF,Y,A26606894,MSH,PEP,1-Alkyl-2-Acylphosphatidates,0.0,[Organic Chemical],712
2,C0000132,ENG,P,PF,Y,A26665454,MSH,PEP,15-Ketosteryl Oleate Hydrolase,0.0,"[Amino Acid, Peptide, or Protein, Enzyme]",712
3,C0000137,ENG,P,PF,Y,A26650280,MSH,PEP,15S RNA,0.0,"[Nucleic Acid, Nucleoside, or Nucleotide, Biol...",712
4,C0000151,ENG,P,PF,Y,A26647507,MSH,PEP,17 beta-Hydroxy-5 beta-Androstan-3-One,0.0,"[Organic Chemical, Pharmacologic Substance]",712


In [12]:
# Create a boolean mask to select all rows where the Rank is the maximum for each CUI
idx = mrconso.groupby('CUI')['Rank'].transform(max) == mrconso['Rank']

In [13]:
# We'll start by selecting only the CUIs with the maximum rank for each CUI, then deal with duplicate Ranks
cui_max_rank = mrconso[idx]

In [14]:
# Next we'll keep only those CUIs where Term Status (TS) is preferred (P)
cui_max_rank_p = cui_max_rank[cui_max_rank['TS'] == 'P']

In [15]:
cui_maxrank_TSp_ISPREFy = cui_max_rank_p[cui_max_rank_p['ISPREF'] == 'Y']

In [16]:
cui_maxrank_TSp_ISPREFy_STTpf = cui_maxrank_TSp_ISPREFy[cui_maxrank_TSp_ISPREFy['STT'] == 'PF']

In [17]:
cui_maxrank_TSp_ISPREFy_STTpf['CUI'].value_counts()

C3631225    1
C5318096    1
C1541869    1
C1995266    1
C5295163    1
           ..
C1857556    1
C5186974    1
C2698480    1
C3490763    1
C0948573    1
Name: CUI, Length: 4362834, dtype: int64

In [18]:
# Set the preferred terms as an index
preferred_term_index = cui_maxrank_TSp_ISPREFy_STTpf.index.tolist()

In [19]:
mrconso.loc[preferred_term_index,['CUI_PREF_TERM']] = 'true'
mrconso

Unnamed: 0,CUI,LAT,TS,STT,ISPREF,AUI,SAB,TTY,STR,SRL,STY,Rank,CUI_PREF_TERM
0,C0000005,ENG,P,PF,Y,A26634265,MSH,PEP,(131)I-Macroaggregated Albumin,0.0,"[Amino Acid, Peptide, or Protein, Pharmacologi...",712,true
1,C0000074,ENG,P,PF,Y,A26606894,MSH,PEP,1-Alkyl-2-Acylphosphatidates,0.0,[Organic Chemical],712,true
2,C0000132,ENG,P,PF,Y,A26665454,MSH,PEP,15-Ketosteryl Oleate Hydrolase,0.0,"[Amino Acid, Peptide, or Protein, Enzyme]",712,true
3,C0000137,ENG,P,PF,Y,A26650280,MSH,PEP,15S RNA,0.0,"[Nucleic Acid, Nucleoside, or Nucleotide, Biol...",712,true
4,C0000151,ENG,P,PF,Y,A26647507,MSH,PEP,17 beta-Hydroxy-5 beta-Androstan-3-One,0.0,"[Organic Chemical, Pharmacologic Substance]",712,true
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10594366,C5234703,ENG,S,PF,Y,A31564837,LNC_SPECIAL_USE,OSN,SARS-CoV-2 E gene XXX Ql NAA+probe,0.0,[Clinical Attribute],415,
10594367,C5234704,ENG,S,PF,Y,A31564826,LNC_SPECIAL_USE,OSN,SARS-CoV-2 N gene XXX Ql NAA+probe,0.0,[Clinical Attribute],415,
10594368,C5389880,ENG,P,PF,Y,A32270781,MEDCIN,XM,MEDCIN3_2020_07_16 to SNOMEDCT_US_2020_03_01 M...,3.0,[Intellectual Product],574,true
10594369,C5399694,ENG,P,PF,Y,A32339509,SNOMEDCT_US,XM,SNOMEDCT_US_2020_09_01 to ICD10CM_2021 Mappings,9.0,[Intellectual Product],696,true


In [20]:
mrconso[mrconso.CUI_PREF_TERM == 'true']

Unnamed: 0,CUI,LAT,TS,STT,ISPREF,AUI,SAB,TTY,STR,SRL,STY,Rank,CUI_PREF_TERM
0,C0000005,ENG,P,PF,Y,A26634265,MSH,PEP,(131)I-Macroaggregated Albumin,0.0,"[Amino Acid, Peptide, or Protein, Pharmacologi...",712,true
1,C0000074,ENG,P,PF,Y,A26606894,MSH,PEP,1-Alkyl-2-Acylphosphatidates,0.0,[Organic Chemical],712,true
2,C0000132,ENG,P,PF,Y,A26665454,MSH,PEP,15-Ketosteryl Oleate Hydrolase,0.0,"[Amino Acid, Peptide, or Protein, Enzyme]",712,true
3,C0000137,ENG,P,PF,Y,A26650280,MSH,PEP,15S RNA,0.0,"[Nucleic Acid, Nucleoside, or Nucleotide, Biol...",712,true
4,C0000151,ENG,P,PF,Y,A26647507,MSH,PEP,17 beta-Hydroxy-5 beta-Androstan-3-One,0.0,"[Organic Chemical, Pharmacologic Substance]",712,true
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10594310,C5234703,ENG,P,PF,Y,A31564787,LNC_SPECIAL_USE,LN,SARS coronavirus 2 E gene:PrThr:Pt:XXX:Ord:Pro...,0.0,[Clinical Attribute],418,true
10594311,C5234704,ENG,P,PF,Y,A31564823,LNC_SPECIAL_USE,LN,SARS coronavirus 2 N gene:PrThr:Pt:XXX:Ord:Pro...,0.0,[Clinical Attribute],418,true
10594368,C5389880,ENG,P,PF,Y,A32270781,MEDCIN,XM,MEDCIN3_2020_07_16 to SNOMEDCT_US_2020_03_01 M...,3.0,[Intellectual Product],574,true
10594369,C5399694,ENG,P,PF,Y,A32339509,SNOMEDCT_US,XM,SNOMEDCT_US_2020_09_01 to ICD10CM_2021 Mappings,9.0,[Intellectual Product],696,true


In [None]:
# Set Source Restriction Level (SRL) as an integer
mrconso.SRL = mrconso.SRL.astype(dtype=int)

In [23]:
# Leave behind the columns we don't need to import into the graph
mrconso = mrconso[['CUI', 'AUI', 'STR', 'STY', 'CUI_PREF_TERM']]
mrconso

Unnamed: 0,CUI,AUI,STR,STY,CUI_PREF_TERM
0,C0000005,A26634265,(131)I-Macroaggregated Albumin,"[Amino Acid, Peptide, or Protein, Pharmacologi...",true
1,C0000074,A26606894,1-Alkyl-2-Acylphosphatidates,[Organic Chemical],true
2,C0000132,A26665454,15-Ketosteryl Oleate Hydrolase,"[Amino Acid, Peptide, or Protein, Enzyme]",true
3,C0000137,A26650280,15S RNA,"[Nucleic Acid, Nucleoside, or Nucleotide, Biol...",true
4,C0000151,A26647507,17 beta-Hydroxy-5 beta-Androstan-3-One,"[Organic Chemical, Pharmacologic Substance]",true
...,...,...,...,...,...
10594366,C5234703,A31564837,SARS-CoV-2 E gene XXX Ql NAA+probe,[Clinical Attribute],
10594367,C5234704,A31564826,SARS-CoV-2 N gene XXX Ql NAA+probe,[Clinical Attribute],
10594368,C5389880,A32270781,MEDCIN3_2020_07_16 to SNOMEDCT_US_2020_03_01 M...,[Intellectual Product],true
10594369,C5399694,A32339509,SNOMEDCT_US_2020_09_01 to ICD10CM_2021 Mappings,[Intellectual Product],true


In [24]:
# Write the dataframe out to CSV
# mrconso.to_csv('MRCONSO_for_import.csv', sep='|', index=False, encoding='utf-8')
mrconso.to_csv('MRCONSO_for_import.csv', index=False, encoding='utf-8')

Move `MRCONSO_for_import.csv` into the database's import folder

In [4]:
# Create the Concept nodes from UMLS
command = '''USING PERIODIC COMMIT 100000 LOAD CSV WITH HEADERS FROM "file:///MRCONSO_for_import.csv" AS COLUMN CREATE (:Concept {term:COLUMN.STR, semantic_type:COLUMN.STY, cui_pref_term:COLUMN.CUI_PREF_TERM, source: 'UMLS', version: '2020AB', cui:COLUMN.CUI, aui:COLUMN.AUI})'''
session.run(command)

<neo4j.work.result.Result at 0x7f4ef8cc8250>

In [9]:
# Set an index for aui
command = '''
CREATE BTREE INDEX aui_index FOR (a:Concept) ON (a.aui)
'''
session.run(command)

<neo4j.work.result.Result at 0x7fd2c95e3310>

In [10]:
# Set an index for cui
command = '''
CREATE BTREE INDEX cui_index FOR (c:Concept) ON (c.cui)
'''
session.run(command)

<neo4j.work.result.Result at 0x7fd2c95f21f0>

## 2. Load all UMLS relationships

In [5]:
# Load MRREL.RRF into a dataframe 
path = '/media/sata_1TB_internal/umls-2020AB-full/2020AB/META/'
mrrel = pd.read_csv(path + 'MRREL.RRF', sep='|', usecols=[1,5,7], header=None, encoding='utf-8') 
mrrel.columns = ['AUI1', 'AUI2', 'RELA']
# mrrel.columns = ['CUI1', 'AUI1', 'STYPE1', 'REL', 'CUI2', 'AUI2', 'STYPE2', 'RELA', 'RUI', 'SRUI', 'SAB', 'SL', 'RG', 'DIR', 'SUPPRESS', 'CVF', '']
mrrel.dropna(inplace = True)
mrrel.drop_duplicates(inplace=True)
mrrel.info()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


<class 'pandas.core.frame.DataFrame'>
Int64Index: 37266704 entries, 2 to 87766309
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   AUI1    object
 1   AUI2    object
 2   RELA    object
dtypes: object(3)
memory usage: 1.1+ GB


In [6]:
# Identify redundant relationship types which exist as an artifact of the former relational database schema, which 
# required separate relationships in order to point backward and forward between concepts
mr_rela = pd.read_csv('MRREL_2020AB_RELA.csv', encoding='utf-8')
mr_rela.head()

Unnamed: 0,RELA,Description
0,abnormal_cell_affected_by_chemical_or_drug,abnormal cell affected by chemical or drug
1,abnormality_associated_with_allele,abnormality associated with allele
2,access_device_used_by,Access device used by
3,access_instrument_of,Access instrument of
4,access_of,Access of


In [7]:
mr_rela[['blank_has','has_rel']] = mr_rela['RELA'].str.split('^has_', expand=True)

In [8]:
mr_rela[mr_rela.has_rel.notnull()]

Unnamed: 0,RELA,Description,blank_has,has_rel
296,has_access_instrument,Has access instrument,,access_instrument
297,has_access,Has access,,access
298,has_action_guidance,Has action guidance,,action_guidance
299,has_active_ingredient,Has active ingredient,,active_ingredient
300,has_active_metabolites,Has active metabolites,,active_metabolites
...,...,...,...,...
579,has_unit_of_presentation,Has unit of presentation,,unit_of_presentation
580,has_units,Has units,,units
581,has_venous_drainage,Has venous drainage,,venous_drainage
582,has_version,Has version,,version


In [9]:
mr_rela[['rel_of','blank_of']] = mr_rela['RELA'].str.split('_of$', expand=True)

In [10]:
has_things = mr_rela[mr_rela.blank_has.notnull()].has_rel.to_list()

In [11]:
# Find all the relationship_of types of relationships which mirror the has_relationship types
mr_rela['duplicate_term'] = mr_rela.rel_of.isin(has_things)
redundant_terms = mr_rela[mr_rela.duplicate_term == True].RELA.to_list()
print(redundant_terms[-10:])

['time_aspect_of', 'time_modifier_of', 'tradename_of', 'translation_of', 'transliterated_form_of', 'tributary_of', 'unit_of_presentation_of', 'units_of', 'venous_drainage_of', 'version_of']


In [12]:
# Remove all the inverse_relationship and by_relationship types
inverse_terms = mr_rela[mr_rela.RELA.str.contains('inverse')].RELA.to_list()
by_terms = mr_rela[mr_rela.RELA.str.contains('by')].RELA.to_list()

# Manually inspected a list of relationships that contain '_of_'. Most, but not all, of these are redundant. 
contains_of_ = ['activity_of_allele', 'allele_plays_role_in_metabolism_of_chemical_or_drug', 'approach_of_excluded', 'approach_of_possibly_included', 'associated_finding_of_excluded', 'associated_finding_of_possibly_included', 'associated_procedure_of_excluded', 'associated_procedure_of_possibly_included', 'associated_with_malfunction_of_gene_product', 'basis_of_strength_substance_of', 'biological_process_is_part_of_process', 'chemical_or_drug_is_product_of_biological_process', 'chromosomal_location_of_allele', 'chromosomal_location_of_wild-type_gene', 'component_of_excluded', 'component_of_possibly_included', 'contraindicated_mechanism_of_action_of', 'count_of_active_ingredient_of', 'count_of_base_of_active_ingredient_of', 'gene_involved_in_pathogenesis_of_disease', 'gene_product_variant_of_gene_product', 'is_abnormal_cell_of_disease', 'is_abnormality_of_gene_product', 'is_abnormality_of_gene', 'is_associated_anatomy_of_gene_product', 'is_biochemical_function_of_gene_product', 'is_chemical_classification_of_gene_product', 'is_chromosomal_location_of_gene', 'is_component_of_chemotherapy_regimen', 'is_cytogenetic_abnormality_of_disease', 'is_finding_of_disease', 'is_grade_of_disease', 'is_location_of_anatomic_structure', 'is_location_of_biological_process', 'is_mechanism_of_action_of_chemical_or_drug', 'is_metastatic_anatomic_site_of_disease', 'is_molecular_abnormality_of_disease', 'is_normal_cell_origin_of_disease', 'is_normal_tissue_origin_of_disease', 'is_not_abnormal_cell_of_disease', 'is_not_cytogenetic_abnormality_of_disease', 'is_not_finding_of_disease', 'is_not_metastatic_anatomic_site_of_disease', 'is_not_molecular_abnormality_of_disease', 'is_not_normal_cell_origin_of_disease', 'is_not_normal_tissue_origin_of_disease', 'is_not_primary_anatomic_site_of_disease', 'is_organism_source_of_gene_product', 'is_physical_location_of_gene', 'is_physiologic_effect_of_chemical_or_drug', 'is_presence_of_lateral_location', 'is_primary_anatomic_site_of_disease', 'is_property_or_attribute_of_eo_disease', 'is_stage_of_disease', 'is_structural_domain_or_motif_of_gene_product', 'locale_of_excluded', 'may_be_abnormal_cell_of_disease', 'may_be_associated_disease_of_disease', 'may_be_cytogenetic_abnormality_of_disease', 'may_be_finding_of_disease', 'may_be_molecular_abnormality_of_disease', 'may_be_normal_cell_origin_of_disease', 'mechanism_of_action_of', 'method_of_excluded', 'method_of_possibly_included', 'panel_element_of_possibly_included', 'pathology_of_excluded', 'pathology_of_possibly_included', 'patient_type_of_excluded', 'patient_type_of_possibly_included', 'pharmaceutical_state_of_matter_of', 'procedure_device_of_excluded', 'procedure_device_of_possibly_included', 'procedure_site_of_excluded', 'procedure_site_of_possibly_included', 'route_of_administration_of_excluded', 'route_of_administration_of_possibly_included', 'route_of_administration_of', 'specimen_of_excluded', 'state_of_matter_of', 'subject_of_information_of', 'surgical_extent_of_excluded', 'surgical_extent_of_possibly_included', 'tissue_is_expression_site_of_gene_product', 'unit_of_presentation_of']

# Manually inspected a list of relationships that contain 'plays'. Most, but not all, of these are redundant.
plays_terms = ['allele_plays_altered_role_in_process', 'allele_plays_role_in_metabolism_of_chemical_or_drug', 'chemical_or_drug_plays_role_in_biological_process', 'gene_plays_role_in_process', 'gene_product_plays_role_in_biological_process']

# Manually inspected a list of relationships that start with 'may_be_'. All of these are redundant.
may_be_terms = ['may_be_abnormal_cell_of_disease', 'may_be_associated_disease_of_disease', 'may_be_cytogenetic_abnormality_of_disease', 'may_be_diagnosed_by', 'may_be_finding_of_disease', 'may_be_molecular_abnormality_of_disease', 'may_be_normal_cell_origin_of_disease', 'may_be_prevented_by', 'may_be_qualified_by', 'may_be_treated_by']

# Manually inspected a list of relationships that contain 'involved_in'. All of these are redundant.
involved_in_terms = ['chromosome_involved_in_cytogenetic_abnormality', 'gene_involved_in_molecular_abnormality', 'gene_involved_in_pathogenesis_of_disease']

# Additional redundant terms discovered while exploring the graph
discovered = ['anatomic_structure_is_physical_part_of', 'nichd_parent_of', 'occurs_after', 'mth_xml_form_of', 'mth_plain_text_form_of', 'mth_expanded_form_of', 'mth_british_form_of', 'homonym_of', 'regimen_has_accepted_use_for_disease', 'biological_process_has_initiator_chemical_or_drug','biological_process_results_from_biological_process', 'biological_process_has_initiator_process', 'cause_of', 'has_adherent', 'cdrh_parent_of', 'mapped_from', 'contraindicated_with_disease', 'is_associated_anatomic_site_of', 'subset_includes_concept', 'classified_as', 'has_permuted_term', 'same_as']

redundant_terms = redundant_terms + inverse_terms + by_terms + contains_of_ + plays_terms + may_be_terms + involved_in_terms + discovered
print(redundant_terms[-10:])

['cause_of', 'has_adherent', 'cdrh_parent_of', 'mapped_from', 'contraindicated_with_disease', 'is_associated_anatomic_site_of', 'subset_includes_concept', 'classified_as', 'has_permuted_term', 'same_as']


In [13]:
# Also remove terms which cause extreneous connections between concepts, such as terminologies which connect
# to many concepts in a way that is not meaningful
extraneous = ['concept_in_subset', 'has_cdrh_parent', 'has_answer', 'answer_to']
redundant_terms = redundant_terms + extraneous # For convenience, we'll just add them to the redundant_terms list

In [14]:
print(len(redundant_terms))
redundant_terms = list(set(redundant_terms))
print(len(redundant_terms))

404
382


In [15]:
# Find and drop all the rows in MRREL which have redundant relationship_of types
mrrel['redundant'] = mrrel.RELA.isin(redundant_terms)
redundant_relationships = mrrel[mrrel.redundant == True].index
mrrel.drop(redundant_relationships, inplace=True)
mrrel.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18898092 entries, 44 to 87766309
Data columns (total 4 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   AUI1       object
 1   AUI2       object
 2   RELA       object
 3   redundant  bool  
dtypes: bool(1), object(3)
memory usage: 594.7+ MB


In [16]:
mrrel.RELA = mrrel.RELA.str.upper()
mrrel[['AUI1', 'AUI2', 'RELA']].head()

Unnamed: 0,AUI1,AUI2,RELA
44,A0016515,A0137399,MAPPED_TO
45,A0016515,A0376033,MAPPED_TO
46,A0016515,A0683149,MAPPED_TO
47,A0016515,A1316792,MAPPED_TO
48,A0016515,A1321548,MAPPED_TO


In [17]:
mrrel[['AUI1', 'AUI2', 'RELA']].to_csv('MRRELA_for_import.csv', index=False, encoding='utf-8')

Move MRRELA_for_import.csv into the database's import folder

In [21]:
# Import relationships from MRRELA_for_import.csv into the graph. Note that the csv file must be placed in the
# database's import folder first
command = '''
USING PERIODIC COMMIT 100000 LOAD CSV WITH HEADERS FROM "file:///MRRELA_for_import.csv" AS COLUMN
MATCH (c1:Concept {aui:COLUMN.AUI1})
MATCH (c2:Concept {aui:COLUMN.AUI2})
CALL apoc.create.relationship(c2, COLUMN.RELA, {source:'UMLS', version:'2020AB'}, c1) YIELD rel
REMOVE rel.noOp;
'''
session.run(command)

<neo4j.work.result.Result at 0x7f4ef8cdd850>

Considered adding broader and narrower relationships from MRREL, but they ruined pathfinding by creating a lot of non-meaningful connections. All cells for this task are commented out, but I left them here in case we need them later.

In [8]:
# # Add broader and narrower relationships from MRREL

# # Load MRREL.RRF into a dataframe 
# path = '/media/sata_1TB_internal/umls-2020AB-full/2020AB/META/'
# mrrel = pd.read_csv(path + 'MRREL.RRF', sep='|', usecols=[0,3,4], header=None, encoding='utf-8') 
# mrrel.columns = ['CUI1', 'REL', 'CUI2']
# # mrrel.columns = ['CUI1', 'AUI1', 'STYPE1', 'REL', 'CUI2', 'AUI2', 'STYPE2', 'RELA', 'RUI', 'SRUI', 'SAB', 'SL', 'RG', 'DIR', 'SUPPRESS', 'CVF', '']
# mrrel.dropna(inplace = True)
# mrrel.drop_duplicates(inplace=True)
# mrrel.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 35854365 entries, 0 to 87766413
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   CUI1    object
 1   REL     object
 2   CUI2    object
dtypes: object(3)
memory usage: 1.1+ GB


In [9]:
mrrel.REL.value_counts()

RO     11161643
SIB    10077578
CHD     4410653
PAR     4410653
RB      1586507
RN      1586507
SY       701075
RQ       693905
AQ       612922
QB       612922
Name: REL, dtype: int64

In [None]:
# mrrel = mrrel.loc[(mrrel.REL == 'RB') | (mrrel.REL == 'PAR')]
# mrrel

In [None]:
# mrrel.REL = 'IS_A'
# mrrel

In [16]:
# # Write out to CSV and move the CSV into the database's import folder
# mrrel[['CUI1', 'CUI2', 'REL']].to_csv('MRREL_for_import.csv', index=False, encoding='utf-8')

In [18]:
# # Import relationships from MRRELA_for_import.csv into the graph. Note that the csv file must be placed in the
# # database's import folder first
# command = '''
# USING PERIODIC COMMIT 100000 LOAD CSV WITH HEADERS FROM "file:///MRREL_for_import.csv" AS COLUMN
# MATCH (c1:Concept {cui:COLUMN.CUI1})
# WHERE c1.cui_pref_term IS NOT NULL
# MATCH (c2:Concept {cui:COLUMN.CUI2})
# WHERE c2.cui_pref_term IS NOT NULL
# CALL apoc.create.relationship(c2, COLUMN.REL, {source:'UMLS', version:'2020AB'}, c1) YIELD rel
# REMOVE rel.noOp;
# '''
# session.run(command)

<neo4j.work.result.Result at 0x7f1402916e80>

In [22]:
# Create a 'SYNONYMOUS' relationships between each concept node and its preferred term
command = '''
CALL apoc.periodic.iterate(
"MATCH (c1: Concept) WHERE c1.cui_pref_term IS NOT NULL MATCH (c2: Concept {cui: c1.cui}) WHERE NOT c1.aui = c2.aui RETURN c1,c2",
"MERGE (c1)<-[:SYNONYM {source:'UMLS', version:'2020AB'}]-(c2)",
{batchSize:10000, parallel:true})'''
session.run(command)

<neo4j.work.result.Result at 0x7f4ef8cdd3d0>

## 3. Connect UMLS concepts to MIMIC-III entities
See [MIMIC-III_v1.4_MI1_import.ipynb](MIMIC-III_v1.4_MI1_import.ipynb)

## 4. Update concepts and relationships from a new UMLS version
- If there is no difference between the current version and the updated version for a concept or relationship, simply set the version property with the latest UMLS version.
- If the entity or relationship was added in the updated version, merge it into the graph.
- If the entity or relationship was removed in the updated version, set a "deprecated in UMLS_version" flag on the current version. 