# UMLS Import
This notebook will load all UMLS concepts as nodes into a graph, connect them with all relationships defined by the UMLS sources, and provide the means to update the graph with updated versions of UMLS.
  
1. Load all UMLS concepts
2. Load all UMLS relationships
3. Connect UMLS concepts to MIMIC-III entities
4. Update concepts and relationships from a new UMLS version

## 1. Load all UMLS concepts

In [12]:
import pandas as pd
import dask.dataframe as dd
from progressbar import ProgressBar
import subprocess

In [13]:
import getpass
password = getpass.getpass("\nPlease enter the Neo4j database password to continue \n")


Please enter the Neo4j database password to continue 
 ···············


In [14]:
from neo4j import GraphDatabase
driver=GraphDatabase.driver(uri="bolt://localhost:7687", auth=('neo4j',password))
session=driver.session()

In [3]:
# Load MRCONSO.RRF into a dataframe
path = '/media/sata_1TB_internal/umls-2020AB-full/2020AB/META/'
mrconso = pd.read_csv(path + 'MRCONSO.RRF', sep='|', usecols=[0,1,7,14], header=None, encoding='utf-8')
mrconso.columns = ['CUI', 'LAT', 'AUI', 'STR']
mrconso.head()

Unnamed: 0,CUI,LAT,AUI,STR
0,C0000005,ENG,A26634265,(131)I-Macroaggregated Albumin
1,C0000005,ENG,A26634266,(131)I-MAA
2,C0000005,FRE,A13433185,Macroagrégats d'albumine marquée à l'iode 131
3,C0000005,FRE,A27488794,MAA-I 131
4,C0000005,FRE,A27614225,Macroagrégats d'albumine humaine marquée à l'i...


In [4]:
mrsty = pd.read_csv(path + 'MRSTY.RRF', usecols=[0,3], sep='|', header=None)
mrsty.columns = ['CUI', 'STY']
mrsty.head()

Unnamed: 0,CUI,STY
0,C0000005,"Amino Acid, Peptide, or Protein"
1,C0000005,Pharmacologic Substance
2,C0000005,"Indicator, Reagent, or Diagnostic Aid"
3,C0000039,Organic Chemical
4,C0000039,Pharmacologic Substance


In [5]:
# Merge semantic type ("STY") into the mrconso dataframe
mrconso = pd.merge(mrconso, mrsty, on=['CUI'], how='outer')
mrconso.head()

Unnamed: 0,CUI,LAT,AUI,STR,STY
0,C0000005,ENG,A26634265,(131)I-Macroaggregated Albumin,"Amino Acid, Peptide, or Protein"
1,C0000005,ENG,A26634265,(131)I-Macroaggregated Albumin,Pharmacologic Substance
2,C0000005,ENG,A26634265,(131)I-Macroaggregated Albumin,"Indicator, Reagent, or Diagnostic Aid"
3,C0000005,ENG,A26634266,(131)I-MAA,"Amino Acid, Peptide, or Protein"
4,C0000005,ENG,A26634266,(131)I-MAA,Pharmacologic Substance


In [12]:
mrconso.tail()

Unnamed: 0,CUI,LAT,AUI,STR,STY
16724383,C5399740,ENG,A32339932,Bavarian Nordic A/S,Health Care Related Organization
16724384,C5399741,ENG,A32340032,"Medication Reference Terminology, 2020_09_08",Intellectual Product
16724385,C5399741,ENG,A32340033,MED-RT_2020_09_08,Intellectual Product
16724386,C5399742,ENG,A32340042,Inactive Preparations by FDA Established Pharm...,Pharmacologic Substance
16724387,C5399742,ENG,A32340102,Inactive Preparations by FDA Established Pharm...,Pharmacologic Substance


In [6]:
mrconso = mrconso[mrconso['LAT'] == 'ENG']

In [7]:
# Delete rows that cause problems with Neo4j csv import
mrconso.dropna(inplace=True, subset=['STR'])

bad_list = [
    'GELATIN,ABSORB PROSTATECTOMY CONE,SZ 18', 
    'Oryctolagus cuniculus f. domestica ""CC',
    'Oryctolagus cuniculus f. domestica',
    '""ff',
    'Morning after pill',
    '\\\\'
]

for term in bad_list:
    mrconso.drop(mrconso[mrconso['STR'].str.contains(term)].index, inplace=True)
    print(term)

GELATIN,ABSORB PROSTATECTOMY CONE,SZ 18
Oryctolagus cuniculus f. domestica ""CC
Oryctolagus cuniculus f. domestica
""ff
Morning after pill
\\


In [10]:
mrconso.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11679703 entries, 0 to 16724387
Data columns (total 5 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   CUI     object
 1   LAT     object
 2   AUI     object
 3   STR     object
 4   STY     object
dtypes: object(5)
memory usage: 534.7+ MB


In [8]:
# Write the dataframe out to CSV
# mrconso.to_csv('MRCONSO_for_import.csv', sep='|', index=False, encoding='utf-8')
mrconso.to_csv('MRCONSO_for_import.csv', index=False, encoding='utf-8')

Move `MRCONSO_for_import.csv` into the database's import folder

In [15]:
# Create the Concept nodes from UMLS
command = '''USING PERIODIC COMMIT 100000 LOAD CSV WITH HEADERS FROM "file:///MRCONSO_for_import.csv" AS COLUMN CREATE (:Concept {term:COLUMN.STR, semantic_type:COLUMN.STY, language:COLUMN.LAT, source: 'UMLS', version: '2020AB', cui:COLUMN.CUI, aui:COLUMN.AUI})'''
session.run(command)

<neo4j.work.result.Result at 0x7f32e75461c0>

In [16]:
# Set an index for aui
command = '''
CREATE BTREE INDEX aui_index FOR (n:Concept) ON (n.aui)
'''
session.run(command)

<neo4j.work.result.Result at 0x7f32e75464c0>

## 2. Load all UMLS relationships

In [18]:
# Load MRREL.RRF into a dataframe 
path = '/media/sata_1TB_internal/umls-2020AB-full/2020AB/META/'
mrrel = pd.read_csv(path + 'MRREL.RRF', sep='|', usecols=[1,5,7], header=None, encoding='utf-8') 
mrrel.columns = ['AUI1', 'AUI2', 'RELA']
# mrrel.columns = ['CUI1', 'AUI1', 'STYPE1', 'REL', 'CUI2', 'AUI2', 'STYPE2', 'RELA', 'RUI', 'SRUI', 'SAB', 'SL', 'RG', 'DIR', 'SUPPRESS', 'CVF', '']
mrrel.dropna(inplace = True)
mrrel.drop_duplicates(inplace=True)
mrrel.info()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


<class 'pandas.core.frame.DataFrame'>
Int64Index: 37266704 entries, 2 to 87766309
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   AUI1    object
 1   AUI2    object
 2   RELA    object
dtypes: object(3)
memory usage: 1.1+ GB


In [19]:
# Identify redundant relationship types which exist as an artifact of the former relational database schema, which 
# required separate relationships in order to point backward and forward between concepts
mr_rela = pd.read_csv('MRREL_2020AB_RELA.csv', encoding='utf-8')
mr_rela.head()

Unnamed: 0,RELA,Description
0,abnormal_cell_affected_by_chemical_or_drug,abnormal cell affected by chemical or drug
1,abnormality_associated_with_allele,abnormality associated with allele
2,access_device_used_by,Access device used by
3,access_instrument_of,Access instrument of
4,access_of,Access of


In [20]:
mr_rela[['blank_has','has_rel']] = mr_rela['RELA'].str.split('^has_', expand=True)

In [21]:
mr_rela[mr_rela.has_rel.notnull()]

Unnamed: 0,RELA,Description,blank_has,has_rel
296,has_access_instrument,Has access instrument,,access_instrument
297,has_access,Has access,,access
298,has_action_guidance,Has action guidance,,action_guidance
299,has_active_ingredient,Has active ingredient,,active_ingredient
300,has_active_metabolites,Has active metabolites,,active_metabolites
...,...,...,...,...
579,has_unit_of_presentation,Has unit of presentation,,unit_of_presentation
580,has_units,Has units,,units
581,has_venous_drainage,Has venous drainage,,venous_drainage
582,has_version,Has version,,version


In [22]:
mr_rela[['rel_of','blank_of']] = mr_rela['RELA'].str.split('_of$', expand=True)

In [23]:
has_things = mr_rela[mr_rela.blank_has.notnull()].has_rel.to_list()

In [24]:
# Find all the relationship_of types of relationships which mirror the has_relationship types
mr_rela['duplicate_term'] = mr_rela.rel_of.isin(has_things)
redundant_terms = mr_rela[mr_rela.duplicate_term == True].RELA.to_list()
print(redundant_terms[-10:])

['time_aspect_of', 'time_modifier_of', 'tradename_of', 'translation_of', 'transliterated_form_of', 'tributary_of', 'unit_of_presentation_of', 'units_of', 'venous_drainage_of', 'version_of']


In [25]:
# Remove all the inverse_relationship and by_relationship types
inverse_terms = mr_rela[mr_rela.RELA.str.contains('inverse')].RELA.to_list()
by_terms = mr_rela[mr_rela.RELA.str.contains('by')].RELA.to_list()

# Manually inspected a list of relationships that contain '_of_'. Most, but not all, of these are redundant. 
contains_of_ = ['activity_of_allele', 'allele_plays_role_in_metabolism_of_chemical_or_drug', 'approach_of_excluded', 'approach_of_possibly_included', 'associated_finding_of_excluded', 'associated_finding_of_possibly_included', 'associated_procedure_of_excluded', 'associated_procedure_of_possibly_included', 'associated_with_malfunction_of_gene_product', 'basis_of_strength_substance_of', 'biological_process_is_part_of_process', 'chemical_or_drug_is_product_of_biological_process', 'chromosomal_location_of_allele', 'chromosomal_location_of_wild-type_gene', 'component_of_excluded', 'component_of_possibly_included', 'contraindicated_mechanism_of_action_of', 'count_of_active_ingredient_of', 'count_of_base_of_active_ingredient_of', 'gene_involved_in_pathogenesis_of_disease', 'gene_product_variant_of_gene_product', 'is_abnormal_cell_of_disease', 'is_abnormality_of_gene_product', 'is_abnormality_of_gene', 'is_associated_anatomy_of_gene_product', 'is_biochemical_function_of_gene_product', 'is_chemical_classification_of_gene_product', 'is_chromosomal_location_of_gene', 'is_component_of_chemotherapy_regimen', 'is_cytogenetic_abnormality_of_disease', 'is_finding_of_disease', 'is_grade_of_disease', 'is_location_of_anatomic_structure', 'is_location_of_biological_process', 'is_mechanism_of_action_of_chemical_or_drug', 'is_metastatic_anatomic_site_of_disease', 'is_molecular_abnormality_of_disease', 'is_normal_cell_origin_of_disease', 'is_normal_tissue_origin_of_disease', 'is_not_abnormal_cell_of_disease', 'is_not_cytogenetic_abnormality_of_disease', 'is_not_finding_of_disease', 'is_not_metastatic_anatomic_site_of_disease', 'is_not_molecular_abnormality_of_disease', 'is_not_normal_cell_origin_of_disease', 'is_not_normal_tissue_origin_of_disease', 'is_not_primary_anatomic_site_of_disease', 'is_organism_source_of_gene_product', 'is_physical_location_of_gene', 'is_physiologic_effect_of_chemical_or_drug', 'is_presence_of_lateral_location', 'is_primary_anatomic_site_of_disease', 'is_property_or_attribute_of_eo_disease', 'is_stage_of_disease', 'is_structural_domain_or_motif_of_gene_product', 'locale_of_excluded', 'may_be_abnormal_cell_of_disease', 'may_be_associated_disease_of_disease', 'may_be_cytogenetic_abnormality_of_disease', 'may_be_finding_of_disease', 'may_be_molecular_abnormality_of_disease', 'may_be_normal_cell_origin_of_disease', 'mechanism_of_action_of', 'method_of_excluded', 'method_of_possibly_included', 'panel_element_of_possibly_included', 'pathology_of_excluded', 'pathology_of_possibly_included', 'patient_type_of_excluded', 'patient_type_of_possibly_included', 'pharmaceutical_state_of_matter_of', 'procedure_device_of_excluded', 'procedure_device_of_possibly_included', 'procedure_site_of_excluded', 'procedure_site_of_possibly_included', 'route_of_administration_of_excluded', 'route_of_administration_of_possibly_included', 'route_of_administration_of', 'specimen_of_excluded', 'state_of_matter_of', 'subject_of_information_of', 'surgical_extent_of_excluded', 'surgical_extent_of_possibly_included', 'tissue_is_expression_site_of_gene_product', 'unit_of_presentation_of']

# Manually inspected a list of relationships that contain 'plays'. Most, but not all, of these are redundant.
plays_terms = ['allele_plays_altered_role_in_process', 'allele_plays_role_in_metabolism_of_chemical_or_drug', 'chemical_or_drug_plays_role_in_biological_process', 'gene_plays_role_in_process', 'gene_product_plays_role_in_biological_process']

# Manually inspected a list of relationships that start with 'may_be_'. All of these are redundant.
may_be_terms = ['may_be_abnormal_cell_of_disease', 'may_be_associated_disease_of_disease', 'may_be_cytogenetic_abnormality_of_disease', 'may_be_diagnosed_by', 'may_be_finding_of_disease', 'may_be_molecular_abnormality_of_disease', 'may_be_normal_cell_origin_of_disease', 'may_be_prevented_by', 'may_be_qualified_by', 'may_be_treated_by']

# Manually inspected a list of relationships that contain 'involved_in'. All of these are redundant.
involved_in_terms = ['chromosome_involved_in_cytogenetic_abnormality', 'gene_involved_in_molecular_abnormality', 'gene_involved_in_pathogenesis_of_disease']

# Additional redundant terms discovered while exploring the graph
discovered = ['has_adherent', 'cdrh_parent_of', 'mapped_from', 'contraindicated_with_disease', 'is_associated_anatomic_site_of', 'subset_includes_concept', 'classified_as', 'has_permuted_term', 'same_as']

redundant_terms = redundant_terms + inverse_terms + by_terms + contains_of_ + plays_terms + may_be_terms + involved_in_terms + discovered
print(redundant_terms[-10:])

['gene_involved_in_molecular_abnormality', 'gene_involved_in_pathogenesis_of_disease', 'cdrh_parent_of', 'mapped_from', 'contraindicated_with_disease', 'is_associated_anatomic_site_of', 'subset_includes_concept', 'classified_as', 'has_permuted_term', 'same_as']


In [26]:
# Also remove terms which cause extreneous connections between concepts, such as terminologies which connect
# to many concepts in a way that is not meaningful
extraneous = ['concept_in_subset', 'has_cdrh_parent']
redundant_terms = redundant_terms + extraneous # For convenience, we'll just add them to the redundant_terms list

In [27]:
print(len(redundant_terms))
redundant_terms = list(set(redundant_terms))
print(len(redundant_terms))

388
366


In [28]:
# Find and drop all the rows in MRREL which have redundant relationship_of types
mrrel['redundant'] = mrrel.RELA.isin(redundant_terms)
redundant_relationships = mrrel[mrrel.redundant == True].index
mrrel.drop(redundant_relationships, inplace=True)
mrrel.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19261442 entries, 44 to 87766309
Data columns (total 4 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   AUI1       object
 1   AUI2       object
 2   RELA       object
 3   redundant  bool  
dtypes: bool(1), object(3)
memory usage: 606.2+ MB


In [29]:
mrrel.RELA = mrrel.RELA.str.upper()
mrrel[['AUI1', 'AUI2', 'RELA']].head()

Unnamed: 0,AUI1,AUI2,RELA
44,A0016515,A0137399,MAPPED_TO
45,A0016515,A0376033,MAPPED_TO
46,A0016515,A0683149,MAPPED_TO
47,A0016515,A1316792,MAPPED_TO
48,A0016515,A1321548,MAPPED_TO


In [30]:
mrrel[['AUI1', 'AUI2', 'RELA']].to_csv('MRREL_for_import.csv', index=False, encoding='utf-8')

In [17]:
# Import relationships from MRREL_for_import.csv into the graph. Note that the csv file must be placed in the
# database's import folder first
command = '''
USING PERIODIC COMMIT 100000 LOAD CSV WITH HEADERS FROM "file:///MRREL_for_import.csv" AS COLUMN
MATCH (c1:Concept {aui:COLUMN.AUI1})
MATCH (c2:Concept {aui:COLUMN.AUI2})
CALL apoc.create.relationship(c2, COLUMN.RELA, {source:'UMLS', version:'2020AB'}, c1) YIELD rel
REMOVE rel.noOp;
'''
session.run(command)

<neo4j.work.result.Result at 0x7f32e75467c0>

In [18]:
# Create an index on cui
command = 'CREATE INDEX cui FOR (n:Concept) ON (n.cui)'
session.run(command)

<neo4j.work.result.Result at 0x7f32e75469a0>

In [19]:
# Create undirected 'SYNONYMOUS' relationships among all terms that share the same CUI
command = '''
CALL apoc.periodic.iterate(
"MATCH (c1: Concept) MATCH (c2: Concept {cui: c1.cui}) WHERE NOT c1.aui = c2.aui RETURN c1,c2",
"MERGE (c1)-[:SYNONYM {source:'UMLS', version:'2020AB'}]-(c2)",
{batchSize:10000, parallel:true})'''
session.run(command)

<neo4j.work.result.Result at 0x7f32e7546c10>

## 3. Connect UMLS concepts to MIMIC-III entities
See [MIMIC-III_v1.4_MI1_import.ipynb](MIMIC-III_v1.4_MI1_import.ipynb)

## 4. Update concepts and relationships from a new UMLS version
- If there is no difference between the current version and the updated version for a concept or relationship, simply set the version property with the latest UMLS version.
- If the entity or relationship was added in the updated version, merge it into the graph.
- If the entity or relationship was removed in the updated version, set a "deprecated in UMLS_version" flag on the current version. 