# UMLS Data Import

## Data Source
<strong>About the UMLS</strong>  
The NLM's Unified Medical Language System® (UMLS) is a collection of databases and tools designed "to facilitate the development of computer systems that behave as if they "understand" the meaning of the language of biomedicine and health... There are three UMLS Knowledge Sources: the Metathesaurus®, the Semantic Network, and the SPECIALIST Lexicon. They are distributed with flexible lexical tools and the MetamorphoSys installation and customization program."

<strong>About the Metathesaurus</strong>  
"The Metathesaurus is organized by concept or meaning. In essence, its purpose is to link alternative names and views of the same concept together and to identify useful relationships between different concepts... It is built from the electronic versions of many different thesauri, classifications, code sets, and lists of controlled terms used in patient care, health services billing, public health statistics, indexing and cataloging biomedical literature, and /or basic, clinical, and health services research."

Source: https://www.nlm.nih.gov/research/umls/about_umls.html

## Download Data
UMLS updates their data about 1-3 times/year. We downloaded the 2020AB Full UMLS Release Files, but you can find the latest version at the [UMLS website](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html).  

## Select Segment
There are many standardized vocabularies included in UMLS. Some come with licensing restrictions and others may not be relevant to your use case. To select a subset that fits your use case, follow the [MetamorphoSys Help documentation](https://www.nlm.nih.gov/research/umls/implementation_resources/metamorphosys/help.html).   

## Explore the data

<img style="float: right;" src="images/UMLS_CUI_map.jpg">

### Unique Identifiers in the Metathesaurus

- Concept Unique Identifiers (CUI)  
A concept is a meaning. A meaning can have many different names. A key goal of Metathesaurus construction is to understand the intended meaning of each name in each source vocabulary and to link all the names from all of the source vocabularies that mean the same thing (the synonyms). CUI contain the letter C followed by seven numbers. In the example on the right the CUI is C0018681.

- Lexical (term) Unique Identifiers (LUI)  
LUI link strings that are lexical variants. Lexical variants are detected using the Lexical Variant Generator (LVG) program, one of the UMLS lexical tools. LUI contain the letter L followed by seven numbers. In the example on the right there are three lexical variants, each given a seperate LUI.

- String Unique Identifiers (SUI)  
Each unique concept name or string in each language in the Metathesaurus has a unique and permanent string identifier (SUI). Any variation in character set, upper-lower case, or punctuation difference is a separate string, with a separate SUI. SUI contain the letter S followed by seven numbers. In the example on the right there are four strings with four different SUI.

- Atom Unique Identifiers (AUI)  
The basic building blocks or "atoms" from which the Metathesaurus is constructed are the concept names or strings from each of the source vocabularies. Every occurrence of a string in each source vocabulary is assigned a unique atom identifier (AUI). If exactly the same string appears multiple times in the same vocabulary, for example as an alternate name for different concepts, a unique AUI is assigned for each occurrence. AUI contain the letter A followed by seven numbers. In the example on the right there are five strings from five sources with five different AUI. The abbreviation for the source that contributed each string is noted in parentheses after the string.

Source: https://www.nlm.nih.gov/research/umls/new_users/online_learning/Meta_005.html

### CUI Synonym Graph (what we're going to build)
![UMLS graph structure](images/UMLS_Structure.png)
Source: https://drive.google.com/file/d/1NT0TR1BX3N-DV75Hxxy4B6brY5ZwGDPv/view?usp=sharing

The files are exported in Rich Release Format (RRF), described [here](https://www.ncbi.nlm.nih.gov/books/NBK9685/).  

The unique identifiers in the Metathesaurus are stored in the RFF file MRCONSO.RFF. 

First few lines from MRCONSO.RFF (column header added for convenience):

|CUI|LAT|TS|LUI|STT|SUI|ISPREF|AUI|SAUI|SCUI|SDUI|SAB|TTY|CODE|STR|SRL|SUPRESS|CVF|  
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|C0000005|ENG|P|L0000005|PF|S0007492|Y|A26634265||M0019694|D012711|MSH|PEP|D012711|(131)I-Macroaggregated Albumin|0|N|256|  
|C0000005|ENG|S|L0270109|PF|S0007491|Y|A26634266||M0019694|D012711|MSH|ET|D012711|(131)I-MAA|0|N|256|  
|C0000005|FRE|P|L6220710|PF|S7133957|Y|A13433185||M0019694|D012711|MSHFRE|PEP|D012711|Macroagrégats d'albumine marquée à l'iode 131|3|N||  
|C0000005|FRE|S|L6215648|PF|S7133916|Y|A27488794||M0019694|D012711|MSHFRE|ET|D012711|MAA-I 131|3|N||  
|C0000005|FRE|S|L6215656|PF|S7133956|Y|A27614225||M0019694|D012711|MSHFRE|ET|D012711|Macroagrégats d'albumine humaine marquée à l'iode 131|3|N||  
|C0000039|CZE|P|L6742182|PF|S7862052|Y|A13042554||M0023172|D015060|MSHCZE|MH|D015060|1,2-dipalmitoylfosfatidylcholin|3|N||  
|C0000039|ENG|P|L0000039|PF|S17175117|N|A28315139|9194921|1926948||RXNORM|IN|1926948|1,2-dipalmitoylphosphatidylcholine|0|N|256|  
|C0000039|ENG|P|L0000039|PF|S17175117|Y|A28572604||||MTH|PN|NOCODE|1,2-dipalmitoylphosphatidylcholine|0|N|256|

Descriptions of all columns can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK9685/table/ch03.T.concept_names_and_sources_file_mr/?report=objectonly), but the most pertinent for our purposes are:

|Column Index|Column|Description|  
|---|---|---|  
|0|CUI|Unique identifier for concept|  
|1|LAT|Language of term|  
|2|TS|Term status - Preferred LUI of the CUI (P) or non-preferred LUI of the CUI (S)|
|3|LUI|Unique identifier for term|  
|5|SUI|Unique identifier for string|  
|6|ISPREF|Atom status - preferred (Y) or not (N) for this string within this concept|  
|7|AUI|Unique identifier for atom - variable length field, 8 or 9 characters|  
|11|SAB|Abbreviated source name (SAB)|  
|12|TTY|Abbreviation for term type in source vocabulary, for example PN (Metathesaurus Preferred Name) or CD (Clinical Drug). Possible values are listed on the Abbreviations Used in [Data Elements page](https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html).|  
|14|STR|String|  
|15|SRL|Source restriction level. See License categories 0-4 listed [here](https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/license_agreement_appendix.html)|  

The RFF files are essentially CSV files that use "|" as a delimiter instead of commas, so we can convert them into CSVs and use the CSV import function of Neo4j to build a graph database from them.

### Initialize a connection to the neo4j database.

In [2]:
import pandas as pd

In [5]:
import getpass
password = getpass.getpass("\nPlease enter the Neo4j database password to continue \n")


Please enter the Neo4j database password to continue 
 ·······


In [6]:
from neo4j import GraphDatabase
driver=GraphDatabase.driver(uri="bolt://localhost:7687", auth=('neo4j',password))
session=driver.session()

## Create a node for each Concept Unique Identifier (CUI)

### Create a CSV file that includes only one row for each CUI using the English preferred term for the CUI

In [7]:
# Load MRCONSO.RRF into a dataframe
mrconso = pd.read_csv('/home/tim/Documents/GrApH_AI/Data/umls-2020AB-full/2020AB/META/MRCONSO.RRF', sep='|', header=None, encoding='utf-8')
mrconso[:5]

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,C0000005,ENG,P,L0000005,PF,S0007492,Y,A26634265,,M0019694,D012711,MSH,PEP,D012711,(131)I-Macroaggregated Albumin,0,N,256.0,
1,C0000005,ENG,S,L0270109,PF,S0007491,Y,A26634266,,M0019694,D012711,MSH,ET,D012711,(131)I-MAA,0,N,256.0,
2,C0000005,FRE,P,L6220710,PF,S7133957,Y,A13433185,,M0019694,D012711,MSHFRE,PEP,D012711,Macroagrégats d'albumine marquée à l'iode 131,3,N,,
3,C0000005,FRE,S,L6215648,PF,S7133916,Y,A27488794,,M0019694,D012711,MSHFRE,ET,D012711,MAA-I 131,3,N,,
4,C0000005,FRE,S,L6215656,PF,S7133956,Y,A27614225,,M0019694,D012711,MSHFRE,ET,D012711,Macroagrégats d'albumine humaine marquée à l'i...,3,N,,


In [8]:
mrconso.columns = ['CUI', 'LAT', 'TS', 'LUI', 'STT', 'SUI', 'ISPREF', 'AUI', 'SAUI', 'SCUI', 'SDUI', 'SAB', 'TTY', 'CODE', 'STR', 'SRL', 'SUPPRESS', 'CVF', '']

In [9]:
mrconso.drop(labels=['SUPPRESS', 'CODE', 'CVF','SAUI', 'SCUI', 'SDUI', ''], axis=1, inplace=True)

In [10]:
mrconso.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15229290 entries, 0 to 15229289
Data columns (total 12 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   CUI     object
 1   LAT     object
 2   TS      object
 3   LUI     object
 4   STT     object
 5   SUI     object
 6   ISPREF  object
 7   AUI     object
 8   SAB     object
 9   TTY     object
 10  STR     object
 11  SRL     int64 
dtypes: int64(1), object(11)
memory usage: 1.4+ GB


In [11]:
mrconso['CUI'].value_counts()

C1612374    2757
C1613478    1723
C2946453    1539
C3486084    1219
C0979217     820
            ... 
C5183127       1
C3713838       1
C4699067       1
C4648462       1
C3749451       1
Name: CUI, Length: 4363491, dtype: int64

In [12]:
mrconso.iloc[:,0].value_counts().sum()

15229290

There are 169 source vocabularies, 15.2 million AUIs, 13.3 million SUIs, 12.2 million LUIs, and 4.36 million unique CUIs in the data.

In [4]:
# Select which string form to use for each CUI. The metathesaurus ranks term preference based on source vocabulary and term type in the MRRANK.RRF file

# Load MRRANK
mrrank = pd.read_csv('/home/tim/Documents/GrApH_AI/Data/umls-2020AB-full/2020AB/META/MRRANK.RRF', sep='|', header=None)
mrrank.columns = ['Rank', 'SAB', 'TTY', 'SUPPRESS','']
mrrank.drop(labels = ['SUPPRESS', ''], axis = 1, inplace=True)
mrrank.head(25)

Unnamed: 0,Rank,SAB,TTY
0,735,MTH,PN
1,734,MTHCMSFRF,PT
2,733,RXNORM,SCD
3,732,RXNORM,SBD
4,731,RXNORM,SCDG
5,730,RXNORM,SBDG
6,729,RXNORM,IN
7,728,RXNORM,PSN
8,727,RXNORM,MIN
9,726,RXNORM,SCDF


In [14]:
# Merge the "Rank" column from mrrank into mrconso
mrconso = pd.merge(mrconso, mrrank, on=['SAB', 'TTY'])
mrconso.head()

Unnamed: 0,CUI,LAT,TS,LUI,STT,SUI,ISPREF,AUI,SAB,TTY,STR,SRL,Rank
0,C0000005,ENG,P,L0000005,PF,S0007492,Y,A26634265,MSH,PEP,(131)I-Macroaggregated Albumin,0,712
1,C0000074,ENG,P,L0000074,PF,S0007615,Y,A26606894,MSH,PEP,1-Alkyl-2-Acylphosphatidates,0,712
2,C0000132,ENG,P,L0000132,PF,S0007739,Y,A26665454,MSH,PEP,15-Ketosteryl Oleate Hydrolase,0,712
3,C0000137,ENG,P,L0000137,PF,S0007753,Y,A26650280,MSH,PEP,15S RNA,0,712
4,C0000151,ENG,P,L0000151,PF,S0007787,Y,A26647507,MSH,PEP,17 beta-Hydroxy-5 beta-Androstan-3-One,0,712


In [15]:
mrconso.sort_values(by=['CUI','Rank'], ascending=False, inplace=True)

In [14]:
mrconso.head(20)

Unnamed: 0,CUI,LAT,TS,LUI,STT,SUI,ISPREF,AUI,SAB,TTY,STR,SRL,Rank
12015007,C5399742,ENG,P,L16661200,PF,S20184617,Y,A32340102,MED-RT,PT,Inactive Preparations by FDA Established Pharm...,0,629
12018397,C5399742,ENG,P,L16661200,PF,S20184617,N,A32340042,MED-RT,FN,Inactive Preparations by FDA Established Pharm...,0,628
15173344,C5399741,ENG,P,L16661227,PF,S20184623,Y,A32340032,SRC,VPT,"Medication Reference Terminology, 2020_09_08",0,4
15173511,C5399741,ENG,S,L16661226,PF,S20184622,Y,A32340033,SRC,VAB,MED-RT_2020_09_08,0,3
13977666,C5399740,ENG,P,L16661187,PF,S20184552,Y,A32339932,MVX,PT,Bavarian Nordic A/S,0,604
15173343,C5399739,ENG,P,L16661189,PF,S20184554,Y,A32339931,SRC,VPT,"Manufacturers of Vaccines, 2020_09_04",0,4
15173510,C5399739,ENG,S,L16661188,PF,S20184553,Y,A32339930,SRC,VAB,MVX2020_09_04,0,3
466975,C5399738,ENG,P,L16661239,PF,S20184634,Y,A32340412,MTH,PN,Receptor Interaction [APC],0,735
12015006,C5399738,ENG,S,L4866612,PF,S20184635,Y,A32340126,MED-RT,PT,Receptor Interaction,0,629
12018396,C5399738,ENG,P,L16661239,PF,S20184634,N,A32340063,MED-RT,FN,Receptor Interaction [APC],0,628


In [16]:
# Create a boolean mask to select all rows where the Rank is the maximum for each CUI
idx = mrconso.groupby('CUI')['Rank'].transform(max) == mrconso['Rank']

In [16]:
# Check how many duplicates of each CUI remain
counts = mrconso.loc[idx, 'CUI'].value_counts()
counts

C4015968    166
C3832820    143
C3266948     87
C3497836     68
C4745418     67
           ... 
C1361699      1
C5281613      1
C1491448      1
C1319923      1
C1468068      1
Name: CUI, Length: 4363491, dtype: int64

In [17]:
sum(counts > 1)

48580

There are still over 48580 CUIs with duplicates. Let's take a look at a few of them to see what's going on.

In [18]:
mrconso[mrconso['CUI'] == 'C3266948'].head(5)

Unnamed: 0,CUI,LAT,TS,LUI,STT,SUI,ISPREF,AUI,SAB,TTY,STR,SRL,Rank
13746905,C3266948,ENG,P,L10319408,PF,S12879493,Y,A20052434,MTHSPL,DP,Aluminum Zirconium Tetrachlorohydrex GLY 15.2 ...,0,686
13746906,C3266948,ENG,S,L10319405,PF,S13431743,Y,A20838971,MTHSPL,DP,Aluminum Zirconium Tetrachlorohydrex GLY 15.2 ...,0,686
13746907,C3266948,ENG,S,L10319406,PF,S12879485,Y,A20052433,MTHSPL,DP,Aluminum Zirconium Tetrachlorohydrex GLY 15.2 ...,0,686
13746908,C3266948,ENG,S,L10319406,VC,S14154309,Y,A23431585,MTHSPL,DP,Aluminum Zirconium Tetrachlorohydrex GLY 15.2 ...,0,686
13746909,C3266948,ENG,S,L10319407,PF,S12879489,Y,A20047450,MTHSPL,DP,Aluminum Zirconium Tetrachlorohydrex GLY 15.2 ...,0,686


In [19]:
mrconso[mrconso['CUI'] == 'C4015968'].head(5)

Unnamed: 0,CUI,LAT,TS,LUI,STT,SUI,ISPREF,AUI,SAB,TTY,STR,SRL,Rank
11277376,C4015968,ENG,P,L12050646,PF,S14945011,N,A24569127,OMIM,PHENO,RECLASSIFIED - VARIANT OF UNKNOWN SIGNIFICANCE,0,584
11277377,C4015968,ENG,P,L12050646,PF,S14945011,N,A24573143,OMIM,PHENO,RECLASSIFIED - VARIANT OF UNKNOWN SIGNIFICANCE,0,584
11277378,C4015968,ENG,P,L12050646,PF,S14945011,N,A24573248,OMIM,PHENO,RECLASSIFIED - VARIANT OF UNKNOWN SIGNIFICANCE,0,584
11277379,C4015968,ENG,P,L12050646,PF,S14945011,N,A24573290,OMIM,PHENO,RECLASSIFIED - VARIANT OF UNKNOWN SIGNIFICANCE,0,584
11277380,C4015968,ENG,P,L12050646,PF,S14945011,N,A24573294,OMIM,PHENO,RECLASSIFIED - VARIANT OF UNKNOWN SIGNIFICANCE,0,584


In [20]:
mrconso[mrconso['CUI'] == 'C3832820']['TS'].value_counts()

S    135
P      8
Name: TS, dtype: int64

They seem to differ by String Unique Identifier (SUI), string (STR), and Atom Unique Identifier (AUI), but not by CUI or Rank. 

In [17]:
# We'll start by selecting only the CUIs with the maximum rank for each CUI, then deal with duplicate Ranks
cui_max_rank = mrconso[idx]

In [22]:
# Next we'll keep only those CUIs where Term Status (TS) is preferred (P)
cui_max_rank['CUI'].value_counts()

C4015968    166
C3832820    143
C3266948     87
C3497836     68
C4745418     67
           ... 
C1361699      1
C5281613      1
C1491448      1
C1319923      1
C1468068      1
Name: CUI, Length: 4363491, dtype: int64

In [18]:
cui_max_rank_p = cui_max_rank[cui_max_rank['TS'] == 'P']

In [24]:
cui_max_rank_p['CUI'].value_counts()

C4015968    166
C5241928     36
C3818228     34
C1821614     32
C4716361     32
           ... 
C5057613      1
C2710958      1
C4158104      1
C4694681      1
C1468068      1
Name: CUI, Length: 4363491, dtype: int64

In [25]:
# Check number of remaining duplicates
sum(cui_max_rank_p['CUI'].value_counts() > 1)

33937

In [26]:
cui_max_rank_p['ISPREF'].value_counts()

Y    4388633
N      16032
Name: ISPREF, dtype: int64

In [19]:
cui_maxrank_TSp_ISPREFy = cui_max_rank_p[cui_max_rank_p['ISPREF'] == 'Y']

In [28]:
cui_maxrank_TSp_ISPREFy['CUI'].value_counts()

C1390416    10
C1401690     7
C1399086     7
C1400842     6
C1398491     6
            ..
C3941307     1
C5357236     1
C5162679     1
C4126911     1
C1468068     1
Name: CUI, Length: 4363491, dtype: int64

In [29]:
# Check number of remaining duplicates
sum(cui_maxrank_TSp_ISPREFy['CUI'].value_counts() > 1)

23385

In [30]:
cui_maxrank_TSp_ISPREFy[cui_maxrank_TSp_ISPREFy['CUI'] == 'C1390416'].head(10)

Unnamed: 0,CUI,LAT,TS,LUI,STT,SUI,ISPREF,AUI,SAB,TTY,STR,SRL,Rank
7526958,C1390416,ENG,P,L3522844,PF,S4116243,Y,A4448865,ICPC2ICD10ENG,PT,"pseudomucinous; cystadenoma, papillary, border...",3,452
7526959,C1390416,ENG,P,L3522844,VW,S4058520,Y,A4391142,ICPC2ICD10ENG,PT,borderline malignancy; papillary pseudomucinou...,3,452
7526960,C1390416,ENG,P,L3522844,VW,S4058527,Y,A4391149,ICPC2ICD10ENG,PT,borderline malignancy; pseudomucinous papillar...,3,452
7526961,C1390416,ENG,P,L3522844,VW,S4069350,Y,A4401972,ICPC2ICD10ENG,PT,"cystadenoma; papillary, pseudomucinous, border...",3,452
7526962,C1390416,ENG,P,L3522844,VW,S4069361,Y,A4401983,ICPC2ICD10ENG,PT,"cystadenoma; pseudomucinous, papillary, border...",3,452
7526963,C1390416,ENG,P,L3522844,VW,S4108795,Y,A4441417,ICPC2ICD10ENG,PT,"ovary; cystadenoma, papillary pseudomucinous, ...",3,452
7526964,C1390416,ENG,P,L3522844,VW,S4108798,Y,A4441420,ICPC2ICD10ENG,PT,"ovary; cystadenoma, pseudomucinous papillary, ...",3,452
7526965,C1390416,ENG,P,L3522844,VW,S4108849,Y,A4441471,ICPC2ICD10ENG,PT,"ovary; papillary pseudomucinous cystadenoma, b...",3,452
7526966,C1390416,ENG,P,L3522844,VW,S4108856,Y,A4441478,ICPC2ICD10ENG,PT,"ovary; pseudomucinous papillary cystadenoma, b...",3,452
7526967,C1390416,ENG,P,L3522844,VW,S4109548,Y,A4442170,ICPC2ICD10ENG,PT,"papillary; cystadenoma, pseudomucinous, border...",3,452


In [20]:
cui_maxrank_TSp_ISPREFy_STTpf = cui_maxrank_TSp_ISPREFy[cui_maxrank_TSp_ISPREFy['STT'] == 'PF']

In [55]:
cui_maxrank_TSp_ISPREFy_STTpf['CUI'].value_counts()

C3824315    1
C2213782    1
C1855810    1
C3395491    1
C5225871    1
           ..
C3532323    1
C5310801    1
C2505120    1
C0761953    1
C0387529    1
Name: CUI, Length: 4363491, dtype: int64

In [56]:
cui_maxrank_TSp_ISPREFy_STTpf.head()

Unnamed: 0,CUI,LAT,TS,LUI,STT,SUI,ISPREF,AUI,SAB,TTY,STR,SRL,Rank
12015007,C5399742,ENG,P,L16661200,PF,S20184617,Y,A32340102,MED-RT,PT,Inactive Preparations by FDA Established Pharm...,0,629
15173344,C5399741,ENG,P,L16661227,PF,S20184623,Y,A32340032,SRC,VPT,"Medication Reference Terminology, 2020_09_08",0,4
13977666,C5399740,ENG,P,L16661187,PF,S20184552,Y,A32339932,MVX,PT,Bavarian Nordic A/S,0,604
15173343,C5399739,ENG,P,L16661189,PF,S20184554,Y,A32339931,SRC,VPT,"Manufacturers of Vaccines, 2020_09_04",0,4
466975,C5399738,ENG,P,L16661239,PF,S20184634,Y,A32340412,MTH,PN,Receptor Interaction [APC],0,735


In [21]:
CUIs_preferred_terms = cui_maxrank_TSp_ISPREFy_STTpf.drop(labels = ['LAT', 'TS', 'LUI', 'STT', 'SUI', 'ISPREF', 'AUI', 'SAB', 'TTY', 'SRL', 'Rank'], axis = 1)
CUIs_preferred_terms.head()

Unnamed: 0,CUI,STR
12015007,C5399742,Inactive Preparations by FDA Established Pharm...
15173344,C5399741,"Medication Reference Terminology, 2020_09_08"
13977666,C5399740,Bavarian Nordic A/S
15173343,C5399739,"Manufacturers of Vaccines, 2020_09_04"
466975,C5399738,Receptor Interaction [APC]


Yay! We finally have only one line per CUI, with the preferred string associated with each CUI. We still have 4363491 unique CUIs, as we had when we started.

### Add Semantic Type for each CUI

In [17]:
# Add Semantic Type for each CUI. The MRSTY.RRF file lists the Semantic Type for each CUI. 

# Load MRSTY.RRF
mrsty = pd.read_csv('/home/tim/Documents/GrApH_AI/Data/umls-2020AB-full/2020AB/META/MRSTY.RRF', sep='|', header=None)
mrsty.columns = ['CUI', 'TUI', 'STN', 'STY', 'ATU1', 'CVF','']
mrsty.drop(labels = ['TUI', 'STN', 'ATU1', 'CVF', ''], axis = 1, inplace=True)
mrsty.head(10)

Unnamed: 0,CUI,STY
0,C0000005,"Amino Acid, Peptide, or Protein"
1,C0000005,Pharmacologic Substance
2,C0000005,"Indicator, Reagent, or Diagnostic Aid"
3,C0000039,Organic Chemical
4,C0000039,Pharmacologic Substance
5,C0000052,"Amino Acid, Peptide, or Protein"
6,C0000052,Enzyme
7,C0000074,Organic Chemical
8,C0000084,"Amino Acid, Peptide, or Protein"
9,C0000084,Biologically Active Substance


In [23]:
# Merge semantic type into the dataframe holding the CUIs with their preferred terms
CUIs_preferred_terms = pd.merge(CUIs_preferred_terms, mrsty, on=['CUI'])
CUIs_preferred_terms.head()

Unnamed: 0,CUI,STR,STY
0,C5399742,Inactive Preparations by FDA Established Pharm...,Pharmacologic Substance
1,C5399741,"Medication Reference Terminology, 2020_09_08",Intellectual Product
2,C5399740,Bavarian Nordic A/S,Health Care Related Organization
3,C5399739,"Manufacturers of Vaccines, 2020_09_04",Intellectual Product
4,C5399738,Receptor Interaction [APC],Pharmacologic Substance


In [28]:
sum(CUIs_preferred_terms['CUI'].isnull())

0

In [17]:
# Reformat semantic types into the standard cypher format for node labels
# mrsty['STY'] = mrsty['STY'].str.replace(' ', '_')
# mrsty['STY'] = mrsty['STY'].str.replace(',', '_')
# mrsty['STY'] = mrsty['STY'].str.replace('-', '_')
# mrsty.head()

Unnamed: 0,CUI,STY
0,C0000005,Amino_Acid__Peptide__or_Protein
1,C0000005,Pharmacologic_Substance
2,C0000005,Indicator__Reagent__or_Diagnostic_Aid
3,C0000039,Organic_Chemical
4,C0000039,Pharmacologic_Substance


In [20]:
# Match the CUI in each node in the database and set the Semantic Type as a label
# for index, row in mrsty.iterrows():
#     CUI = row[0]
#     STY = row[1]
#     command = '''MATCH (n:Concept_UMLS) WHERE n.cui = '{CUI}' SET n.semantic_type = '{STY}' '''.format(CUI=CUI, STY=STY)
#     session.run(command)

In [29]:
CUIs_preferred_terms.to_csv('CUIs_preferred_terms.csv', index=False, header=False, encoding='utf-8')

### Import CUIs as nodes into the graph

Move the "CUIs_preferred_terms.csv" file into your graph database's Import folder. 

In [6]:
# Create a node for each concept in the UMLS
query = '''USING PERIODIC COMMIT 100000 LOAD CSV FROM "file:///CUIs_preferred_terms.csv" AS COLUMN CREATE (:Concept_UMLS {preferred_term:COLUMN[1], cui:COLUMN[0], semantic_type:COLUMN[2], UMLS_edition:'2020AB'})'''

session.run(query)

<neo4j.work.result.Result at 0x7fb227993b20>

In [30]:
# Create a uniqueness constraint for the CUI property of each Concept_UMLS node
query = 'CREATE CONSTRAINT UniqueCUIforConceptConstraint ON (c:Concept_UMLS) ASSERT c.cui IS UNIQUE'
session.run(query)

<neo4j.work.result.Result at 0x7ff9b933f340>

Runtime for the query "MATCH (n:Concept_UMLS {cui:'C1145670'}) RETURN (n)" before and after defining uniqueness constraint on CUI:  
Before: 1343 ms  
After: 33 ms

DO NOT DO THE FOLLOWING. Semantic relationships do not hold true at the level of relationships between individual CUIs. 
Create semantic relationships among all nodes based on their semantic type

In [21]:
# Load Semantic Relationships from the SRSTR file
# srstr = pd.read_csv('/home/tim/Documents/GrApH_AI/Data/umls-2020AB-full/2020AB/NET/SRSTR', sep='|', header=None)
# srstr.columns = ['STY1', 'REL', 'STY2', 'LS', '']
# srstr.drop(labels = ['LS', ''], axis = 1, inplace=True)
# srstr.head()

Unnamed: 0,STY1,REL,STY2
0,Acquired Abnormality,co-occurs_with,Injury or Poisoning
1,Acquired Abnormality,isa,Anatomical Abnormality
2,Acquired Abnormality,result_of,Behavior
3,Activity,isa,Event
4,Age Group,isa,Group


In [24]:
# # Reformat semantic types into the standard cypher format for node labels
# srstr['STY1'] = (srstr['STY1']
#                 .str.replace(' ', '_')
#                 .str.replace(',', '_')
#                 .str.replace('-', '_')
#                 )
# srstr['STY2'] = (srstr['STY2']
#                 .str.replace(' ', '_')
#                 .str.replace(',', '_')
#                 .str.replace('-', '_')
#                 )
# srstr['REL'] = (srstr['REL']
#                 .str.replace(' ', '_')
#                 .str.replace(',', '_')
#                 .str.replace('-', '_')
#                 .str.upper()
#                )
# srstr.head()

Unnamed: 0,STY1,REL,STY2
0,Acquired_Abnormality,CO_OCCURS_WITH,Injury_or_Poisoning
1,Acquired_Abnormality,ISA,Anatomical_Abnormality
2,Acquired_Abnormality,RESULT_OF,Behavior
3,Activity,ISA,Event
4,Age_Group,ISA,Group


In [25]:
# for index, row in srstr.iterrows():
#     STY1 = row[0]
#     REL = row[1]
#     STY2 = row[2]
#     query = '''"MATCH (n:{STY1}), (m:{STY2}) RETURN n, m", "MERGE (n)-[:{REL}]->(m)"'''.format(STY1=STY1, STY2=STY2, REL=REL)
#     command = 'CALL apoc.periodic.iterate('+query+', {batchSize:1000, parallel: true, iterateList:true})'
#     session.run(command)

KeyboardInterrupt: 

## Create a node for each Lexical Unique Identifier

### Create a CSV file that includes only one row for each LUI using the preferred term for the LUI

### Import LUIs as nodes into the graph

In [None]:
# Create a node for each LUI in the UMLS


In [None]:
# Create a uniqueness constraint for the LUI property of each LUI_UMLS node

In [11]:
# Merge LUIs into CUIs
command = 'CALL apoc.periodic.iterate(\"MATCH (cn:{child_node}) MATCH (pn:{parent_node} {{{parent_fk}:cn.{child_fk}}}) RETURN cn, pn\", \"CREATE (cn)-[:CHILD_OF]->(pn)\", {{batchSize:10000, parallel: true, iterateList:true}})'.format(child_node=child_node, parent_node=parent_node, child_fk=child_fk, parent_fk=parent_fk)
session.run(command)

session.run(query)

In [None]:
# TS - identifies preferred LUI of the CUI in a given language
# ISPREF - identifies preferred AUI of the SUI

In [None]:
# for row in 

# properties = {cui:, preferred_term:}

# '''USING PERIODIC COMMIT 100000 LOAD CSV FROM "file:///{csv_file}" AS COLUMN CREATE (:Concept:UMLS {properties})'''.format(csv_file=csv_file, properties=properties)

# '''USING PERIODIC COMMIT 100000 LOAD CSV FROM "file:///{csv_file}" AS COLUMN CREATE (n:{label} {properties})'''.format(csv_file=csv_file, label=label, properties=properties)

# Use ISPREF to select only those rows marked as preferred string (Y) for the conccept 

In [None]:
# MERGE nodes for LUI, SUI, and AUI in that order with each concept node

# '''MATCH cui
# RETURN cui node,
# MERGE 
# '''



In [None]:
# Create constraints for CUI, LUI, CUI, and source vocabulary

## Create a CSV with only CUIs and all strings

In [13]:
str_to_CUI = mrconso[['STR', 'CUI']].copy()
str_to_CUI['STR'].value_counts()

procedimiento retirado - RETIRADO - (concepto no activo)        4035
procedimiento retirado - RETIRADO -                             4035
ALCOHOL 80 mL in 100 mL TOPICAL LIQUID [Hand Sanitizer]          732
Deprecated                                                       676
OXYGEN 99 L in 100 L RESPIRATORY (INHALATION) GAS                499
                                                                ... 
ulceración mamaria                                                 1
tímpano - hallazgo                                                 1
Неспецифические отклонения от нормы показателей кардиограммы       1
gewrichtskraakbeenaandoening, schouderstreek                       1
Debilidad de las extremidades inferiores                           1
Name: STR, Length: 13058674, dtype: int64

In [14]:
str_to_CUI.drop_duplicates(subset='STR', inplace=True)

In [15]:
str_to_CUI['STR'].value_counts()

Luxation interne fermée de l'extrémité proximale du tibia             1
leukemia; myeloid                                                     1
Parmelia amplissima                                                   1
Alpha-B crystallin-related late-onset distal myopathy                 1
ZHIRNYKH KISLOT SINTETAZ KOMPLEKS TIPA II                             1
                                                                     ..
positive regulation of smooth muscle cell apoptosis                   1
Removal of all drains (procedure)                                     1
Persistent headache due to and following injury of head (disorder)    1
fractura expuesta de rótula, vertical (trastorno)                     1
Debilidad de las extremidades inferiores                              1
Name: STR, Length: 13058674, dtype: int64

In [18]:
# Merge semantic type ("STY") into the string_to_CUI dataframe
str_to_CUI = pd.merge(str_to_CUI, mrsty, on=['CUI'])

In [19]:
str_to_CUI.sort_values(by=['STR'], inplace=True)

In [20]:
str_to_CUI.head()

Unnamed: 0,STR,CUI,STY
7431708,""""" w/o Surgery Capability",C1548830,Health Care Activity
3545113,Debulking (résection) de tumeur,C0439805,Therapeutic or Preventive Procedure
5181040,Wet prep positif,C0861028,Laboratory or Test Result
14119101,!Orthotrichum mandonii,C5257799,Plant
14119099,"!Orthotrichum mandonii Schimp. ex Hampe, 1865",C5257799,Plant


In [21]:
str_to_CUI.to_csv('str_to_CUI.csv', index=False, encoding='utf-8')

UMLS Database Query Diagram to find all information associated with a particular UMLS concept (CUI value)  

<img style="float: left;" src="images/UMLS_find_all_data_for_cui.gif">
This diagram shows how to find all information associated with a particular UMLS concept (CUI value). The information returned could be compiled into a "concept report" showing an overview of what the UMLS concept contains.

Certain files contain both AUI (Atom Unique Identifier) and CUI (Concept Unique Identifier) fields. In these cases, a CUI search will yield results for all atoms that exist within that concept. For example, a CUI search in MRDEF.RRF will yield all definitions associated with all atoms (AUIs) within the concept having that CUI.

When performing CUI searches in MRREL.RRF, the resulting rows will yield a CUI2 value that is searchable in MRCONSO.RRF. That search will identify the concept on the other side of the relationship.

Corresponding Oracle Queries:

1. Find all atoms of a UMLS concept.

SELECT * FROM mrconso
WHERE cui = 'C0032344';

2. Find all source definitions associated with a UMLS concept.

SELECT * FROM mrdef
WHERE cui = 'C0032344';

3. Find all source contexts associated with a UMLS concept.

SELECT * FROM mrhier
WHERE cui = 'C0032344';

4. Find all attributes for a UMLS concept.

SELECT * FROM mrsat
WHERE cui = 'C0032344'
     AND stype = 'CUI';

5. Find all semantic types for a UMLS concept.

SELECT * FROM mrsty
WHERE cui = 'C0032344';

6.a. Find all relationships for a UMLS concept.
Note: In MRREL, the REL/RELA always expresses the nature of the relationship from CUI2 to the "current concept", CUI1. Because we're querying CUI1 below, this represents the "natural" direction of the relationship.

SELECT * FROM mrrel
WHERE cui1 = 'C0032344'

6.b. Find all inverse relationships for a UMLS concept.
Note: In MRREL, the REL/RELA always expresses the nature of the relationship from CUI2 to the "current concept", CUI1. Because we're querying CUI2 below, this represents the opposite of the "natural" direction of the relationship.

SELECT * FROM mrrel
WHERE cui2 = 'C0032344';
     AND stype2 = 'CUI';

7. Find all relationships for a concept and the preferred (English) name of the CUI2.

SELECT a.cui1, a.cui2, b.str FROM mrrel a, mrconso b
WHERE a.cui1 = 'C0032344'
     AND a.stype1 = 'CUI'
     AND a.cui2 = b.cui
     AND b.ts = 'P'
     AND b.stt = 'PF'
     AND b.ispref = 'Y'
     AND b.lat = 'ENG';

[Source](https://www.nlm.nih.gov/research/umls/implementation_resources/query_diagrams/er1.html) 

Building the Semantic Network  

The Semantic Network of the UMLS:  
![The semantic network](https://www.ncbi.nlm.nih.gov/books/NBK9679/bin/ch05-Image003.jpg)

The UMLS documentation notes that "Semantic relationships may or may not hold at the concept level. For example, the relationship Clinical Drug causes Disease or Syndrome does not hold at the concept level for Aspirin and Cancer. Aspirin does not cause cancer." 
Thus, the semantic network may be useful for labeling concept nodes, but not for creating relationships among specific concept nodes.

[Source](https://uts.nlm.nih.gov/semanticnetwork.html)

# Data Reference  
Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D267-70. doi: 10.1093/nar/gkh061. PubMed PMID: 14681409; PubMed Central PMCID: PMC308795.