# Multi-source data import 

Sources:
- MIMIC-III. Covers the years 2001-2012. Has free-text notes.  
- MIMIC-IV. Covers the years  2008 - 2019. Has physician order entry data, reference ranges for lab values, and some other changes. Doesn't have free-text notes as of this writing.
- UMLS. Provides a common set of concepts that form a central connection point for many other sources such as RxNorm and MeSH.
- RxNorm. Has drug-drug and drug-disease interactions, indications, contraindications, etc.  
- MeSH. Has broader-narrower relationships among hierarchically-related terms.
- Pubmed. Has the majority of the world's medical literature in free text, with abstracts freely available an accessible through an API.

## Information about each source

### MIMIC-III
Schema of MIMIC-III: https://mit-lcp.github.io/mimic-schema-spy/index.html

### MIMIC-IV
Documentation for MIMIC-IV (no schema on schema spy as of this writing): 

### RxNorm 
Connect various forms/dosages/routes of a clinical drug to the underlying pharmacologic substance  
![](images/RxNorm_relationships_among_RXCUIs.png)  
Note the "TTY" field from the graph above corresponds to the heading of each box below.  
![](images/RxNorm_CUIs_related_to_coumadin.png)

Relate each pharmacologic substance to other drugs with interaction info  
![](images/RxNorm_drug_interactions_warfarin.png)  

Connect clinically relevant properties of drugs   
![](images/RxNorm_clinical_properties_relationships.png)  

RxNorm main landing page: https://www.nlm.nih.gov/research/umls/rxnorm/index.html  
AMIA article describing RxNorm: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3128404/  
Data downloads: https://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html  
Web-based browser: https://mor.nlm.nih.gov/RxNav/search?searchBy=String&searchTerm=coumadin  
Technical docs: https://www.nlm.nih.gov/research/umls/rxnorm/docs/index.html  


The full download of RxNorm files contains a directory called "rrf" with the following contents:

RXNCONSO.RRF                        121,180,353          bytes
RXNDOC.RRF                          218,467              bytes
RXNREL.RRF                          503,188,245          bytes
RXNSAB.RRF                          10,698               bytes
RXNSAT.RRF                          502,793,103          bytes
RXNSTY.RRF                          17,996,450           bytes

Archival files for tracking RxNorm historical content:
RXNATOMARCHIVE.RRF                  74,069,962           bytes
RXNCUICHANGES.RRF                   39,589               bytes
RXNCUI.RRF                          1,716,694            bytes

In [2]:
import pandas as pd

In [3]:
# Load RXNREL.RRF into a dataframe
rxnrel = pd.read_csv('/home/tim/Documents/GrApH_AI/Data/RxNorm_full_06072021/rrf/RXNREL.RRF', sep='|', header=None, encoding='utf-8')
rxnrel[:5]

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,,5.0,AUI,SY,,6.0,AUI,permuted_term_of,155592245.0,,MSH,,,,,,
1,,5.0,SDUI,SIB,,104746.0,SDUI,,154524204.0,,MSH,,,,,,
2,,5.0,SDUI,RN,,609702.0,SDUI,mapped_to,154691227.0,,MSH,,1.0,,,,
3,,5.0,AUI,SY,,2666961.0,AUI,sort_version_of,155371534.0,,MSH,,,,,,
4,,5.0,AUI,SY,,2681015.0,AUI,entry_version_of,155054914.0,,MSH,,,,,,


In [4]:
rxnrel.iloc[:,7].value_counts()

inactive_ingredient_of             1532663
has_inactive_ingredient            1532663
active_ingredient_of                357487
has_active_ingredient               357487
has_active_moiety                   337459
active_moiety_of                    337459
has_ingredient                      323817
ingredient_of                       323817
inverse_isa                         242390
isa                                 242390
dose_form_of                        124076
has_dose_form                       124076
constitutes                         107708
consists_of                         107708
tradename_of                         98860
has_tradename                        98860
doseformgroup_of                     34806
has_doseformgroup                    34806
has_print_name                       27671
print_name_of                        27671
ingredients_of                       11308
has_ingredients                      11308
has_precise_ingredient               10992
precise_ing

In [19]:
# Load RXNSAT.RRF (Simple Concept and Atom Attributes) into a dataframe
columns = ['RXCUI', 'LUI', 'SUI', 'RXAUI', 'STYPE', 'CODE', 'ATUI', 'SATUI', 'ATN', 'SAB', 'ATV', 'SUPPRESS', 'CVF'] # Column headers and descriptions at https://www.nlm.nih.gov/research/umls/rxnorm/docs/techdoc.html#sat
rxnsat = pd.read_csv('/home/tim/Documents/GrApH_AI/Data/RxNorm_full_06072021/rrf/RXNSAT.RRF', sep='|', header=None, encoding='utf-8')
rxnsat = rxnsat.iloc[:,:13] # Drop empty column at index 14
rxnsat.columns = columns
rxnsat[:5]

Unnamed: 0,RXCUI,LUI,SUI,RXAUI,STYPE,CODE,ATUI,SATUI,ATN,SAB,ATV,SUPPRESS,CVF
0,38,,,829,AUI,38,,,RXN_BN_CARDINALITY,RXNORM,single,N,4096.0
1,38,,,8056626,AUI,D001971,AT212333259,,TERMUI,MSH,T005606,N,
2,38,,,8056626,AUI,D001971,AT212365433,,LT,MSH,TRD,N,
3,38,,,8056626,AUI,D001971,AT212543507,,TH,MSH,UNK (19XX),N,
4,38,,,8056626,SCUI,D001971,AT60770509,,RN,MSH,0,N,


RXSAT.RFF table info

|Column|Description|
|---|---|
|RXCUI|Unique identifier for concept (concept id)|  
|LUI|Unique identifier for term (no value provided)|  
|SUI|Unique identifier for string (no value provided)|  
|RXAUI|RxNorm atom identifier (RXAUI) or RxNorm relationship identifier (RUI).|  
|STYPE|The name of the column in RXNCONSO.RRF or RXNREL.RRF that contains the identifier to which the attribute is attached, e.g., CUI, AUI.|  
|CODE|"Most useful" source asserted identifier (if the source vocabulary has more than one identifier), or a RxNorm-generated source entry identifier (if the source vocabulary has none.)|  
|ATUI|Unique identifier for attribute|  
|SATUI|Source asserted attribute identifier (optional - present if it exists)|  
|ATN|Attribute name (e.g. NDC). Possible values appear in RXNDOC.RRF and are described on the UMLS Attribute Names page|  
|SAB|Abbreviation of the source of the attribute. Possible values appear in RXNSAB.RRF and are listed on the UMLS Source Vocabularies page|  
|ATV|Attribute value described under specific attribute name on the UMLS Attribute Names page (e.g. 000023082503 where ATN = 'NDC'). A few attribute values exceed 1,000 characters. Many of the abbreviations used in attribute values are explained in RXNDOC.RRF and included UMLS Abbreviations Used in Data Elements page|  
|SUPPRESS|Suppressible flag. Values = O, Y, or N. Reflects the suppressible status of the attribute. N - Attribute is not suppressed. O - Attribute is suppressed at source level. Y - Attribute is suppressed by RxNorm editors.|  
|CVF|Content view flag. RxNorm includes one value, '4096', to denote inclusion in the Current Prescribable Content subset. All rows with CVF='4096' can be found in the subset.| 

In [27]:
pd.set_option("display.max_rows", 120)
rxnsat['ATN'].value_counts() #Table listing attribute names and descriptions: https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/attribute_names.html

NDC                                                               1820557
SPL_SET_ID                                                        1684527
LABELER                                                            360647
DM_SPL_ID                                                          193547
LABEL_TYPE                                                         186919
MARKETING_EFFECTIVE_TIME_LOW                                       184220
MARKETING_CATEGORY                                                 183116
MARKETING_STATUS                                                   183051
DDF                                                                148523
DCSA                                                               141358
DRT                                                                113332
DST                                                                103673
COLORTEXT                                                           78851
COLOR                                 

### MED-RT
Connect medications with other concept types such as diseases, phenotypes, etc.

How MED-RT connects multiple source vocabularies:  
![image.png](images/MED_RT_content_model.png)  
Figure source: https://evs.nci.nih.gov/ftp1/MED-RT/MED-RT%20Documentation.pdf  

Sample of some relationships specified in MED-RT:  
![image.png](images/MED_RT_relationships.png)  
Screenshot source: https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/MED-RT/metarepresentation.html#relationships 

MEDRT_MoA_NUIs file is an index of mechanisms of action.  
Sample line from the file:  
Acetylcholine Release Inhibitors [MoA]	N0000175770	MED-RT  
Possible ways to store the data:
- Each line becomes a node with the label "Mechanism_of_Action"
- Each line becomes a property of a drug node

MEDRT_PE_NUIs file is an index of physiologic effects.  
Sample line from the file:  
Acetylcholine Activity Alteration [PE]	N0000008290	MED-RT  
Possible ways to store the data:
- Each line becomes a node with the label "Physiologic_Effect"
- Each line becomes a property of an existing UMLS concept node

### Excerpt from MED-RT_Schema_v1.xsd

AssociationDef - definition of Association
	<xs:complexType name="AssociationDef">
		<xs:annotation>
			<xs:documentation> This element includes all types of Associations: Synonyms, Term Associations and Concept Associations.
			</xs:documentation>
		</xs:annotation>
		<xs:sequence>
			<xs:element name="namespace" type="xs:token"/>
			<xs:element name="name" type="xs:token"/>
			<!-- name of AssociationType -->
			<xs:group ref="FromElement"/>
			<xs:group ref="ToElement"/>
			<xs:element name="qualifier" type="QualifierDef" minOccurs="0" maxOccurs="unbounded"/>
		</xs:sequence>
	</xs:complexType>
	<xs:group name="ToElement">
		<xs:annotation>
			<xs:documentation> A reference from the local Concept/Term to another Concept/Term (in any Namespace).
			</xs:documentation>
		</xs:annotation>
		<xs:sequence>
			<xs:element name="to_namespace" type="xs:token"/>
			<xs:element name="to_name" type="xs:token">
				<xs:annotation>
					<xs:documentation>MED-RT: Concept Name
MeSH: Preferred Term
RxNorm: Preferred Term
SNOMED CT: FSN Synonym</xs:documentation>
				</xs:annotation>
			</xs:element>
			<!-- name of target Concept/Term -->
			<xs:element name="to_code" type="xs:token" minOccurs="0">
				<xs:annotation>
					<xs:documentation>MED-RT: NUI
MeSH: Code in Source
RxNorm: Code in Source
SNOMED CT: Code in Source</xs:documentation>
				</xs:annotation>
			</xs:element>
			<!-- code of target Term -->
		</xs:sequence>
	</xs:group>
	<xs:group name="FromElement">
		<xs:annotation>
			<xs:documentation> A reference to the local Concept/Term from another Concept/Term (in a different Namespace).
			</xs:documentation>
		</xs:annotation>
		<xs:sequence>
			<xs:element name="from_namespace" type="xs:token"/>
			<xs:element name="from_name" type="xs:token">
				<xs:annotation>
					<xs:documentation>MED-RT: Concept Name
MeSH: Preferred Term
RxNorm: Preferred Term
SNOMED CT: FSN Synonym</xs:documentation>
				</xs:annotation>
			</xs:element>
			<!-- name of source Concept/Term -->
			<xs:element name="from_code" type="xs:token">
				<xs:annotation>
					<xs:documentation>MED-RT: NUI
MeSH: Code in Source
RxNorm: Code in Source
SNOMED CT: Code in Source</xs:documentation>
				</xs:annotation>
			</xs:element>
			<!-- code of source Term -->
		</xs:sequence>
	</xs:group>

### FDA's Structured Product Labels
"The Structured Product Labeling (SPL) is a document markup standard approved by Health Level Seven (HL7) and adopted by FDA as a mechanism for exchanging product and facility information." - U.S. FDA  
SPL Resources: https://www.fda.gov/industry/fda-resources-data-standards/structured-product-labeling-resources  
Download data: https://dailymed.nlm.nih.gov/dailymed/spl-resources-all-drug-labels.cfm

### MeSH
Connect heirarchically-related terms with broader-narrower relationships  
![Broader-narrower relationships among MeSH concepts](images/MeSH_relationships.png)  
MeSH contributes broader-narrower connections as displayed in the UMLS browser:  
![](images/MeSH_broader_narrower_in_UMLSbrowser.png)

RDF format for MeSH: https://id.nlm.nih.gov/mesh/, https://hhs.github.io/meshrdf/  
Concept structure of MeSH: https://www.nlm.nih.gov/mesh/concept_structure.html


### SemMedDB


SemMedDB can be downloaded as MySQL files or CSV files [here](https://ii.nlm.nih.gov/SemRep_SemMedDB_SKR/SemMedDB/SemMedDB_download.shtml). These are the CSV files:  

|TABLE NAME|Size compressed|Size uncompressed|# Rows|  
|---|---|---|---|  
|CITATIONS|152M|1.7G|32,470,549|  
|ENTITY|39G|160.5G|1,555,897,812|  
|GENERIC_CONCEPT|3.9K|9.3K|259|  
|PREDICATION|2.7G|15G|111,846,030|  
|PREDICATION_AUX|3.6G|16.4G|111,846,028|  
|SENTENCE|14G|43.8G|219,049,752|  


Schema of SemMedDB version 4.2 and later  
![image](images/Schema_SemMedDB_4.2.png)

Let's have a look at each table. Column names are obtained from the [schema documentation](https://ii.nlm.nih.gov/SemRep_SemMedDB_SKR/dbinfo.shtml). 

In [11]:
columns = ['PMID', 'ISSN', 'DP', 'EDAT', 'PYEAR']
citations = pd.read_csv('/home/tim/Documents/GrApH_AI/Data/SemMedDB/semmedVER43_2021_R_CITATIONS.23871.csv', header=None, names = columns, nrows=100)
citations.head()

Unnamed: 0,PMID,ISSN,DP,EDAT,PYEAR
0,1,0006-2944,1975 Jun,1975-6-1,1975
1,10,1873-2968,1975 Sep 01,1975-9-1,1975
2,100,0547-6844,1975,1975-1-1,1975
3,1000,0264-6021,1975 Sep,1975-9-1,1975
4,10000,0006-3002,1976 Sep 28,1976-9-28,1976


In [21]:
# Note that the order of columns in the actual data differs slightly from the documentation
columns = ['ENTITY_ID', 'SENTENCE_ID', 'CUI', 'NAME', 'TYPE', 'GENE_ID', 'GENE_NAME', 'TEXT', 'SCORE', 'START_INDEX', 'END_INDEX']
entity = pd.read_csv('/home/tim/Documents/GrApH_AI/Data/SemMedDB/semmedVER43_2021_R_ENTITY.23871.csv', header=None, names = columns, nrows=100)
entity.head()

Unnamed: 0,ENTITY_ID,SENTENCE_ID,CUI,NAME,TYPE,GENE_ID,GENE_NAME,TEXT,SCORE,START_INDEX,END_INDEX
454796,774955,2784046,C0162783,Prefrontal Cortex,bpoc,,,prefrontal cortex,888,33,50
454797,774852,2724035,C0597134,oral bacteria,bact,,,oral bacteria,1000,34,47
454798,52631916,1793713,C0242485,Measurement,ftcn,,,measurements,1000,24,36
454799,774991,2573972,C0003250,Monoclonal Antibodies,"aapp,imft",,,Monoclonal antibodies,1000,21,42
454800,776913,2363841,C0205245,Functional,ftcn,,,functional,694,52,62


In [20]:
columns = ['CONCEPT_ID', 'CUI', 'PREFERRED_NAME']
generic_concept = pd.read_csv('/home/tim/Documents/GrApH_AI/Data/SemMedDB/semmedVER43_2021_R_GENERIC_CONCEPT.csv', header=None, names = columns, nrows=100)
generic_concept.head()

Unnamed: 0,CONCEPT_ID,CUI,PREFERRED_NAME
0,1983,C0001687,Adverse effects NEC
1,1984,C0002526,"Amino Acids, Peptides, and Proteins"
2,1985,C0003043,Animalia
3,1986,C0003062,Animals
4,1987,C0005515,Biological Factors


In [41]:
# Note that three spurious columns exist after the last column named below, each containing '/n'
columns = ['PREDICATION_ID', 'SENTENCE_ID', 'PMID', 'PREDICATE', 'SUBJECT_CUI', 'SUBJECT_NAME', 'SUBJECT_SEMTYPE', 'SUBJECT_NOVELTY', 'OBJECT_CUI', 'OBJECT_NAME', 'OBJECT_SEMTYPE', 'OBJECT_NOVELTY']
predication = pd.read_csv('/home/tim/Documents/GrApH_AI/Data/SemMedDB/semmedVER43_2021_R_PREDICATION.23871.csv', usecols = [0,1,2,3,4,5,6,7,8,9,10,11], header=None, names = columns, nrows=10000000)
predication.head()

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
0,10592604,16,16530475,PROCESS_OF,C0003725,Arboviruses,virs,1,C0999630,Lepus capensis,mamm,1
1,10592697,17,16530475,ISA,C0039258,Tahyna virus,virs,1,C0446169,California Group Viruses,virs,1
2,10592728,17,16530475,ISA,C0318627,Eyach virus,virs,1,C0206590,Coltivirus,virs,1
3,10592759,17,16530475,ISA,C0446169,California Group Viruses,virs,1,C0003725,Arboviruses,virs,1
4,10592832,18,16530475,PROCESS_OF,C0012634,Disease,dsyn,0,C0020114,Human,humn,0


In [2]:
# Select only the columns of interest from the predication table
columns = ['SENTENCE_ID', 'PMID', 'PREDICATE', 'SUBJECT_CUI','OBJECT_CUI']
predication_select = pd.read_csv('/home/tim/Documents/GrApH_AI/Data/SemMedDB/semmedVER43_2021_R_PREDICATION.23871.csv', usecols = [1,2,3,4,8], header=None, names = columns)
predication_select.head()

Unnamed: 0,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,OBJECT_CUI
0,16,16530475,PROCESS_OF,C0003725,C0999630
1,17,16530475,ISA,C0039258,C0446169
2,17,16530475,ISA,C0318627,C0206590
3,17,16530475,ISA,C0446169,C0003725
4,18,16530475,PROCESS_OF,C0012634,C0020114


In [72]:
# Check what types of predicates are most common
predicates = predication_select['PREDICATE'].value_counts()
predicates.head(20)

PROCESS_OF         23376639
LOCATION_OF        18730787
PART_OF            10876791
TREATS             10333343
ISA                 6207073
AFFECTS             5053621
USES                5050811
COEXISTS_WITH       4249601
INTERACTS_WITH      3952863
CAUSES              3074874
ASSOCIATED_WITH     2636789
STIMULATES          2156497
ADMINISTERED_TO     1750345
INHIBITS            1568002
AUGMENTS            1274477
compared_with       1214187
DIAGNOSES           1106466
DISRUPTS            1081220
PRODUCES            1015853
PREDISPOSES          804967
Name: PREDICATE, dtype: int64

In [3]:
causal = predication_select[predication_select['PREDICATE'] == 'CAUSES'].copy()
causal

Unnamed: 0,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,OBJECT_CUI
5,18,16530475,CAUSES,C0042776,C0012634
46,103,16530483,CAUSES,C1504598,C0004368
53,112,16530485,CAUSES,C0812258|1869|7332,C0162638
55,116,16530485,CAUSES,C0812258|1869|7332,C0162638
161,384,16530601,CAUSES,C0403425,C1261469
...,...,...,...,...,...
111845480,367045376,33909265,CAUSES,C5203670,C0011065
111845666,367045836,33909316,CAUSES,C3166216,C1262477
111845766,367046022,33909339,CAUSES,C0007222,C0011065
111845834,367046134,33909350,CAUSES,C0042210,C0040034


In [98]:
causal.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3074874 entries, 5 to 111845852
Data columns (total 5 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   SENTENCE_ID  int64 
 1   PMID         int64 
 2   PREDICATE    object
 3   SUBJECT_CUI  object
 4   OBJECT_CUI   object
dtypes: int64(2), object(3)
memory usage: 140.8+ MB


In [4]:
# Some items in the SUBJECT_CUI column have extra numbers after the CUI. Let's remove those.
causal['SUBJECT_CUI'] = causal['SUBJECT_CUI'].apply([lambda x: x.split('|')[0]])
causal

Unnamed: 0,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,OBJECT_CUI
5,18,16530475,CAUSES,C0042776,C0012634
46,103,16530483,CAUSES,C1504598,C0004368
53,112,16530485,CAUSES,C0812258,C0162638
55,116,16530485,CAUSES,C0812258,C0162638
161,384,16530601,CAUSES,C0403425,C1261469
...,...,...,...,...,...
111845480,367045376,33909265,CAUSES,C5203670,C0011065
111845666,367045836,33909316,CAUSES,C3166216,C1262477
111845766,367046022,33909339,CAUSES,C0007222,C0011065
111845834,367046134,33909350,CAUSES,C0042210,C0040034


In [100]:
causal['SUBJECT_CUI'].str.len().value_counts()

8    2990694
4      45515
5      20558
3      12252
6       4707
2        585
9        527
0         30
1          6
Name: SUBJECT_CUI, dtype: int64

In [5]:
# Remove any rows where the length of the CUI is incorrect for the SUBJECT_CUI or OBJECT_CUI
mask = (causal['SUBJECT_CUI'].str.len() == 8) & (causal['OBJECT_CUI'].str.len() == 8)
causal = causal.loc[mask]
causal

Unnamed: 0,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,OBJECT_CUI
5,18,16530475,CAUSES,C0042776,C0012634
46,103,16530483,CAUSES,C1504598,C0004368
53,112,16530485,CAUSES,C0812258,C0162638
55,116,16530485,CAUSES,C0812258,C0162638
161,384,16530601,CAUSES,C0403425,C1261469
...,...,...,...,...,...
111845480,367045376,33909265,CAUSES,C5203670,C0011065
111845666,367045836,33909316,CAUSES,C3166216,C1262477
111845766,367046022,33909339,CAUSES,C0007222,C0011065
111845834,367046134,33909350,CAUSES,C0042210,C0040034


In [6]:
causal.to_csv('/home/tim/Documents/GrApH_AI/Data/SemMedDB/causal_predicates.csv', index=False)

Due to memory limits it was necessary to restart the notebook's kernal before attempting to load the SENTENCE table.

In [2]:
# Obtain all information necessary to retrieve the sentence stating the causal relationship
# from the article where it was found

# Note that the order of columns in the actual data differs slightly from the documentation
columns = ['SENTENCE_ID', 'TYPE', 'SENT_START_INDEX', 'SENT_END_INDEX']   
sentence = pd.read_csv('/home/tim/Documents/GrApH_AI/Data/SemMedDB/semmedVER43_2021_R_SENTENCE.23871.csv', usecols = [0,2,4,6], header=None, names = columns)
sentence.head()

Unnamed: 0,SENTENCE_ID,TYPE,SENT_START_INDEX,SENT_END_INDEX
0,6,ti,21,119
1,7,ab,125,302
2,8,ab,302,385
3,9,ab,385,578
4,10,ab,578,757


In [3]:
sentence.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219049752 entries, 0 to 219049751
Data columns (total 4 columns):
 #   Column            Dtype 
---  ------            ----- 
 0   SENTENCE_ID       int64 
 1   TYPE              object
 2   SENT_START_INDEX  int64 
 3   SENT_END_INDEX    object
dtypes: int64(2), object(2)
memory usage: 6.5+ GB


In [3]:
sentence.to_csv('/home/tim/Documents/GrApH_AI/Data/SemMedDB/sentence_locations.csv', index=False)

In [3]:
# Load the causal predicates
causal_predicates = pd.read_csv('/home/tim/Documents/GrApH_AI/Data/SemMedDB/causal_predicates.csv')
causal_predicates.head()

Unnamed: 0,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,OBJECT_CUI
0,18,16530475,CAUSES,C0042776,C0012634
1,103,16530483,CAUSES,C1504598,C0004368
2,112,16530485,CAUSES,C0812258,C0162638
3,116,16530485,CAUSES,C0812258,C0162638
4,384,16530601,CAUSES,C0403425,C1261469


In [4]:
# Load the sentence locations
sentence_locations = pd.read_csv('/home/tim/Documents/GrApH_AI/Data/SemMedDB/sentence_locations.csv')
sentence_locations.head()

Unnamed: 0,SENTENCE_ID,TYPE,SENT_START_INDEX,SENT_END_INDEX
0,6,ti,21,119
1,7,ab,125,302
2,8,ab,302,385
3,9,ab,385,578
4,10,ab,578,757


In [None]:
# Merge the sentence data with the causal subject-predicate-object data
causal_predicates = pd.merge(causal_predicates, sentence_locations, on=['SENTENCE_ID'])
causal_predicates.head()

In [29]:
columns = ['PREDICATION_AUX_ID', 'PREDICATION_ID', 'SUBJECT_TEXT', 'SUBJECT_DIST', 'SUBJECT_MAXDIST', 'SUBJECT_START_INDEX', 'SUBJECT_END_INDEX', 'SUBJECT_SCORE', 'INDICATOR_TYPE', 'PREDICATE_START_INDEX', 'PREDICATE_END_INDEX', 'OBJECT', 'CURR_TIMESTAMP']
predication_aux = pd.read_csv('/home/tim/Documents/GrApH_AI/Data/SemMedDB/semmedVER43_2021_R_PREDICATION_AUX.23871.csv', usecols = [0,1,2,3,4,5,6,7,8,9,10,11,12], header=None, names = columns, nrows=100)
predication_aux.head()

Unnamed: 0,PREDICATION_AUX_ID,PREDICATION_ID,SUBJECT_TEXT,SUBJECT_DIST,SUBJECT_MAXDIST,SUBJECT_START_INDEX,SUBJECT_END_INDEX,SUBJECT_SCORE,INDICATOR_TYPE,PREDICATE_START_INDEX,PREDICATE_END_INDEX,OBJECT,CURR_TIMESTAMP
0,10592600,10592604,arboviruses,1,3,69,80,840,PREP,81,83,brown hares,1
1,10592679,10592697,Tahyna virus,0,0,232,244,1000,SPEC,232,279,California encephalitis serogroup,0
2,10592713,10592728,Eyach virus,0,0,196,207,1000,SPEC,196,225,genus Coltivirus,0
3,10592749,10592759,California encephalitis serogroup,0,0,246,279,901,SPEC,246,326,arthropod-borne viruses,0
4,10592816,10592832,disease,0,0,402,409,888,MOD/HEAD,396,409,human,0


In [64]:
# Note that the order of columns in the actual data differs slightly from the documentation
columns = ['SENTENCE_ID', 'PMID', 'TYPE', 'NUMBER', 'SENT_START_INDEX', 'SENTENCE', 'SENT_END_INDEX', 'SECTION_HEADER', 'NORMALIZED_SECTION_HEADER']   
sentence = pd.read_csv('/home/tim/Documents/GrApH_AI/Data/SemMedDB/semmedVER43_2021_R_SENTENCE.23871.csv', header=None, names = columns, nrows=100)
sentence.head()

Unnamed: 0,SENTENCE_ID,PMID,TYPE,NUMBER,SENT_START_INDEX,SENTENCE,SENT_END_INDEX,SECTION_HEADER,NORMALIZED_SECTION_HEADER
0,6,16530473,ti,1,21,Fluoride-selective colorimetric sensor based o...,119,,
1,7,16530473,ab,1,125,"A structurally simple colorimetric sensor, N-4...",302,,
2,8,16530473,ab,2,302,"In acetonitrile, the addition of F(-) changed ...",385,,
3,9,16530473,ab,3,385,In the presence of other anions such as CH(3)C...,578,,
4,10,16530473,ab,4,578,The association constants of anionic complexes...,757,,


In [66]:
for string_of_interest in sentence.SENTENCE[0:5]:
    print(string_of_interest, '\n')

Fluoride-selective colorimetric sensor based on thiourea binding site and anthraquinone reporter. 

A structurally simple colorimetric sensor, N-4-nitrobenzene-N'-1'-anthraquinone-thiourea (1), for anions was synthesized and characterized by (1)H NMR, ESI mass and IR methods. 

In acetonitrile, the addition of F(-) changed 1 solution from colorless to yellow. 

In the presence of other anions such as CH(3)CO(2)(-), H(2)PO(4)(-), HSO(4)(-) and Cl(-), however, the absorption spectrum of 1 was slightly red shifted with no obvious color changes observed. 

The association constants of anionic complexes followed the order of F(-)>>CH(3)CO(2)(-)>H(2)PO(4)(-)>HSO(4)(-)>Cl(-)>Br(-), which was different from the order of anion basicity. 



In [68]:
sentence.TYPE.value_counts()

ab    88
ti    12
Name: TYPE, dtype: int64

## Data model to connect the various data sources

MIMIC-IV d_labitems loinc_code connects to UMLS by LOINC code