# Importing MIMIC-III v1.4 data into a graph database

### About the data  
The data were obtaind from multiple repositories on Physionet. The main body of data comes from the MIMIC-III v1.4 Clinical Database. 

#### MIMIC-III Clinical Database  
Description:
MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The database includes information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (including post-hospital discharge).   
Published: Sept. 4, 2016. Version: 1.4  
Data downloaded 13 February 2021 at https://physionet.org/content/mimiciii/1.4/

#### Phenotype Annotations for Patient Notes in the MIMIC-III Database  
Description:  
"A dataset of patient notes, all in the English language, with a focus on frequently readmitted patients, labeled with 15 clinical patient phenotypes believed to be associated with risk of recurrent Intensive Care Unit (ICU) readmission per our domain experts (co-authors LAC, PAT, DAG) as well as the literature [5-7]. Each entry in this database consists of a MIMIC-III derived Subject Identifier ("SUBJECT_ID", integer), a Hospital Admission Identifier ("HADM_ID", integer), the index from MIMIC-III v1.4 NOTEEVENTS table ("ROW_ID", integer), 15 Phenotypes (binary) including "None'' and "Unsure'', and Operator (string)."  
Published: March 5, 2020. Version: 1.20.03  
Data downloaded 13 February 2021 at https://physionet.org/content/phenotype-annotations-mimic/1.20.03/

## Import Data into a graph database
---

The downloaded data consisted of the following 27 CSV files, which collectively used 46.6 GB of memory:
- ADMISSIONS.csv
- CALLOUT.csv
- CAREGIVERS.csv
- CHARTEVENTS.csv
- CPTEVENTS.csv
- DATETIMEEVENTS.csv
- D_CPT.csv
- DIAGNOSES_ICD.csv
- D_ICD_DIAGNOSES.csv
- D_ICD_PROCEDURES.csv
- D_ITEMS.csv
- D_LABITEMS.csv
- DRGCODES.csv
- ICUSTAYS.csv
- INPUTEVENTS_CV.csv
- INPUTEVENTS_MV.csv
- LABEVENTS.csv
- MICROBIOLOGYEVENTS.csv
- NOTEEVENTS_ANNOTATION.csv
- NOTEEVENTS.csv
- OUTPUTEVENTS.csv
- PATIENTS.csv
- PRESCRIPTIONS.csv
- PROCEDUREEVENTS_MV.csv
- PROCEDURES_ICD.csv
- SERVICES.csv
- TRANSFERS.csv

These files were placed in the Import folder of the MIMIC-III Neo4j database to make them readily available for import into the graph. The name of the file containing phenotype annotations was changed from ACTdb102003.csv to NOTEEVENTS_ANNOTATIONS.csv to improve readability.

In [3]:
import pandas as pd

In [5]:
# Specify the location of the CSV files to import. This is the path to the /import folder in your Neo4j database. 
# Make sure your path ends with a backslash
path = '/home/tim/.config/Neo4j Desktop/Application/relate-data/dbmss/dbms-b23913ca-cd46-4f05-bb15-7b1b8e6ebb5a/import/'

# Create a list of all CSV files to import
csv_files = ['ADMISSIONS.csv', 'CALLOUT.csv', 'CAREGIVERS.csv', 'CHARTEVENTS.csv', 'CPTEVENTS.csv', 'DATETIMEEVENTS.csv', 'D_CPT.csv', 'DIAGNOSES_ICD.csv', 'D_ICD_DIAGNOSES.csv', 'D_ICD_PROCEDURES.csv', 'D_ITEMS.csv', 'D_LABITEMS.csv', 'DRGCODES.csv', 'ICUSTAYS.csv', 'INPUTEVENTS_CV.csv', 'INPUTEVENTS_MV.csv', 'LABEVENTS.csv', 'MICROBIOLOGYEVENTS.csv', 'NOTEEVENTS_ANNOTATION.csv', 'NOTEEVENTS.csv', 'OUTPUTEVENTS.csv', 'PATIENTS.csv', 'PRESCRIPTIONS.csv', 'PROCEDUREEVENTS_MV.csv', 'PROCEDURES_ICD.csv', 'SERVICES.csv', 'TRANSFERS.csv']

# Create a dictionary with file names as keys and the list of headers for each file as values 
headers_dict = {}
for file in csv_files:
    headers = pd.read_csv(path+file, nrows=1)
    headers = headers.columns.tolist()
    if not file in headers_dict:
        headers_dict[file] = headers
        
# Inspect an example item in the dictionary
print(headers_dict['ADMISSIONS.csv'])

['ROW_ID', 'SUBJECT_ID', 'HADM_ID', 'ADMITTIME', 'DISCHTIME', 'DEATHTIME', 'ADMISSION_TYPE', 'ADMISSION_LOCATION', 'DISCHARGE_LOCATION', 'INSURANCE', 'LANGUAGE', 'RELIGION', 'MARITAL_STATUS', 'ETHNICITY', 'EDREGTIME', 'EDOUTTIME', 'DIAGNOSIS', 'HOSPITAL_EXPIRE_FLAG', 'HAS_CHARTEVENTS_DATA']


In [8]:
# Create a function that writes the string for a cypher command
# to create nodes from each CSV file

def csv_to_node(csv_file):
    
    # Create the node label based on the CSV file name. Place it in title case and remove the '.csv' suffix
    label= csv_file[:-4].title() 
    
    # Convert the CSV's headers into node properties
    properties = '{'
    col_index = 0
    for header in headers_dict[csv_file]:
        properties = properties+header+':COLUMN['+str(col_index)+'], '
        properties = properties.replace('.','_')
        properties = properties.lower()
        col_index += 1
    properties = properties[:-2]+'}' # Delete last comma of the list and add the ending curly bracket
    
    # Compile the complete cypher command
    cypher = '''USING PERIODIC COMMIT 100000 LOAD CSV FROM "file:///{csv_file}" AS COLUMN CREATE (n:{label} {properties})'''.format(csv_file=csv_file, label=label, properties=properties)
    return cypher

# Generate the cypher code for a single csv file to test in the Neo4j browser
csv_to_node('NOTEEVENTS_ANNOTATION.csv')

'USING PERIODIC COMMIT 100000 LOAD CSV FROM "file:///NOTEEVENTS_ANNOTATION.csv" AS COLUMN CREATE (n:Noteevents_Annotation {hadm_id:column[0], subject_id:column[1], row_id:column[2], advanced_cancer:column[3], advanced_heart_disease:column[4], advanced_lung_disease:column[5], alcohol_abuse:column[6], batch_id:column[7], chronic_neurological_dystrophies:column[8], chronic_pain_fibromyalgia:column[9], dementia:column[10], depression:column[11], developmental_delay_retardation:column[12], non_adherence:column[13], none:column[14], obesity:column[15], operator:column[16], other_substance_abuse:column[17], schizophrenia_and_other_psychiatric_disorders:column[18], unsure:column[19]})'

### Initialize a connection to the neo4j database.

In [None]:
import getpass
password = getpass.getpass("\nPlease enter the Neo4j database password to continue \n")

In [5]:
from neo4j import GraphDatabase
driver=GraphDatabase.driver(uri="bolt://localhost:7687", auth=('neo4j',password))
session=driver.session()

In [20]:
# Create all nodes
for csv_name in csv_files[18:]:
    query = csv_to_node(csv_name)
    session.run(query)

This operation created 397,848,371 nodes.

### Create relationships

Prepare a CSV of the tables and foreing keys from the original MIMIC III schema:
- Obtain foreign key constraints from https://mit-lcp.github.io/mimic-schema-spy/constraints.html
- Copy the table into a CSV file. 
- In a spreadsheet editor, keep only the columns "Child Column" and "Parent Column." 
- Split each of these columns on "." into node and foreign key columns. 
- Save as "mimic3_relational_schema.csv"

In [3]:
# Read the CSV into a dataframe
sql_schema = pd.read_csv('mimic3_relational_schema.csv')

# Examine the first five rows
sql_schema.iloc[:5,:]

Unnamed: 0,Child Node,Child Foreign Key,Parent Node,Parent Foreign Key
0,admissions,subject_id,patients,subject_id
1,callout,hadm_id,admissions,hadm_id
2,callout,subject_id,patients,subject_id
3,chartevents,cgid,caregivers,cgid
4,chartevents,hadm_id,admissions,hadm_id


Note that the normal cypher command to create these relationships would attempt to load too much into RAM at the same time, so the computer can't run the command unless you utilize periodic execution.  
See the Neo4j documentation for periodic execution at https://neo4j.com/labs/apoc/4.1/graph-updates/periodic-execution/ to understand the cypher command in the following cell.  

Note also that we avoid creating a cartesian product with our MATCH query, which would be very computationally expensive. See Stefan Armbruster's description of how to avoid creating a cartesian product in this scenario at https://community.neo4j.com/t/reliably-create-relationships-on-12million-nodes/22223.

In [7]:
# Write a cypher command for each relationship specified in the original
# MIMIC III schema
count = 0
for index, row in sql_schema.iterrows():
    child_node = row['Child Node'].title()
    child_fk = row['Child Foreign Key'].upper()
    parent_node = row['Parent Node'].title()
    parent_fk = row['Parent Foreign Key'].upper()

    command = 'CALL apoc.periodic.iterate(\"MATCH (cn:{child_node}) MATCH (pn:{parent_node} {{{parent_fk}:cn.{child_fk}}}) RETURN cn, pn\", \"CREATE (cn)-[:CHILD_OF]->(pn)\", {{batchSize:10000, parallel: true, iterateList:true}})'.format(child_node=child_node, parent_node=parent_node, child_fk=child_fk, parent_fk=parent_fk)
    session.run(command)
    count += 1
    print(str(count)+' of 63: '+child_node)

1 of 63: Admissions
2 of 63: Callout
3 of 63: Callout
4 of 63: Chartevents
5 of 63: Chartevents
6 of 63: Chartevents
7 of 63: Chartevents
8 of 63: Chartevents
9 of 63: Cptevents
10 of 63: Cptevents
11 of 63: Datetimeevents
12 of 63: Datetimeevents
13 of 63: Datetimeevents
14 of 63: Datetimeevents
15 of 63: Datetimeevents
16 of 63: Diagnoses_Icd
17 of 63: Diagnoses_Icd
18 of 63: Diagnoses_Icd
19 of 63: Drgcodes
20 of 63: Drgcodes
21 of 63: Icustays
22 of 63: Icustays
23 of 63: Inputevents_Cv
24 of 63: Inputevents_Cv
25 of 63: Inputevents_Cv
26 of 63: Inputevents_Cv
27 of 63: Inputevents_Mv
28 of 63: Inputevents_Mv
29 of 63: Inputevents_Mv
30 of 63: Inputevents_Mv
31 of 63: Inputevents_Mv
32 of 63: Labevents
33 of 63: Labevents
34 of 63: Labevents
35 of 63: Microbiologyevents
36 of 63: Microbiologyevents
37 of 63: Microbiologyevents
38 of 63: Microbiologyevents
39 of 63: Microbiologyevents
40 of 63: Noteevents
41 of 63: Noteevents
42 of 63: Noteevents
43 of 63: Outputevents
44 of 63: Out

In [17]:
# Create relationshipS for Noteevents_Annotation table
nea_schema = [
    ['Child Node', 'Child Foreign Key', 'Parent Node', 'Parent Foreign Key'],
    ['Noteevents_Annotation', 'SUBJECT_ID', 'Patients', 'SUBJECT_ID'],
    ['Noteevents_Annotation', 'HADM_ID', 'Admissions', 'HADM_ID'],
    ['Noteevents_Annotation', 'ROW_ID', 'Noteevents', 'ROW_ID']
]

count = 0
for row in nea_schema[1:]:
    child_node = row[0].title()
    child_fk = row[1].upper()
    parent_node = row[2].title()
    parent_fk = row[3].upper()

    command = 'CALL apoc.periodic.iterate(\"MATCH (cn:{child_node}) MATCH (pn:{parent_node} {{{parent_fk}:cn.{child_fk}}}) RETURN cn, pn\", \"CREATE (cn)-[:CHILD_OF]->(pn)\", {{batchSize:10000, parallel: true, iterateList:true}})'.format(child_node=child_node, parent_node=parent_node, child_fk=child_fk, parent_fk=parent_fk)
    session.run(command)
    count += 1
    print(str(count)+' of 3: '+child_node)

1 of 3: Noteevents_Annotation
2 of 3: Noteevents_Annotation
3 of 3: Noteevents_Annotation


The relationship-building commands above operated on 397,848,371 nodes, creating 587,769,497 relationships. These relationships used a total of 149GB of disk space.

### Create Constraints to prevent data duplication

### Close the connection to the neo4j database

In [21]:
session.close()

### Data references:
The dataset:
Johnson, A., Pollard, T., & Mark, R. (2016). MIMIC-III Clinical Database (version 1.4). PhysioNet. https://doi.org/10.13026/C2XW26.

The original publication:
Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific Data, 3, 160035.

The data hosting service PhysioNet:
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.