# Importing Data into Graph

### About the data  
The data were obtaind from Physionet, where the following abstract describes the data: "MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012 [1]. The MIMIC-III Clinical Database is available on PhysioNet (doi: 10.13026/C2XW26). Though deidentified, MIMIC-III contains detailed information regarding the care of real patients, and as such requires credentialing before access. To allow researchers to ascertain whether the database is suitable for their work, we have manually curated a demo subset, which contains information for 100 patients also present in the MIMIC-III Clinical Database. <strong>Notably, the demo dataset does not include free-text notes.</strong>"

### Data reference:
Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3, 160035.  

## Import MIMIC III Demo 1.4 Data to into a graph database
---

Data downloaded from https://physionet.org/content/mimiciii-demo/1.4/ on 11 February, 2021.
The downloaded data consisted of the following 28 files:
- ADMISSIONS.csv
- CALLOUT.csv
- CAREGIVERS.csv
- CHARTEVENTS.csv
- CPTEVENTS.csv
- DATETIMEEVENTS.csv
- D_CPT.csv
- DIAGNOSES_ICD.csv
- D_ICD_DIAGNOSES.csv
- D_ICD_PROCEDURES.csv
- D_ITEMS.csv
- D_LABITEMS.csv
- DRGCODES.csv
- ICUSTAYS.csv
- INPUTEVENTS_CV.csv
- INPUTEVENTS_MV.csv
- LABEVENTS.csv
- LICENSE.txt
- MICROBIOLOGYEVENTS.csv
- NOTEEVENTS.csv
- OUTPUTEVENTS.csv
- PATIENTS.csv
- PRESCRIPTIONS.csv
- PROCEDUREEVENTS_MV.csv
- PROCEDURES_ICD.csv
- SERVICES.csv
- SHA256SUMS.txt
- TRANSFERS.csv

The CSV files were placed in the Import folder of the GraphEHR_Proof-of-concept Neo4j database to make them readily available for import into the graph.

### Write cypher commands for nodes

In [2]:
import pandas as pd

In [3]:
# Specify the location of the CSV files to import. This is the path to the /import folder in your Neo4j database. 
# Make sure your path ends with a backslash
path = '/home/tim/.config/Neo4j Desktop/Application/relate-data/dbmss/dbms-0d5f0a02-7d73-4462-a47e-2a074de0b766/import/'

# Create a list of all CSV files to import
csv_files = ['ADMISSIONS.csv', 'CALLOUT.csv', 'CAREGIVERS.csv', 'CHARTEVENTS.csv', 'CPTEVENTS.csv', 'DATETIMEEVENTS.csv', 'D_CPT.csv', 'DIAGNOSES_ICD.csv', 'D_ICD_DIAGNOSES.csv', 'D_ICD_PROCEDURES.csv', 'D_ITEMS.csv', 'D_LABITEMS.csv', 'DRGCODES.csv', 'ICUSTAYS.csv', 'INPUTEVENTS_CV.csv', 'INPUTEVENTS_MV.csv', 'LABEVENTS.csv', 'MICROBIOLOGYEVENTS.csv', 'NOTEEVENTS.csv', 'OUTPUTEVENTS.csv', 'PATIENTS.csv', 'PRESCRIPTIONS.csv', 'PROCEDUREEVENTS_MV.csv', 'PROCEDURES_ICD.csv', 'SERVICES.csv', 'TRANSFERS.csv']

# Create a dictionary with file names as keys and the list of headers for each file as values 
headers_dict = {}
for file in csv_files:
    headers = pd.read_csv(path+file, nrows=1)
    headers = headers.columns.tolist()
    if not file in headers_dict:
        headers_dict[file] = headers
        
# Inspect an example item in the dictionary
print(headers_dict['ADMISSIONS.csv'])

['row_id', 'subject_id', 'hadm_id', 'admittime', 'dischtime', 'deathtime', 'admission_type', 'admission_location', 'discharge_location', 'insurance', 'language', 'religion', 'marital_status', 'ethnicity', 'edregtime', 'edouttime', 'diagnosis', 'hospital_expire_flag', 'has_chartevents_data']


In [8]:
# Create a function that writes the string for a cypher command
# to create nodes from each CSV file

def csv_to_node(csv_file):
    
    # Create the node label based on the CSV file name. Place it in title case and remove the '.csv' suffix
    label= csv_file[:-4].title() 
    
    # Convert the CSV's headers into node properties
    properties = '{'
    col_index = 0
    for header in headers_dict[csv_file]:
        header = header.lower()
        properties = properties+header+':COLUMN['+str(col_index)+'], '
        properties = properties.replace('.','_')
        col_index += 1
    properties = properties[:-2]+'}' # Delete last comma of the list and add the ending curly bracket
    
    # Compile the complete cypher command
    cypher = '''USING PERIODIC COMMIT 100000 LOAD CSV FROM "file:///{csv_file}" AS COLUMN CREATE (n:{label} {properties})'''.format(csv_file=csv_file, label=label, properties=properties)
    return cypher

# Generate the cypher code for a single csv file to test in the Neo4j browser
csv_to_node('ADMISSIONS.csv')

'USING PERIODIC COMMIT 100000 LOAD CSV FROM "file:///ADMISSIONS.csv" AS COLUMN CREATE (n:Admissions {row_id:COLUMN[0], subject_id:COLUMN[1], hadm_id:COLUMN[2], admittime:COLUMN[3], dischtime:COLUMN[4], deathtime:COLUMN[5], admission_type:COLUMN[6], admission_location:COLUMN[7], discharge_location:COLUMN[8], insurance:COLUMN[9], language:COLUMN[10], religion:COLUMN[11], marital_status:COLUMN[12], ethnicity:COLUMN[13], edregtime:COLUMN[14], edouttime:COLUMN[15], diagnosis:COLUMN[16], hospital_expire_flag:COLUMN[17], has_chartevents_data:COLUMN[18]})'

### Initialize a connection to the neo4j database.

In [15]:
from neo4j import GraphDatabase
driver=GraphDatabase.driver(uri="bolt://localhost:7687", auth=('neo4j','Gr@ph3HR'))
session=driver.session()

### Import nodes

In [10]:
# Create all nodes
for csv_name in csv_files:
    query = csv_to_node(csv_name)
    session.run(query)

### Create relationships

Prepare a CSV of the tables and foreing keys from the original MIMIC III schema:
- Obtain foreign key constraints from https://mit-lcp.github.io/mimic-schema-spy/constraints.html
- Copy the table into a CSV file. 
- In a spreadsheet editor, keep only the columns "Child Column" and "Parent Column." 
- Split each of these columns on "." into node and foreign key columns. 
- Save as "mimic3_relational_schema.csv"

In [11]:
# Read the CSV into a dataframe
sql_schema = pd.read_csv('mimic3_relational_schema.csv')

# Examine the first five rows
sql_schema.iloc[:5,:]

Unnamed: 0,Child Node,Child Foreign Key,Parent Node,Parent Foreign Key
0,admissions,subject_id,patients,subject_id
1,callout,hadm_id,admissions,hadm_id
2,callout,subject_id,patients,subject_id
3,chartevents,cgid,caregivers,cgid
4,chartevents,hadm_id,admissions,hadm_id


Note that the normal cypher command to create these relationships would attempt to load too much into RAM at the same time, so the computer can't run the command unless you utilize periodic execution.  
See the Neo4j documentation for periodic execution at https://neo4j.com/labs/apoc/4.1/graph-updates/periodic-execution/ to understand the cypher command in the following cell.  

Note also that we avoid creating a cartesian product with our MATCH query, which would be very computationally expensive. See Stefan Armbruster's description of how to avoid creating a cartesian product in this scenario at https://community.neo4j.com/t/reliably-create-relationships-on-12million-nodes/22223.

In [16]:
# Write a cypher command for each relationship specified in the original
# MIMIC III schema
count = 0
for index, row in sql_schema.iterrows():
    child_node = row['Child Node'].title()
    child_fk = row['Child Foreign Key']
    parent_node = row['Parent Node'].title()
    parent_fk = row['Parent Foreign Key']

    command = 'CALL apoc.periodic.iterate(\"MATCH (cn:{child_node}) MATCH (pn:{parent_node} {{{parent_fk}:cn.{child_fk}}}) RETURN cn, pn\", \"CREATE (cn)-[:CHILD_OF]->(pn)\", {{batchSize:10000, parallel: true, iterateList:true}})'.format(child_node=child_node, parent_node=parent_node, child_fk=child_fk, parent_fk=parent_fk)
    session.run(command)
    count += 1
    print(str(count)+' of 63: '+child_node)

1 of 63: Admissions
2 of 63: Callout
3 of 63: Callout
4 of 63: Chartevents
5 of 63: Chartevents
6 of 63: Chartevents
7 of 63: Chartevents
8 of 63: Chartevents
9 of 63: Cptevents
10 of 63: Cptevents
11 of 63: Datetimeevents
12 of 63: Datetimeevents
13 of 63: Datetimeevents
14 of 63: Datetimeevents
15 of 63: Datetimeevents
16 of 63: Diagnoses_Icd
17 of 63: Diagnoses_Icd
18 of 63: Diagnoses_Icd
19 of 63: Drgcodes
20 of 63: Drgcodes
21 of 63: Icustays
22 of 63: Icustays
23 of 63: Inputevents_Cv
24 of 63: Inputevents_Cv
25 of 63: Inputevents_Cv
26 of 63: Inputevents_Cv
27 of 63: Inputevents_Mv
28 of 63: Inputevents_Mv
29 of 63: Inputevents_Mv
30 of 63: Inputevents_Mv
31 of 63: Inputevents_Mv
32 of 63: Labevents
33 of 63: Labevents
34 of 63: Labevents
35 of 63: Microbiologyevents
36 of 63: Microbiologyevents
37 of 63: Microbiologyevents
38 of 63: Microbiologyevents
39 of 63: Microbiologyevents
40 of 63: Noteevents
41 of 63: Noteevents
42 of 63: Noteevents
43 of 63: Outputevents
44 of 63: Out

### Close the connection to the neo4j database

In [17]:
session.close()