# HDAT Capstone Project

## Research Question - Mortality prediction in the ICU:

#### Task - The task is to build a predictive algorithm using the techniques we learned in this course
#### Objective - To assess the role of machine learning algorithms for predicting mortality by using the MIMIC-II dataset
#### Question - Is it possible to accurately predict mortality based on data from the first 24 hours in ICU?
#### Study population - MIMIC-II dataset

Notes about the datsets:

1. Incorrect values - MIMIC-II was not collected for research and is a combination of two different electronic medical record systems (CareVue and Metavision). This increase the likelihood of inaccuracies in data entry and extraction.

2. Missing data/sparseness: there is variation in the information recorded between patients due to different uses of the EMR (e.g. use of a separate system for recording lab results, or medications) across time, and the data being collected for clinical relevance rather than research.

All patients have a unique identifying ID (subject_id), a hospital stay ID (hadm_id) and an ICU stay ID (icustay_id). These IDs can be used to identify readmissions to hospital and ICU.

## Imports

In [44]:
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np 
import seaborn as sns
sns.set_style("darkgrid")
import matplotlib.pyplot as plt 


## Load in datasets and merge

In [45]:
patients = pd.read_csv('mimic_data/patients.csv') # https://mimic.physionet.org/mimictables/patients/
# Table purpose: Defines each SUBJECT_ID in the database, i.e. defines a single patient
# Links to: ADMISSIONS on SUBJECT_ID, ICUSTAYS on SUBJECT_ID
patients.head()

Unnamed: 0,row_id,subject_id,gender,dob,dod,dod_hosp,dod_ssn,expire_flag
0,234,249,F,2075-03-13 00:00:00,,,,0
1,235,250,F,2164-12-27 00:00:00,2188-11-22 00:00:00,2188-11-22 00:00:00,,1
2,236,251,M,2090-03-15 00:00:00,,,,0
3,237,252,M,2078-03-06 00:00:00,,,,0
4,238,253,F,2089-11-26 00:00:00,,,,0


In [46]:
# Select only columns of interest from patients
patients = patients[['subject_id','gender','dob']]
patients.head()

Unnamed: 0,subject_id,gender,dob
0,249,F,2075-03-13 00:00:00
1,250,F,2164-12-27 00:00:00
2,251,M,2090-03-15 00:00:00
3,252,M,2078-03-06 00:00:00
4,253,F,2089-11-26 00:00:00


In [47]:
# load in pt_stay_hr as the building block master table
pt_stay_hr = pd.read_csv('mimic_data/pt_stay_hr.csv')
pt_stay_hr.head()

Unnamed: 0,icustay_id,hadm_id,subject_id,intime,outtime,starttime,endtime,hr,dy
0,200001,152234,55973,2181-11-25 19:06:12,2181-11-28 20:59:25,2181-11-24 19:06:12,2181-11-24 20:06:12,-24.0,0.0
1,200001,152234,55973,2181-11-25 19:06:12,2181-11-28 20:59:25,2181-11-24 20:06:12,2181-11-24 21:06:12,-23.0,0.0
2,200001,152234,55973,2181-11-25 19:06:12,2181-11-28 20:59:25,2181-11-24 21:06:12,2181-11-24 22:06:12,-22.0,0.0
3,200001,152234,55973,2181-11-25 19:06:12,2181-11-28 20:59:25,2181-11-24 22:06:12,2181-11-24 23:06:12,-21.0,0.0
4,200001,152234,55973,2181-11-25 19:06:12,2181-11-28 20:59:25,2181-11-24 23:06:12,2181-11-25 00:06:12,-20.0,0.0


In [48]:
# Merge pt_stay_hr and patients to master table on subject_id

master = pd.merge(pt_stay_hr, patients, on='subject_id')
master.head()
# intime + outtime = ICU in and out times
# hr starts from -24 = 24 hrs before admission
# dy days in ICU
# starttime and endtime = start and end of each hr interval

Unnamed: 0,icustay_id,hadm_id,subject_id,intime,outtime,starttime,endtime,hr,dy,gender,dob
0,200001,152234,55973,2181-11-25 19:06:12,2181-11-28 20:59:25,2181-11-24 19:06:12,2181-11-24 20:06:12,-24.0,0.0,F,2120-10-31 00:00:00
1,200001,152234,55973,2181-11-25 19:06:12,2181-11-28 20:59:25,2181-11-24 20:06:12,2181-11-24 21:06:12,-23.0,0.0,F,2120-10-31 00:00:00
2,200001,152234,55973,2181-11-25 19:06:12,2181-11-28 20:59:25,2181-11-24 21:06:12,2181-11-24 22:06:12,-22.0,0.0,F,2120-10-31 00:00:00
3,200001,152234,55973,2181-11-25 19:06:12,2181-11-28 20:59:25,2181-11-24 22:06:12,2181-11-24 23:06:12,-21.0,0.0,F,2120-10-31 00:00:00
4,200001,152234,55973,2181-11-25 19:06:12,2181-11-28 20:59:25,2181-11-24 23:06:12,2181-11-25 00:06:12,-20.0,0.0,F,2120-10-31 00:00:00


In [49]:
# Load pt_icu_outcome dataset
pt_icu_outcome = pd.read_csv('mimic_data/pt_icu_outcome.csv')
pt_icu_outcome.head()

Unnamed: 0,row_id,subject_id,dob,hadm_id,admittime,dischtime,icustay_id,age_years,intime,outtime,los,hosp_deathtime,icu_expire_flag,hospital_expire_flag,dod,expire_flag,ttd_days
0,1,2,2138-07-17 00:00:00,163353,2138-07-17 19:04:00,2138-07-21 15:48:00,243653,0.0,2138-07-17 21:20:07,2138-07-17 23:32:21,0.0918,,0,0.0,,0,
1,2,3,2025-04-11 00:00:00,145834,2101-10-20 19:08:00,2101-10-31 13:58:00,211552,76.0,2101-10-20 19:10:11,2101-10-26 20:43:09,6.0646,,0,0.0,2102-06-14 00:00:00,1,236.0
2,3,4,2143-05-12 00:00:00,185777,2191-03-16 00:28:00,2191-03-23 18:41:00,294638,47.0,2191-03-16 00:29:31,2191-03-17 16:46:31,1.6785,,0,0.0,,0,
3,4,5,2103-02-02 00:00:00,178980,2103-02-02 04:31:00,2103-02-04 12:15:00,214757,0.0,2103-02-02 06:04:24,2103-02-02 08:06:00,0.0844,,0,0.0,,0,
4,5,6,2109-06-21 00:00:00,107064,2175-05-30 07:15:00,2175-06-15 16:00:00,228232,65.0,2175-05-30 21:30:54,2175-06-03 13:39:54,3.6729,,0,0.0,,0,


In [50]:
# Selecting only columns of interest from pt_icu
pt_icu_outcome = pt_icu_outcome[['icustay_id','age_years','los','icu_expire_flag', 'ttd_days']]
pt_icu_outcome.head()

Unnamed: 0,icustay_id,age_years,los,icu_expire_flag,ttd_days
0,243653,0.0,0.0918,0,
1,211552,76.0,6.0646,0,236.0
2,294638,47.0,1.6785,0,
3,214757,0.0,0.0844,0,
4,228232,65.0,3.6729,0,


In [51]:
# Left join the master table with our selected variables from pt_icu_outcome on the icustay_id
master = pd.merge(master, pt_icu_outcome, on='icustay_id', how='left')
master.head()

Unnamed: 0,icustay_id,hadm_id,subject_id,intime,outtime,starttime,endtime,hr,dy,gender,dob,age_years,los,icu_expire_flag,ttd_days
0,200001,152234,55973,2181-11-25 19:06:12,2181-11-28 20:59:25,2181-11-24 19:06:12,2181-11-24 20:06:12,-24.0,0.0,F,2120-10-31 00:00:00,61.0,3.0786,0,365.0
1,200001,152234,55973,2181-11-25 19:06:12,2181-11-28 20:59:25,2181-11-24 20:06:12,2181-11-24 21:06:12,-23.0,0.0,F,2120-10-31 00:00:00,61.0,3.0786,0,365.0
2,200001,152234,55973,2181-11-25 19:06:12,2181-11-28 20:59:25,2181-11-24 21:06:12,2181-11-24 22:06:12,-22.0,0.0,F,2120-10-31 00:00:00,61.0,3.0786,0,365.0
3,200001,152234,55973,2181-11-25 19:06:12,2181-11-28 20:59:25,2181-11-24 22:06:12,2181-11-24 23:06:12,-21.0,0.0,F,2120-10-31 00:00:00,61.0,3.0786,0,365.0
4,200001,152234,55973,2181-11-25 19:06:12,2181-11-28 20:59:25,2181-11-24 23:06:12,2181-11-25 00:06:12,-20.0,0.0,F,2120-10-31 00:00:00,61.0,3.0786,0,365.0
