<a href="https://colab.research.google.com/github/MIT-LCP/bidmc-datathon/blob/master/06_aki_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# eICU Collaborative Research Database

# Notebook 6: An example project

This notebook introduces a project focused on acute kidney injury, quantifying differences between patients with and without the condition.

## Load libraries and connect to the database

In [0]:
# Import libraries
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.path as path

# Make pandas dataframes prettier
from IPython.display import display, HTML

# Access data using Google BigQuery.
from google.colab import auth
from google.cloud import bigquery

In [0]:
# authenticate
auth.authenticate_user()

In [0]:
# Set up environment variables
project_id='bidmc-datathon'
os.environ["GOOGLE_CLOUD_PROJECT"]=project_id

## Define the cohort

Our first step is to define the patient population we are interested in. For this project, we'd like to identify those patients with any past history of renal failure and compare them with the remaining patients.

First, we extract all patient unit stays from the patient table.


In [0]:
# Link the patient and apachepatientresult tables on patientunitstayid
# using an inner join.
%%bigquery patient

SELECT *
FROM `physionet-data.eicu_crd_demo.patient`

Now we investigate the pasthistory table, and look at all the mentions of past history which contain the phrase 'Renal  (R)' - note we use % as they are wildcard characters for SQL.



In [None]:
%%bigquery ph

SELECT pasthistorypath, count(*) as n
FROM `physionet-data.eicu_crd_demo.pasthistory`
WHERE pasthistorypath LIKE '%Renal  (R)%'
GROUP BY pasthistorypath
ORDER BY n DESC;

In [0]:
for row in ph.iterrows():
    r = row[1]
    print('{:3g} - {:20s}'.format(r['n'],r['pasthistorypath'][48:]))

These all seem like reasonable surrogates for renal insufficiency (note: for a real clinical study, you'd want to be a lot more thorough!).



In [0]:
# identify patients with insufficiency
%%bigquery df_have_crf

SELECT DISTINCT patientunitstayid
FROM `physionet-data.eicu_crd_demo.pasthistory`
WHERE pasthistorypath LIKE '%Renal  (R)%'

In [None]:
df_have_crf['crf'] = 1

In [0]:
# merge the data above into our original dataframe
df = patient.merge(df_have_crf, 
                   how='left', 
                   left_on='patientunitstayid', 
                   right_on='patientunitstayid')

df.head()

In [0]:
# impute 0s for the missing CRF values
df.fillna(value=0,inplace=True)
df.head()

In [0]:
# set patientunitstayid as the index - convenient for indexing later
df.set_index('patientunitstayid',inplace=True)

## Load creatinine from lab table


In [0]:
%%bigquery lab

SELECT patientunitstayid, labresult
FROM `physionet-data.eicu_crd_demo.lab`
WHERE labname = 'creatinine'

In [None]:
# set patientunitstayid as the index
lab.set_index('patientunitstayid', inplace=True)

In [0]:
# get first creatinine by grouping by the index (level=0)
cr_first = lab.groupby(level=0).first()

# similarly get maximum creatinine
cr_max = lab.groupby(level=0).max()

## Plot distributions of creatinine in both groups


In [0]:
plt.figure(figsize=[10,6])

xi = np.arange(0,10,0.1)

# get patients who had CRF and plot a histogram
idx = df.loc[df['crf']==1,:].index
plt.hist( cr_first.loc[idx,'labresult'].dropna(), bins=xi, label='With CRF' )

# get patients who did not have CRF
idx = df.loc[df['crf']==0,:].index
plt.hist( cr_first.loc[idx,'labresult'].dropna(), alpha=0.5, bins=xi, label='No CRF' )

plt.legend()

plt.show()

While it appears that patients in the red group have higher creatinines, we have far more patients in the blue group (no CRF) than in the red group (have CRF). To alleviate this and allow a fairer comparison, we can normalize the histogram.



In [0]:
plt.figure(figsize=[10,6])

xi = np.arange(0,10,0.1)

# get patients who had CRF and plot a histogram
idx = df.loc[df['crf']==1,:].index
plt.hist( cr_first.loc[idx,'labresult'].dropna(), bins=xi, normed=True,
         label='With CRF' )

# get patients who did not have CRF
idx = df.loc[df['crf']==0,:].index
plt.hist( cr_first.loc[idx,'labresult'].dropna(), alpha=0.5, bins=xi, normed=True,
         label='No CRF' )

plt.legend()

plt.show()

Here we can very clearly see that the first creatinine measured is a lot higher for patients with some baseline kidney dysfunction when compared to those without. Let's try it with the highest value.



In [0]:
plt.figure(figsize=[10,6])

xi = np.arange(0,10,0.1)

# get patients who had CRF and plot a histogram
idx = df.loc[df['crf']==1,:].index
plt.hist( cr_max.loc[idx,'labresult'].dropna(), bins=xi, normed=True,
         label='With CRF' )

# get patients who did not have CRF
idx = df.loc[df['crf']==0,:].index
plt.hist( cr_max.loc[idx,'labresult'].dropna(), alpha=0.5, bins=xi, normed=True,
         label='No CRF' )

plt.legend()

plt.show()

Unsuprisingly, a very similar story!