<a href="https://colab.research.google.com/github/MIT-LCP/2019_tokyo_datathon/blob/master/04_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# eICU Collaborative Research Database

# Notebook 4: Prediction

This notebook explores how a decision trees can be trained to predict in-hospital mortality of patients.


## Load libraries and connect to the database

In [0]:
# Import libraries
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.path as path

# model building
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from sklearn import metrics
from sklearn import impute

from sklearn import tree
from sklearn import ensemble
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

# Make pandas dataframes prettier
from IPython.display import display, HTML, Image
plt.rcParams.update({'font.size': 20})
%matplotlib inline
plt.style.use('ggplot')

# Access data using Google BigQuery.
from google.colab import auth
from google.cloud import bigquery

In [0]:
# authenticate
auth.authenticate_user()

In [0]:
# Set up environment variables
project_id='datathonjapan2019'
os.environ["GOOGLE_CLOUD_PROJECT"]=project_id

To make our lives easier, we'll also install and import a set of helper functions from the `datathon2` package. We will be using the following functions from the package:
- `plot_model_pred_2d`: to visualize our data, helping to display a class split assigned by a tree vs the true class.
- `run_query()`: to run an SQL query against our BigQuery database and assign the results to a dataframe. 


In [0]:
!pip install datathon2

In [0]:
import datathon2 as dtn

In this notebook we'll be looking at tree models, so we'll now install and import packages for visualizing these models.

In [0]:
!apt-get install graphviz -y
!pip install pydotplus

In [0]:
import pydotplus

## Load the patient cohort

In this example, we will load all data from the patient data, and link it to physiology score data to provide richer summary information. Note that physiology score measurements indicate the "worst" value during the first day.

In [0]:
# Link the patient, apachepatientresult, and apacheapsvar tables on patientunitstayid
# using an inner join.
query = """
SELECT p.unitadmitsource, p.gender, p.age, p.unittype, p.unitstaytype, 
    a.actualhospitalmortality,
    v.heartrate, v.meanbp
FROM `physionet-data.eicu_crd_demo.patient` p
INNER JOIN `physionet-data.eicu_crd_demo.apachepatientresult` a
ON p.patientunitstayid = a.patientunitstayid
INNER JOIN `physionet-data.eicu_crd_demo.apacheapsvar` v
ON p.patientunitstayid = v.patientunitstayid
WHERE a.apacheversion LIKE 'IVa'
"""

cohort = dtn.run_query(query,project_id)

In [0]:
cohort.head()

## Prepare the data for analysis

Before continuing, we want to review our data, paying attention to factors such as:
- data types (for example, are values recorded as characters or numerical values?) 
- missing data
- distribution of values

In [0]:
# review the data dataset
print(cohort.info())

In [0]:
# Encode the categorical data
encoder = preprocessing.LabelEncoder()
cohort['gender_code'] = encoder.fit_transform(cohort['gender'])
cohort['unittype_code'] = encoder.fit_transform(cohort['unittype'])
cohort['actualhospitalmortality_code'] = encoder.fit_transform(cohort['actualhospitalmortality'])


In [0]:
# Handle the deidentified ages
cohort['agenum'] = pd.to_numeric(cohort['age'], downcast='integer', errors='coerce')

In [0]:
# Preview the encoded data
cohort[['gender','gender_code']].head()

In [0]:
# Check the outcome variable
cohort['actualhospitalmortality_code'].unique()

## Create our train and test sets

Note that we only use two columns of data because initially we'd like to visualize our tree models in two dimensions. 

In [0]:
features = ['heartrate','meanbp']
outcome = 'actualhospitalmortality_code'

X = cohort[features]
y = cohort[outcome]

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

In [0]:
# Review the number of cases in each set
print("Train data: {}".format(len(X_train)))
print("Test data: {}".format(len(X_test)))


## Decision trees

Let's build the simplest tree model we can think of: a classification tree with only one split. Decision trees of this form are commonly referred to under the umbrella term Classification and Regression Trees (CART) [1]. While we will only be looking at classification here, regression isn't too different. After grouping the data (which is essentially what a decision tree does), classification involves assigning all members of the group to the majority class of that group during training. Regression is the same, except you would assign the average value, not the majority. In the case of a decision tree with one split, often called a "stump", the model will partition the data into two groups, and assign classes for those two groups based on majority vote. There are many parameters available for the DecisionTreeClassifier class; by specifying max_depth=1 we will build a decision tree with only one split - i.e. of depth 1.

[1] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.

In [0]:
# specify max_depth=1 so we train a stump, i.e. a tree with only 1 split
mdl = tree.DecisionTreeClassifier(max_depth=1)

# fit the model to the data - trying to predict y from X
mdl = mdl.fit(X_train,y_train)

Since our model is so simple, we can actually look at the full decision tree.

In [0]:
graph = dtn.create_graph(mdl,feature_names=features)
Image(graph.create_png())

Here we see three nodes: a node at the top, a node in the lower left, and a node in the lower right.

The top node is the root of the tree: it contains all the data. Let's read this node bottom to top:
- `value = [809, 78]`:  Current class balance. There are 809 observations of class 1 and 78 observations of class 2.
- `samples = 887`:  Number of samples assessed at this node.
- `gini = 0.16`: Gini impurity, a measure of "impurity". The higher the value, the bigger the mix of classes. A 50/50 split of two classes would result in an index of 0.5.
- `meanbp <=46.5`: Decision rule learned by the node. In this case, patients with a mean BP of <= 46.5 are moved into the left node and >46.5 to the right. 

The gini impurity is is actually used by the algorithm to determine a split. The model evaluates every feature (in our case, heart rate and blood pressure) at every possible split (46, 47..) to find the split that has the lowest gini impurity in two resulting nodes. 

The approach is referred to as "greedy", because we are choosing the optimal split given our current state. Let's take a closer look at our decision boundary.

In [0]:
# look at the regions in a 2d plot
# based on scikit-learn tutorial plot_iris.html
plt.figure(figsize=[10,8])
dtn.plot_model_pred_2d(mdl, X_train, y_train)

In this plot we can see the decision boundary on the y-axis, separating the predicted classes. The true classes are indicated at each point. Where the background and point colours are mismatched, there has been misclassification. Of course we are using a very simple model. Let's see what happens when we increase the depth to 5.

In [0]:
mdl = tree.DecisionTreeClassifier(max_depth=5)
mdl = mdl.fit(X_train,y_train)

In [0]:
plt.figure(figsize=[10,8])
dtn.plot_model_pred_2d(mdl, X_train, y_train)

Now our tree is more complicated! We can see a few vertical boundaries as well as the horizontal one from before. Some of these we may like, but some appear unnatural. Let's look at the tree itself.

In [0]:
tree_graph = tree.export_graphviz(mdl, out_file=None,
                         feature_names=feat, 
                         filled=True, rounded=True)  
graph = pydotplus.graphviz.graph_from_dot_data(tree_graph) 
Image(graph.create_png())