In this notebook, we'll see the basics of building a machine learning model using _scikit-learn_.

We'll be working with a data set consisting of a sample of 200 subjects who were part of a study on survival of patients following admission to an adult intensive care unit. The goal of the study was to develop a logistic regression model to predict the probability of survival to hospital discharge of these patients.

In [None]:
import pandas as pd

In [None]:
icu = pd.read_csv('../data/icu.csv')

In [None]:
icu.head(2)

The variables are as follows:

|Variable | Description | Codes/Values|
|---|---|---|
| ID | Identification Code | ID Number|
| STA | Vital Status | 0 = Lived<br /> 1 = Died |
| AGE | Age | Years |
| SEX | Sex | 0 = Male<br /> 1 = Female | 
| RACE | Race | 1 = White<br />2 = Black<br />3 = Other |
| SER | Service at ICU Admission | 0 = Medical<br />1 = Surgical |
| CAN | Cancer Part of Present Problem | 0 = No<br />1 = Yes |
| CRN | History of Chronic Renal Failure | 0 = No<br />1 = Yes |
| INF | Infection Probable at ICU Admission | 0 = No<br />1 = Yes |
| CPR | CPR Prior to ICU Admission | 0 = No<br />1 = Yes |
| SYS | Systolic Blood Pressure at ICU Admission | mm Hg |
| HRA | Heart Rate at ICU Admission | Beats/min |
| PRE | Previous Admission to an ICU Within 6 Months | 0 = No<br />1 = Yes |
| TYP | Type of Admission | 0 = Elective<br />1 = Emergency |
| FRA | Long Bone, Multiple, Neck, Single Area, or Hip Fracture | 0 = No<br />1 = Yes |
| PO2 | PO2 from Initial Blood Gases | 0: $>$60<br />1: $\leq$ 60 |
| PH | PH from Initial Blood Gases | 0: $\geq$ 7.25<br />1: $<$7.25 |
| PCO | PCO2 from Initial Blood Gases | 0: $\leq$ 45<br />1: $>$45 |
| BIC | Bicarbonate from Initial Blood Gases | 0: $\geq$ 18<br />1: $<$ 18 |
| CRE | Creatinine from Initial Blood Gases | 0: $\leq$2.0<br />1: $>$2.0 |
| LOC | Level of Consciousness at ICU Admission | 0 = No Coma or Deep Stupor<br />1 = Deep Stupor<br />2 = Coma |

# Machine Learning Approach - _scikit-learn_

The _scikit-learn_ library includes a large number of supervised and unsupervised learning models along with other useful utilities for machine learning. See the user guide [here](https://scikit-learn.org/stable/supervised_learning.html).

For this notebook, we'll use a **random forest**, an ensemble model built from decision trees.

Our goal is to achieve a model which is useful for making predictions on future data. Hence, we will include all of the variables and let the algorithm determine which have predictive power.

This is a very flexible model and consequently will perform well on the data that it is trained on. To get a fair assessment of how well our model makes predictions, we'll set aside some our our data as a **test set**.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [None]:
X = icu.drop(columns = ['ID', 'STA'])             # Use all variables as predictors except for the ID and the target
y = icu['STA']                                    # Target variable

We need to encode the categorical variables, and we'll do so using the `get_dummies` function from `pandas`.

In [None]:
categorical_variables = ['RACE', 'LOC']
X = pd.get_dummies(X, columns = categorical_variables)

In [None]:
X.head(2)

Split the data into a training set and a test set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify = y,     # Keep the same proportions of the target in the training and test data
                                                    test_size = 0.25,
                                                    random_state = 321)

In [None]:
clf = RandomForestClassifier(random_state = 321)
clf.fit(X_train, y_train)

Now, we can see how well it performs on our test set.

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score

First, generate predictions.

In [None]:
y_pred = clf.predict(X_test)

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
print(classification_report(y_test, y_pred))

We can also look at the predicted probabilities.

In [None]:
y_proba = clf.predict_proba(X_test)[:,1]

In [None]:
y_proba

In [None]:
roc_auc_score(y_test, y_proba)

One nice feature of a random forest model is that it will tell you which variables it relies most on to make predictions.

**Warning:** A high importance value indicates that the model is relying on a particular variable to make predictions but doesn't reveal _how_ it is using that variable. 

In [None]:
importances = pd.DataFrame({'variable': X.columns, 'importance': clf.feature_importances_})
importances.sort_values('importance', ascending = False)

# Appendix - Statistical Analysis using _statsmodels_

A useful library for conducting statistical tests is the _statsmodels_ library.

Let's say we want to test the null hypothesis that there is no difference in average age between those that die compared to those that do not die against the alternative hypothesis that there is a difference. 

For this, we can use a t-test.

In [None]:
from statsmodels.stats.weightstats import ttest_ind

In [None]:
ttest_ind(x1 = icu[icu['STA'] == 0]['AGE'],               # observations that do not die
          x2 = icu[icu['STA'] == 1]['AGE'],               # observations that die
          alternative = 'two-sided',                      # can perform a one-sided test by using 'larger' or 'smaller'
          usevar = 'unequal')                             # We'll Welch's t-test

This function returns the test statistic, the p-value, and the degrees of freedom.

In this case, at the 0.05 significance level, we can reject the null hypothesis.

# Statistical Modeling Approach - _statsmodels_

If we want to build a logistic regression model, we can make use statsmodels along with the patsy library to build a design matrix.

In [None]:
from patsy import dmatrices
import statsmodels.api as sm

In [None]:
y, X = dmatrices('STA ~ AGE',                       # Target variable ~ Predictor variable(s)
                 icu,                               # Dataset
                 return_type = 'dataframe')

In [None]:
X.head(2)

Now, we'll use the Logit class from statsmodels to build our model.

In [None]:
logit = sm.Logit(y, X)

Fit the model and save the result.

In [None]:
res = logit.fit()

We can see the parameters using the `params` attribute.

In [None]:
res.params

And we can get a statistical summary using the `summary()` method.

In [None]:
res.summary()

If we want to include other variables, we can do so by separating them with a `+`.

To include categorical variables with more than 2 levels, we can encode them using `C()`.

In [None]:
y, X = dmatrices('STA ~ AGE + SEX + C(RACE)', icu, return_type = 'dataframe')

In [None]:
X.head(2)

In [None]:
logit = sm.Logit(y, X)
res = logit.fit()
res.summary()