<img src = "../../Data/bgsedsc_0.jpg">

# Project: Support Vector Machines (SVM)

## Programming project: probability of death

In this project, you have to predict the probability of death of a patient that is entering an ICU (Intensive Care Unit).

The dataset comes from MIMIC project (https://mimic.physionet.org/). MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.

Each row of *mimic_train.csv* correponds to one ICU stay (*hadm_id*+*icustay_id*) of one patient (*subject_id*). Column HOSPITAL_EXPIRE_FLAG is the indicator of death (=1) as a result of the current hospital stay; this is the outcome to predict in our modelling exercise.
The remaining columns correspond to vitals of each patient (when entering the ICU), plus some general characteristics (age, gender, etc.), and their explanation can be found at *mimic_patient_metadata.csv*. 

Note that the main cause/disease of patient contidition is embedded as a code at *ICD9_diagnosis* column. The meaning of this code can be found at *MIMIC_metadata_diagnose.csv*. **But** this is only the main one; a patient can have co-occurrent diseases (comorbidities). These secondary codes can be found at *extra_data/MIMIC_diagnoses.csv*.

Don't use features that you don't know the first day a patient enters the ICU, such as LOS.

As performance metric, you can use *AUC* for the binary classification case, but feel free to report as well any other metric if you can justify that is particularly suitable for this case.

Main tasks are:
+ Using *mimic_train.csv* file build a predictive model for *HOSPITAL_EXPIRE_FLAG* .
+ For this analysis there is an extra test dataset, *mimic_test.csv*. Apply your final model to this extra dataset and submit to Kaggle competition to obtain accuracy of prediction (follow the requested format).

Try to optimize hyperparameters of your SVM model.

You can follow those **steps** in your first implementation:
1. *Explore* and understand the dataset. 
2. Manage missing data.
2. Manage categorial features. E.g. create *dummy variables* for relevant categorical features, or build an ad hoc distance function.
3. Build a prediction model. Try to improve it using methods to tackle class imbalance.
5. Assess expected accuracy  of previous models using *cross-validation*. 
6. Test the performance on the test file by submitting to Kaggle, following same preparation steps (missing data, dummies, etc). Remember that you should be able to yield a prediction for all the rows of the test dataset.

For the in-class version, feel free to reduce the training dataset if you experience computational constraints.

## Main criteria for IN_CLASS grading
The weighting of these components will vary between the in-class and extended projects:
+ Code runs - 15%
+ Data preparation - 20%
+ SVMs method(s) have been used - 20%
+ Probability of death for each test patient is computed - 15%
+ Accuracy itself - 15%
+ Hyperparameter optimization - 10%
+ Class imbalance management - 5%
+ Neat and understandable code, with some titles and comments - 0%
+ Improved methods from what we discussed in class (properly explained/justified) - 0%


In [1]:
import os
from google.colab import drive
drive.mount('/content/drive')
os.chdir('/content/drive/MyDrive/CML_2_Projects/Project 2/')

Mounted at /content/drive


In [73]:
from utils import helper_functions
import pandas as pd
import seaborn as sns
import numpy as np
import sklearn
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score, confusion_matrix, roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_predict as cvp
from sklearn.impute import SimpleImputer
from sklearn.metrics import make_scorer

In [89]:
def reweight_binary(pi,q1=0.5,r1=0.5):
    r0 = 1-r1
    q0 = 1-q1
    tot = pi*(q1/r1)+(1-pi)*(q0/r0)
    w = pi*(q1/r1)
    w /= tot
    return w

In [35]:
train_data = pd.read_csv('mimic_train.csv')
test_data = pd.read_csv('mimic_test_death.csv')

In [36]:
#assign features
features_to_drop = ['LOS', 'Diff']
identifiers = ['subject_id', 'hadm_id', 'icustay_id']
numerical_features = ['HeartRate_Min', 'HeartRate_Max', 'HeartRate_Mean', 'SysBP_Min',
       'SysBP_Max', 'SysBP_Mean', 'DiasBP_Min', 'DiasBP_Max', 'DiasBP_Mean',
       'MeanBP_Min', 'MeanBP_Max', 'MeanBP_Mean', 'RespRate_Min',
       'RespRate_Max', 'RespRate_Mean', 'TempC_Min', 'TempC_Max', 'TempC_Mean',
       'SpO2_Min', 'SpO2_Max', 'SpO2_Mean', 'Glucose_Min', 'Glucose_Max',
       'Glucose_Mean']
categorical_features = ['GENDER', 'DOB', 'ADMITTIME', 'ADMISSION_TYPE', 'INSURANCE', 'RELIGION',
       'MARITAL_STATUS', 'ETHNICITY', 'DIAGNOSIS', 'ICD9_diagnosis',
       'FIRST_CAREUNIT']
target = ['HOSPITAL_EXPIRE_FLAG']

In [37]:
#drop irrelevant columns
train_data = train_data.drop(features_to_drop, axis=1)
test_data = test_data.drop('Diff', axis=1)
#note that offending columns are not in test set so no need to drop 

In [38]:
train_data.isnull().sum()

HOSPITAL_EXPIRE_FLAG       0
subject_id                 0
hadm_id                    0
icustay_id                 0
HeartRate_Min           2187
HeartRate_Max           2187
HeartRate_Mean          2187
SysBP_Min               2208
SysBP_Max               2208
SysBP_Mean              2208
DiasBP_Min              2209
DiasBP_Max              2209
DiasBP_Mean             2209
MeanBP_Min              2186
MeanBP_Max              2186
MeanBP_Mean             2186
RespRate_Min            2189
RespRate_Max            2189
RespRate_Mean           2189
TempC_Min               2497
TempC_Max               2497
TempC_Mean              2497
SpO2_Min                2203
SpO2_Max                2203
SpO2_Mean               2203
Glucose_Min              253
Glucose_Max              253
Glucose_Mean             253
GENDER                     0
DOB                        0
ADMITTIME                  0
ADMISSION_TYPE             0
INSURANCE                  0
RELIGION                   0
MARITAL_STATUS

In [39]:
test_data.isnull().sum()

subject_id          0
hadm_id             0
icustay_id          0
HeartRate_Min     545
HeartRate_Max     545
HeartRate_Mean    545
SysBP_Min         551
SysBP_Max         551
SysBP_Mean        551
DiasBP_Min        552
DiasBP_Max        552
DiasBP_Mean       552
MeanBP_Min        547
MeanBP_Max        547
MeanBP_Mean       547
RespRate_Min      546
RespRate_Max      546
RespRate_Mean     546
TempC_Min         638
TempC_Max         638
TempC_Mean        638
SpO2_Min          551
SpO2_Max          551
SpO2_Mean         551
Glucose_Min        58
Glucose_Max        58
Glucose_Mean       58
GENDER              0
DOB                 0
ADMITTIME           0
ADMISSION_TYPE      0
INSURANCE           0
RELIGION            0
MARITAL_STATUS    180
ETHNICITY           0
DIAGNOSIS           0
ICD9_diagnosis      0
FIRST_CAREUNIT      0
dtype: int64

In [40]:
#basic imputation for numerical features
imp_num = SimpleImputer(strategy="mean")
imp_num.fit(train_data[numerical_features])
train_data[numerical_features] = imp_num.transform(train_data[numerical_features])
test_data[numerical_features] = imp_num.transform(test_data[numerical_features])

In [41]:
#impute for categorical features (note it's only marital status)
categorical_features
imp_cat = SimpleImputer(strategy="most_frequent")
imp_cat.fit(train_data[categorical_features])
train_data[categorical_features] = imp_cat.transform(train_data[categorical_features])
test_data[categorical_features] = imp_cat.transform(test_data[categorical_features])

In [44]:
train_data[categorical_features].nunique()

GENDER                2
DOB               14007
ADMITTIME         19714
ADMISSION_TYPE        3
INSURANCE             5
RELIGION             17
MARITAL_STATUS        7
ETHNICITY            41
DIAGNOSIS          6193
ICD9_diagnosis     1853
FIRST_CAREUNIT        5
dtype: int64

In [None]:
#short on time to make an age variable so drop these
categorical_features = categorical_features.drop('DOB', 'ADMITTIME')

In [45]:
#scale features (only numerical features)
scaler = StandardScaler()
scaler.fit(train_data[numerical_features])
train_data[numerical_features] = scaler.transform(train_data[numerical_features])
test_data[numerical_features] = scaler.transform(test_data[numerical_features])

First implementation using linear SVC and only numerical features

Note that below I comment out a reasonable grid search due to time constraints

In [48]:
from sklearn.metrics import make_scorer
linear_svc = SVC(kernel='linear', probability = True, C = 1) #class_weight imbalance not addressed 
#grid_values = {'C':[0.1, 1, 10, 100]}  
#grid_linear_svc = GridSearchCV(linear_svc, param_grid = grid_values,scoring = 'roc_auc', cv=5)
linear_svc.fit(train_data[numerical_features], train_data[target])

  y = column_or_1d(y, warn=True)


In [51]:
y_hat_prob = linear_svc.predict_proba(train_data[numerical_features]) 
#note that I am aware that sklearn documentation warns that probabilites and point predictions
#may not completely align, but for the purposes of the in-class I continue with this approach

In [79]:
#get predictions for kaggle
y_hat_test = linear_svc.predict_proba(test_data[numerical_features])

In [77]:
roc_auc_score(train_data[target], y_hat_prob[:, 1])

0.6863399599321929

In [None]:
#out of sample prediction 
from sklearn.model_selection import cross_val_predict
y_hat_cv = cross_val_predict(linear_svc, train_data[numerical_features], train_data[target], cv = 5)


In [None]:
roc_auc_score(train_data[target], y_hat_cv)

In [88]:
#try for balanced class_weight
linear_svc_balanced = SVC(kernel='linear', probability = True, C = 1, class_weight = 'balanced') #would usually grid search for C
linear_svc_balanced.fit(train_data[numerical_features], train_data[target])
y_hat_prob_balanced = linear_svc_balanced.predict_proba(train_data[numerical_features]) 
y_hat_test_balanced = linear_svc_balanced.predict_proba(test_data[numerical_features])


  y = column_or_1d(y, warn=True)


NameError: ignored

In [90]:
#reweight probabilities 

q1 = train_data[target].sum()/len(train_data[target])
r1 = 0.5
y_hat_test_balanced = pd.DataFrame(y_hat_test_balanced[:, 1]).apply(reweight_binary,args=(q1,r1))

In [91]:
roc_auc_score(train_data[target], y_hat_test_balanced)

ValueError: ignored

Second simple implementation using non-linear SVC and only numerical features

Note that below I comment out a reasonable grid search due to time constraints. 

Fit didn't run in time, but code below

In [84]:
rbf_svc = SVC(kernel='rbf', probability = True, C=1, gamma=0.5) #on reflection should have probably left as the default gamma 
#grid_values = {'C':[0.1, 1, 10], 'gamma':[0.01,0.1,0.2, 0.5, 0.8]} #will try more values for gamma in extended project
#grid_rbf_svc = GridSearchCV(rbf_svc, param_grid=grid_values, scoring = 'roc_auc', cv = 5)
rbf_svc.fit(train_data[numerical_features], train_data[target])
y_hat_prob = rbf_svc.predict_proba(train_data[numerical_features])

  y = column_or_1d(y, warn=True)


In [85]:
roc_auc_score(train_data[target], y_hat_prob[:, 1])
#bit of a crazy roc_auc_score. I assume some overfitting is going on (would check with cross-val), but purpose was to demonstrate I understood
#how to implement the non-linear SVC. 

0.9525999797590872

In [86]:
#get predictions for kaggle
y_hat_test = rbf_svc.predict_proba(test_data[numerical_features])

In [None]:
#out of sample prediction
y_hat_cv = cross_val_predict(rbf_svc, train_data[numerical_features], train_data[target], cv = 5)


In [None]:
roc_auc_score(train_data[target], y_hat_cv)

In [None]:
#your code here

### Kaggle Predictions Submissions

Once you have produced testset predictions you can submit these to <i> kaggle </i> in order to see how your model performs. 

The following code provides an example of generating a <i> .csv </i> file to submit to kaggle
1) create a pandas dataframe with two columns, one with the test set "icustay_id"'s and the other with your predicted "HOSPITAL_EXPIRE_FLAG" for that observation

2) use the <i> .to_csv </i> pandas method to create a csv file. The <i> index = False </i> is important to ensure the <i> .csv </i> is in the format kaggle expects 

In [87]:
# Produce .csv for kaggle testing 
test_predictions_submit = pd.DataFrame({"icustay_id": test_data["icustay_id"], "HOSPITAL_EXPIRE_FLAG": y_hat_test_balanced[:, 1]})
test_predictions_submit.to_csv("test_predictions_submit.csv", index = False)