<img src = "../../Data/bgsedsc_0.jpg">

# Project: (K-) Nearest Neighbors


## Programming project: probability of death

In this project, you have to predict the probability of death of a patient that is entering an ICU (Intensive Care Unit).

The dataset comes from MIMIC project (https://mimic.physionet.org/). MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.

Each row of *mimic_train.csv* correponds to one ICU stay (*hadm_id*+*icustay_id*) of one patient (*subject_id*). Column HOSPITAL_EXPIRE_FLAG is the indicator of death (=1) as a result of the current hospital stay; this is the outcome to predict in our modelling exercise.
The remaining columns correspond to vitals of each patient (when entering the ICU), plus some general characteristics (age, gender, etc.), and their explanation can be found at *mimic_patient_metadata.csv*. 

Please don't use any feature that you infer you don't know the first day of a patient in an ICU.

Note that the main cause/disease of patient condition is embedded as a code at *ICD9_diagnosis* column. The meaning of this code can be found at *MIMIC_metadata_diagnose.csv*. **But** this is only the main one; a patient can have co-occurrent diseases (comorbidities). These secondary codes can be found at *extra_data/MIMIC_diagnoses.csv*.

As performance metric, you can use *AUC* for the binary classification case, but feel free to report as well any other metric if you can justify that is particularly suitable for this case.

Main tasks are:
+ Using *mimic_train.csv* file build a predictive model for *HOSPITAL_EXPIRE_FLAG* .
+ For this analysis there is an extra test dataset, *mimic_test_death.csv*. Apply your final model to this extra dataset and generate predictions following the same format as *mimic_kaggle_death_sample_submission.csv*. Once ready, you can submit to our Kaggle competition and iterate to improve the accuracy.

As a *bonus*, try different algorithms for neighbor search and for distance, and justify final selection. Try also different weights to cope with class imbalance and also to balance neighbor proximity. Try to assess somehow confidence interval of predictions.

You can follow those **steps** in your first implementation:
1. *Explore* and understand the dataset. 
2. Manage missing data.
2. Manage categorial features. E.g. create *dummy variables* for relevant categorical features, or build an ad hoc distance function.
3. Build a prediction model. Try to improve it using methods to tackle class imbalance.
5. Assess expected accuracy  of previous models using *cross-validation*. 
6. Test the performance on the test file and report accuracy, following same preparation steps (missing data, dummies, etc). Remember that you should be able to yield a prediction for all the rows of the test dataset.

Feel free to reduce the training dataset if you experience computational constraints.

## Main criteria for IN_CLASS grading
The weighting of these components will vary between the in-class and extended projects:
+ Code runs - 20%
+ Data preparation - 35%
+ Nearest neighbor method(s) have been used - 15%
+ Probability of death for each test patient is computed - 10%
+ Accuracy of predictions for test patients is calculated (kaggle) - 10%
+ Hyperparameter optimization - 5%
+ Class imbalance management - 5%
+ Neat and understandable code, with some titles and comments - 0%
+ Improved methods from what we discussed in class (properly explained/justified) - 0%

In [166]:
import os
from google.colab import drive
drive.mount('/content/drive')
os.chdir('/content/drive/MyDrive/CML_2_Projects/Project 1/')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [167]:
import pandas as pd
import numpy as np
import os 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, NearestNeighbors 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score, roc_curve, auc, confusion_matrix 
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import RandomOverSampler

In [168]:
# Training dataset
data=pd.read_csv('mimic_train.csv')
data.head()

Unnamed: 0,HOSPITAL_EXPIRE_FLAG,subject_id,hadm_id,icustay_id,HeartRate_Min,HeartRate_Max,HeartRate_Mean,SysBP_Min,SysBP_Max,SysBP_Mean,DiasBP_Min,DiasBP_Max,DiasBP_Mean,MeanBP_Min,MeanBP_Max,MeanBP_Mean,RespRate_Min,RespRate_Max,RespRate_Mean,TempC_Min,TempC_Max,TempC_Mean,SpO2_Min,SpO2_Max,SpO2_Mean,Glucose_Min,Glucose_Max,Glucose_Mean,GENDER,DOB,DOD,ADMITTIME,DISCHTIME,DEATHTIME,Diff,ADMISSION_TYPE,INSURANCE,RELIGION,MARITAL_STATUS,ETHNICITY,DIAGNOSIS,ICD9_diagnosis,FIRST_CAREUNIT,LOS
0,0,55440,195768,228357,89.0,145.0,121.043478,74.0,127.0,106.586957,42.0,90.0,61.173913,59.0,94.0,74.543478,15.0,30.0,22.347826,35.111111,36.944444,36.080247,90.0,99.0,95.73913,111.0,230.0,160.777778,F,2108-07-16 00:00:00,2180-03-09 00:00:00,2178-02-06 10:35:00,2178-02-13 18:30:00,,-61961.7847,EMERGENCY,Medicare,PROTESTANT QUAKER,SINGLE,WHITE,GASTROINTESTINAL BLEED,5789,MICU,4.5761
1,0,76908,126136,221004,63.0,110.0,79.117647,89.0,121.0,106.733333,49.0,74.0,64.733333,58.0,84.0,74.8,13.0,21.0,16.058824,36.333333,36.611111,36.472222,98.0,100.0,99.058824,103.0,103.0,103.0,F,2087-01-16 00:00:00,,2129-02-12 22:34:00,2129-02-13 16:20:00,,-43146.18378,EMERGENCY,Private,UNOBTAINABLE,MARRIED,WHITE,ESOPHAGEAL FOOD IMPACTION,53013,MICU,0.7582
2,0,95798,136645,296315,81.0,98.0,91.689655,88.0,138.0,112.785714,45.0,67.0,56.821429,64.0,88.0,72.888889,13.0,21.0,15.9,36.444444,36.888889,36.666667,100.0,100.0,100.0,132.0,346.0,217.636364,F,2057-09-17 00:00:00,,2125-11-17 23:04:00,2125-12-05 17:55:00,,-42009.96157,EMERGENCY,Medicare,PROTESTANT QUAKER,SEPARATED,BLACK/AFRICAN AMERICAN,UPPER GI BLEED,56983,MICU,3.7626
3,0,40708,102505,245557,76.0,128.0,98.857143,84.0,135.0,106.972973,30.0,89.0,41.864865,48.0,94.0,62.783784,12.0,35.0,26.771429,36.333333,39.5,37.833333,78.0,100.0,95.085714,108.0,139.0,125.0,F,2056-02-27 00:00:00,2132-03-01 00:00:00,2131-01-26 08:00:00,2131-02-05 16:23:00,,-43585.37922,ELECTIVE,Medicare,NOT SPECIFIED,WIDOWED,WHITE,HIATAL HERNIA/SDA,5533,SICU,3.8734
4,0,28424,127337,225281,,,,,,,,,,,,,,,,,,,,,,97.0,137.0,113.0,F,2066-12-19 00:00:00,2147-01-18 00:00:00,2146-05-04 02:02:00,2146-05-20 18:40:00,,-50271.76602,EMERGENCY,Medicare,JEWISH,WIDOWED,WHITE,ABDOMINAL PAIN,56211,TSICU,5.8654


In [169]:
data.shape

(20885, 44)

In [170]:
data.isnull().sum()

HOSPITAL_EXPIRE_FLAG        0
subject_id                  0
hadm_id                     0
icustay_id                  0
HeartRate_Min            2187
HeartRate_Max            2187
HeartRate_Mean           2187
SysBP_Min                2208
SysBP_Max                2208
SysBP_Mean               2208
DiasBP_Min               2209
DiasBP_Max               2209
DiasBP_Mean              2209
MeanBP_Min               2186
MeanBP_Max               2186
MeanBP_Mean              2186
RespRate_Min             2189
RespRate_Max             2189
RespRate_Mean            2189
TempC_Min                2497
TempC_Max                2497
TempC_Mean               2497
SpO2_Min                 2203
SpO2_Max                 2203
SpO2_Mean                2203
Glucose_Min               253
Glucose_Max               253
Glucose_Mean              253
GENDER                      0
DOB                         0
DOD                     13511
ADMITTIME                   0
DISCHTIME                   0
DEATHTIME 

In [171]:
data.columns

Index(['HOSPITAL_EXPIRE_FLAG', 'subject_id', 'hadm_id', 'icustay_id',
       'HeartRate_Min', 'HeartRate_Max', 'HeartRate_Mean', 'SysBP_Min',
       'SysBP_Max', 'SysBP_Mean', 'DiasBP_Min', 'DiasBP_Max', 'DiasBP_Mean',
       'MeanBP_Min', 'MeanBP_Max', 'MeanBP_Mean', 'RespRate_Min',
       'RespRate_Max', 'RespRate_Mean', 'TempC_Min', 'TempC_Max', 'TempC_Mean',
       'SpO2_Min', 'SpO2_Max', 'SpO2_Mean', 'Glucose_Min', 'Glucose_Max',
       'Glucose_Mean', 'GENDER', 'DOB', 'DOD', 'ADMITTIME', 'DISCHTIME',
       'DEATHTIME', 'Diff', 'ADMISSION_TYPE', 'INSURANCE', 'RELIGION',
       'MARITAL_STATUS', 'ETHNICITY', 'DIAGNOSIS', 'ICD9_diagnosis',
       'FIRST_CAREUNIT', 'LOS'],
      dtype='object')

In [172]:
irrelevant_features = ['subject_id', 'hadm_id', 'icustay_id']
numerical_features = ['subject_id', 'hadm_id', 'icustay_id',
       'HeartRate_Min', 'HeartRate_Max', 'HeartRate_Mean', 'SysBP_Min',
       'SysBP_Max', 'SysBP_Mean', 'DiasBP_Min', 'DiasBP_Max', 'DiasBP_Mean',
       'MeanBP_Min', 'MeanBP_Max', 'MeanBP_Mean', 'RespRate_Min',
       'RespRate_Max', 'RespRate_Mean', 'TempC_Min', 'TempC_Max', 'TempC_Mean',
       'SpO2_Min', 'SpO2_Max', 'SpO2_Mean', 'Glucose_Min', 'Glucose_Max',
       'Glucose_Mean', 'LOS']
categorical_features = ['GENDER', 'DOB', 'DOD', 'ADMITTIME', 'DISCHTIME',
       'DEATHTIME', 'Diff', 'ADMISSION_TYPE', 'INSURANCE', 'RELIGION',
       'MARITAL_STATUS', 'ETHNICITY', 'DIAGNOSIS', 'ICD9_diagnosis',
       'FIRST_CAREUNIT']
target = ['HOSPITAL_EXPIRE_FLAG']

In [173]:
#drop id columns
data = data.drop(irrelevant_features, axis=1)

#So for a number of numerical features there are roughly 2000 NA observations. As
#a first attempt, let's see what happens when we drop observations that have 
#nas in all the below columns. 
data = data.dropna(subset=['HeartRate_Min', 'HeartRate_Max', 'HeartRate_Mean', 'SysBP_Min',
       'SysBP_Max', 'SysBP_Mean', 'DiasBP_Min', 'DiasBP_Max', 'DiasBP_Mean',
       'MeanBP_Min', 'MeanBP_Max', 'MeanBP_Mean', 'RespRate_Min',
       'RespRate_Max', 'RespRate_Mean', 'TempC_Min', 'TempC_Max', 'TempC_Mean',
       'SpO2_Min', 'SpO2_Max', 'SpO2_Mean'], how = 'all')

In [174]:
print(data.shape)
print(data.isnull().sum())
#So we have lost about 2000 observations, but now have far fewer NAs. 

(18702, 41)
HOSPITAL_EXPIRE_FLAG        0
HeartRate_Min               4
HeartRate_Max               4
HeartRate_Mean              4
SysBP_Min                  25
SysBP_Max                  25
SysBP_Mean                 25
DiasBP_Min                 26
DiasBP_Max                 26
DiasBP_Mean                26
MeanBP_Min                  3
MeanBP_Max                  3
MeanBP_Mean                 3
RespRate_Min                6
RespRate_Max                6
RespRate_Mean               6
TempC_Min                 314
TempC_Max                 314
TempC_Mean                314
SpO2_Min                   20
SpO2_Max                   20
SpO2_Mean                  20
Glucose_Min               253
Glucose_Max               253
Glucose_Mean              253
GENDER                      0
DOB                         0
DOD                     12298
ADMITTIME                   0
DISCHTIME                   0
DEATHTIME               16604
Diff                        0
ADMISSION_TYPE              

In [175]:
#To create a clean dataset with respect to numerical features, simply drop any observation that contains a missing value. 
data = data.dropna(subset=['HeartRate_Min', 'HeartRate_Max', 'HeartRate_Mean', 'SysBP_Min',
       'SysBP_Max', 'SysBP_Mean', 'DiasBP_Min', 'DiasBP_Max', 'DiasBP_Mean',
       'MeanBP_Min', 'MeanBP_Max', 'MeanBP_Mean', 'RespRate_Min',
       'RespRate_Max', 'RespRate_Mean', 'TempC_Min', 'TempC_Max', 'TempC_Mean',
       'SpO2_Min', 'SpO2_Max', 'SpO2_Mean', 'Glucose_Min', 'Glucose_Max',
       'Glucose_Mean'], how = 'any')
#drop any observations with missing data for marital_status
data = data.dropna(subset=['MARITAL_STATUS'], how = 'any')

In [176]:
print(data.shape)
print(data.isnull().sum())
#So we have lost a futher 1300ish observations, but now have no NAs for numerical features or marital status

(17476, 41)
HOSPITAL_EXPIRE_FLAG        0
HeartRate_Min               0
HeartRate_Max               0
HeartRate_Mean              0
SysBP_Min                   0
SysBP_Max                   0
SysBP_Mean                  0
DiasBP_Min                  0
DiasBP_Max                  0
DiasBP_Mean                 0
MeanBP_Min                  0
MeanBP_Max                  0
MeanBP_Mean                 0
RespRate_Min                0
RespRate_Max                0
RespRate_Mean               0
TempC_Min                   0
TempC_Max                   0
TempC_Mean                  0
SpO2_Min                    0
SpO2_Max                    0
SpO2_Mean                   0
Glucose_Min                 0
Glucose_Max                 0
Glucose_Mean                0
GENDER                      0
DOB                         0
DOD                     11580
ADMITTIME                   0
DISCHTIME                   0
DEATHTIME               15702
Diff                        0
ADMISSION_TYPE              

In [177]:
data['HOSPITAL_EXPIRE_FLAG'].value_counts()
#So we see that for those people who survived, they do not have a DEATHTIME.

0    15702
1     1774
Name: HOSPITAL_EXPIRE_FLAG, dtype: int64

In [178]:
#This is a bit weird, but suggests we know the day somebody died, but not necessarily the time. 
data[(data['DEATHTIME'].isna() == True) & (data['DOD'].isna() == False)]

Unnamed: 0,HOSPITAL_EXPIRE_FLAG,HeartRate_Min,HeartRate_Max,HeartRate_Mean,SysBP_Min,SysBP_Max,SysBP_Mean,DiasBP_Min,DiasBP_Max,DiasBP_Mean,MeanBP_Min,MeanBP_Max,MeanBP_Mean,RespRate_Min,RespRate_Max,RespRate_Mean,TempC_Min,TempC_Max,TempC_Mean,SpO2_Min,SpO2_Max,SpO2_Mean,Glucose_Min,Glucose_Max,Glucose_Mean,GENDER,DOB,DOD,ADMITTIME,DISCHTIME,DEATHTIME,Diff,ADMISSION_TYPE,INSURANCE,RELIGION,MARITAL_STATUS,ETHNICITY,DIAGNOSIS,ICD9_diagnosis,FIRST_CAREUNIT,LOS
0,0,89.0,145.0,121.043478,74.0,127.0,106.586957,42.0,90.0,61.173913,59.0,94.0,74.543478,15.0,30.0,22.347826,35.111111,36.944444,36.080247,90.0,99.0,95.739130,111.0,230.0,160.777778,F,2108-07-16 00:00:00,2180-03-09 00:00:00,2178-02-06 10:35:00,2178-02-13 18:30:00,,-61961.78470,EMERGENCY,Medicare,PROTESTANT QUAKER,SINGLE,WHITE,GASTROINTESTINAL BLEED,5789,MICU,4.5761
3,0,76.0,128.0,98.857143,84.0,135.0,106.972973,30.0,89.0,41.864865,48.0,94.0,62.783784,12.0,35.0,26.771429,36.333333,39.500000,37.833333,78.0,100.0,95.085714,108.0,139.0,125.000000,F,2056-02-27 00:00:00,2132-03-01 00:00:00,2131-01-26 08:00:00,2131-02-05 16:23:00,,-43585.37922,ELECTIVE,Medicare,NOT SPECIFIED,WIDOWED,WHITE,HIATAL HERNIA/SDA,5533,SICU,3.8734
9,0,74.0,98.0,81.142857,84.0,140.0,113.875000,35.0,72.0,54.343750,31.0,81.0,66.806452,17.0,28.0,23.264706,35.888889,37.111111,36.652778,88.0,99.0,94.600000,85.0,161.0,112.000000,M,2032-05-26 00:00:00,2118-03-09 00:00:00,2117-12-15 18:12:00,2117-12-23 15:25:00,,-39375.73369,EMERGENCY,Medicare,BUDDHIST,WIDOWED,WHITE,PULMONARY EMBOLISM,41511,TSICU,1.3265
10,0,61.0,102.0,76.772727,84.0,126.0,102.595238,32.0,63.0,46.142857,45.0,78.0,59.547619,16.0,34.0,22.068182,35.722222,36.611111,36.015873,92.0,100.0,97.386364,90.0,161.0,118.600000,F,1805-09-07 00:00:00,2107-07-28 00:00:00,2105-09-07 10:37:00,2105-09-13 20:08:00,,-34323.25475,EMERGENCY,Medicare,PROTESTANT QUAKER,WIDOWED,BLACK/AFRICAN AMERICAN,STROKE;TELEMETRY;TRANSIENT ISCHEMIC ATTACK,34830,SICU,2.1554
14,0,52.0,127.0,95.157895,92.0,144.0,118.714286,41.0,67.0,55.250000,59.0,97.0,74.423077,9.0,31.0,19.076923,35.500000,38.500000,36.898810,95.0,100.0,97.473684,78.0,185.0,135.636364,F,2052-02-23 00:00:00,2128-09-27 00:00:00,2127-10-30 14:00:00,2127-11-11 15:00:00,,-43430.35550,ELECTIVE,Medicare,CATHOLIC,MARRIED,WHITE,MITRAL VALVE DISORDER\MITRAL VALVE REPLACEMENT,3941,CSRU,1.2737
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20860,0,64.0,76.0,70.208333,84.0,120.0,101.047619,52.0,89.0,58.904762,61.0,96.0,70.238095,10.0,23.0,14.285714,36.444444,36.444444,36.444444,97.0,100.0,99.740741,103.0,126.0,114.333333,F,2050-02-01 00:00:00,2105-02-26 00:00:00,2102-09-16 19:51:00,2102-09-19 14:55:00,,-34334.75811,EMERGENCY,Medicaid,CATHOLIC,SINGLE,WHITE,ALTERED MENTAL STATUS,5722,MICU,0.8968
20861,0,64.0,117.0,96.517241,91.0,167.0,109.750000,54.0,90.0,68.593750,68.0,115.0,81.843750,14.0,30.0,18.677419,36.611111,37.111111,36.916667,93.0,100.0,96.896552,118.0,129.0,124.666667,M,2082-04-12 00:00:00,2143-11-12 00:00:00,2143-02-21 18:01:00,2143-02-23 18:15:00,,-48847.30437,EMERGENCY,Private,CATHOLIC,MARRIED,WHITE,SUBGLOTTIC STENOSIS,5184,SICU,2.0185
20871,0,88.0,119.0,107.500000,75.0,145.0,122.000000,30.0,94.0,50.000000,50.0,98.0,66.931034,18.0,31.0,24.966667,35.944444,36.166667,36.037037,89.0,99.0,94.433333,81.0,86.0,83.500000,F,2098-08-14 00:00:00,2164-10-31 00:00:00,2164-06-08 18:06:00,2164-06-25 14:16:00,,-55762.62141,EMERGENCY,Private,CATHOLIC,MARRIED,WHITE,ATRIAL FIBRILLATION;TELEMETRY;PLEURAL EFFUSION,0389,MICU,16.8399
20879,0,80.0,85.0,81.947368,81.0,107.0,94.176471,52.0,69.0,58.823529,60.0,78.0,66.882353,11.0,20.0,14.944444,36.333333,37.000000,36.666667,97.0,100.0,98.631579,102.0,102.0,102.000000,M,2097-09-24 00:00:00,2159-10-29 00:00:00,2158-12-15 23:01:00,2158-12-22 15:45:00,,-54809.57459,EMERGENCY,Private,NOT SPECIFIED,MARRIED,WHITE,S/P ARREST,4271,MICU,0.8011


In [179]:
data[categorical_features].nunique()

GENDER                2
DOB               12045
DOD                3951
ADMITTIME         16519
DISCHTIME         16519
DEATHTIME          1566
Diff              13680
ADMISSION_TYPE        3
INSURANCE             5
RELIGION             17
MARITAL_STATUS        7
ETHNICITY            41
DIAGNOSIS          5343
ICD9_diagnosis     1749
FIRST_CAREUNIT        5
dtype: int64

In [180]:
#Create dummies for categorical features with few unique values. Drop others for first implementation (to avoid blowing up the feature space). Note that DEATHTIME and DOD are also dropped which temporarily avoids the missing data problem. 
data = data.drop(['DOB', 'DOD', 'ADMITTIME', 'DISCHTIME', 'DEATHTIME', 'Diff', 'RELIGION', 'MARITAL_STATUS', 'ETHNICITY', 'DIAGNOSIS', 'ICD9_diagnosis', 'LOS'], axis=1)
data = pd.get_dummies(data, columns=['GENDER', 'ADMISSION_TYPE', 'INSURANCE', 'FIRST_CAREUNIT'], drop_first=False)

In [184]:
data_test = data_test.drop(['DOB', 'ADMITTIME', 'Diff', 'RELIGION', 'MARITAL_STATUS', 'ETHNICITY', 'DIAGNOSIS', 'ICD9_diagnosis'], axis=1)
data_test = pd.get_dummies(data_test, columns=['GENDER', 'ADMISSION_TYPE', 'INSURANCE', 'FIRST_CAREUNIT'], drop_first=False)

KeyError: ignored

In [159]:
all_features = ['HeartRate_Min', 'HeartRate_Max',
       'HeartRate_Mean', 'SysBP_Min', 'SysBP_Max', 'SysBP_Mean', 'DiasBP_Min',
       'DiasBP_Max', 'DiasBP_Mean', 'MeanBP_Min', 'MeanBP_Max', 'MeanBP_Mean',
       'RespRate_Min', 'RespRate_Max', 'RespRate_Mean', 'TempC_Min',
       'TempC_Max', 'TempC_Mean', 'SpO2_Min', 'SpO2_Max', 'SpO2_Mean',
       'Glucose_Min', 'Glucose_Max', 'Glucose_Mean', 'LOS', 'GENDER_F',
       'GENDER_M', 'ADMISSION_TYPE_ELECTIVE', 'ADMISSION_TYPE_EMERGENCY',
       'ADMISSION_TYPE_URGENT', 'INSURANCE_Government', 'INSURANCE_Medicaid',
       'INSURANCE_Medicare', 'INSURANCE_Private', 'INSURANCE_Self Pay',
       'FIRST_CAREUNIT_CCU', 'FIRST_CAREUNIT_CSRU', 'FIRST_CAREUNIT_MICU',
       'FIRST_CAREUNIT_SICU', 'FIRST_CAREUNIT_TSICU']
numerical_features = ['HeartRate_Min', 'HeartRate_Max',
       'HeartRate_Mean', 'SysBP_Min', 'SysBP_Max', 'SysBP_Mean', 'DiasBP_Min',
       'DiasBP_Max', 'DiasBP_Mean', 'MeanBP_Min', 'MeanBP_Max', 'MeanBP_Mean',
       'RespRate_Min', 'RespRate_Max', 'RespRate_Mean', 'TempC_Min',
       'TempC_Max', 'TempC_Mean', 'SpO2_Min', 'SpO2_Max', 'SpO2_Mean',
       'Glucose_Min', 'Glucose_Max', 'Glucose_Mean']
categorical_features = ['GENDER_F',
       'GENDER_M', 'ADMISSION_TYPE_ELECTIVE', 'ADMISSION_TYPE_EMERGENCY',
       'ADMISSION_TYPE_URGENT', 'INSURANCE_Government', 'INSURANCE_Medicaid',
       'INSURANCE_Medicare', 'INSURANCE_Private', 'INSURANCE_Self Pay',
       'FIRST_CAREUNIT_CCU', 'FIRST_CAREUNIT_CSRU', 'FIRST_CAREUNIT_MICU',
       'FIRST_CAREUNIT_SICU', 'FIRST_CAREUNIT_TSICU']

In [160]:
# Test dataset (to produce predictions)
data_test=pd.read_csv('mimic_test_death.csv')
data_test.sort_values('icustay_id').head()

Unnamed: 0,subject_id,hadm_id,icustay_id,HeartRate_Min,HeartRate_Max,HeartRate_Mean,SysBP_Min,SysBP_Max,SysBP_Mean,DiasBP_Min,DiasBP_Max,DiasBP_Mean,MeanBP_Min,MeanBP_Max,MeanBP_Mean,RespRate_Min,RespRate_Max,RespRate_Mean,TempC_Min,TempC_Max,TempC_Mean,SpO2_Min,SpO2_Max,SpO2_Mean,Glucose_Min,Glucose_Max,Glucose_Mean,GENDER,DOB,ADMITTIME,Diff,ADMISSION_TYPE,INSURANCE,RELIGION,MARITAL_STATUS,ETHNICITY,DIAGNOSIS,ICD9_diagnosis,FIRST_CAREUNIT
4930,93535,121562,200011,56.0,82.0,71.205128,123.0,185.0,156.411765,37.0,84.0,59.911765,60.0,109.0,83.861111,18.0,34.0,26.97619,36.444444,37.666667,37.005556,89.0,100.0,95.605263,94.0,108.0,101.0,F,2104-05-13 00:00:00,2188-08-05 20:27:00,-64881.43517,EMERGENCY,Medicare,JEWISH,SINGLE,WHITE,ASTHMA;COPD EXACERBATION,49322,MICU
1052,30375,177945,200044,,,,,,,,,,,,,,,,,,,,,,125.0,125.0,125.0,F,2053-08-08 00:00:00,2135-07-07 16:13:00,-46540.62661,EMERGENCY,Medicare,CATHOLIC,WIDOWED,WHITE,HEAD BLEED,85220,SICU
3412,73241,149216,200049,54.0,76.0,64.833333,95.0,167.0,114.545455,33.0,80.0,43.363636,49.0,88.0,59.848485,11.0,27.0,18.861111,36.444444,37.388889,36.816667,88.0,100.0,96.027778,125.0,297.0,199.875,M,2054-03-21 00:00:00,2118-08-14 22:27:00,-38956.8589,EMERGENCY,Private,JEWISH,MARRIED,WHITE,HEPATIC ENCEPHALOPATHY,5722,MICU
1725,99052,129142,200063,85.0,102.0,92.560976,91.0,131.0,108.365854,42.0,64.0,53.707317,14.0,83.0,68.826667,9.0,22.0,17.291667,,,,87.0,100.0,96.837209,95.0,196.0,127.368421,M,2104-02-12 00:00:00,2141-03-09 23:19:00,-47014.25437,EMERGENCY,Medicaid,NOT SPECIFIED,SINGLE,UNKNOWN/NOT SPECIFIED,TYPE A DISSECTION,44101,CSRU
981,51698,190004,200081,82.0,133.0,94.323529,86.0,143.0,111.09375,47.0,91.0,64.28125,57.0,97.0,75.34375,9.0,28.0,21.029412,35.555556,38.888889,37.489899,92.0,100.0,95.617647,81.0,134.0,117.833333,M,2072-04-09 00:00:00,2142-02-23 06:56:00,-47377.26087,EMERGENCY,Medicare,OTHER,MARRIED,PORTUGUESE,PULMONARY EMBOLISM,41519,CCU


In [162]:
#scale numerical features
scaler = StandardScaler()
features_to_scale = []
scaler.fit(data[numerical_features])
data[numerical_features] = scaler.transform(data[numerical_features])
data_test[numerical_features] = scaler.transform(data_test[numerical_features])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())


In [165]:
MyKNN = KNeighborsClassifier(algorithm='brute') #, metric=lambda X, Y: mydist(X, Y))

k_vals = [1, 2, 5, 10, 20] 
weights = ['uniform', 'distance']

grid_values = {'n_neighbors':k_vals, 'weights':weights}
grid_knn = GridSearchCV(MyKNN, param_grid=grid_values, scoring='accuracy', cv=5) 
grid_knn.fit(data, data[target])

print('best parameters:', grid_knn.best_params_)
print('best score:', grid_knn.best_score_)

y_pred = grid_knn.predict(data_test)
y_prob_pred = grid_knn.predict_proba(data_test)

  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)


best parameters: {'n_neighbors': 5, 'weights': 'distance'}
best score: 0.9139390695436109


  return self._fit(X, y)
Feature names unseen at fit time:
- ADMISSION_TYPE
- ADMITTIME
- DIAGNOSIS
- DOB
- Diff
- ...
Feature names seen at fit time, yet now missing:
- ADMISSION_TYPE_ELECTIVE
- ADMISSION_TYPE_EMERGENCY
- ADMISSION_TYPE_URGENT
- FIRST_CAREUNIT_CCU
- FIRST_CAREUNIT_CSRU
- ...



ValueError: ignored

### Kaggle Predictions Submissions

Once you have produced testset predictions you can submit these to <i> kaggle </i> in order to see how your model performs. 

The following code provides an example of generating a <i> .csv </i> file to submit to kaggle
1) create a pandas dataframe with two columns, one with the test set "icustay_id"'s and the other with your predicted "HOSPITAL_EXPIRE_FLAG" for that observation

2) use the <i> .to_csv </i> pandas method to create a csv file. The <i> index = False </i> is important to ensure the <i> .csv </i> is in the format kaggle expects 

In [163]:
# Sample output prediction file
pred_sample=pd.read_csv('../../Data/mimic_kaggle_death_sample_submission.csv')
pred_sample.sort_values('icustay_id').head()

FileNotFoundError: ignored

In [None]:
# Produce .csv for kaggle testing 
test_predictions_submit = pd.DataFrame({"icustay_id": data_test["icustay_id"], "HOSPITAL_EXPIRE_FLAG": y_hat_test})
test_predictions_submit.to_csv("test_predictions_submit.csv", index = False)