# K-Nearest Neighbours and Support Vector Classifiers 
## Gian-Piero Lovicu
This notebook contains commentary and code to run a K-Nearest Neighbours (KNN) and Support Vector Classifier (SVC) models on the mimic dataset. The notebook is structured as follows:

* Data cleaning and feature creation, which applies to both models
* Fit KNN classifier to the data
* Fit a SVC to the data

# Data Cleaning and Feature Creation
## Step 0: Import packages

Specific functions are imported when they are called.

In [1]:
#Import packages and functions
#Packages
import pandas as pd
import numpy as np
import datetime as dt

#sklearn functions
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline as imbPipeline
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn import preprocessing
from imblearn.over_sampling import SMOTENC, SMOTE, RandomOverSampler
import sklearn.feature_selection as fs
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

## Step 1: Import data
Import training and test sets, detailed diagnoses data and custom ethnicity mapping (I manually mapped ethnicity to a reduced set of categories for ease of interpretation). The ethnicity mapping file was submitted with this notebook.

Note - you may need to change path to data.

In [2]:
path = '/home/gigilovicu/Documents/masters/semester_1/machine learning/CM1_materials/'

# Training data set
data_train = pd.read_csv(path + "mimic_dataset/mimic_train.csv")         
# Testing data set
data_test = pd.read_csv(path + "mimic_dataset/mimic_test_death.csv")
# Diagnoses data
data_diagnoses = pd.read_csv(path + "mimic_dataset/extra_data/MIMIC_diagnoses.csv")
# Ethnicity mapping - map ethnicity to simpler categories (submitted with this notebook)
ethnicity_mapping = pd.read_csv(path + 'mimic_dataset/ethnicity.csv')


## Section 2 - Clean data and generate new features

First deal with date columns - `DOB` and `ADMITTIME` - to extract age of patient. This is date of birth of patient and date and time of admission.
* Convert to date format using `datetime` package
* Add `Diff` column to decode to sensible values
* Calculate `AGE` at time of admission as the difference between `ADMITTIME` and `DOB`. Hypothesise that older patients are more likely to die.

In [3]:
for df in [data_train, data_test]:
# Convert admittime to date, making adjustment for date scrambling
    df['ADMITTIME'] = (df["ADMITTIME"].apply(lambda x: dt.datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))
                      + df["Diff"].apply(lambda x: dt.timedelta(x))).apply(lambda x: x.date())
# Convert dob to date, making adjustment for date scrambling
    df['DOB'] = (df["DOB"].apply(lambda x: dt.datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))
                + df["Diff"].apply(lambda x: dt.timedelta(x))).apply(lambda x: x.date())
# Convert to age in years
    df['AGE'] = (df['ADMITTIME'] - df['DOB']).apply(lambda x: x.days//365)

Next, extract some additional health data for use in the model.

**Number of diagnoses per visit (`SEQ_NUM`).** From the (linked) comorbidities dataset, we can extract a count of diagnoses per visit. The idea is that a patient with more diagnoses is more likely to die.        

In [4]:

#DIAGNOSES - count diagnoses by patient per hospital visit
data_diagnoses.columns = ['subject_id', 'hadm_id', 'SEQ_NUM', 'ICD9_diagnosis']
data_diagnoses_per_visit = data_diagnoses[['hadm_id', 'SEQ_NUM']].groupby(['hadm_id'], sort = False).max()

# Merge into main data
data_train = data_train.merge(data_diagnoses_per_visit, on = ['hadm_id'], how = 'left')
data_test = data_test.merge(data_diagnoses_per_visit, on = ['hadm_id'], how = 'left')


**Repeat vists (`repeat_visits`)**. Some patients in the data present to ICU multiple times. Repeat visits may indicate a higher probability of dying. The trick is to calculate cumulative visits *at the time of a particular admission* (i.e. a patient may visit 3 times, but we don't know that on the first or second visit). Calculating this involves ordering the data, grouping by patient and admission identifiers and doing a cumulative count within groups.


In [5]:
#Sort values by hospital admission and admission date
data_train = data_train.sort_values(['subject_id', 'ADMITTIME']).reset_index(drop = True)
data_test = data_test.sort_values(['subject_id', 'ADMITTIME']).reset_index(drop = True)

#REPEAT VISITS - cumulative total of visits to ICU (known on admission date)
data_train['repeat_visits'] = data_train.groupby(['subject_id']).cumcount()+1
data_test['repeat_visits'] = data_test.groupby(['subject_id']).cumcount()+1

**Intensive care diagnosis code (`ICD9_diagnosis`)**. There are 1000's of these codes (including one for West Nile Fever with Encephalitis, who knew). For each diagnosis a patient is given, it is mapped to a code. I combine the comorbidities data with information on whether a patient died or not to calculate two kinds of variables. I reasoned that this was allowed because *historical* data on deaths within a diagnosis code are known when a patient enters ICU (i.e. the data is not patient specific, it is diagnosis specific). 

**Death rates (`max_deathrate`, `mean_deathrate`, `median_deathrate`)**. I calculate death rates for each `ICD9_diagnosis` as the count of deaths in a code divided by the total count of diagnoses in code. Then I assign each patient a maximum, average and median death rate based on all of their comorbidities. 

In [6]:
#BIGGEST KILLERS - looks at the diagnoses with the most deaths (by number and by share)
# Join in death indicator
data_diagnoses_all = data_diagnoses.merge(data_train[['hadm_id', 'HOSPITAL_EXPIRE_FLAG']], on = ['hadm_id'], how = 'left')
# Count deaths and survivals by ICD9_diagnosis
biggest_killers = data_diagnoses_all[['HOSPITAL_EXPIRE_FLAG', 'ICD9_diagnosis']].groupby(['HOSPITAL_EXPIRE_FLAG', 'ICD9_diagnosis']).size().reset_index(name='counts').pivot(index = 'ICD9_diagnosis', columns = 'HOSPITAL_EXPIRE_FLAG', values = 'counts')
biggest_killers.columns = ['0', '1']
# Fill na ICD9 codes with 0's (i.e. no one died from something)
biggest_killers = biggest_killers.fillna(0)
# Calculate death rates
biggest_killers['morbidity_share'] = biggest_killers['1']/(biggest_killers['0'] + biggest_killers['1'])
biggest_killers = biggest_killers.reset_index(drop = False)
# Merge death rates in with comorbidities data
data_diagnoses_all = data_diagnoses_all.merge(biggest_killers[['ICD9_diagnosis', 'morbidity_share']], on = 'ICD9_diagnosis', how = 'left')

# Max, mean and median death rates per patient 
data_death_rates = data_diagnoses_all.groupby(['hadm_id']).agg(max_deathrate = pd.NamedAgg('morbidity_share', 'max'),
                                                               mean_deathrate = pd.NamedAgg('morbidity_share', 'mean'),
                                                               median_deathrate = pd.NamedAgg('morbidity_share', 'median'))

# Merge into main data
data_train = data_train.merge(data_death_rates, on = ['hadm_id'], how = 'left')
data_test = data_test.merge(data_death_rates, on = ['hadm_id'], how = 'left')


**Most common diagnoses (dummies)**. I also calculated the most common `ICD9_diagnosis` codes among patients that died. I rank each `ICD9_code` based on the frequency in each group of patients (died vs survived). I take 'commonality' as the absolute difference between ranks in each group and scale it by frequency in the (larger) class of patients who survived. The scaling penalises the inclusion of codes where a diagnosis was very common in the survived group, but very rare among patients who died. To further guard against the inclusion of this type of diagnosis, I only keep diagnoses where at least 200 patients in the training data died (this number is arbitrary and can be tuned). 

The intuition behind this approach (rather than the simpler one of taking the highest frequency codes for patients who died) is that we want to extract codes that most differentiate (maximise the distance) between patients who died and who survived. This process results in 37 codes, which I include as dummies. The benefit of these features above and beyond **death rates** is that they allow the model to account for the *interaction* between comorbidites that are more likely to result in death. This is a feature of the data that scalar death rates cannot capture. 

In [7]:
# Most common diagnoses for patients that died versus survived
most_common_diag_1 = data_diagnoses_all[data_diagnoses_all['HOSPITAL_EXPIRE_FLAG'] == 1].groupby(['ICD9_diagnosis']).size().reset_index(name='counts').sort_values(by = 'counts', ascending = False)                  
most_common_diag_0 = data_diagnoses_all[data_diagnoses_all['HOSPITAL_EXPIRE_FLAG'] == 0].groupby(['ICD9_diagnosis']).size().reset_index(name='counts').sort_values(by = 'counts', ascending = False)                  

# Add a rank variable 
for df in [most_common_diag_1, most_common_diag_0]:
    df['rank'] = range(1, len(df)+1)

# Combine data from each group and calculate commonality indicator abs(rank died - rank survived)/frequency survived 
most_common_diag = most_common_diag_1.merge(most_common_diag_0, on = 'ICD9_diagnosis', how = 'left')
most_common_diag['range'] = np.absolute((most_common_diag['rank_x'] - most_common_diag['rank_y'])/most_common_diag['counts_y'])
# Exclude diagnoses where < 200 patients died (can tune)
most_common_diag = most_common_diag.sort_values(by = 'range', ascending = False)[most_common_diag['counts_x'] > 200]

# Create dummies for most common diagnosis codes in comorbitidites data
# Filter for relevant diagnosis codes
most_common_diag_dummy = data_diagnoses_all[data_diagnoses_all['ICD9_diagnosis'].isin(list(most_common_diag['ICD9_diagnosis']))]
# Add indicator for the diagnosis
most_common_diag_dummy['flag'] = 1
#There were some duplicate rows in co-morbidities data - drop them
dupes = most_common_diag_dummy.drop(['SEQ_NUM', 'HOSPITAL_EXPIRE_FLAG'], axis = 1).duplicated()
# Pivot codes to dummies - fill na with zero (didn't have that diagnosis)
most_common_diag_dummy = most_common_diag_dummy[~dupes].pivot(index = ['hadm_id'], columns = 'ICD9_diagnosis', values = 'flag').fillna(value = 0)

# Merge into main data set
data_train = data_train.merge(most_common_diag_dummy, on = ['hadm_id'], how = 'left')
data_test = data_test.merge(most_common_diag_dummy, on = ['hadm_id'], how = 'left')


  most_common_diag = most_common_diag.sort_values(by = 'range', ascending = False)[most_common_diag['counts_x'] > 200]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  most_common_diag_dummy['flag'] = 1


Next, edit the `ETHNICITY` variable. To do this, we map ethnicity from ~40 detailed categories to 5 simpler ones (this follows standard race classification in US): white, asian, black, hispanic, other, not reported. This mapping is saved in an external file, which I have turned in with this notebook. The new variable is called `ETHNICITY_MAP`.



In [8]:
data_train = data_train.merge(ethnicity_mapping, on = 'ETHNICITY')
data_test = data_test.merge(ethnicity_mapping, on = 'ETHNICITY')

## Section 3 - Data exploration
Given the features calculated above and the high prevalence multiple categorical variables in the data, I wanted to do some exploratory data analysis. It turns out that this was also useful for thinking about missingness in some of the data. Below I look at the shares of patients who died/survived across many of the features in our the feature space.

**Summary**
* None of `GENDER`, , `MARITAL_STATUS` and `repeat_visits` showed much variation in the death rate between values
* `INSURANCE` showed a higher death rate for patients with Medicare coverage, but this probably reflects that it is a program for the elderly
* `FIRST_CAREUNIT` exhibited some variation from the categories: CSRU and MICU
* For `ADMISSION_TYPE`, EMERGENCY and URGENT admissions had a much higher rate of death than elective visits (EMERGENCY especially so)
* Patients whose `RELIGION`, `ETHNICITY` or `MARITAL_STATUS` was flagged as 'unknown' or 'not recorded' had a much higher incidence of death than other categories. I think this is the missing data telling us something - I hypothesise that a disproportionate share of these patients died before the hospital had an opportunity to record this (periphery) information. To make sure this is captured, I construct an indicator variable based on whether `RELIGION` was recorded or not (I picked this feature because it was a feature I would drop otherwise).

You can view any of the diagnostics by changing the last line of the next code block. I left it on `RELIGION` as a default.

In [9]:
#EYEBALL DIAGNOSTICS - COLUMNS TO CHECK
diagnostics_cols = ['GENDER', 'ETHNICITY_MAP', 'FIRST_CAREUNIT', 'ADMISSION_TYPE', 'INSURANCE',
                       'MARITAL_STATUS', 'RELIGION', 'SEQ_NUM', 'repeat_visits']

# For loop to calculate share of patients that died in each category
diagnostics_eyeball = {}
for i in diagnostics_cols:
    diagnostic = data_train[['HOSPITAL_EXPIRE_FLAG', i]].groupby(['HOSPITAL_EXPIRE_FLAG', i]).size().reset_index(name='counts').pivot(index = i, columns = 'HOSPITAL_EXPIRE_FLAG', values = 'counts')
    diagnostic.columns = ['0', '1']
    diagnostic['share'] = diagnostic['1']/(diagnostic['0'] + diagnostic['1'])
    diagnostics_eyeball[i] = diagnostic

# Add new column for RELIGION
for df in [data_train, data_test]:
# RELIGION - keep only the UNOBTAINABLE flag
    df['RELIGION_ADJ'] = pd.Series([1 if i == 'UNOBTAINABLE' else 0 for i in list(data_train['RELIGION'])])

diagnostics_eyeball['RELIGION']

Unnamed: 0_level_0,0,1,share
RELIGION,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
7TH DAY ADVENTIST,29.0,1.0,0.033333
BUDDHIST,97.0,12.0,0.110092
CATHOLIC,6826.0,829.0,0.108295
CHRISTIAN SCIENTIST,149.0,15.0,0.091463
EPISCOPALIAN,264.0,24.0,0.083333
GREEK ORTHODOX,158.0,20.0,0.11236
HEBREW,,1.0,
HINDU,33.0,5.0,0.131579
JEHOVAH'S WITNESS,40.0,5.0,0.111111
JEWISH,1606.0,234.0,0.127174


Finally, drop columns that are either not useful or were not available on first day patient entered ICU. Note some columns also not available in test data.
* Identifier for test: `icustay_id`
* Not useful/too granular: `subject_id`, `hadm_id`, `DIAGNOSIS`
* Not available: `DOD`, `DISCHTIME`, `DEATHTIME`, `LOS`
* Transformed: `Diff`, `DOB`, `ETHNICITY`, `RELIGION`

NOTE THIS IS BASED ON THE ORIGINAL DATA YOU GAVE US - IF YOU HAVE SUBSEQUENTLY REMOVED THESE COLUMNS THE CODE WON'T RUN


In [10]:
# ID to join with predictions
test_kaggle_id = pd.DataFrame(data_test[['icustay_id']])

#Drop columns from train and test
data_train = data_train.drop(['icustay_id', 'subject_id', 'hadm_id','RELIGION', 'DIAGNOSIS',
                              'ICD9_diagnosis', 'DOD', 'DISCHTIME', 'DEATHTIME', 'LOS', 'ADMITTIME',
                              'DOB', 'ETHNICITY', 'Diff'], axis=1)

#Check number of columns equal in data sets
data_test = data_test.drop(['icustay_id', 'subject_id', 'hadm_id', 'RELIGION', 'DIAGNOSIS',
                            'ICD9_diagnosis', 'ADMITTIME', 'DOB', 'Diff',
                            'ETHNICITY'], axis = 1)

len(data_train.columns) - 1 == len(data_test.columns)

# Split into training and test data
target_col = 'HOSPITAL_EXPIRE_FLAG'

# Training data
X_train = data_train.drop([target_col], axis=1) 
y_train = data_train[target_col] 

# Test data
X_test = data_test

## Section 4 - diagnostics on missingness and class imbalance

It is important to check for nulls in our data. If there are too many nulls for a variable, there is no point imputing because there is not enough variation in the data to give a useful signal. On the other hand, the presence of a null may provide useful information in certain contexts (as seen with unknown observations in `RELIGION`, `ETHNICITY` and `MARITAL_STATUS`). The features with null data (those measuring a patient's vitals) are manageable. I found nothing to suggest they were not 'missing at random', and so will impute their values using a KNN algorithm (to preserve non-linearity of the KNN classifier, but this is also fine for the SVM as well).

I also check for class imbalance. It turns out that there could be an issue here with the proportion of deaths in the data set considerably lower than those who survived.

For KNN, I correct for this class imbalance in the pipeline using the SMOTE algorithm (I also tried SMOTE-NC, which apparently works better with categorical data, but the difference was negligible and using it was more awkward within the Pipeline). Below I set up the function to reweight the predicted probabilties from the KNN classifier to make them unbiased. For SVC, class imbalance can be handled from within the classifier by assigning class specific weights to the penalty component of the objective function (which will penalise the mis-classification of the minority class more heavily).  

In [11]:
# Count nulls
nulls_train = data_train.isnull().sum()
nulls_test = data_test.isnull().sum()

nulls_train.head()

#Check for class imbalance
unique, counts = np.unique(y_train, return_counts=True)
class_counts = dict(zip(unique, counts))

# Calculate class weights
class0 = class_counts[0]/(class_counts[0] + class_counts[1])
class1 = class_counts[1]/(class_counts[0] + class_counts[1])

#Function for re-weighting probabilities to correct for class imbalance. 
def reweight(pi,q1=class1,r1=0.5):
    r0 = 1-r1
    q0 = 1-q1
    tot = pi*(q1/r1)+(1-pi)*(q0/r0)
    w = pi*(q1/r1)
    w /= tot
    return w

# K-Nearest Neighbours

Implement the standard sklearn pipeline for the KNN (and later SVC) models. This includes:
* Dummifying any categorical variables
* Imputing missing data
* Scaling data into standardised values
* Correcting for an imbalanced sample (KNN only)
* Feature selection
* Running the classifier model (tuning parameters using `GridSearchCV()`)  

Set up the pipeline and specify a standard grid of parameters to search over. An advantage of using the pipeline is that it allows tuning for all of the parameters in the process, not just those specific to the final classifier.

The metric to evaluate the model on is set to roc_auc, the same as Kaggle.

For **feature selection**, I use recursive feature selection with a Decision Tree Classifier. I chose this approach because the Decision Tree Classifier is a non-linear model that can give some measure of feature importance. I was reticent to fit a linear model to the data for feature selection (for KNN), because this may have excluded features with a meaningful non-linear relationship with the target variable. Recursive Feature Selection with a Decision Tree is a greedy algorithm and takes quite a while to run, but gives good results.

**Other things I tried in addition to pipeline below:**
* Imputation using `SimpleImputer()` and mean and median strategies (also applies to SVM)
* Oversampling using `RandomOverSampler()`
* Feature selection using `SelectFromModel()` and `LogisticRegressionCV()`, I ditched this because it was linear.
* Different distance weighting approaches in the `KNeighborsClassifier()`: uniform (no distance weighting), Manhattan distance (selected), Euclidean distance and higher order Minkowski distances (p = 3 and p =5). All gave similar results. I also noted the DistanceMetrics page and noted other approaches available for integer- and boolean-valued vector spaces.

NOTE: Below I have just included a minimal example of the pipeline (excluding feature selection), so you can see that the code works. As a result, before the pipeline I drop additional variables that feature selection did not include in the model. In all, I end up with about 70 features.

In [12]:
#Make column transformer for one hot encoding
column_dummy = make_column_transformer((preprocessing.OneHotEncoder(drop = 'if_binary'),
                                        ['ADMISSION_TYPE']),
                                        #'FIRST_CAREUNIT', 'INSURANCE', 'GENDER',
                                        # 'MARITAL_STATUS', 'ETHNICITY_MAP']),
                                        remainder = 'passthrough') 

# Drop variables not included in feature selection:
X_train = X_train.drop(['ETHNICITY_MAP', 'INSURANCE', 'MARITAL_STATUS','FIRST_CAREUNIT', 'GENDER'], axis=1)
X_test = X_test.drop(['ETHNICITY_MAP', 'INSURANCE', 'MARITAL_STATUS','FIRST_CAREUNIT', 'GENDER'], axis=1)

#Check number of columns equal
X_train.columns == X_test.columns

#Pipeline for KNN
pipe_knn = imbPipeline([('dummy', column_dummy), 
                ('imputer', KNNImputer(missing_values=np.nan, n_neighbors = 100, weights = 'distance', add_indicator = False)),
                ('preprocessing', preprocessing.StandardScaler()),
                ('sampling', SMOTE()),
                #('features', fs.RFECV(estimator = DecisionTreeClassifier(class_weight = 'balanced'),
                #                      step = 10, cv = 5, scoring = 'roc_auc', verbose = 0)),
                ('classifier', KNeighborsClassifier(n_neighbors = 20,
                                                    weights = 'distance',
                                                    algorithm = 'auto'))])

# Grid
grid_values = [{'imputer__n_neighbors':[100]},
               {'classifier__n_neighbors':[300],
               'classifier__weights':['distance'], 'classifier__p':[1]}]
               
#Run grid search
grid_knn = GridSearchCV(pipe_knn, param_grid = grid_values, scoring = 'roc_auc', cv = 5, verbose = 3)


#Fit model
grid_knn.fit(X_train, y_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END ..........imputer__n_neighbors=100;, score=0.871 total time=  50.3s
[CV 2/5] END ..........imputer__n_neighbors=100;, score=0.909 total time=  47.0s
[CV 3/5] END ..........imputer__n_neighbors=100;, score=0.899 total time=  57.2s
[CV 4/5] END ..........imputer__n_neighbors=100;, score=0.899 total time=  52.8s
[CV 5/5] END ..........imputer__n_neighbors=100;, score=0.926 total time=  53.5s
[CV 1/5] END classifier__n_neighbors=300, classifier__p=1, classifier__weights=distance;, score=0.912 total time=  54.3s
[CV 2/5] END classifier__n_neighbors=300, classifier__p=1, classifier__weights=distance;, score=0.928 total time=  55.5s
[CV 3/5] END classifier__n_neighbors=300, classifier__p=1, classifier__weights=distance;, score=0.927 total time=  50.7s
[CV 4/5] END classifier__n_neighbors=300, classifier__p=1, classifier__weights=distance;, score=0.932 total time=  51.4s
[CV 5/5] END classifier__n_neighbors=300, classifie

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('dummy',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('onehotencoder',
                                                                         OneHotEncoder(drop='if_binary'),
                                                                         ['ADMISSION_TYPE'])])),
                                       ('imputer',
                                        KNNImputer(n_neighbors=100,
                                                   weights='distance')),
                                       ('preprocessing', StandardScaler()),
                                       ('sampling', SMOTE()),
                                       ('classifier',
                                        KNeighborsClassifier(n_neighbors=20,
                                                             weights='distance'))]),
          

### Interpret Output - KNN

Here I assessed the model chosen using grid search. When I was conducting feature selection, I would inspect the number and ranking of the features. Most runs of feature selection selected most of the health features, one indicator of missingness from `MARTIAL_STATUS` or `ETHNICITY` (I ended up using a similar feature with `RELIGION`).

Health features includes:
* Patient vitals (heart rate, bloos pressure, etc.). I thought about reducing dimensionality here by turning max and min into a range (retaining mean as a levels measure), but you can only do so much.
* Number of diagnoses
* Cumulative count of visits
* Most common diagnosis indicators (sometimes a couple were dropped)
* Death rates (most times all three measures - max, mean, median were retained)

In all, I include 70 features in the model. The optimal number of neighbours for imputation was 100 and for the classifier it was 300. I used distance weights with the Manhattan distance. The table below shows the mean CV score for this combination, which is quite high.


In [13]:
#DIAGNOSTICS

#Features
# Feature ranking
#feature_ranking = pd.DataFrame(list(grid_knn.best_estimator_.named_steps['dummy'].get_feature_names_out()),
#                               list(grid_knn.best_estimator_.named_steps['features'].ranking_)).reset_index(drop = False).sort_values(by = 'index')
# Features chosen
#feature_support = list(grid_knn.best_estimator_.named_steps['features'].support_)
#pd.DataFrame(grid_knn.best_estimator_.named_steps['dummy'].get_feature_names_out())[feature_support]

[grid_knn.best_estimator_.named_steps['classifier'].n_features_in_, grid_knn.best_estimator_.named_steps['dummy'].get_feature_names_out()]

[70,
 array(['onehotencoder__ADMISSION_TYPE_ELECTIVE',
        'onehotencoder__ADMISSION_TYPE_EMERGENCY',
        'onehotencoder__ADMISSION_TYPE_URGENT', 'remainder__HeartRate_Min',
        'remainder__HeartRate_Max', 'remainder__HeartRate_Mean',
        'remainder__SysBP_Min', 'remainder__SysBP_Max',
        'remainder__SysBP_Mean', 'remainder__DiasBP_Min',
        'remainder__DiasBP_Max', 'remainder__DiasBP_Mean',
        'remainder__MeanBP_Min', 'remainder__MeanBP_Max',
        'remainder__MeanBP_Mean', 'remainder__RespRate_Min',
        'remainder__RespRate_Max', 'remainder__RespRate_Mean',
        'remainder__TempC_Min', 'remainder__TempC_Max',
        'remainder__TempC_Mean', 'remainder__SpO2_Min',
        'remainder__SpO2_Max', 'remainder__SpO2_Mean',
        'remainder__Glucose_Min', 'remainder__Glucose_Max',
        'remainder__Glucose_Mean', 'remainder__AGE', 'remainder__SEQ_NUM',
        'remainder__repeat_visits', 'remainder__max_deathrate',
        'remainder__mean_deathra

In [14]:
# Model parameters
# KNN imputer
grid_knn.best_estimator_.named_steps['imputer'].get_params()
# KNN classifier
results_knn = pd.DataFrame(grid_knn.cv_results_)
results_knn = results_knn.sort_values(by=["rank_test_score"])
results_knn["param_values"] = results_knn["params"].apply(lambda x: "_".join(str(val) for val in x.values()))
results_knn['param_name'] = [list(i.keys()) for i in results_knn["params"]]
results_knn[["param_name", "param_values", "rank_test_score", "mean_test_score", "std_test_score"]]

Unnamed: 0,param_name,param_values,rank_test_score,mean_test_score,std_test_score
1,"[classifier__n_neighbors, classifier__p, class...",300_1_distance,1,0.929217,0.011455
0,[imputer__n_neighbors],100,2,0.900637,0.017689


### Generate predictions and write to a CSV for upload to Kaggle

In [15]:
#5. PREDICT USING TEST DATA
y_pred_prob = grid_knn.predict_proba(X_test)

#If reweighted for class imbalance
y_pred_prob = reweight(y_pred_prob)

#Write TO CSV FOR KAGGLE
test_kaggle = pd.concat([test_kaggle_id, pd.Series(y_pred_prob[:,1]).rename('HOSPITAL_EXPIRE_FLAG')], axis = 1) ## The unique ID
test_kaggle.head()
test_kaggle.to_csv(path + "gigi_kaggle_knn.csv", index = False)

# Support Vector Machines

Set up the pipeline - which is largely similar to the KNN pipeline. I have left a more comprehensive parameter search below, I searched across different penalisation parameters for misclassified data (C), kernels (linear and radial basis function), class weights (balanced and None). Gamma is the tuning parameter for the radial basis function kernel. I did not specify more parameters because SVM took a long time to run on my computer.

I set probability = True in the classifier to allow for the retrieval of predicted probabilities. This fits a logistic regression to the model output to calculate probability estimates. 

In [16]:
#Pipeline for SVC model
pipe = Pipeline([('dummy', column_dummy), 
                ('imputer', KNNImputer(missing_values=np.nan, n_neighbors = 100, weights = 'distance', add_indicator = False)),
                ('preprocessing', preprocessing.StandardScaler()),             
                ('classifier', SVC(probability=True))])

#grid_values = [{'imputer__n_neighbors':[100]},
#               {'classifier__C':[0.01, 1, 10],
#               'classifier__class_weight':["balanced", None],
#               'classifier__kernel':['linear', 'rbf'],
#              'classifier__gamma':[0.25, 0.5, 0.75]}]

grid_values = [{'imputer__n_neighbors':[100]},
               {'classifier__C':[1],
               'classifier__class_weight':["balanced"],
               'classifier__kernel':['linear']}]

#Run grid search
grid_svm = GridSearchCV(pipe, param_grid = grid_values, scoring = 'roc_auc', cv=5, verbose = 3)

#4. FIT MODEL TO DATA
grid_svm.fit(X_train, y_train)


Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END ..........imputer__n_neighbors=100;, score=0.920 total time= 1.4min
[CV 2/5] END ..........imputer__n_neighbors=100;, score=0.948 total time= 1.4min
[CV 3/5] END ..........imputer__n_neighbors=100;, score=0.940 total time= 1.4min
[CV 4/5] END ..........imputer__n_neighbors=100;, score=0.947 total time= 1.5min
[CV 5/5] END ..........imputer__n_neighbors=100;, score=0.966 total time= 1.6min
[CV 1/5] END classifier__C=1, classifier__class_weight=balanced, classifier__kernel=linear;, score=0.934 total time= 3.4min
[CV 2/5] END classifier__C=1, classifier__class_weight=balanced, classifier__kernel=linear;, score=0.952 total time= 3.6min
[CV 3/5] END classifier__C=1, classifier__class_weight=balanced, classifier__kernel=linear;, score=0.946 total time= 3.5min
[CV 4/5] END classifier__C=1, classifier__class_weight=balanced, classifier__kernel=linear;, score=0.953 total time= 3.6min
[CV 5/5] END classifier__C=1, classifie

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('dummy',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('onehotencoder',
                                                                         OneHotEncoder(drop='if_binary'),
                                                                         ['ADMISSION_TYPE'])])),
                                       ('imputer',
                                        KNNImputer(n_neighbors=100,
                                                   weights='distance')),
                                       ('preprocessing', StandardScaler()),
                                       ('classifier', SVC(probability=True))]),
             param_grid=[{'imputer__n_neighbors': [100]},
                         {'classifier__C': [1],
                          'classifier__class_weight': ['balanced'],
                          'classif

### Interpret output - SVM

Similar to the KNN model, I included around 70 features. GridSearch preferred a linear kernel and balanced class weights. The optimal value for C over the values I tried was 1.

In [17]:
# Model features

[grid_svm.best_estimator_.named_steps['classifier'].n_features_in_, grid_svm.best_estimator_.named_steps['dummy'].get_feature_names_out()]



[70,
 array(['onehotencoder__ADMISSION_TYPE_ELECTIVE',
        'onehotencoder__ADMISSION_TYPE_EMERGENCY',
        'onehotencoder__ADMISSION_TYPE_URGENT', 'remainder__HeartRate_Min',
        'remainder__HeartRate_Max', 'remainder__HeartRate_Mean',
        'remainder__SysBP_Min', 'remainder__SysBP_Max',
        'remainder__SysBP_Mean', 'remainder__DiasBP_Min',
        'remainder__DiasBP_Max', 'remainder__DiasBP_Mean',
        'remainder__MeanBP_Min', 'remainder__MeanBP_Max',
        'remainder__MeanBP_Mean', 'remainder__RespRate_Min',
        'remainder__RespRate_Max', 'remainder__RespRate_Mean',
        'remainder__TempC_Min', 'remainder__TempC_Max',
        'remainder__TempC_Mean', 'remainder__SpO2_Min',
        'remainder__SpO2_Max', 'remainder__SpO2_Mean',
        'remainder__Glucose_Min', 'remainder__Glucose_Max',
        'remainder__Glucose_Mean', 'remainder__AGE', 'remainder__SEQ_NUM',
        'remainder__repeat_visits', 'remainder__max_deathrate',
        'remainder__mean_deathra

In [18]:
# Model parameters
# KNN imputer
grid_svm.best_estimator_.named_steps['imputer'].get_params()
# SVM classifier
results_svm = pd.DataFrame(grid_svm.cv_results_)
results_svm = results_svm.sort_values(by=["rank_test_score"])
results_svm["param_values"] = results_svm["params"].apply(lambda x: "_".join(str(val) for val in x.values()))
results_svm['param_name'] = [list(i.keys()) for i in results_svm["params"]]
results_svm[["param_name", "param_values", "rank_test_score", "mean_test_score", "std_test_score"]]

Unnamed: 0,param_name,param_values,rank_test_score,mean_test_score,std_test_score
1,"[classifier__C, classifier__class_weight, clas...",1_balanced_linear,1,0.95093,0.011333
0,[imputer__n_neighbors],100,2,0.944266,0.01468


In [19]:
#5. PREDICT USING TEST/TRAIN DATA
y_pred_prob_svm = grid_svm.predict_proba(X_test)

#Write TO CSV FOR KAGGLE
test_kaggle_svm = pd.concat([test_kaggle_id, pd.Series(y_pred_prob_svm[:,1]).rename('HOSPITAL_EXPIRE_FLAG')], axis = 1) ## The unique ID
test_kaggle.head()
test_kaggle.to_csv(path +"gigi_kaggle_svm.csv", index = False)