# Task 2

Group name: Cbbayes

Team members: mcolomer (mcolomer@student.ethz.ch), pratsink (pratsink@student.ethz.ch) and scastro (scastro@student.ethz.ch)

Spring 2021


## Import libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.linear_model import LassoCV
from sklearn.model_selection import KFold
import sklearn.metrics as metrics

## Read data

In [2]:
## We read the already normalized and imputed data. For specifics about the imputation and normalization 
## see preprocessing.R file
test_feat_path = "../data/test_features_imp.csv" 
train_feat_path = "../data/train_features_imp.csv" 
train_lab_path = "../data/train_labels.csv"
test_feat = pd.read_csv(test_feat_path)
train_feat = pd.read_csv(train_feat_path)
train_lab = pd.read_csv(train_lab_path)

## Order data to make sure that rows in X and Y match
test_feat.sort_values(by=['pid'], inplace = True, ignore_index = True)
train_feat.sort_values(by=['pid'], inplace = True,ignore_index = True)
train_lab.sort_values(by=['pid'], inplace = True, ignore_index = True)

## Select exclude the pid column and make into array
X_test = test_feat.iloc[:, 1:272].values
X_train = train_feat.iloc[:, 1:272].values
Y_train = train_lab

# Create output file with the pid
output = pd.DataFrame({'pid': test_feat.iloc[:, 0].values})

## Subtask 1
### Histogram-based Gradient Boosting Classification Tree

In [3]:
## Define the names of the labels to predict
def prob_classsifier(X_train, Y_train, X_test, output):
    """Classifier that uses the HGBC to give the probability predictions
    for the labels of subtask1.
    Input:
        - X_train: numpy array with the training features
        - Y_train: pandas dataframe with the training labels
        - X_test: numpy array with the test features
        - Output: pandas dataframe with the pid we want to assess
    Output:
        - Output: pandas dataframe with the predicted values for the labels
        """
    labels_subtask_1 = ['LABEL_BaseExcess', 'LABEL_Fibrinogen', 'LABEL_AST',
                    'LABEL_Alkalinephos', 'LABEL_Bilirubin_total', 
                    'LABEL_Lactate', 'LABEL_TroponinI', 'LABEL_SaO2',
                    'LABEL_Bilirubin_direct', 'LABEL_EtCO2']

    ## Write to an array the labels of interest
    Y_train = Y_train[labels_subtask_1].to_numpy()

    ## For every label in Y_train fit a HGBC and use it to predict the probabilities of X_test
    print("ROC AUC validation and training score (training score on probability estimates), for each label:")
    for i, label in enumerate(labels_subtask_1):
        ## Fit model
        clf = HistGradientBoostingClassifier(scoring = 'roc_auc', 
                                             random_state = 123).fit(X_train, Y_train[:, i])

        ## Print the testing and traing score. Training score is estimated for the probability estimates not the labels.
        print(clf.validation_score_[np.size(clf.validation_score_) - 1], " ", 
              metrics.roc_auc_score(Y_train[:, i],
              clf.predict_proba(X_train)[:, 1], average='micro'))

        ## Write to results df
        output[label] = clf.predict_proba(X_test)[:, 1]
    return output

output = prob_classsifier(X_train, Y_train, X_test, output)

ROC AUC validation and training score (training score on probability estimates), for each label:
0.9368288898293131   0.971955844716475
0.7739326298701299   0.9370982421954289
0.7395438474996354   0.8227781635493072
0.7661201321874631   0.863604620989755
0.737128308244282   0.8743783775584116
0.8243524930747923   0.8820985417681808
0.8664142813173283   0.9300488203798855
0.8386329323829322   0.9063001574084065
0.7579146241830066   0.9210975183372138
0.9442974647887324   0.9891843985498588


## Subtask 2
### Histogram-based Gradient Boosting Classification Tree

In [4]:
def classifier(X_train, Y_train, X_test, output):
    """Classifier that uses the HGBC to give the probability predictions
    for the labels of subtask1.
    Input:
        - X_train: numpy array with the training features
        - Y_train: pandas dataframe with the training labels
        - X_test: numpy array with the test features
        - Output: pandas dataframe with the pid we want to assess
    Output:
        - Output: pandas dataframe with the predicted LABEL_Sepsis values
        """
    ## Write to an array the labels of interest
    Y_train = Y_train['LABEL_Sepsis'].to_numpy()

    ## Fit a HGBC and use it to predict the probabilities of X_test
    print("ROC AUC validation and training score (training score on probability estimates), for each label:")

    ## Fit model
    clf = HistGradientBoostingClassifier(scoring = 'roc_auc',
                                         random_state = 123).fit(X_train, Y_train)

    ## Print the testing and traing score. Trainig score is estimated for the probability estimates not the labels.
    print(clf.validation_score_[np.size(clf.validation_score_) - 1],
          " ",
          metrics.roc_auc_score(Y_train,
                                clf.predict_proba(X_train)[:, 1],
                                average='micro'))

    ## Write to results df
    output['LABEL_Sepsis'] = clf.predict_proba(X_test)[:, 1]
    return output

output = classifier(X_train, Y_train, X_test, output)

ROC AUC validation and training score (training score on probability estimates), for each label:
0.7437313990953749   0.8853415748524238


## Subtask 3
### Lasso Regression

In [5]:
def regressor(train_feat, Y_train, test_feat, output):
    """Regressor that uses Lasso-regression to estimate the values
    Input:
        - X_train: numpy array with the training features
        - Y_train: pandas dataframe with the training labels
        - X_test: numpy array with the test features
        - Output: pandas dataframe with the pid we want to assess
    Output:
        - Output: pandas dataframe with the regressed values
    """
    ## Define the features to predict for this rask
    labels_subtask_3 = ['LABEL_RRate', 'LABEL_ABPm', 'LABEL_SpO2', 'LABEL_Heartrate']

    ## Write to an array the labels of interest
    Y_train = Y_train[labels_subtask_3].to_numpy()

    ## Fit Lasso regression to the data and predict
    print("Training scores for each label:")
    for i, label in enumerate(labels_subtask_3):
        ## Get suffix of the label to predict
        sufix = label.split("_", maxsplit = 2)[1] + "$"

        ## Filter out columns that dont end with the suffix
        X_in_loop_train = train_feat.filter(regex = sufix, axis = 1).to_numpy()
        X_in_loop_test = test_feat.filter(regex = sufix, axis = 1).to_numpy()

        ## Fit model
        reg = LassoCV(random_state = 123, 
                      verbose = False,
                      max_iter = 10000).fit(X_in_loop_train, Y_train[:, i])

        ## Print training score (the suck)
        print(reg.score(X_in_loop_train, Y_train[:, i]))

        ## Write to output
        output[label] = reg.predict(X_in_loop_test)
    return output

output = regressor(train_feat, Y_train, test_feat, output)

Training scores for each label:
0.37770345083252754
0.5859785441608802
0.3838607478091973
0.614472438587367



## Visualize output

In [6]:
## Write results to .zip
output.to_csv('../output/submission.zip', index=False, float_format='%.3f', compression='zip')
output.head()

Unnamed: 0,pid,LABEL_BaseExcess,LABEL_Fibrinogen,LABEL_AST,LABEL_Alkalinephos,LABEL_Bilirubin_total,LABEL_Lactate,LABEL_TroponinI,LABEL_SaO2,LABEL_Bilirubin_direct,LABEL_EtCO2,LABEL_Sepsis,LABEL_RRate,LABEL_ABPm,LABEL_SpO2,LABEL_Heartrate
0,0,0.841947,0.500762,0.767665,0.81449,0.820963,0.564167,0.075179,0.304444,0.032662,0.003509,0.084461,15.092701,81.921415,98.225746,84.676908
1,3,0.014213,0.053325,0.189638,0.22785,0.176369,0.045784,0.042676,0.069303,0.015386,0.016995,0.038164,17.708045,85.29036,96.73791,95.10756
2,5,0.030908,0.040707,0.160747,0.174051,0.190061,0.069336,0.067896,0.09551,0.01869,0.016465,0.025506,19.01677,72.983975,95.905009,71.068876
3,7,0.867916,0.898349,0.847249,0.96237,0.943367,0.419422,0.061377,0.557548,0.34583,0.037326,0.201741,17.50136,86.498119,98.069292,94.273838
4,9,0.086848,0.070932,0.394263,0.19874,0.333319,0.091845,0.073415,0.087626,0.020094,0.001391,0.036281,20.007447,87.643147,95.843105,91.415635


## Compute the score of our submission

In [7]:
VITALS = ['LABEL_RRate', 'LABEL_ABPm', 'LABEL_SpO2', 'LABEL_Heartrate']
TESTS = ['LABEL_BaseExcess', 'LABEL_Fibrinogen', 'LABEL_AST', 'LABEL_Alkalinephos', 'LABEL_Bilirubin_total',
         'LABEL_Lactate', 'LABEL_TroponinI', 'LABEL_SaO2',
         'LABEL_Bilirubin_direct', 'LABEL_EtCO2']


def get_score(df_true, df_submission):
    """Function that determines the score of a predicted submission"""
    df_submission = df_submission.sort_values('pid')
    df_true = df_true.sort_values('pid')
    task1 = np.mean([metrics.roc_auc_score(df_true[entry], df_submission[entry]) for entry in TESTS])
    task2 = metrics.roc_auc_score(df_true['LABEL_Sepsis'], df_submission['LABEL_Sepsis'])
    task3 = np.mean([0.5 + 0.5 * np.maximum(0, metrics.r2_score(df_true[entry], df_submission[entry])) for entry in VITALS])
    score = np.mean([task1, task2, task3])
    print("Score task 1: ", task1)
    print("Score task 2: ", task2)
    print("Score task 3: ", task3)
    scores = [task1, task2, task3, score]
    return scores


def crossvalidation_analysis(X_cross, y_cross, train_feat, folds=5):
    """Cross-validation analysis of our classifiers and regressors
    Input:
        - X_cross: numpy array with the training features
        - y_cross: pandas dataframe with the training labels
        - train_feat: pandas dataframe with the training features
    Output:
        - scores: pandas dataframe with the scores for each of the cross-validation folds"""
    kf = KFold(n_splits=folds)
    scores = []
    for train_index, test_index in kf.split(X_cross):
        X_train, X_test = X_cross[train_index], X_cross[test_index]
        Y_train, Y_test = y_cross.loc[train_index].reset_index(), y_cross.loc[test_index].reset_index()
        X_train_labels, X_test_labels = train_feat.loc[train_index].reset_index(), train_feat.loc[test_index].reset_index()
        output = pd.DataFrame({'pid': Y_test.iloc[:, 0].values})
        output = prob_classsifier(X_train, Y_train, X_test, output)
        output = classifier(X_train, Y_train, X_test, output)
        output = regressor(X_train_labels, Y_train, X_test_labels, output)
        print("Fold score", get_score(Y_test, output))
        scores.append(get_score(Y_test, output))
    
    scores = pd.DataFrame(scores,columns=['Task1', "Task2", "Task3", "Average"])
    print("FINAL SCORE: ", np.mean(scores))
    
    return scores


scores = crossvalidation_analysis(X_train, Y_train, train_feat)

ROC AUC validation and training score (training score on probability estimates), for each label:
0.9394005620420716   0.9590585149615444
0.826178216314935   0.9335226318120724
0.765157420434237   0.8575271967725269
0.743072800078695   0.8766099141066033
0.7415546779555077   0.8641397035598006
0.8060797409431287   0.9005914863376695
0.8931119221411192   0.9598435552959501
0.8356163718500501   0.922006993221125
0.7049557220708447   0.9476332236180761
0.9345515946943531   0.9907608277656279
ROC AUC validation and training score (training score on probability estimates), for each label:
0.6756182271739217   0.8776616627395168
Training scores for each label:
0.37916184374386375
0.5876642668166384
0.38004478408501774
0.6135093222785881
Score task 1:  0.8175721006246818
Score task 2:  0.7184949906561462
Score task 3:  0.7463243116873541
Fold score [0.8175721006246818, 0.7184949906561462, 0.7463243116873541, 0.7607971343227273]
Score task 1:  0.8175721006246818
Score task 2:  0.718494990656146

## Overall scores of our classifiers and regressor

In the following table, it is summarised the scores we get for each subtask and the average score for a k-fold (k=5) cross-validation analysis.

In [8]:
scores

Unnamed: 0,Task1,Task2,Task3,Average
0,0.817572,0.718495,0.746324,0.760797
1,0.819049,0.686426,0.7405,0.748658
2,0.828821,0.713874,0.742998,0.761898
3,0.825023,0.692353,0.735145,0.75084
4,0.815887,0.733771,0.751819,0.767159


In [9]:
scores.describe()

Unnamed: 0,Task1,Task2,Task3,Average
count,5.0,5.0,5.0,5.0
mean,0.821271,0.708984,0.743357,0.75787
std,0.005447,0.019456,0.006248,0.007832
min,0.815887,0.686426,0.735145,0.748658
25%,0.817572,0.692353,0.7405,0.75084
50%,0.819049,0.713874,0.742998,0.760797
75%,0.825023,0.718495,0.746324,0.761898
max,0.828821,0.733771,0.751819,0.767159


## Results Log

### Subtask 1. Binary Relevance and HGBC

|   | C | kernel | gamma | weight | features | n_features | F1 score | AUC | runtime (min) |
|---|---|---|---|---|---|---|---|---|---|
| run_1 |  1 |  rbf | scale  |  balanced |  median for NA's and mean  | 35 | 0.598165656150447 | ? | 33 |
| run_2 |  1 |  rbf | scale  |  balanced |  median for NA's and mean, max, min, median, sd  | 170 | 0.628216870267411 |?| 102 |
| run_3 |  1 |  rbf | scale  |  balanced |  median for NA's and mean, max, min, median, sd, range, skw, kurt  | 272 | 0.649372121402984 | 0.8236937992110356 | 141 |
| run_4 |  HGBC |  HGBC | HGBC |  HGBC |  median for NA's and mean, max, min, median, sd, range, skw, kurt  | 272 | 0.871097657278231* | 0.8222653647930391 | 0.5 |

*I think the reason for this high score is beacuse the f1_micro is more severe when all labels are taken into account instead of one by one and the averaging. Hence I dont belive the HGBC is superiro in terms of performance, otherwise we would have also observed a big increase in the AUC.

### Subtask 3. Lasso

Trainig scores for normalized and unnormalized imputed data restircted to the labels:

|nomralized|UN-nomralized|
|---|---|
|0.37770345083252754 | 0.37759566055685045|
|0.5859785441608802  | 0.5856645886174903 |
|0.38386074780919743 | 0.3842306307116389 |
|0.6144724385873669  | 0.6142282361433877 | 

The sumbission scores were only a little bit different for normalized and unormalized data. 0.754641671097 and 0.754664968318 respectively. We therofre decide to use normalized data becasue this way we dont need two imputation scripts. 