# Introduction to Machine Learning - Task 2

Group name: Cbbayes

Team members: mcolomer (mcolomer@student.ethz.ch), pratsink (pratsink@student.ethz.ch) and scastro (scastro@student.ethz.ch)

Spring 2021

## Import libraries

In [49]:
import pandas as pd
import numpy as np

## Read data 

In [50]:
# Indicate the path to the data file
path = "../data/"


train_features = pd.read_csv(path+"train_features.csv")
train_labels = pd.read_csv(path+"train_labels.csv")
test_features = pd.read_csv(path+"test_features.csv")


## Description of the train features:

All medical measurements are not available at each timestep, meaning the data contains a lot of missing values, indicated with ‘nan’ in our case. To simplify the problem, the data is already re-sampled hourly. This means that we aggregate measurements by one-hour period, thus there are only 12 rows for a given patient in the corresponding .csv file.

Either vital signs such as the Heart rate, or lab tests such as Calcium level in the patient blood. Finally, you are provided with the Age of the patient which is the same during the entire stay.


In [51]:
# Looking at the data
train_features



Unnamed: 0,pid,Time,Age,EtCO2,PTT,BUN,Lactate,Temp,Hgb,HCO3,...,Alkalinephos,SpO2,Bilirubin_direct,Chloride,Hct,Heartrate,Bilirubin_total,TroponinI,ABPs,pH
0,1,3,34.0,,,12.0,,36.0,8.7,24.0,...,,100.0,,114.0,24.6,94.0,,,142.0,7.33
1,1,4,34.0,,,,,36.0,,,...,,100.0,,,,99.0,,,125.0,7.33
2,1,5,34.0,,,,,36.0,,,...,,100.0,,,,92.0,,,110.0,7.37
3,1,6,34.0,,,,,37.0,,,...,,100.0,,,,88.0,,,104.0,7.37
4,1,7,34.0,,,,,,,,...,,100.0,,,22.4,81.0,,,100.0,7.41
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
227935,9999,8,85.0,,,,,,,,...,,,,,,80.0,,,110.0,
227936,9999,9,85.0,,,,,,,,...,,,,,,83.0,,,123.0,
227937,9999,10,85.0,,,,,36.0,,,...,,98.0,,,,80.0,,,138.0,
227938,9999,11,85.0,,,,,,10.2,,...,,98.0,,,31.0,75.0,,,125.0,


In [None]:
## Feature engineering

feature_labels = ['EtCO2', 'PTT', 'BUN','Lactate',
 'Temp', 'Hgb', 'HCO3', 'BaseExcess','RRate','Fibrinogen', 'Phosphate',
'WBC', 'Creatinine', 'PaCO2', 'AST','FiO2','Platelets','SaO2',
 'Glucose', 'ABPm', 'Magnesium',
 'Potassium', 'ABPd','Calcium', 'Alkalinephos',
 'SpO2', 'Bilirubin_direct', 'Chloride', 'Hct',
 'Heartrate', 'Bilirubin_total', 'TroponinI', 'ABPs','pH']

def create_lag(dataset, feature):
    column_lag = feature+"_lag"
    dataset[column_lag] = dataset.groupby(['pid'])[feature].diff()
    return dataset

X_train = train_features.copy()
X_test = test_features.copy()


for feature in feature_labels:
    print(feature)
    X_train[feature].fillna(X_train.groupby("pid")[feature].transform('mean'))
    X_test[feature].fillna(X_test.groupby("pid")[feature].transform('mean'))
    X_train = create_lag(X_train, feature)
    X_test = create_lag(X_test, feature)

EtCO2
PTT
BUN
Lactate
Temp
Hgb
HCO3
BaseExcess
RRate
Fibrinogen
Phosphate


## Description of the train labels:

### SubTask 1) Produce (probabilistic) real-valued predictions in the interval [0, 1].
The corresponding columns containing the binary ground truth in train_labels.csv are: LABEL_BaseExcess, LABEL_Fibrinogen, LABEL_AST, LABEL_Alkalinephos, LABEL_Bilirubin_total, LABEL_Lactate, LABEL_TroponinI, LABEL_SaO2, LABEL_Bilirubin_direct, LABEL_EtCO2.

### SubTask 2) Binary: 0 or 1
The corresponding column containing the binary ground-truth in train_labels.csv is LABEL_Sepsis.

Note: for subtasks 1 and 2, you will need to produce predictions in the interval [0, 1]. How can you achieve this with an SVM? In the lecture, you have seen that the SVM prediction for binary classification is 
sign. In order to produce real-valued predictions in the interval [0, 1] with SVM, you can replace the 
sign function by the sigmoid function.

### SubTask 3) Regression 
The corresponding columns containing the real-valued ground truth in train_labels.csv are: LABEL_RRate, LABEL_ABPm, LABEL_SpO2, LABEL_Heartrate.



**Additional notes**: Both train_features.csv and test_features.csv contain missing values ('nan' entries). An important part of this project is how to deal with such missing data (known as '**data imputation**' in the ML literature) and **feature engineering** (what features can you extract from measurements taken in consecutive hours, etc.)

Ideas:

For SubTask 1a and 1b -> Engineer new features, perform SVM

For SubTask 1c -> Regression, take into account we need to output the values of the features at hour 13


## Dealing with NaN

Finally, we could fill the gaps. This is called data imputation and there are many strategies that could be used to fill the gaps. Three methods that may perform well include:

Persisting the last observed value forward (linear).
Use the median value for the hour of day within the chunk.
Use the median value for the hour of day across chunks.


In [4]:
# Looking at the class labels
train_labels

Unnamed: 0,pid,LABEL_BaseExcess,LABEL_Fibrinogen,LABEL_AST,LABEL_Alkalinephos,LABEL_Bilirubin_total,LABEL_Lactate,LABEL_TroponinI,LABEL_SaO2,LABEL_Bilirubin_direct,LABEL_EtCO2,LABEL_Sepsis,LABEL_RRate,LABEL_ABPm,LABEL_SpO2,LABEL_Heartrate
0,1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,12.1,85.4,100.0,59.9
1,10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.8,100.6,95.5,85.5
2,100,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,16.5,88.3,96.5,108.1
3,1000,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,19.4,77.2,98.3,80.9
4,10000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.6,76.8,97.7,95.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18990,9993,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,17.1,69.8,100.0,110.7
18991,9995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.6,97.3,97.8,59.2
18992,9996,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,17.3,66.3,96.9,100.3
18993,9998,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,18.8,81.5,96.9,99.4


In [14]:
test_features

Unnamed: 0,pid,Time,Age,EtCO2,PTT,BUN,Lactate,Temp,Hgb,HCO3,...,Alkalinephos,SpO2,Bilirubin_direct,Chloride,Hct,Heartrate,Bilirubin_total,TroponinI,ABPs,pH
0,0,1,39.0,,,,,,,,...,,,,,,,,,,
1,0,2,39.0,,44.2,17.0,,36.0,10.2,13.0,...,119.0,100.0,,98.0,31.0,82.0,21.8,,119.0,
2,0,3,39.0,,,,,,,,...,,100.0,,,,78.0,,,125.0,7.34
3,0,4,39.0,,,,,,,,...,,100.0,,,,80.0,,,136.0,
4,0,5,39.0,,,,,,,,...,,100.0,,,,83.0,,,135.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
151963,9997,8,57.0,,,,,,,,...,,100.0,,,,84.0,,,103.0,
151964,9997,9,57.0,,,,,,,,...,,100.0,,,,83.0,,,110.0,
151965,9997,10,57.0,,,,,,,,...,,100.0,,,,88.0,,,111.0,
151966,9997,11,57.0,,,,,37.0,,,...,,100.0,,,,89.0,,,118.0,


## Subtask 1a

Produce (probabilistic) real-valued predictions in the interval [0, 1].
The corresponding columns containing the binary ground truth in train_labels.csv are: LABEL_BaseExcess, LABEL_Fibrinogen, LABEL_AST, LABEL_Alkalinephos, LABEL_Bilirubin_total, LABEL_Lactate, LABEL_TroponinI, LABEL_SaO2, LABEL_Bilirubin_direct, LABEL_EtCO2.



In [42]:
from sklearn.svm import SVC

labels_a = ["LABEL_BaseExcess", "LABEL_Fibrinogen", 
          "LABEL_AST", "LABEL_Alkalinephos", "LABEL_Bilirubin_total", 
          "LABEL_Lactate", "LABEL_TroponinI", "LABEL_SaO2", "LABEL_Bilirubin_direct", "LABEL_EtCO2"]


def fit_SVM(X_train, y_train, X_test):
    clf = SVC()
    clf.fit(X_train, y_train)
    predicted = clf.predict(X_test)
    return predicted




output_dataframe = test_features.copy()
for feature in ["LABEL_BaseExcess"]:
    y_train = train_labels[feature]
    output_dataframe[feature] = fit_SVM(X_train, y_train, X_test)
    
    

ValueError: Found array with 0 sample(s) (shape=(0, 71)) while a minimum of 1 is required.

## Subtask 1b

The corresponding column containing the binary ground-truth in train_labels.csv is LABEL_Sepsis.

Note: for subtasks 1 and 2, you will need to produce predictions in the interval [0, 1]. How can you achieve this with an SVM? In the lecture, you have seen that the SVM prediction for binary classification is 
sign. In order to produce real-valued predictions in the interval [0, 1] with SVM, you can replace the 
sign function by the sigmoid function.


In [10]:
labels_b = ["LABEL_Sepsis"]

## Subtask 1c
Regression 
The corresponding columns containing the real-valued ground truth in train_labels.csv are: LABEL_RRate, LABEL_ABPm, LABEL_SpO2, LABEL_Heartrate.

In [11]:
labels_c = ["LABEL_RRate", "LABEL_ABPm", "LABEL_SpO2", "LABEL_Heartrate"]

## Save to output file

We need to export the output file as a zip. Here it is the command they want us to use:

In [17]:
sample = pd.read_csv(path+"sample.csv")
sample

Unnamed: 0,pid,LABEL_BaseExcess,LABEL_Fibrinogen,LABEL_AST,LABEL_Alkalinephos,LABEL_Bilirubin_total,LABEL_Lactate,LABEL_TroponinI,LABEL_SaO2,LABEL_Bilirubin_direct,LABEL_EtCO2,LABEL_Sepsis,LABEL_RRate,LABEL_ABPm,LABEL_SpO2,LABEL_Heartrate
0,0,0.940,0.341,0.597,0.651,0.557,0.745,0.224,0.363,0.506,0.643,0.162,18.796,82.511,96.947,84.12
1,10001,0.773,0.320,0.451,0.152,0.001,0.525,0.276,0.327,0.316,0.656,0.486,18.796,82.511,96.947,84.12
2,10003,0.741,0.211,0.348,0.153,0.859,0.446,0.406,0.607,0.757,0.290,0.451,18.796,82.511,96.947,84.12
3,10004,0.147,0.312,0.733,0.129,0.356,0.367,0.931,0.715,0.434,0.005,0.785,18.796,82.511,96.947,84.12
4,10005,0.255,0.746,0.587,0.743,0.248,0.330,0.071,0.291,0.399,0.217,0.040,18.796,82.511,96.947,84.12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12659,9989,0.943,0.541,0.373,0.944,0.562,0.594,0.838,0.938,0.401,0.195,0.647,18.796,82.511,96.947,84.12
12660,9991,0.561,0.040,0.095,0.667,0.918,0.323,0.784,0.343,0.552,0.047,0.916,18.796,82.511,96.947,84.12
12661,9992,0.112,0.962,0.967,0.564,0.064,0.545,0.210,0.853,0.429,0.829,0.093,18.796,82.511,96.947,84.12
12662,9994,0.892,0.540,0.868,0.201,0.259,0.632,0.282,0.810,0.724,0.074,0.936,18.796,82.511,96.947,84.12


In [7]:
train_labels.to_csv("../output/"+'prediction_1.zip', index=False, float_format='%.3f', compression='zip')

In [None]:
sampl