# Introduction to Machine Learning - Task 2

Group name: Cbbayes

Team members: mcolomer (mcolomer@student.ethz.ch), pratsink (pratsink@student.ethz.ch) and scastro (scastro@student.ethz.ch)

Spring 2021

## Import libraries

In [1]:
import pandas as pd
import numpy as np

## Read data 

In [9]:
# Indicate the path to the data file
path = "../data/"

train_features = pd.read_csv(path+"train_features.csv")
train_labels = pd.read_csv(path+"train_labels.csv")
test_features = pd.read_csv(path+"test_features.csv")


## Description of the train features:

All medical measurements are not available at each timestep, meaning the data contains a lot of missing values, indicated with ‘nan’ in our case. To simplify the problem, the data is already re-sampled hourly. This means that we aggregate measurements by one-hour period, thus there are only 12 rows for a given patient in the corresponding .csv file.

Either vital signs such as the Heart rate, or lab tests such as Calcium level in the patient blood. Finally, you are provided with the Age of the patient which is the same during the entire stay.


In [3]:
## Feature engineering
feature_labels = list(train_features.columns)
feature_labels = ['EtCO2', 'PTT', 'BUN','Lactate',
 'Temp', 'Hgb', 'HCO3', 'BaseExcess','RRate','Fibrinogen', 'Phosphate',
'WBC', 'Creatinine', 'PaCO2', 'AST','FiO2','Platelets','SaO2',
 'Glucose', 'ABPm', 'Magnesium',
 'Potassium', 'ABPd','Calcium', 'Alkalinephos',
 'SpO2', 'Bilirubin_direct', 'Chloride', 'Hct',
 'Heartrate', 'Bilirubin_total', 'TroponinI', 'ABPs','pH']

def create_lag(dataset, feature):
    #Create a lag and a median feature
    column_lag = feature+"_lag"
    column_median = feature+"_median"
    dataset[column_lag] = dataset.groupby(['pid'])[feature].diff()
    dataset[column_median] = dataset.groupby(['pid'])[feature].median()

    return dataset

X_train = train_features.copy()
X_test = test_features.copy()


for feature in feature_labels:
    print(feature)
    X_train[feature].fillna(X_train.groupby("pid")[feature].transform('median'))
    X_test[feature].fillna(X_test.groupby("pid")[feature].transform('median'))
    X_train = create_lag(X_train, feature)
    X_test = create_lag(X_test, feature)
    
X_train = X_train.fillna(X_train.median())
X_test = X_test.fillna(X_test.median())

X_train = X_train.groupby(by=["pid"]).mean()
X_test = X_test.groupby(by=["pid"]).mean()

EtCO2
PTT
BUN
Lactate
Temp
Hgb
HCO3
BaseExcess
RRate
Fibrinogen
Phosphate
WBC
Creatinine
PaCO2
AST
FiO2
Platelets
SaO2
Glucose
ABPm
Magnesium
Potassium
ABPd
Calcium
Alkalinephos
SpO2
Bilirubin_direct
Chloride
Hct
Heartrate
Bilirubin_total
TroponinI
ABPs
pH


## Description of the train labels:

### SubTask 1) Produce (probabilistic) real-valued predictions in the interval [0, 1].
The corresponding columns containing the binary ground truth in train_labels.csv are: LABEL_BaseExcess, LABEL_Fibrinogen, LABEL_AST, LABEL_Alkalinephos, LABEL_Bilirubin_total, LABEL_Lactate, LABEL_TroponinI, LABEL_SaO2, LABEL_Bilirubin_direct, LABEL_EtCO2.

### SubTask 2) Binary: 0 or 1
The corresponding column containing the binary ground-truth in train_labels.csv is LABEL_Sepsis.

Note: for subtasks 1 and 2, you will need to produce predictions in the interval [0, 1]. How can you achieve this with an SVM? In the lecture, you have seen that the SVM prediction for binary classification is 
sign. In order to produce real-valued predictions in the interval [0, 1] with SVM, you can replace the 
sign function by the sigmoid function.

### SubTask 3) Regression 
The corresponding columns containing the real-valued ground truth in train_labels.csv are: LABEL_RRate, LABEL_ABPm, LABEL_SpO2, LABEL_Heartrate.



**Additional notes**: Both train_features.csv and test_features.csv contain missing values ('nan' entries). An important part of this project is how to deal with such missing data (known as '**data imputation**' in the ML literature) and **feature engineering** (what features can you extract from measurements taken in consecutive hours, etc.)

Ideas:

For SubTask 1a and 1b -> Engineer new features, perform SVM

For SubTask 1c -> Regression, take into account we need to output the values of the features at hour 13


## Dealing with NaN

Finally, we could fill the gaps. This is called data imputation and there are many strategies that could be used to fill the gaps. Three methods that may perform well include:

Persisting the last observed value forward (linear).
Use the median value for the hour of day within the chunk.
Use the median value for the hour of day across chunks.


## Subtask 1a

Produce (probabilistic) real-valued predictions in the interval [0, 1].
The corresponding columns containing the binary ground truth in train_labels.csv are: LABEL_BaseExcess, LABEL_Fibrinogen, LABEL_AST, LABEL_Alkalinephos, LABEL_Bilirubin_total, LABEL_Lactate, LABEL_TroponinI, LABEL_SaO2, LABEL_Bilirubin_direct, LABEL_EtCO2.



In [10]:
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn import linear_model



def fit_RFC_prob(X_train, y_train, X_test):
    """Probabilistic values from
    random forest for predicting
    the regression features"""
    clf = RandomForestClassifier()
    clf.fit(X_train, y_train)
    predicted = clf.predict_proba(X_test)
    return predicted[:,1] #Choose probabilistic values for class 1

def fit_RFC(X_train, y_train, X_test):
    """Random forest for predicting
    the regression features"""
    clf = RandomForestClassifier()
    clf.fit(X_train, y_train)
    predicted = clf.predict(X_test)
    return predicted


def fit_Lasso(X_train, y_train, X_test):
    """Lasso regressor for predicting
    the regression features"""
    clf = linear_model.Lasso(alpha=0.1)
    clf.fit(X_train, y_train)
    predicted = clf.predict(X_test)
    return predicted


#Create the output dataframe
output_dataframe = test_features.groupby(by=["pid"], as_index=False).mean()



labels_a = ["LABEL_BaseExcess", "LABEL_Fibrinogen", 
          "LABEL_AST", "LABEL_Alkalinephos", "LABEL_Bilirubin_total", 
          "LABEL_Lactate", "LABEL_TroponinI", "LABEL_SaO2", "LABEL_Bilirubin_direct", "LABEL_EtCO2"]


for feature in labels_a:
    print(feature)
    y_train = train_labels[feature]
    output_dataframe[feature] = fit_RFC_prob(X_train, y_train, X_test)
    
labels_b = ["LABEL_Sepsis"]

for feature in labels_b:
    print(feature)
    y_train = train_labels[feature]
    output_dataframe[feature] = fit_RFC(X_train, y_train, X_test)
    
labels_c = ["LABEL_RRate", "LABEL_ABPm", "LABEL_SpO2", "LABEL_Heartrate"]


for feature in labels_c:
    print(feature)
    y_train = train_labels[feature]
    output_dataframe[feature] = fit_Lasso(X_train, y_train, X_test)
    

LABEL_BaseExcess




[0. 1.]
LABEL_Fibrinogen




[0. 1.]
LABEL_AST




[0. 1.]
LABEL_Alkalinephos




[0. 1.]
LABEL_Bilirubin_total




[0. 1.]
LABEL_Lactate




[0. 1.]
LABEL_TroponinI




[0. 1.]
LABEL_SaO2




[0. 1.]
LABEL_Bilirubin_direct




[0. 1.]
LABEL_EtCO2




[0. 1.]
LABEL_Sepsis




LABEL_RRate
LABEL_ABPm
LABEL_SpO2
LABEL_Heartrate


In [12]:
#Create output file

output_columns = ["pid"]+labels_a+labels_b+labels_c
sample = pd.read_csv(path+"sample.csv")

output_columns = list(sample.columns)
output = output_dataframe[output_columns]
output.to_csv("../output/"+'prediction_5.zip', index=False, float_format='%.3f', compression='zip')

In [13]:
output

Unnamed: 0,pid,LABEL_BaseExcess,LABEL_Fibrinogen,LABEL_AST,LABEL_Alkalinephos,LABEL_Bilirubin_total,LABEL_Lactate,LABEL_TroponinI,LABEL_SaO2,LABEL_Bilirubin_direct,LABEL_EtCO2,LABEL_Sepsis,LABEL_RRate,LABEL_ABPm,LABEL_SpO2,LABEL_Heartrate
0,0,0.4,0.2,0.5,0.4,0.5,0.4,0.2,0.5,0.2,0.1,0.0,18.749403,81.782172,96.942572,84.230554
1,3,0.3,0.0,0.2,0.2,0.3,0.3,0.1,0.2,0.0,0.4,0.0,18.827689,82.542756,96.907013,84.264837
2,5,0.4,0.1,0.0,0.1,0.2,0.3,0.0,0.2,0.0,0.1,0.0,18.793512,82.858547,96.946206,84.128266
3,7,0.6,0.0,0.2,0.2,0.3,0.0,0.0,0.3,0.2,0.1,0.0,19.184185,83.735821,97.029362,84.625243
4,9,0.5,0.0,0.2,0.2,0.3,0.3,0.0,0.2,0.0,0.1,0.0,18.733537,82.500228,96.963035,83.606771
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12659,31647,0.5,0.1,0.1,0.3,0.4,0.3,0.2,0.1,0.0,0.1,0.0,18.696079,82.806950,97.007095,83.682735
12660,31649,0.3,0.2,0.2,0.4,0.4,0.0,0.1,0.3,0.1,0.0,0.0,18.856438,82.376880,96.947983,84.143699
12661,31651,0.3,0.0,0.5,0.4,0.5,0.4,0.4,0.4,0.0,0.1,0.0,18.811595,82.280806,96.903950,84.273450
12662,31652,0.1,0.2,0.5,0.2,0.4,0.4,0.1,0.6,0.1,0.1,0.0,18.663080,82.059539,96.984852,83.282791
