# Introduction to Machine Learning - Task 2

Group name: Cbbayes

Team members: mcolomer (mcolomer@student.ethz.ch), pratsink (pratsink@student.ethz.ch) and scastro (scastro@student.ethz.ch)

Spring 2021

## Import libraries

In [1]:
import pandas as pd
import numpy as np

## Read data 

In [2]:
# Indicate the path to the data file
path = "../data/"


train_features = pd.read_csv(path+"train_features.csv")
train_labels = pd.read_csv(path+"train_labels.csv")
test_features = pd.read_csv(path+"test_features.csv")


## Description of the train features:

All medical measurements are not available at each timestep, meaning the data contains a lot of missing values, indicated with ‘nan’ in our case. To simplify the problem, the data is already re-sampled hourly. This means that we aggregate measurements by one-hour period, thus there are only 12 rows for a given patient in the corresponding .csv file.

Either vital signs such as the Heart rate, or lab tests such as Calcium level in the patient blood. Finally, you are provided with the Age of the patient which is the same during the entire stay.


In [5]:
# Looking at the data
train_features

Unnamed: 0,pid,Time,Age,EtCO2,PTT,BUN,Lactate,Temp,Hgb,HCO3,...,Alkalinephos,SpO2,Bilirubin_direct,Chloride,Hct,Heartrate,Bilirubin_total,TroponinI,ABPs,pH
0,1,3,34.0,,,12.0,,36.0,8.7,24.0,...,,100.0,,114.0,24.6,94.0,,,142.0,7.33
1,1,4,34.0,,,,,36.0,,,...,,100.0,,,,99.0,,,125.0,7.33
2,1,5,34.0,,,,,36.0,,,...,,100.0,,,,92.0,,,110.0,7.37
3,1,6,34.0,,,,,37.0,,,...,,100.0,,,,88.0,,,104.0,7.37
4,1,7,34.0,,,,,,,,...,,100.0,,,22.4,81.0,,,100.0,7.41
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
227935,9999,8,85.0,,,,,,,,...,,,,,,80.0,,,110.0,
227936,9999,9,85.0,,,,,,,,...,,,,,,83.0,,,123.0,
227937,9999,10,85.0,,,,,36.0,,,...,,98.0,,,,80.0,,,138.0,
227938,9999,11,85.0,,,,,,10.2,,...,,98.0,,,31.0,75.0,,,125.0,


## Description of the train labels:

### SubTask 1) Produce (probabilistic) real-valued predictions in the interval [0, 1].
The corresponding columns containing the binary ground truth in train_labels.csv are: LABEL_BaseExcess, LABEL_Fibrinogen, LABEL_AST, LABEL_Alkalinephos, LABEL_Bilirubin_total, LABEL_Lactate, LABEL_TroponinI, LABEL_SaO2, LABEL_Bilirubin_direct, LABEL_EtCO2.

### SubTask 2) Binary: 0 or 1
The corresponding column containing the binary ground-truth in train_labels.csv is LABEL_Sepsis.

Note: for subtasks 1 and 2, you will need to produce predictions in the interval [0, 1]. How can you achieve this with an SVM? In the lecture, you have seen that the SVM prediction for binary classification is 
sign. In order to produce real-valued predictions in the interval [0, 1] with SVM, you can replace the 
sign function by the sigmoid function.

### SubTask 3) Regression 
The corresponding columns containing the real-valued ground truth in train_labels.csv are: LABEL_RRate, LABEL_ABPm, LABEL_SpO2, LABEL_Heartrate.



**Additional notes**: Both train_features.csv and test_features.csv contain missing values ('nan' entries). An important part of this project is how to deal with such missing data (known as '**data imputation**' in the ML literature) and **feature engineering** (what features can you extract from measurements taken in consecutive hours, etc.)

Ideas:

For SubTask 1a and 1b -> Engineer new features, perform SVM

For SubTask 1c -> Regression, take into account we need to output the values of the features at hour 13


In [6]:
# Looking at the class labels
train_labels

Unnamed: 0,pid,LABEL_BaseExcess,LABEL_Fibrinogen,LABEL_AST,LABEL_Alkalinephos,LABEL_Bilirubin_total,LABEL_Lactate,LABEL_TroponinI,LABEL_SaO2,LABEL_Bilirubin_direct,LABEL_EtCO2,LABEL_Sepsis,LABEL_RRate,LABEL_ABPm,LABEL_SpO2,LABEL_Heartrate
0,1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,12.1,85.4,100.0,59.9
1,10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.8,100.6,95.5,85.5
2,100,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,16.5,88.3,96.5,108.1
3,1000,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,19.4,77.2,98.3,80.9
4,10000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.6,76.8,97.7,95.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18990,9993,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,17.1,69.8,100.0,110.7
18991,9995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.6,97.3,97.8,59.2
18992,9996,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,17.3,66.3,96.9,100.3
18993,9998,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,18.8,81.5,96.9,99.4


## Save to output file

We need to export the output file as a zip. Here it is the command they want us to use:

In [7]:
train_labels.to_csv("../output/"+'prediction_1.zip', index=False, float_format='%.3f', compression='zip')