# Overview

PhysioNet is a widely-used repository of biomedical data and software. It enables researchers around the world to share and reuse resources that underpin clinical studies, promoting reproducible research and lowering barriers to data access. Each year, PhysioNet hosts a challenge in which researchers and hobbyists can compete to see who can create the most accurate models for a given problem, with previous challenges including sleep apnea detection and heart sound classification (https://physionetchallenges.org/).

For this project, you will train a basic machine learning model that is capable of solving the task from the 2019 PhysioNet Challenge on sepsis prediction. Early detection and treatment of sepsis are critical for improving sepsis outcomes, where each hour of delayed treatment has been associated with roughly an 4–8% increase in mortality [1, 2]. The challenge gave people access to a dataset containing demographic information, vital signs, and lab reports. Your model will load this data, split it into separate groups for different purposes, and then train and evaluate your model.

[1] Kumar, A., Roberts, D., Wood, K.E., Light, B., Parrillo, J.E., Sharma, S., Suppes, R., Feinstein, D., Zanotti, S., Taiberg, L. and Gurka, D., 2006. Duration of hypotension before initiation of effective antimicrobial therapy is the critical determinant of survival in human septic shock. Critical care medicine, 34(6), pp.1589-1596.

[2] Seymour, C.W., Gesten, F., Prescott, H.C., Friedrich, M.E., Iwashyna, T.J., Phillips, G.S., Lemeshow, S., Osborn, T., Terry, K.M. and Levy, M.M., 2017. Time to treatment and mortality during mandated emergency care for sepsis. New England Journal of Medicine, 376(23), pp.2235-2244.


## What to Submit

Please go through the notebook and complete any of the code blocks marked `# TODO`. To get full credit for this assignment, we should be able to run your entire notebook without any errors. To check this, go to "Runtime" > "Run all" in the Google Colab menu. We realize this will take a while, but we want to make sure everything can be run.

# Part 1: Preparing for the Challenge

## Required Reading Materials

You should also read the details of the challenge at https://physionet.org/challenge/2019/.
Pay particular attention to what kind of data you will be working with and what the objective of the challenge is.


## Optional Reading Materials

Machine learning is a vast and deep topic, and this assignment will only scratch the surface.
Although you should be able to complete this assignment strictly by following our instructions, it may help to read through some materials to familiarize yourself with important concepts.
There are hundreds of blogs, videos, and online courses that people use to learn about machine learning.
Here are a couple of our favorites:
* [Jason Mayes' Machine Learning 101](https://docs.google.com/presentation/d/1kSuQyW5DTnkVaZEjGYCkfOxvzCqGEFzWBy4e9Uedd9k)
* [Machine Learning Mastery](https://machinelearningmastery.com/start-here/)

If you would prefer to look at academic materials, we recommend going through Roger Grosse's course notes for CSC321: http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/.
These particular materials may be of interest:
* [CSC411 Intro](http://www.cs.toronto.edu/~rgrosse/courses/csc411_f18/slides/lec01-slides.pdf): What is machine learning, history of machine learning
* [CSC321 Lecture 2 Slides](http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/readings/L02\%20Linear\%20Regression.pdf): Fitting polynomials (useful visualization), generalization
* [CSC321 Lecture 2 Notes](http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/readings/L02\%20Linear\%20Regression.pdf): Generalization
* [CSC321 Lecture 3 Notes](http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/readings/L03\%20Linear\%20Classifiers.pdf): Section 1, intro paragraph of 2.

## Package Installation

We will need to take advantage of numerous software packages that will give us access to functions that will make our lives easier. You have already seen one of these packages (`numpy`) in a previous assignment, but the full list is below:
* [`numpy`](https://numpy.org/) provides efficient implementations of many math operations, particularly ones that involve arrays and matrices.
* [`pandas`](https://pandas.pydata.org) is an extremely useful library for working with data, especially when your data can be organized into tables (e.g., a `.csv` or `.xlsx` file).
* [`scikit-learn`](https://scikit-learn.org/stable/index.html) is a tool for data mining, data analysis and machine learning.
* [`imbalanced-learn`](https://github.com/scikit-learn-contrib/imbalanced-learn) contains functions to help you work with imbalanced data.
* [`cache-em-all`](https://pypi.org/project/cache-em-all/) will allow us to save the result of a function so that it will only take a long time the first time you call it. In other words, `cache-em-all` allows you to save the result of the function so that it only takes 5 minutes the first time the function is called; whenever you call it again, it will only take a couple of seconds to load the saved result. This package was created by a TA from a previous iteration of the course.

`numpy`, `pandas`, and `scikit-learn` are some of the most popular ones for doing data science with Python, so getting familiar with these packages would be a good idea for both this assignment and any future projects you might do on your own.

If you are running your program on a local computer, you will need to install these packages yourself.
Google Colab will already have some of these packages pre-installed, but we will confirm this and install the missing packages by running the following command:

In [None]:
!pip install scikit-learn numpy pandas cache-em-all imbalanced-learn

This is not Python code, but rather a `bash` command that you would normally run in a local terminal.

# Part 2: Loading the Data

## Downloading the Data

One way you could get the into your program would be to download the `.zip` file(s) provided by the challenge website, extract their contents, and then putting those files in the working directory of your code. However, this is time-consuming and would require you to repeat the process if the data were to change. What we can do instead is run `bash` commands to directly download the data programmatically:

In [None]:
!wget -nc https://archive.physionet.org/users/shared/challenge-2019/training_setA.zip
!unzip -n training_setA.zip

Run the code block below to import all of the software packages we will want to use throughout our program:

In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GroupKFold, StratifiedKFold, train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score
from imblearn.under_sampling import RandomUnderSampler
from cache_em_all import Cachable

DATA_DIR = "training"

## `DataFrames` in `Pandas`

The `pandas` library has a special data structure called a `DataFrame` It is very similar to a table in that it has rows and columns, and each column can have a name assigned to it. We can create a `DataFrame` in many different ways, but for the sake of this example, we will manually create one with random values:

In [None]:
# This is an example
np.random.seed(0)
df = pd.DataFrame(np.random.rand(6,3), columns=['hemoglobin', 'bilirubin', 'uric acid'])

Similar to a dictionary, you can access specific columns within the `DataFrame` by putting a column name or a list of column names within square brackets:

In [None]:
# This is an example
df["bilirubin"] # Retrieves one column
df[["hemoglobin", "bilirubin"]] # Retrieves multiple columns

You can take this a step further to assign values to all the rows of a given column. For example, the following code converts the hemoglobin values from g/dL to g/L by multiplying them by a factor of 10:

In [None]:
# This is an example
df["hemoglobin"] = df["hemoglobin"] * 10

You can also create new columns with a default value for all of the rows.
In this example, we have created a new column for people's mood with the default value `"happy"`:

In [None]:
# This is an example
df["mood"] = "happy"

Each row is assigned an numerical **index** value. By default, each row's index is the same as its row number (i.e., the first row has an index of 0, the second row has an index of 1, etc). However, this may not always be true since there may be situations when you need to either index your rows differently or shuffle your rows while keeping track of their original position. You can access specific rows of a `DataFrame` using either its assigned label in the `DataFrame` (with the method `.loc[]`) or its positional index (with the method `.iloc[]`):

In [None]:
# This is an example
df.loc[0] # The first row in df
df.iloc[0] # The row in df with the index 0

For the purposes of this assignment, you should just be aware that each row is associated with a numerical index.

## Loading a Single File

We will first create a helper function called `load_single_file()` that will load a single file from the dataset folder that you just created.  These files are in the `.psv` format. Just like how a `.csv` file contains values separated by commas (`,`), a `.psv` file contains values separated by pipes (`|`). Your function should return a `DataFrame` after doing the following:

* Read in a file using the `pandas` function `read_csv()`. As its name suggests, this function typically expects values to be separated by commas; however, the optional `sep` argument enables us to specify the character that the file uses to separate values. Here is an example of this function being used for a `.psv` file: `df = pd.read_csv("training/p00001.psv", sep="|")`

* Create a new column called `patient` that contains the name of the file from which we retrieved this data. We will use the file name as a sort of patient ID to keep track of which rows belong to which patient.

* Create a column called `hour` that represents when each row was collected. You may recall from the challenge description that each row in the file represents an hour. This means that we can use the row number (or index) to as the hour value for each row: `df["hour"] = df.index`

In [None]:
def load_single_file(file_path):
    # TODO: Write your code here

In [None]:
# Test your function here

## Loading All of the Data

Now that we can read one file, we will write a helper function called `load_data()` that will read all of the data files and put them into a single `DataFrame`. One way we could do this is by continuously appending rows to a single `DataFrame`, but that is very slow. Instead, what we will do is read each file in our dataset, append the resulting `DataFrame` to a list, and then use the `pandas` function `concat()` to concatenate the list of `DataFrames` into a single one. Here are the steps your function should follow:

* Get a list of all the filenames in the dataset using the `get_data_files()` function we have provided.
* Call the `load_single_file()` helper function you wrote earlier on each filename in that list and append the result to a list.
* Use the function `pd.concat()` to combine the `DataFrames`. Note that this is only possible because each `DataFrame` in the list has the same columns.
* Pre-process the data using the `clean_data()` function we have provided.

Once you are sure this function is working correctly, you can uncomment the `@Cachable("data.csv")` line above the function. The next time you run the function, its result will be saved and subsequent calls to the function will load the saved data. If you realize there was a mistake in your implementation and you have to fix it, you will need to delete the `cache` folder; otherwise, your fixed code will not be called and the old saved data will still be returned.

In [None]:
# Names of all columns in the data that contain physiological data
physiological_cols = ['HR', 'O2Sat', 'Temp', 'SBP', 'MAP', 'DBP', 'Resp', 'EtCO2',
       'BaseExcess', 'HCO3', 'FiO2', 'pH', 'PaCO2', 'SaO2', 'AST', 'BUN',
       'Alkalinephos', 'Calcium', 'Chloride', 'Creatinine', 'Bilirubin_direct',
       'Glucose', 'Lactate', 'Magnesium', 'Phosphate', 'Potassium',
       'Bilirubin_total', 'TroponinI', 'Hct', 'Hgb', 'PTT', 'WBC',
       'Fibrinogen', 'Platelets']

# Names of all columns in the data that contain demographic data
demographic_cols = ['Age', 'Gender', 'Unit1', 'Unit2', 'HospAdmTime', 'ICULOS']

# The combination of physiological and demographic data is what we will use as features in our model
feature_cols = physiological_cols + demographic_cols

# The name of the column that contains the value we are trying to predict
label_col = "SepsisLabel"

# Pre-calculated means and standard deviation of all physiological and demographic columns. We will use this to normalize
# data using their z-score. This isn't as important for simpler models such as random forests and decision trees,
# but can result in significant improvements when using neural networks
physiological_mean = np.array([
        83.8996, 97.0520,  36.8055,  126.2240, 86.2907,
        66.2070, 18.7280,  33.7373,  -3.1923,  22.5352,
        0.4597,  7.3889,   39.5049,  96.8883,  103.4265,
        22.4952, 87.5214,  7.7210,   106.1982, 1.5961,
        0.6943,  131.5327, 2.0262,   2.0509,   3.5130,
        4.0541,  1.3423,   5.2734,   32.1134,  10.5383,
        38.9974, 10.5585,  286.5404, 198.6777])
physiological_std = np.array([
        17.6494, 3.0163,  0.6895,   24.2988, 16.6459,
        14.0771, 4.7035,  11.0158,  3.7845,  3.1567,
        6.2684,  0.0710,  9.1087,   3.3971,  430.3638,
        19.0690, 81.7152, 2.3992,   4.9761,  2.0648,
        1.9926,  45.4816, 1.6008,   0.3793,  1.3092,
        0.5844,  2.5511,  20.4142,  6.4362,  2.2302,
        29.8928, 7.0606,  137.3886, 96.8997])
demographic_mean = np.array([60.8711, 0.5435, 0.0615, 0.0727, -59.6769, 28.4551])
demographic_std = np.array([16.1887, 0.4981, 0.7968, 0.8029, 160.8846, 29.5367])

def get_data_files():
    return [os.path.join(DATA_DIR, x) for x in sorted(os.listdir(DATA_DIR)) if int(x[1:-4]) % 5 > 0]

def clean_data(data):
    data.reset_index(inplace=True, drop=True)

    # Normalizes physiological and demographic data using z-score.
    data[physiological_cols] = (data[physiological_cols] - physiological_mean) / physiological_std
    data[demographic_cols] = (data[demographic_cols] - demographic_mean) / demographic_std

    # Maps invalid numbers (NaN, inf, -inf) to numbers (0, really large number, really small number)
    data[feature_cols] = np.nan_to_num(data[feature_cols])

    return data

In [None]:
# @Cachable("data.csv")
def load_data():
    # TODO: Write your code here

In [None]:
# Test your function here

## Data Exploration

Once you have loaded the data, run the following commands to understand some high-level characteristics of your dataset:

In [None]:
# This is an example
df = load_data()
print("Column names:", df.columns)
print("Number of samples:", len(df))
print("Distribution of sepsis vs. non-sepsis samples:", df["SepsisLabel"].value_counts())

Moving forward, we will consider all of the columns other than the patient ID to be our **features** (the inputs to our model) and our **label** (the desired output of our model) to be `SepsisLabel`. Because we are working with data that has labels, we will be using a type of learning called **supervised learning**. More specifically, we are trying to predict an output $y$ given an input $X$ and we have examples of $(X, y)$ pairs that we can use to train our system. This is in contrast to **unsupervised learning** where we would only have $X$ at our disposal.

There are two types of tasks within supervised learning: **regression** and **classification**. Regression involves continuous labels like white blood cell count or the price of a house. Classification involves discrete labels like healthy/sick or cat/dog/bird; the different possible values that a discrete label can take for a given problem are known as **classes**. We will be doing binary (2-class) classification in this assignment.

# Part 3: Machine Learning Basics

## Splitting the Data

We want the model we train to be able to produce accurate predictions on data it has not seen before. One way we can do this is by splitting our data into two sets: (1) a **training dataset** that we use to train the model, and (2) a **test set** that will only be used to evaluate the model. The `scikit-learn` library provides a function called `train_test_split()` that splits data into train and test splits for us:

In [None]:
# This is an example
train_df, test_df = train_test_split(df, test_size=0.2)

The parameter `test_size` is the proportion of data that should be used for the test set. In this example, 20% of the data goes to the test set `test_df`, which means the remaining 80% goes to the training set `train_df`.

## Training a Classifier

Now that we have $X$ and $y$ for both the train and test sets, we are finally ready to create a classifier. There are many different types of classifiers available in `scikit-learn`, but we will start with a decision tree classifier. Look through the [online documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for `DecisionTreeClassifier` and complete the following steps for `train_simple()`:

* Separate the features and the labels for both `train_df` and `test_df`. For your convenience, some of the code your ran earlier has variables called `feature_cols` and `label_col`.

* Create a new instance of the `DecisionTreeClassifier` with `class_weight="balanced"`.

* Pass the training data through the classifier so it can identify the best structure and model weights that will maximize its accuracy; this process is often known as **fitting**.

* Use your trained model to **predict** labels for the training features.

* Use the `evaluate()` function we have provided to print out accuracy metrics that explain how often the predicted labels matched the expected labels.

* Repeat the previous two steps using the test dataset.

In [None]:
def evaluate(actual, predicted, prefix=""):
  """
  Compares the predicted labels to the actual lables and prints out multiple metrics
  of classification performance: precision, recall, and overall accuracy

  actual corresponds to the ground truth labels that were in the collected dataset
  predicted corresponds to the labels that were predicted by the model
  prefix is a string you can use to specify which data or model corresponds to the given analysis

  """
  precision = precision_score(actual, predicted)
  recall = recall_score(actual, predicted)
  accuracy = accuracy_score(actual, predicted)

  print("%s Precision: %.3f%%, Recall: %.3f%%, Accuracy: %.3f%%" % (prefix, precision * 100, recall * 100, accuracy * 100))

In [None]:
def train_simple(data, feature_cols, label_col):
    # TODO: Write your code here

In [None]:
# Test your function here

## Making Your Results Repeatable

What happens if you run `train_simple()` multiple times? The results are not consistent because there are multiple parts of our code that rely on randomness: how we split the data into training and test sets, how the model fits itself to training data, etc. Inconsistent results make it hard to replicate our work, so we should have a way to be able to produce the same result every time.

Many random number generators are not truly random; they are actually pseudorandom in that they generate numbers based on a **seed**. Therefore, if we set the value of the seed, we can control the sequence of random numbers the generator produces. We can do this by using the following lines of code:

In [None]:
# This is an example
seed = 9001
np.random.seed(seed)

You can set the value of your random seed to be whatever you want; as long as you keep the seed the same, your program should produce the same results. These lines of code should be ran before any part of your program that involves randomness. We recommend putting it next to the `import` statements from the first part of the assignment. We also need to pass this seed to the `train_test_split()` function:

In [None]:
# This is an example
train_df, test_df = train_test_split(df, test_size=0.2, random_state=seed)

Re-run your `train_simple()` function multiple times to make sure it produces the same results each time.

# Part 4: Improving Your Pipeline

## Dealing with Overfitting

How well does the model work on the training dataset? How about on the test dataset? The model likely worked better on the training data for a variety of reasons. The test dataset may have some rows that are completely different from what was used to train the model, which would lead to a higher chance of misclassification. The model might also be memorizing the training data so well that it lacks the flexibility to generalize to unseen data that is roughly similar; we call this phenomenon **overfitting**.

One way of addressing overfitting is by tuning the complexity of the model. A model that is too simple may not have enough flexibility to capture the complex nature of the dataset, while a model that is too complicated may be so flexible that it can memorize the dataset; the best model will often lie between these two extremes. For the decision tree classifier, the model complexity is dictated by the width and depth of the decision tree, which we can control as follows:

In [None]:
# This is an example
clf = DecisionTreeClassifier(class_weight="balanced", max_depth=20, max_leaf_nodes=20)

Try different values for the model complexity parameters in your `train_simple()` implementation. We suggest varying those parameters in increments of 10 and looking for general trends; otherwise, you may overoptimize the complexity of your model for your particular data split.

## Dealing with Class Imbalance

Recall that there is a significant difference between the number of positive (`SepsisLabel = 1`) and negative examples (`SepsisLabel = 0`) in our dataset. When a dataset is significantly imbalanced, classifiers may become biased because there are too few examples of a particular class.

Within the `DecisionTreeClassifier`, we can set `class_weight="balance"` to tell the classifier to weigh classes according to how many examples there are in the training data. For example, if there are twice as many negative examples than positive examples, the classifier will consider incorrect predictions on positive examples twice as bad as a incorrect predictions on negative examples while training.

Another way to deal with class imbalance is by adjusting how the data is sampled. In this case, we are going to **undersample** the data, which means that were are going to keep all of the data in the minority class and decreasing the size of the majority class. The alternative would be **oversampling**, which means that we would be keeping the size of the majority class and repeating examples in the minority class. In your `train_simple()` implementation, use the `RandomUnderSampler` from `imbalanced-learn` to undersample the training data before you fit your model:

In [None]:
# This is an example
rus = RandomUnderSampler(random_state=seed)
X_train, y_train = rus.fit_resample(X_train, y_train)

## Better Cross Validation

In the previous section, we split the data into training and test sets to see how well our model would generalize to unseen data. However, those splits were not as independent as they should be. The `train_test_split()` function randomly splits the rows in the dataset, but multiple rows belong to same patient. This means that data from patient $P_1$ can appear in both the training and test datasets.
If that happens, the model is essentially "peeking at the answers" ahead of time, so even if a model has high accuracy on the test dataset, we cannot argue that it will work for other unseen patients.

To fix this issue, we will split our data using a variant of **$k$-fold cross-validation**. A typical $k$-fold cross-validation procedure goes as follows:

* Rather than splitting the data into a single pair of training and testing sets, we split the data into $k$ number of sets (called **folds**) of equal size.
* For each fold, we do the following:

  * We select one fold to hold out as the test set and use the remaining $k-1$ folds collectively as the training set.
  * We fit a model on the training set and evaluate on the test set as we would normally do.
  * We save the prediction results for the training and test datasets, but we discard the model.

* At this point, all of the folds will have been the test set at least once, which means we will have a prediction for each sample. We can therefore calculate the performance of our modeling approach the same way as we did before.

This still does not solve our problem of maintaining independence, which is why we need to use a variant of $k$-fold cross-validation. We will stratify our data according to the `patient` column before we split it into folds so that rows from the same patient can only appear in one fold. `scikit-learn` has an object called `GroupKFold` that will do this for us:

In [None]:
# This is an example
X = data[feature_cols]
y = data[label_col]
group = data[stratify_col]
kf = GroupKFold(n_splits=5)
kf.split(X, y, group)

We have implemented the skeleton for a stratified 5-fold cross-validation procedure in `train_stratified()`. Your job is to complete this function by writing code that trains a classifier within the loop. You may re-use your `train_simple()` function as you see fit.

When this function is complete and you run it, you should expect to see a drop in accuracy since you are no longer "cheating" during model training; nevertheless, this is a more accurate representation of how well your approach will be able to generalize for unseen patients.

In [None]:
def train_stratified(data, feature_cols, label_col, stratify_col):
    X = data[feature_cols]
    y = data[label_col]
    group = data[stratify_col]

    train_pred = []
    train_actual = []

    test_pred = []
    test_actual = []

    kf = GroupKFold(n_splits=5)
    for train_idx, test_idx in kf.split(X, y, group):
        X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
        X_test, y_test = X.iloc[test_idx], y.iloc[test_idx]

        # TODO: Implement your classifier here

        train_pred.extend(clf.predict(X_train))
        train_actual.extend(y_train)

        test_pred.extend(clf.predict(X_test))
        test_actual.extend(y_test)

    evaluate(train_actual, train_pred, "Train")
    evaluate(test_actual, test_pred, "Test")

In [None]:
# Test your function here

# Part 5: Putting It All Together

Now that you have written and iterated upon a pipeline for model training, it is time for you to find a configuration that maximizes cross-validated test accuracy. There are numerous aspects of the pipeline that you can tweak, but here are the easiest ones to start with:

* **Model architecture:** We have been using the `DecisionTreeClassifier` in `scikit-learn`, but there are many others at your disposal. We have imported a few others that you can try — `KNeighborsClassifier`, `RandomForestClassifier`, and `MLPClassifier` — but you can also explore other ones on `scikit-learn`'s website.

* **Model parameters:** We examined how adjusting the model complexity can help us avoid overfitting. As you try different models, be sure to also explore different parameters.

* **Number of folds:** The skeleton we wrote for `train_stratified()` performs 5-fold cross-validation because of the parameter we passed to `GroupKFold`. What happens when you increase the number of folds to 10? What happens when you decrease the number of folds to 2 or 3?

You could also improve your pipeline by looking into feature pre-processing, feature selection, and automated hyperparameter tuning. However, these topics are outside of the scope of this course. If you are interested in learning more, feel free to reach out to the instructors!

The final deliverable for this assignment is a report that explains the different configurations you tried, the accuracy those configurations achieved, and written explanations of why you believe those results happened. Since there are so many configurations to choose from and each person may be splitting the data differently, we are not expecting everyone to achieve a definitive correct answer. What we are looking for is careful experimentation and viable explanations for the results of those experiments. You may find it helpful to use tables or graphs to systematically present your accuracy numbers. The report should be single-column, single-spaced, and no longer than 3 pages including tables and graphs. Save the report as either `report.pdf` or `report.docx`.

In [None]:
# Feel free to run your experiments here