# Predicting COVID-19 ICU Admission Using Neural Network

## License

The original dataset is under **Attribution-NonCommercial 4.0 (International CC BY-NC 4.0)** license.

Dataset is free to **share** and **adapt** under the following terms:

* credit to the original article is given and any changes are indicated (nothing has been changed as of 30/11/2021)
* material is not used for commercial purposes 

Credit:

* original material is published on [Kaggle](https://www.kaggle.com/) and accessible [here](https://www.kaggle.com/S%C3%ADrio-Libanes/covid19).

## Intro

This repository contains source code and report for a seminar paper in the context of the course *Machine Learning* in the winter semester 2021/2021 at Faculty of Computer and Information science, University of Ljubljana.

The dataset contains anonymized data from Hospital Sírio-Libanês, São Paulo and Brasilia. 

### Context (*copied from the above-mentioned Kaggle article*)
COVID-19 pandemic impacted the whole world, overwhelming healthcare systems - unprepared for such intense and lengthy request for ICU beds, professionals, personal protection equipment and healthcare resources.
Brazil recorded first COVID-19 case on February 26 and reached community transmission on March 20.

## Task

Predict admission to the ICU of confirmed COVID-19 cases.
Based on the data available, is it feasible to predict which patients will need intensive care unit support?
The aim is to provide tertiary and quarternary hospitals with the most accurate answer, so ICU resources can be arranged or patient transfer can be scheduled (*copied from Kaggle article*).

## Dataset

Data has been cleaned and scaled by column according to Min Max Scaler. In total, there are 54 features (expanded when pertinent to the mean, median, max, min, diff and relative diff). 

### Available Data

Features in the dataset can be grouped in four groups.



| Group | Amount of features |
| ----- | :------------------: |
| Demographics | 3 |
| Grouped diseases | 9 |
| Blood results | 36 |
| Vital signs | 6 |
| **Total**| **54** |



### Window Concept

Data for each patient has been grouped in five windows, each containing diagnostic results from the respective time window.



| Window      | Description |
| ----------- | ----------- |
| 0-2         | From 0 to 2 hours of the admission |
| 2-4         | From 2 to 4 hours of the admission |
| 4-6         | From 4 to 6 hours of the admission |
| 6-12        | From 6 to 12 hours of the admission |
| Above 12    | Above 12 hours from admission |



Kaggle article warns not to use data from the window where the target variable is 1. This means we need to manipulate our data a little. For example let's take a look at the following time tables:



| Window      | Patient admitted to ICU | Data can be used for modelling | Target variable |
| ----------- | :-----------: | :-----------: | :-----------: |
| 0-2         | False | True | 1 |
| 2-4         | False | True | 1 |
| 4-6         | False | True | 1 |
| 6-12        | True | False |  |
| Above 12    | True | False |  |



Patient is admitted in the fourth time window (6-12 from initial non-ICU admission). This means we can use data from the first three time windows with target variable being 1 (patient being admitted to the ICU ward).



| Window      | Patient admitted to ICU | Data can be used for modelling | Target variable |
| ----------- | :-----------: | :-----------: | :-----------: |
| 0-2         | False | True | 0 |
| 2-4         | False | True | 0 |
| 4-6         | False | True | 0 |
| 6-12        | False | True | 0 |
| Above 12    | False | True | 0 |



Patient is never admitted to the ICU, we can therefore use all time windows with target variable 0.

### Null Values

If we take a look at the following snippet from the original Kaggle article:

```
It is reasonable to assume that a patient who does not have a measurement recorded in a time window is clinically stable, potentially presenting vital signs and blood labs similar to neighboring windows. Therefore, one may fill the missing values using the next or previous entry. Attention to multicollinearity and zero variance issues in this data when choosing your algorithm.
```

We will be filling missing values from neighbouring cells, as specified in the snippet above.

## Import

Before you begin to run your code, you need to load all required modules. Simply execute the code block below. This block also enables Jupyter's auto-reloading feature, so you dont need to re-import modules whenever you change them.

In [9]:
# In order to import from the python file without hassle, we add the current
# directory to the python path
import sys; sys.path.append(".")

# Auto-reload
%load_ext autoreload
%autoreload 2

# Utilities module
import src.utilities as util

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Data Preparation

Instructions on how to set up the environment are specified in the [README](https://github.com/JakobSkornik/covid19-admission/blob/main/README.md) file.

The original dataset is provided in a single *xlsx* file. Let us first import the dataset and store it in a single pandas DataFrame.

In [None]:
# Load Excel into DataFrame
dataset = util.load_xlsx("data/Kaggle_Sirio_Libanes_ICU_Prediction.xlsx")

# Print first 10 elements of DataFrame
dataset.head(10)

Lets take a look at the datatypes present in this dataset.

In [None]:
dataset.dtypes.unique()

The next thing we want to do, is to add the target variable to all the rows. If there's at least one positive value in the **ICU** column for a single patient, the target variable is 1.

First we obtain the target variable for every patient.

In [None]:
# Create a df with PATIENT_ID/TARGET columns
patient_target_df = util.get_target_variables(dataset)
patient_target_df.head(10)

Now we append target variable to each row of the **dataset** dataframe.

In [None]:
dataset = util.append_target_variable(dataset, patient_target_df)
dataset.head()

Now we can remove the rows, where dataset contains value 1 in the column **ICU**.

In [None]:
dataset = dataset[dataset.ICU != 1]
dataset.head()

Next, we remove metadata column containing patient ID **PATIENT_VISIT_IDENTIFIER** and column **ICU**, since every row has the same ICU value 0.

In [None]:
dataset = dataset.drop(["PATIENT_VISIT_IDENTIFIER", "ICU"], axis=1)
dataset.head()

We still have to deal with null values. As specified above, we will fill null values with neighbouring values. We can take advantage of the **pd.DataFrame.fillna** method.

In [None]:
dataset_backward_fill = dataset.fillna(method="bfill")
dataset_forward_fill = dataset.fillna(method="ffill")

backward_filled = dataset_backward_fill.isna().sum().all()
forward_filled = dataset_forward_fill.isna().sum().all()

backward_filled, forward_filled

Both methods successfully filled the dataset. We can select either one of those, or save them to compare results between them later on.

In [None]:
dataset = dataset_backward_fill
dataset.head()

There are still whitespace characters in column names, so we replace them with underscores.

In [None]:
dataset.columns = dataset.columns.str.replace(" ", "_")
dataset.head()

We also need to encode the **AGE_PERCENTIL** column. We can see that there are 10 distinct values.

In [None]:
dataset.AGE_PERCENTIL.unique()

We can use pandas.get_dummies method, which will map a single column with n possible values, into n different binary columns. Column representing the original value, will contain value 1.

For example the following column:

| Value |
| :----: |
| a |
| b |
| c |
| a |
| a |

will get mapped into:

| Value_a | Value_b | Value_c |
| :---: | :---: | :---: |
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |

In [None]:
dataset = util.get_dummies(dataset, cols=["AGE_PERCENTIL"])
dataset.head()

The final non-numeric column is the **WINDOW** column. Using this column, we can create 6 different datasets. There are five windows, so we can use each distinct value as a separate dataset and an additional dataset with all time windows. I will show the example for window 2-4. Keep in mind, when we create a dataset for a window, all previous windows must be included aswell.

After we extract a dataset for a desired window, **WINDOW** column can be dropped.

All datasets will then be created and stored in a new object, with all required metadata using a helper method.

In [None]:
window_24_dataset = dataset[(dataset.WINDOW == "0-2") | (dataset.WINDOW == "2-4")]
window_24_dataset = window_24_dataset.drop("WINDOW", axis=1)
window_24_dataset.head()

Lets make sure that the datatypes of the curated dataset are all numeric.

In [None]:
window_24_dataset.dtypes.unique()

We can now apply this data preparation again, this time for all time windows and for different fill methods separately in a script. The final dictionary is structured as follows:

* **datasets**: *dict*
  * **ffill_datasets**: *dict*
    * **window_0_2**: *pd.Dataframe*
    * **window_2_4**: *pd.DataFrame*
    * **window_4_6**: *pd.DataFrame*
    * **window_6_12**: *pd.DataFrame*
    * **window_all**: *pd.DataFrame*
  * **bfill_datasets**: *dict*
    * . . .

In [10]:
datasets = util.get_datasets()

## Neural Network

The first question as to why one would want to design a neural network by hand, can be answered easily; to understand how such algorithms work at a deep level and because it's fun.

The whole neural network is designed as a package, that we can use in this notebook. We will start with a simple stochastic gradient descent backpropagation algorithm.

The first step is to desgin a simple neural network thats capable of approximating some basic boolean functions.

In [1]:
# In order to import from the python file without hassle, we add the current
# directory to the python path
import sys; sys.path.append(".")

# Auto-reload
%load_ext autoreload
%autoreload 2

In [2]:
from src.neural_network.basic import BasicNeuralNetwork

In [None]:

# Create a neural network
nn = BasicNeuralNetwork(
    input_size=2,
    output_size=2,
    iterations=1000,
    log_frequency=100,
    alpha=1,
    alpha_decay=0.05,
    layer_size=64,
    hidden_layers=1,
)

# Create dataset
X = [
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
]

# AND target values
y = [0, 0, 0, 1]

# Train on the dataset
nn.learn(X, y)

# Optional encoding
encoding = {
    0: "False",
    1: "True"
}

# Predictions
X00 = [[0,0]]
X01 = [[0,1]]
X10 = [[1,0]]
X11 = [[1,1]]

# Results
print("\nRESULTS: ")
print(f"\tFalse and False -> {nn.predict(X00, encoding)}")
print(f"\tFalse and True  -> {nn.predict(X01, encoding)}")
print(f"\tTrue and False  -> {nn.predict(X10, encoding)}")
print(f"\tTrue and True   -> {nn.predict(X11, encoding)}")

Neural network seems to work for an AND method. Let's try XOR.

In [6]:
# Create a neural network
nn = BasicNeuralNetwork(
    input_size=2,
    output_size=2,
    iterations=10,
    logs=True,
    log_frequency=1,
    alpha=1,
    alpha_decay=0.001,
    layer_size=64,
    hidden_layers=1,
)

# Create dataset
X = [
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
]

# XOR target values
y = [0, 1, 1, 0]

# Train on the dataset
nn.learn(X, y)

# Optional encoding
encoding = {
    0: "False",
    1: "True"
}

# Predictions
X00 = [[0,0]]
X01 = [[0,1]]
X10 = [[1,0]]
X11 = [[1,1]]

# Results
print("\nRESULTS: ")
print(f"\tFalse and False -> {nn.predict(X00, encoding)}")
print(f"\tFalse and True  -> {nn.predict(X01, encoding)}")
print(f"\tTrue and False  -> {nn.predict(X10, encoding)}")
print(f"\tTrue and True   -> {nn.predict(X11, encoding)}")

0: loss: 3.532179491345712, accuracy: 0.5, learning_rate: 1.0
1: loss: 8.059047875479163, accuracy: 0.5, learning_rate: 0.9990009990009991
2: loss: 0.745304719745516, accuracy: 0.75, learning_rate: 0.998003992015968
3: loss: 0.6814844140854559, accuracy: 0.5, learning_rate: 0.9970089730807579
4: loss: 1.0256145821789489, accuracy: 0.5, learning_rate: 0.9960159362549801
5: loss: 0.6206621311196778, accuracy: 0.5, learning_rate: 0.9950248756218907
6: loss: 0.23299459053884802, accuracy: 1.0, learning_rate: 0.9940357852882704
7: loss: 0.13498562804541667, accuracy: 1.0, learning_rate: 0.99304865938431
8: loss: 0.11846899228654595, accuracy: 1.0, learning_rate: 0.9920634920634921
9: loss: 0.10588533391221254, accuracy: 1.0, learning_rate: 0.9910802775024778
Finished learning. Accuracy: 1.0.

RESULTS: 
	False and False -> False
	False and True  -> True
	True and False  -> True
	True and True   -> False


In [19]:
dataset = datasets["ffill_datasets"]["window_all"].copy()

X = dataset.drop("TARGET", axis=1).to_numpy()
y = dataset[["TARGET"]].to_numpy()

# Create a neural network
nn = BasicNeuralNetwork(
    input_size=X.shape[1],
    output_size=2,
    iterations=1000,
    logs=True,
    log_frequency=100,
    alpha=1,
    alpha_decay=0.001,
    layer_size=64,
    hidden_layers=1,
)

nn.learn(X, y)

ValueError: shapes (1410,229) and (1410,64) not aligned: 229 (dim 1) != 1410 (dim 0)