# AIQDSC27 - Machine Learning Algorithms

**Student**: Quentin Le Roux

## Instructions

With the available part of the MIMICS dataset, **propose the best model** (among Linear Regression, KNN, Naive Bayes, RandomForest) to predict:

> **re-hospitalization** (evaluation metrics, accuracy)

To build the features (X), all or part of the following columns can be used (all types of pre-processing is allowed):

- **DOB**: Date of Birth
- **GENDER**
- **MARITAL_STATUS**
- **ETHNICITY**
- **INSURANCE**
- **DEATHTIME**: Date of Death (if the patient has died)
- **ADMITTIME**: Date of the admission
- **ADMISSION_TYPE**
    - blood, circulatory, congenital, digestive, endocrine, genitourinary, infectious, injury, mental, misc, muscular, neoplasms, nervous, pregnancy, prenatal, respiratory, skin
    - Bag of Words representation of diagnosis
- **DISCHTIME**: date of the discharge
- **DISCHARGE_LOCATION**: patient's destination after discharge from hospital
- **TEXT**: discharge medical report
- **DAYS_NEXT_ADMIT**: number of days between discharge and readmission
- **NXT_ADMITTIME**: date of readmission
- **OUTPUT_LABEL**
- **DEATHTIME**: Date of Death (if the patient has died)

**Data leakage** (i.e. https://www.kaggle.com/alexisbcook/data-leakage) has to be accounted for/dealt with.

The rendering will be in the form of a **jupyter notebook written like a report**: with a clearly announced plan, different sections and a conclusion.

*A part of the grade will be given on the quality of the report (8 points), a part on the quality of the work done, and the respect of the methodology (6 points), a part on the quality of the prediction (6 points)*. 

## Notes

### Data Leakage (excerpts from [here](https://www.kaggle.com/alexisbcook/data-leakage))

"Data leakage (or leakage) happens when **your training data contains information about the target**, but similar data will not be available when the model is used for prediction. This leads to high performance on the training set (and possibly even the validation data), but the model will perform poorly in production.

[...]

**Target leakage** occurs when your predictors include data that will not be available at the time you make predictions. It is important to think about target leakage in terms of the timing or chronological order that data becomes available, not merely whether a feature helps make good predictions.

[...] 

Validation is meant to be a measure of how the model does on data that it hasn't considered before. You can corrupt this process in subtle ways if the validation data affects the preprocessing behavior. This is sometimes called **train-test contamination**."

## 0. Table of Content

1. **Introduction**

    a. Overview of project steps
    
    b. Library imports


2. **Data pre-processing**

    a. Overview of used methods
    
    b. Pre-processing
   
   
3. **Modeling**

    a. Linear Regression
    
    b. KNN
    
    c. Naive Bayes
    
    d. Random Forest
 
 
4. **Exploring hyperparameters of the best model**


5. **Conclusion**

## 1. Introduction

### 1.1 Overview of project steps
    
The goal is the following:

1. **Pre-process the dataset** into a ready-to-train-on array 


2. **Train and test our four selected model types**: Linear Regression, KNN, Naive Bayes, RandomForest


3. **Select the most promising** of the four and **perform futher hyperparameter tuning** to increase the performance


4. **Conclude** and propose further areas of explorations

### 1.2 Library imports

In [2]:
import pandas as pd

## 2. Data pre-processing

### 2.1 Overview of used methods
    
<span style="color:red">TBD</span>

### 2.2 Pre-processing

#### 1 - <u>Loading the *train* and *test* datasets:</u>

In [9]:
online_path = "http://www.i3s.unice.fr/~riveill/dataset/MIMIC-III-readmission/"
train_set_path = online_path + "train.csv.zip"
test_set_path = online_path + "test.csv.zip"

local_train_set_path = "./datasets/train.csv.zip"
local_test_set_path = "./datasets/test.csv.zip"

In [65]:
# df_train = pd.read_csv(train_set_path)
# df_train = pd.read_csv(test_set_path)

df_train = pd.read_csv(local_train_set_path)
df_test = pd.read_csv(local_test_set_path)

We create two placeholder for the dataset so that we do not erase the original data/perform inplace modifications.

In [84]:
X_train = None
X_test = None

In [85]:
y_train = None
y_test = None

#### 2 - <u>Quick overview of the two sets:</u>

- We find that the train dataset holds **2000** entries, while the test dataset holds **901** entries, i.e., a **69-31 train-test ratio**. *The size of the dataset is small, and a rule of thumb of at least a 80-20 train-test ratio is usually recommended. As such, we will keep the following sets as-is*.


- There seems to be **several features with NaN values**, which will have to be dealt with.


- The available features are of types **int64** or **Object**. We will have to transform those items accordingly


- As seen in the [MIMIC-III Clinical Database Demo 1.4](https://physionet.org/content/mimiciii-demo/1.4/ADMISSIONS.csv), diagnosis starts as a string value containing a list of diagnosis separated by either '/', ';', ',' or '-' to say the least. Provided in the available dataset is a bag of word representation of that diagnosis column. Given that we see more than 0 or 1 values (i.e. true or false), it seems to indicate that **the Bag of Word approach may represent some kind of importance** (e.g. number of time the term appears). 

In [21]:
df_train.describe()

Unnamed: 0,SUBJECT_ID,HADM_ID,DAYS_NEXT_ADMIT,blood,circulatory,congenital,digestive,endocrine,genitourinary,infectious,...,mental,misc,muscular,neoplasms,nervous,pregnancy,prenatal,respiratory,skin,OUTPUT_LABEL
count,2000.0,2000.0,1210.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,...,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,18155.6905,150103.483,119.883433,0.4825,2.858,0.036,0.7475,1.389,0.6605,0.4385,...,0.4475,0.4305,0.216,0.2555,0.421,0.008,0.119,0.9725,0.189,0.505
std,26240.378348,29205.036893,404.753993,0.735503,2.253969,0.196783,1.179593,1.329121,0.895902,0.809658,...,0.847114,0.739894,0.544511,0.704605,0.801299,0.151484,0.376709,1.199359,0.551753,0.5001
min,11.0,100095.0,-0.602083,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1490.5,124979.5,5.383333,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3103.5,150743.5,13.219792,0.0,3.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
75%,25072.75,174570.75,25.327951,1.0,4.0,0.0,1.0,2.0,1.0,1.0,...,1.0,1.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,1.0
max,99562.0,199955.0,3867.977778,5.0,13.0,2.0,9.0,10.0,4.0,7.0,...,9.0,5.0,5.0,8.0,7.0,4.0,5.0,6.0,6.0,1.0


In [23]:
df_test.describe()

Unnamed: 0,SUBJECT_ID,HADM_ID,DAYS_NEXT_ADMIT,blood,circulatory,congenital,digestive,endocrine,genitourinary,infectious,...,mental,misc,muscular,neoplasms,nervous,pregnancy,prenatal,respiratory,skin,OUTPUT_LABEL
count,901.0,901.0,526.0,901.0,901.0,901.0,901.0,901.0,901.0,901.0,...,901.0,901.0,901.0,901.0,901.0,901.0,901.0,901.0,901.0,901.0
mean,18306.197558,149172.830189,84.578517,0.466149,2.81798,0.044395,0.72808,1.372919,0.700333,0.468368,...,0.468368,0.436182,0.201998,0.243063,0.440622,0.015538,0.119867,0.931188,0.241953,0.503885
std,26349.689656,29115.501914,304.437951,0.69139,2.256878,0.231479,1.165418,1.406611,0.944628,0.804397,...,0.919147,0.752463,0.53876,0.682942,0.784625,0.253383,0.354423,1.18403,0.624726,0.500263
min,6.0,100039.0,-0.454167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1521.0,123423.0,5.100868,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3176.0,147718.0,11.302431,0.0,2.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
75%,25256.0,174749.0,22.211632,1.0,4.0,0.0,1.0,2.0,1.0,1.0,...,1.0,1.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,1.0
max,99982.0,199807.0,3543.101389,4.0,12.0,2.0,7.0,7.0,5.0,7.0,...,6.0,5.0,5.0,5.0,4.0,5.0,3.0,7.0,6.0,1.0


In [22]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 34 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   SUBJECT_ID          2000 non-null   int64  
 1   HADM_ID             2000 non-null   int64  
 2   ADMITTIME           2000 non-null   object 
 3   DISCHTIME           2000 non-null   object 
 4   DAYS_NEXT_ADMIT     1210 non-null   float64
 5   NEXT_ADMITTIME      1210 non-null   object 
 6   ADMISSION_TYPE      2000 non-null   object 
 7   DEATHTIME           158 non-null    object 
 8   DISCHARGE_LOCATION  2000 non-null   object 
 9   INSURANCE           2000 non-null   object 
 10  MARITAL_STATUS      1924 non-null   object 
 11  ETHNICITY           2000 non-null   object 
 12  DIAGNOSIS           1998 non-null   object 
 13  TEXT                1925 non-null   object 
 14  GENDER              2000 non-null   object 
 15  DOB                 2000 non-null   object 
 16  blood 

In [30]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 901 entries, 0 to 900
Data columns (total 34 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   SUBJECT_ID          901 non-null    int64  
 1   HADM_ID             901 non-null    int64  
 2   ADMITTIME           901 non-null    object 
 3   DISCHTIME           901 non-null    object 
 4   DAYS_NEXT_ADMIT     526 non-null    float64
 5   NEXT_ADMITTIME      526 non-null    object 
 6   ADMISSION_TYPE      901 non-null    object 
 7   DEATHTIME           58 non-null     object 
 8   DISCHARGE_LOCATION  901 non-null    object 
 9   INSURANCE           901 non-null    object 
 10  MARITAL_STATUS      861 non-null    object 
 11  ETHNICITY           901 non-null    object 
 12  DIAGNOSIS           901 non-null    object 
 13  TEXT                871 non-null    object 
 14  GENDER              901 non-null    object 
 15  DOB                 901 non-null    object 
 16  blood   

#### 3 - <u>Dealing with data leakage:</u>

It appears that **a single person**, represented by a single subject ID, **can have several entries in the dataset** (Based on the [MIMIC-III information](https://mimic.physionet.org/mimictables/admissions/), **HADM_ID** represents a single patient’s admission to the hospital and **SUBJECT_ID** indicates that a single patient who can have multiple admissions to the hospital). <span style="color:red">To avoid data leakage, we must identify lines related to single individuals and merge them if possible</span>.

Our goal is **individualizing each row** so that we reduce the dependencies between each rows. This is key in order to avoid data leakage.

**Example with subject_id 17**:

In [89]:
# we look for the number of times a single patient has been admitted to a hospital. 
# We find that a single patient may have been admitted up to 15 times in the training set

df_train.pivot_table(index = ['SUBJECT_ID'], aggfunc ='size').unique()

array([ 1,  2, 15,  3,  5,  4,  6,  8])

In [64]:
# We identify that subject 17 has been admitted twice

df_train.pivot_table(index = ['SUBJECT_ID'], aggfunc ='size')

SUBJECT_ID
11       1
17       2
19       1
21       1
22       1
        ..
99312    1
99384    1
99464    1
99538    1
99562    1
Length: 1758, dtype: int64

In [86]:
df_train[df_train["SUBJECT_ID"]==17]

Unnamed: 0,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,DAYS_NEXT_ADMIT,NEXT_ADMITTIME,ADMISSION_TYPE,DEATHTIME,DISCHARGE_LOCATION,INSURANCE,...,mental,misc,muscular,neoplasms,nervous,pregnancy,prenatal,respiratory,skin,OUTPUT_LABEL
1182,17,161087,2135-05-09 14:11:00,2135-05-13 14:40:00,,,EMERGENCY,,HOME HEALTH CARE,Private,...,1,1,2,0,0,0,0,1,0,0
1710,17,194023,2134-12-27 07:15:00,2134-12-31 16:05:00,128.920833,2135-05-09 14:11:00,ELECTIVE,,HOME HEALTH CARE,Private,...,0,0,0,0,0,0,0,0,0,0


We also want to identify whether people who have been readmitted always have a single row per admission. Based on simple data wrangling, it appears that **some patients have a readmission time but do not have multiple lines associated to their case**.

**Example with subject_id 937**:

In [88]:
# We find that subject_id 937 has been admitted twice but has only one single record
# in the training dataset

df_train[df_train["SUBJECT_ID"]==937]

Unnamed: 0,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,DAYS_NEXT_ADMIT,NEXT_ADMITTIME,ADMISSION_TYPE,DEATHTIME,DISCHARGE_LOCATION,INSURANCE,...,mental,misc,muscular,neoplasms,nervous,pregnancy,prenatal,respiratory,skin,OUTPUT_LABEL
0,937,148592,2163-01-20 18:39:00,2163-01-24 08:00:00,0.061806,2163-01-24 09:29:00,EMERGENCY,2163-01-26 08:00:00,DEAD/EXPIRED,Medicare,...,0,0,0,0,1,0,0,0,0,1


#### 4 - <u>Building our target variable:</u>
    
We want to estimate the re-hospitalization rate of a patient. The question is then **how to represent re-hospitalization**?

Two approaches are possible:

- **regression**: Predicting the number of days between discharge and readmission for a patient
    - We can predict the number of days between discharge and readmission using the DAYS_NEXT_ADMIT feature that is available to us
    - The main issue of DAYS_NEXT_ADMIT is how to represent the absence of readmission (NaN in the dataset)

- **classification**: Predicting if a patient will **i)** be readmitted at some point, **ii)** die, **iii)** being discharged without readmission
    - We can assign a tag to each of the scenarios above which will be used to perform classification
    
With regards to constructing our y values, <span style="color:red">we find that some elements are problematic</span>. For instance, the subject_id 937 indicates both a next admission time and a death time in the same row. 

In [83]:
df_train[df_train["SUBJECT_ID"]==937]

Unnamed: 0,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,DAYS_NEXT_ADMIT,NEXT_ADMITTIME,ADMISSION_TYPE,DEATHTIME,DISCHARGE_LOCATION,INSURANCE,...,mental,misc,muscular,neoplasms,nervous,pregnancy,prenatal,respiratory,skin,OUTPUT_LABEL
0,937,148592,2163-01-20 18:39:00,2163-01-24 08:00:00,0.061806,2163-01-24 09:29:00,EMERGENCY,2163-01-26 08:00:00,DEAD/EXPIRED,Medicare,...,0,0,0,0,1,0,0,0,0,1


In [80]:
df_test[["SUBJECT_ID", "HADM_ID", "DEATHTIME", "NEXT_ADMITTIME"]]

Unnamed: 0,SUBJECT_ID,HADM_ID,DEATHTIME,NEXT_ADMITTIME
0,25697,104760,,2122-04-12 22:12:00
1,2668,121020,,
2,71,111944,,
3,14131,136336,,2118-04-17 19:22:00
4,85870,123324,,2144-01-24 22:07:00
...,...,...,...,...
896,81545,108398,,2130-11-09 17:47:00
897,28073,196299,,2178-03-24 20:02:00
898,21233,139588,,2172-03-14 23:36:00
899,59085,128590,,2183-02-01 01:59:00


## 3. Modeling

### 3.1 Linear Regression

### 3.2 KNN

### 3.3 Naive Bayes

### 3.4 Random Forest

## 4. Exploring the hyperparameters of the best model

## 5. Conclusion