# AIQDSC27 - Machine Learning Algorithms

**Student**: Quentin Le Roux

## Instructions

With the available part of the MIMICS dataset, **propose the best model** (among Linear Regression, KNN, Naive Bayes, RandomForest) to predict:

> **re-hospitalization** (evaluation metrics, accuracy)

To build the features (X), all or part of the following columns can be used (all types of pre-processing is allowed):

- **DOB**: Date of Birth
- **GENDER**
- **MARITAL_STATUS**
- **ETHNICITY**
- **INSURANCE**
- **DEATHTIME**: Date of Death (if the patient has died)
- **ADMITTIME**: Date of the admission
- **ADMISSION_TYPE**
    - blood, circulatory, congenital, digestive, endocrine, genitourinary, infectious, injury, mental, misc, muscular, neoplasms, nervous, pregnancy, prenatal, respiratory, skin
    - Bag of Words representation of diagnosis
- **DISCHTIME**: date of the discharge
- **DISCHARGE_LOCATION**: patient's destination after discharge from hospital
- **TEXT**: discharge medical report
- **DAYS_NEXT_ADMIT**: number of days between discharge and readmission
- **NXT_ADMITTIME**: date of readmission
- **OUTPUT_LABEL**
- **DEATHTIME**: Date of Death (if the patient has died)

**Data leakage** (i.e. https://www.kaggle.com/alexisbcook/data-leakage) has to be accounted for/dealt with.

The rendering will be in the form of a **jupyter notebook written like a report**: with a clearly announced plan, different sections and a conclusion.

*A part of the grade will be given on the quality of the report (8 points), a part on the quality of the work done, and the respect of the methodology (6 points), a part on the quality of the prediction (6 points)*. 

## Notes

### Data Leakage (excerpts from [Kaggle](https://www.kaggle.com/alexisbcook/data-leakage))

"Data leakage (or leakage) happens when **your training data contains information about the target**, but similar data will not be available when the model is used for prediction. This leads to high performance on the training set (and possibly even the validation data), but the model will perform poorly in production.

[...]

**Target leakage** occurs when your predictors include data that will not be available at the time you make predictions. It is important to think about target leakage in terms of the timing or chronological order that data becomes available, not merely whether a feature helps make good predictions.

[...] 

Validation is meant to be a measure of how the model does on data that it hasn't considered before. You can corrupt this process in subtle ways if the validation data affects the preprocessing behavior. This is sometimes called **train-test contamination**."

## 0. Table of Content

1. **Introduction**

    a. Overview of project steps
    
    b. Library imports and built functions


2. **Data pre-processing**

    a. Overview of used methods
    
    b. Pre-processing
   
   
3. **Modeling**

    a. Linear Regression
    
    b. KNN
    
    c. Naive Bayes
    
    d. Random Forest
 
 
4. **Exploring hyperparameters of the best model**


5. **Conclusion**

## 1. Introduction

### 1.1 Overview of project steps
    
The project will proceed using the following steps:

1. **Pre-processing of the dataset** into a ready-to-train-on array 


2. **Training and testing our selected model types**: Linear Regression, KNN, Naive Bayes, RandomForest


3. **Selecting the most promising** of the four and **perform further hyperparameter tuning** to increase the model's performance


4. **Concluding** and propose further areas of explorations

<u><span style="color:red">Note on **Linear Regression**</span>:</u>

The mentioned models are *Linear Regression*, KNN, Naive Bayes, RandomForest. Linear Regression is a **regression** model while the three others are **classification** models. Though we will implement linear regression in this project, we will also add **logistic regression** as a classification model to perform more classification comparisons.

### 1.2 Library imports and built functions

In [1]:
# Library imports

import datetime as dt
import nltk
import numpy as np
import pandas as pd

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Function Declarations

def remove_stop_words(tokenized_sentence):
    """
    Removes the stop words from a tokenized sentence
    """
    punctuation = [".", ",", "[", "]", "`", "(", ")", "?", "'", "'s", ":", "!"]
    stop_words = stopwords.words('english')
    stop_words += punctuation
    return [w for w in tokenized_sentence if w not in stop_words]
    
def lemmatize(tokenized_sentence):
    """
    Create a lemmatizer object and lematized tokenized items (e.g. sentences)
    Might require running the following:
        nltk.download('wordnet')
    """
    lemmatizer = nltk.WordNetLemmatizer()
    return [lemmatizer.lemmatize(w) for w in tokenized_sentence]

sentence_processing = lambda sentence: " ".join(
    lemmatize(
        remove_stop_words(
            word_tokenize(str.lower(str(sentence)))
        )
    )
)

## 2. Data pre-processing

### 2.1 Overview of Process
    
- **Loading** the training and testing sets and remarks on the set


- Identifying and removing columns with a risk of **data leakage**


- **Final selection of feature variables** (X)


- **Building/Wrangling** the features variables
    - Dealing with NaN values
    - Building a LENGTH_OF_STAY feature
    - Building an AGE feature
    - One-hot encoding the kept discrete features
    - Creating a word embedding of the DIAGNOSIS column
    - Building the final training and testing datasets


- **Building the target variable(s)** (Y)



### 2.2 Pre-processing

#### 1 - <u>Loading the *training* and *testing* datasets:</u>

We also create two placeholder variables to hold the training and testing sets so that we do not erase the original data/perform inplace modifications.

In [3]:
online_path = "http://www.i3s.unice.fr/~riveill/dataset/MIMIC-III-readmission/"
train_set_path = online_path + "train.csv.zip"
test_set_path = online_path + "test.csv.zip"

local_train_set_path = "./datasets/train.csv.zip"
local_test_set_path = "./datasets/test.csv.zip"

In [4]:
# df_train = pd.read_csv(train_set_path)
# df_train = pd.read_csv(test_set_path)

df_train = pd.read_csv(local_train_set_path)
df_test = pd.read_csv(local_test_set_path)

In [5]:
X_train = None
X_test = None

In [6]:
y_train = None
y_test = None

#### 2 - <u>Notes on the training and testing sets:</u>

- We find that the training dataset holds **2000 entries**, while the testing dataset holds **901 entries**, i.e., a **69 to 31 train-test ratio**. *The small size of a dataset usually implies a rule of thumb of using at least a 80-20 train-test ratio. As such, the current split is decidedly okay, and we will keep the sets as they are*.


- There seems to be **several features with NaN values**, which will have to be dealt with.


- The available features are of types **int64** or **Object**. We will have to transform those items accordingly


- As seen in the [MIMIC-III Clinical Database Demo 1.4](https://physionet.org/content/mimiciii-demo/1.4/ADMISSIONS.csv), the DIAGNOSIS variable (a column in the data table) corresponds to a string value containing a list of diagnoses separated by either '/', ';', ',', '-', etc. 

    - Provided in the dataset is a bag of word representation of this DIAGNOSIS column. Given that we see more than 0 or 1 values (i.e. true or false), it seems to indicate that **the provided Bag of Word approach may represent some kind of importance** (e.g. number of time the term appears)
    
    - Consequently, given the DIAGNOSIS column is given, **we might want to create our own word embedding representation**
    
<u>Preliminary information on the sets:</u>

In [7]:
df_train.describe()

Unnamed: 0,SUBJECT_ID,HADM_ID,DAYS_NEXT_ADMIT,blood,circulatory,congenital,digestive,endocrine,genitourinary,infectious,...,mental,misc,muscular,neoplasms,nervous,pregnancy,prenatal,respiratory,skin,OUTPUT_LABEL
count,2000.0,2000.0,1210.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,...,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,18155.6905,150103.483,119.883433,0.4825,2.858,0.036,0.7475,1.389,0.6605,0.4385,...,0.4475,0.4305,0.216,0.2555,0.421,0.008,0.119,0.9725,0.189,0.505
std,26240.378348,29205.036893,404.753993,0.735503,2.253969,0.196783,1.179593,1.329121,0.895902,0.809658,...,0.847114,0.739894,0.544511,0.704605,0.801299,0.151484,0.376709,1.199359,0.551753,0.5001
min,11.0,100095.0,-0.602083,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1490.5,124979.5,5.383333,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3103.5,150743.5,13.219792,0.0,3.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
75%,25072.75,174570.75,25.327951,1.0,4.0,0.0,1.0,2.0,1.0,1.0,...,1.0,1.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,1.0
max,99562.0,199955.0,3867.977778,5.0,13.0,2.0,9.0,10.0,4.0,7.0,...,9.0,5.0,5.0,8.0,7.0,4.0,5.0,6.0,6.0,1.0


In [8]:
df_test.describe()

Unnamed: 0,SUBJECT_ID,HADM_ID,DAYS_NEXT_ADMIT,blood,circulatory,congenital,digestive,endocrine,genitourinary,infectious,...,mental,misc,muscular,neoplasms,nervous,pregnancy,prenatal,respiratory,skin,OUTPUT_LABEL
count,901.0,901.0,526.0,901.0,901.0,901.0,901.0,901.0,901.0,901.0,...,901.0,901.0,901.0,901.0,901.0,901.0,901.0,901.0,901.0,901.0
mean,18306.197558,149172.830189,84.578517,0.466149,2.81798,0.044395,0.72808,1.372919,0.700333,0.468368,...,0.468368,0.436182,0.201998,0.243063,0.440622,0.015538,0.119867,0.931188,0.241953,0.503885
std,26349.689656,29115.501914,304.437951,0.69139,2.256878,0.231479,1.165418,1.406611,0.944628,0.804397,...,0.919147,0.752463,0.53876,0.682942,0.784625,0.253383,0.354423,1.18403,0.624726,0.500263
min,6.0,100039.0,-0.454167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1521.0,123423.0,5.100868,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3176.0,147718.0,11.302431,0.0,2.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
75%,25256.0,174749.0,22.211632,1.0,4.0,0.0,1.0,2.0,1.0,1.0,...,1.0,1.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,1.0
max,99982.0,199807.0,3543.101389,4.0,12.0,2.0,7.0,7.0,5.0,7.0,...,6.0,5.0,5.0,5.0,4.0,5.0,3.0,7.0,6.0,1.0


In [9]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 34 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   SUBJECT_ID          2000 non-null   int64  
 1   HADM_ID             2000 non-null   int64  
 2   ADMITTIME           2000 non-null   object 
 3   DISCHTIME           2000 non-null   object 
 4   DAYS_NEXT_ADMIT     1210 non-null   float64
 5   NEXT_ADMITTIME      1210 non-null   object 
 6   ADMISSION_TYPE      2000 non-null   object 
 7   DEATHTIME           158 non-null    object 
 8   DISCHARGE_LOCATION  2000 non-null   object 
 9   INSURANCE           2000 non-null   object 
 10  MARITAL_STATUS      1924 non-null   object 
 11  ETHNICITY           2000 non-null   object 
 12  DIAGNOSIS           1998 non-null   object 
 13  TEXT                1925 non-null   object 
 14  GENDER              2000 non-null   object 
 15  DOB                 2000 non-null   object 
 16  blood 

In [10]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 901 entries, 0 to 900
Data columns (total 34 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   SUBJECT_ID          901 non-null    int64  
 1   HADM_ID             901 non-null    int64  
 2   ADMITTIME           901 non-null    object 
 3   DISCHTIME           901 non-null    object 
 4   DAYS_NEXT_ADMIT     526 non-null    float64
 5   NEXT_ADMITTIME      526 non-null    object 
 6   ADMISSION_TYPE      901 non-null    object 
 7   DEATHTIME           58 non-null     object 
 8   DISCHARGE_LOCATION  901 non-null    object 
 9   INSURANCE           901 non-null    object 
 10  MARITAL_STATUS      861 non-null    object 
 11  ETHNICITY           901 non-null    object 
 12  DIAGNOSIS           901 non-null    object 
 13  TEXT                871 non-null    object 
 14  GENDER              901 non-null    object 
 15  DOB                 901 non-null    object 
 16  blood   

#### 3 - <u>Dealing with data leakage:</u>

> **<span style="color:red">The following examples are interesting as they outline data leakage risk we want to contain or remove</span>**.
>  
> Our goal is also **individualizing each row** so that we reduce the dependencies between each of them.

1. It is possible that **a single patient** (i.e. a single **SUBJECT_ID**) **has multiple entries in the dataset**
    - Based on the [information provided by the repository for the MIMIC dataset](https://mimic.physionet.org/mimictables/admissions/), **HADM_ID** represents a single patient’s admission(s) to the hospital and **SUBJECT_ID** represents a single patient
    - To avoid data leakage, we must **identify features which we will need to exclude** due to leakage
    
    
<u>Example with subject_id 17:</u>

In [11]:
# we look for the number of times a single patient has been admitted to a hospital. 
# We find that a single patient may have been admitted up to 15 times in the training set

df_train.pivot_table(index = ['SUBJECT_ID'], aggfunc ='size').unique()

array([ 1,  2, 15,  3,  5,  4,  6,  8])

In [12]:
# We identify that patient 17 has been admitted twice

print(df_train.pivot_table(index = ['SUBJECT_ID'], aggfunc ='size').head(2))
df_train[df_train["SUBJECT_ID"]==17]

SUBJECT_ID
11    1
17    2
dtype: int64


Unnamed: 0,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,DAYS_NEXT_ADMIT,NEXT_ADMITTIME,ADMISSION_TYPE,DEATHTIME,DISCHARGE_LOCATION,INSURANCE,...,mental,misc,muscular,neoplasms,nervous,pregnancy,prenatal,respiratory,skin,OUTPUT_LABEL
1182,17,161087,2135-05-09 14:11:00,2135-05-13 14:40:00,,,EMERGENCY,,HOME HEALTH CARE,Private,...,1,1,2,0,0,0,0,1,0,0
1710,17,194023,2134-12-27 07:15:00,2134-12-31 16:05:00,128.920833,2135-05-09 14:11:00,ELECTIVE,,HOME HEALTH CARE,Private,...,0,0,0,0,0,0,0,0,0,0


2. We need to check if patients who have been readmitted have an equal number of rows (i.e. 1 row = 1 admission). Based on the following data wrangling, it appears that:

     - **Some patients have a mentioned readmission time but do not have multiple lines associated to their case**.
     - Some patients have **missing admissions along with mismatching dates**. 

**Interpretation**: <span style="color:red">In terms of interpretation, the dataset cannot be understood as a time series.</span>. As such, **each row** (and their potential readmission) **should be be construed as independent from other rows**. 

**Implication**: In terms of data leakage, it implies we **should fudge the data in a way that no two row can be linked to each other**

<u>Example with subject_id 937:</u>

In [13]:
# We find that subject_id 937 has been admitted twice but has only one single record in the training dataset

df_train[df_train["SUBJECT_ID"]==937]

Unnamed: 0,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,DAYS_NEXT_ADMIT,NEXT_ADMITTIME,ADMISSION_TYPE,DEATHTIME,DISCHARGE_LOCATION,INSURANCE,...,mental,misc,muscular,neoplasms,nervous,pregnancy,prenatal,respiratory,skin,OUTPUT_LABEL
0,937,148592,2163-01-20 18:39:00,2163-01-24 08:00:00,0.061806,2163-01-24 09:29:00,EMERGENCY,2163-01-26 08:00:00,DEAD/EXPIRED,Medicare,...,0,0,0,0,1,0,0,0,0,1


<u>Example with subject_id 808:</u>

In [14]:
# We see that subject_id 808 has three referenced admissions but the last one mentions a next 
# admission that is not referenced

# The admit times also are mismatched, and that there are 3 admissions with each showing a next admission 
# time, implying at least one admission is missing from the dataset

df_train[df_train["SUBJECT_ID"]==808]

Unnamed: 0,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,DAYS_NEXT_ADMIT,NEXT_ADMITTIME,ADMISSION_TYPE,DEATHTIME,DISCHARGE_LOCATION,INSURANCE,...,mental,misc,muscular,neoplasms,nervous,pregnancy,prenatal,respiratory,skin,OUTPUT_LABEL
553,808,197130,2181-11-16 08:18:00,2181-11-23 09:04:00,8.701389,2181-12-02 01:54:00,EMERGENCY,,HOME HEALTH CARE,Private,...,0,0,0,2,0,0,0,3,0,1
1189,808,100677,2181-07-12 20:11:00,2181-07-17 13:14:00,13.395833,2181-07-30 22:44:00,EMERGENCY,,HOME HEALTH CARE,Private,...,0,1,0,3,0,0,0,0,0,1
1995,808,139077,2181-05-11 16:57:00,2181-05-16 11:58:00,13.701389,2181-05-30 04:48:00,EMERGENCY,,HOME,Private,...,0,0,0,0,0,0,0,2,0,1


3. **Some eatures will leak information about a potential target variable**

<u>Example with subject_id 937:</u>

> We see that DISCHARGE_LOCATION and TEXT hold important data with regards to the outcome of the patient's stay. In the case of subject_id 937 (75 year old man with a history of hypertension), we see that the person died during their care at the hospital and that the mention of their death (term used: dead/expired) is referenced in both columns.

In [15]:
# We find that the cell DISCHARGE_LOCATION holds important data on the fate of the
# patient

df_train[df_train["SUBJECT_ID"]==937]["DISCHARGE_LOCATION"]

0    DEAD/EXPIRED
Name: DISCHARGE_LOCATION, dtype: object

In [16]:
# We find that the cell TEXT holds important data on the fate of the patient:
#     Discharge Disposition:\nExpired\n\nDischarge Diagnosis:\n1.  
#     intraparenchymal hemmorrhage\n\nDischarge Condition:\nexpired

print(df_train[df_train["SUBJECT_ID"]==937]["TEXT"].values)

['Admission Date:  [**2163-1-20**]              Discharge Date:   [**2163-1-24**]\n\nDate of Birth:  [**2087-9-24**]             Sex:   M\n\nService: NEUROLOGY\n\nAllergies:\nPatient recorded as having No Known Allergies to Drugs\n\nAttending:[**First Name3 (LF) 5868**]\nChief Complaint:\ntransfer from ICH with intra-parenchymal bleed\n\nMajor Surgical or Invasive Procedure:\nnone\n\nHistory of Present Illness:\nThe patient is a 75 year old man with a history of hypertension\nand high cholesterol, now presenting on transfer from an OSH\nwith\na large right intraparenchymal cerebral bleed.  As per his\nchart, he originally presented to the OSH with the complaint of\ninability to feel his right leg.  An angiogram of the leg\nuncovered a right femoral artery occlusion and he was given t-\nPA (iv).  The next morning, the patient developed a left\nhemiparesis with left facial droop and a right gaze preference.\nAn emergent CT scan of his brain showed multiple hemorrhages\nprimarily in the r

#### 4 - <u>Selecting the feature variables (X):</u>

We decide due to data leakage concerns to not focus on the following variables:

- **DEATHTIME**, **ADMITTIME**, **DISCHTIME**: those are obvious data leakage risk as they give information on actual readmission rates
- **TEXT**, **DISCHARGE_LOCATION**: data leakage risk (see note)
- Bag of Words (of **DIAGNOSIS**): We discard the bag of word as we will be building our own embedding

As such, we focus on the following features (X):

- Age (which we will have to construct out of **DOB** and **ADMITTIME**)
- **GENDER**
- **MARITAL_STATUS**
- **ETHNICITY** (<span style="color:red">see note</span>)
- **INSURANCE**
- **ADMISSION_TYPE**
- Length of stay (which we will have to construct out of **DISCTIME** and **ADMITTIME**)
- **DIAGNOSIS**

<u>Note on **ETHNICITY**:</u>

> It is important to note that **ethnic/racial data is a controversial topic in AI**. The goal is to **avoid racial profiling as well as racial discrimination**. Especially in health. 
>
> It happens that **systemic racism and poverty greatly affect minorities in the United States**. We recall that the [MIMIC dataset is a relational database containing tables of data relating to patients who stayed within the intensive care units](https://mimic.physionet.org/gettingstarted/overview/) at [Beth Israel Deaconess Medical Center in Boston, MA, USA](https://en.wikipedia.org/wiki/Beth_Israel_Deaconess_Medical_Center). The hospital is a *private* teaching center attached to the Harvard Medical School. In Massachusetts, [poverty afflicts minorities about twice as much as white people](https://www.welfareinfo.org/poverty-rate/massachusetts/).
>
> As such, ethnicity may have a **strong impact on both the quality of their care, their access to insurance, and in the end their potential rate of readmission, etc**. Also, 
>
> consequently, <span style="color:red">**we will need to see if ethnicity has a strong effect on our end result, and, if possible, whether we can do without it**</span>.

<u>Note on **TEXT** and **DISCHARGE_LOCATION**:</u>

> As we saw in the cell above, TEXT and DISCHARGE_LOCATION may hold important information on the end fate of the patient, meaning we cannot include those features as **it would leak information with regards to the outcome we want to predict**.

In [17]:
# We store our target features (or the columns used to build our feature, e.g., "Age" and "Length of Stay") in
# our placeholders

kept_columns = ["DOB", "GENDER", "MARITAL_STATUS", "ETHNICITY", "INSURANCE", 
                "ADMISSION_TYPE", "DIAGNOSIS", "ADMITTIME", "DISCHTIME"]

X_train = df_train[kept_columns]
X_test = df_test[kept_columns]

In [18]:
X_train.head(1)

Unnamed: 0,DOB,GENDER,MARITAL_STATUS,ETHNICITY,INSURANCE,ADMISSION_TYPE,DIAGNOSIS,ADMITTIME,DISCHTIME
0,2087-09-24 00:00:00,M,,OTHER/UNKNOWN,Medicare,EMERGENCY,INTRACRANIAL HEMORRHAGE,2163-01-20 18:39:00,2163-01-24 08:00:00


#### 5 - <u>Building the feature variables (X):</u>

1. **What about NaN values?**

> As we see below NaN values only in the MARITAL_STATUS and DIAGNOSIS columns, and knowing we will perform One-Hot Encoding for the former and Word Embedding for the latter, we can afford not removing those rows.

2. **LENGTH_OF_STAY** (in days)

> We build our length of stay variable by taking the difference between ADMITTIME and DISCTIME in days

3. **AGE** (in year)

> We build our age variable by taking the difference between ADMITTIME and DOB in year
>
> Some ages are reportedly impossible (being well above the oldest recorded age for a human), leading to think that some date values were misrecorded. We replace those erroneous values with the average age of the train dataset excluding those misrecorded ages.

4. **GENDER, MARITAL_STATUS, ETHNICITY, INSURANCE, ADMISSION_TYPE**

> We build one-hot encoding for those variables

5. **DIAGNOSIS**

> We want to build our own Bag of Word representation using the sklearn CountVectorizer object

In [19]:
# We find that only the columns MARITAL_STATUS and DIAGNOSIS have NaN values in both
# the training and testing dataset.

print(X_train.isnull().sum(),
      X_test.isnull().sum(),
      sep="\n\n")

DOB                0
GENDER             0
MARITAL_STATUS    76
ETHNICITY          0
INSURANCE          0
ADMISSION_TYPE     0
DIAGNOSIS          2
ADMITTIME          0
DISCHTIME          0
dtype: int64

DOB                0
GENDER             0
MARITAL_STATUS    40
ETHNICITY          0
INSURANCE          0
ADMISSION_TYPE     0
DIAGNOSIS          0
ADMITTIME          0
DISCHTIME          0
dtype: int64


In [20]:
# LENGTH_OF_STAY

# 1. convert dates to datetime
# 2. calculate the float value timedelta (in days)
X_train["ADMITTIME"] = pd.to_datetime(X_train["ADMITTIME"])
X_train["DISCHTIME"] = pd.to_datetime(X_train["DISCHTIME"])
X_train["LENGTH_OF_STAY"] = X_train["DISCHTIME"] - X_train["ADMITTIME"]
X_train["LENGTH_OF_STAY"] = X_train["LENGTH_OF_STAY"].dt.total_seconds() / (24 * 60 * 60)

X_test["ADMITTIME"] = pd.to_datetime(X_test["ADMITTIME"])
X_test["DISCHTIME"] = pd.to_datetime(X_test["DISCHTIME"])
X_test["LENGTH_OF_STAY"] = X_test["DISCHTIME"] - X_test["ADMITTIME"]
X_test["LENGTH_OF_STAY"] = X_test["LENGTH_OF_STAY"].dt.total_seconds() / (24 * 60 * 60)

 # we drop columns as they are not useful anymore
X_train.drop(["DISCHTIME"], axis = 1, inplace = True)
X_test.drop(["DISCHTIME"], axis = 1, inplace = True)

In [21]:
# AGE

# 1. convert dates to year
# 2. calculate the float value timedelta (in year)
X_train["DOB"] = pd.to_datetime(X_train["DOB"]).dt.year
X_train["ADMITTIME"] = X_train["ADMITTIME"].dt.year
X_train["AGE"] = X_train["ADMITTIME"] - X_train["DOB"]

X_test["DOB"] = pd.to_datetime(X_test["DOB"]).dt.year
X_test["ADMITTIME"] = X_test["ADMITTIME"].dt.year
X_test["AGE"] = X_test["ADMITTIME"] - X_test["DOB"]

 # we drop columns as they are not useful anymore
X_train.drop(["ADMITTIME", "DOB"], axis = 1, inplace = True)
X_test.drop(["ADMITTIME", "DOB"], axis = 1, inplace = True)

We see that some ages are impossible, leading to think that we have misrecorded values. **All in all 119 rows are impacted**. This is an issue we need to remedy as we can't throw away 5%+ of our dataset.

The way to deal with those is to **replace those wrong values with the average age of the rest of the dataset** (i.e. the mean of all age that are not impossible).

In [22]:
# Some calculated ages are well above possible values

print(sorted(X_train["AGE"].unique()),
      sorted(X_test["AGE"].unique()),
      sep="\n\n")

[0, 1, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 300, 301, 302, 303, 305, 306, 307, 308, 310]

[0, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 300, 301, 302, 303, 305, 308]


In [23]:
# There are 119 age value above 122 (the oldest recordest age in history in a human)

len(X_train[X_train["AGE"]>122])

119

In [24]:
# We calculate the average age of the rest of the train dataset
# We replace the wrong age value with the calculated average

average_age = X_train[X_train["AGE"]<=89]["AGE"].mean()
print(average_age)

X_train.loc[(X_train.AGE > 89), 'AGE'] = average_age
X_test.loc[(X_test.AGE > 89), 'AGE'] = average_age

62.22328548644338


In [26]:
# One-Hot encoding of the following columns:
# GENDER, MARITAL_STATUS, ETHNICITY, INSURANCE, ADMISSION_TYPE

dummy_list = ["GENDER", "MARITAL_STATUS", "ETHNICITY", "INSURANCE", "ADMISSION_TYPE"]

X_train = pd.get_dummies(X_train, columns = dummy_list)
X_test = pd.get_dummies(X_test, columns = dummy_list)

In [27]:
# DIAGNOSIS

# Pre-processing the content of the DIAGNOSIS column
X_train["DIAGNOSIS"] = X_train["DIAGNOSIS"].apply(sentence_processing)
X_test["DIAGNOSIS"] = X_test["DIAGNOSIS"].apply(sentence_processing)

In [28]:
# Applying CountVectorizer to the pre-processed DIAGNOSIS column

#      We declare and fit the CountVectorizer object
cv = CountVectorizer(analyzer="word", ngram_range=(1,1), stop_words="english")
cv.fit(X_train["DIAGNOSIS"].tolist())

#      We transform the diagnosis column using the count vectorizer
tf = lambda s: cv.transform([s]).todense().tolist()[0]
X_train["DIAGNOSIS"] = X_train["DIAGNOSIS"].apply(tf)
X_test["DIAGNOSIS"] = X_test["DIAGNOSIS"].apply(tf)

#      We expand the resulting feature matrix into individual columns
X_train[cv.get_feature_names()] = pd.DataFrame(X_train["DIAGNOSIS"].tolist(), 
                                               index= X_train.index)
X_test[cv.get_feature_names()] = pd.DataFrame(X_test["DIAGNOSIS"].tolist(), 
                                               index= X_test.index)

#      We drop the DIAGNOSIS column
X_train.drop(["DIAGNOSIS"], axis=1, inplace=True)
X_test.drop(["DIAGNOSIS"], axis=1, inplace=True)

In [29]:
# We look at the first 100 elements of the feature_names list of the Count Vectorizer
# object

print(f"Number of diagnosis features: {len(cv.get_feature_names())}.",
      cv.get_feature_names()[:100],
      sep="\n\n")

Number of diagnosis features: 819.

['1st', '21', '22', 'abcess', 'abd', 'abdcess', 'abdomal', 'abdomen', 'abdominal', 'ablation', 'abscess', 'abuse', 'accending', 'access', 'accident', 'account', 'achalasia', 'acidosis', 'acitic', 'acsites', 'acute', 'advancement', 'afib', 'aicd', 'air', 'airway', 'alcohol', 'als', 'altered', 'aml', 'anasarca', 'anemia', 'aneursym', 'aneurysm', 'angina', 'angio', 'angiogram', 'angioplasty', 'ankle', 'anomaly', 'anterior', 'antibiotic', 'anticholinergic', 'aorta', 'aortic', 'appendicitis', 'approach', 'ar', 'arachnoid', 'arch', 'arf', 'arrest', 'arterial', 'artery', 'ascending', 'ascites', 'aspiration', 'assault', 'asthma', 'asthmaticus', 'ataxia', 'atriacure', 'atrial', 'atrioventricular', 'attach', 'attack', 'aureus', 'av', 'avascular', 'avr', 'axillo', 'bacteremia', 'bacterial', 'benign', 'bental', 'bentall', 'benzodiazepine', 'bi', 'bifemoral', 'bilateral', 'bile', 'bili', 'biliary', 'biventricular', 'bladder', 'bled', 'bleed', 'bleeding', 'block',

In [35]:
# print(X_train.head(1), X_test.head(1), X_train.dtypes, X_test.dtypes, sep="\n\n")

#### 6 - <u>Building the target variable (Y):</u>
    
We want to estimate the re-hospitalization rate of a patient. The question is then **how to represent re-hospitalization**?

Two approaches are possible:

- **regression**: Predicting the number of days between discharge and readmission for a patient

    - We can predict the number of days between discharge and readmission using the DAYS_NEXT_ADMIT feature that is available to us
    - The main issue of DAYS_NEXT_ADMIT is how to represent the absence of readmission (NaN in the dataset)


- **classification**: Predicting if a patient will **i)** be readmitted at some point, **ii)** die, **iii)** be discharged without readmission

    - We can assign a tag to each of the scenarios above which will be used to perform classification
    
With regards to constructing our y values, <span style="color:red">we find that some elements are problematic</span>. For instance, the subject_id 937 indicates both a next admission time and a death time in the same row. Via a quick check, we see that there are several cases like this, but each subject_ID is only mentioned once.

In [31]:
df_train[df_train["SUBJECT_ID"]==937]

Unnamed: 0,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,DAYS_NEXT_ADMIT,NEXT_ADMITTIME,ADMISSION_TYPE,DEATHTIME,DISCHARGE_LOCATION,INSURANCE,...,mental,misc,muscular,neoplasms,nervous,pregnancy,prenatal,respiratory,skin,OUTPUT_LABEL
0,937,148592,2163-01-20 18:39:00,2163-01-24 08:00:00,0.061806,2163-01-24 09:29:00,EMERGENCY,2163-01-26 08:00:00,DEAD/EXPIRED,Medicare,...,0,0,0,0,1,0,0,0,0,1


In [32]:
df_train[["SUBJECT_ID", "HADM_ID", "DEATHTIME", "NEXT_ADMITTIME"]].dropna()

Unnamed: 0,SUBJECT_ID,HADM_ID,DEATHTIME,NEXT_ADMITTIME
0,937,148592,2163-01-26 08:00:00,2163-01-24 09:29:00
507,6912,143307,2196-09-09 08:00:00,2196-09-08 11:37:00
552,9998,144947,2173-06-15 22:00:00,2173-06-14 12:00:00
579,8818,156627,2135-08-19 12:00:00,2135-08-19 14:08:00
722,4791,166578,2157-02-27 05:18:00,2157-02-27 10:59:00
846,23843,177112,2144-01-25 23:07:00,2144-01-25 08:40:00
1173,7880,172698,2165-12-01 08:00:00,2165-12-01 15:09:00
1190,11740,137487,2154-02-27 08:00:00,2154-02-27 08:43:00
1217,19617,127959,2127-04-04 12:00:00,2127-04-04 15:07:00
1313,11519,134459,2195-11-28 17:17:00,2195-11-28 15:54:00


In [33]:
df_train[df_train.isnull().any(axis=1)]

Unnamed: 0,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,DAYS_NEXT_ADMIT,NEXT_ADMITTIME,ADMISSION_TYPE,DEATHTIME,DISCHARGE_LOCATION,INSURANCE,...,mental,misc,muscular,neoplasms,nervous,pregnancy,prenatal,respiratory,skin,OUTPUT_LABEL
0,937,148592,2163-01-20 18:39:00,2163-01-24 08:00:00,0.061806,2163-01-24 09:29:00,EMERGENCY,2163-01-26 08:00:00,DEAD/EXPIRED,Medicare,...,0,0,0,0,1,0,0,0,0,1
1,3016,159142,2107-01-23 02:45:00,2107-01-26 14:00:00,,,EMERGENCY,,HOME HEALTH CARE,Medicare,...,2,0,0,0,0,0,0,1,0,0
2,2187,186282,2134-06-24 23:30:00,2134-07-02 17:45:00,,,EMERGENCY,,REHAB/DISTINCT PART HOSP,Medicaid,...,1,2,1,0,3,0,0,4,0,0
3,19213,140312,2202-11-02 12:32:00,2202-11-05 14:20:00,12.968056,2202-11-18 13:34:00,EMERGENCY,,HOME,Medicare,...,0,0,0,0,0,0,0,1,1,1
4,425,118058,2149-05-13 12:23:00,2149-05-26 20:00:00,,,EMERGENCY,,HOME HEALTH CARE,Medicare,...,0,0,0,0,0,0,0,2,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,808,139077,2181-05-11 16:57:00,2181-05-16 11:58:00,13.701389,2181-05-30 04:48:00,EMERGENCY,,HOME,Private,...,0,0,0,0,0,0,0,2,0,1
1996,698,171990,2167-12-23 03:24:00,2167-12-31 14:08:00,,,EMERGENCY,,SNF,Medicare,...,2,0,1,0,0,0,0,1,0,0
1997,58821,179166,2176-02-06 21:05:00,2176-02-15 13:39:00,7.473611,2176-02-23 01:01:00,EMERGENCY,,SNF,Medicare,...,0,0,2,0,0,0,0,0,0,1
1998,1308,127034,2134-02-21 15:52:00,2134-02-27 14:09:00,,,EMERGENCY,,SNF,Medicare,...,0,0,0,0,1,0,0,2,0,0


## 3. Modeling

### 3.1 Logistic Regression

### 3.2 KNN

### 3.3 Naive Bayes

### 3.4 Random Forest

## 4. Exploring the hyperparameters of the best model

## 5. Conclusion