# TP 1
---
---
**Alan Balendran** <br> **Celine Beji** <br> **Etienne Peyrot** <br>**François Grolleau**<br>**Raphaël Porcher**<br> 


[Centre de Recherche en Epidémiologie et Statistiques (CRESS) - Equipe METHODS](https://cress-umr1153.fr/fr/teams/methods/)<br> 


The goals of this practical session are to:
- Explain the basis of data exploration in python using a dataset from ICU patients
- Explain how to train machine learning algorithms and predict on new data

The idea is to show the main ideas and not necessarily to go into detail.

If you have any questions: 
- Option 1 (9am-5pm): alan.balendran@u-paris.fr 
- Option 2: (24/7): https://chat.openai.com/ 
- Option 3 (24/7): https://chat.mistral.ai/

### What is Python?

- Versatile programming language
- Easy to read and understand
- Comes with a vast collection of libraries like *NumPy*, *Pandas*, and *Scikit-learn*. These libraries simplify complex tasks such as data analysis, scientific computation, data visualization, machine learning, etc.

### What is a Notebook?

Interactive, open-source web application that allows you to create and share documents containing code, equations, visualizations, and narrative text.

- **Code and Markdown Cells**: Jupyter Notebooks consist of cells that can contain either code or formatted text using Markdown. This makes it easy to combine code and explanations in one document.

- **Real-time Execution**: Code cells can be executed in real-time, allowing for an interactive and iterative coding experience. Results and visualizations are displayed right beneath the code cell.

- **Rich Visualizations**: Jupyter supports inline plotting with libraries like Matplotlib, allowing you to create and visualize graphs and charts directly within the notebook.

- **Shareability**: Notebooks can be easily shared and can be exported to various formats, including HTML, PDF, and slides (like this one!).

## A really quick introduction to Python

In every programming language, data is stored in variables. One of the advantages of Python is that you don't need to specify the type of variable.

In [None]:
# Variables
patient_name = "Phil Good" ## String
age = 25 ## Integer
temperature = 37.5 ## Float
first_hospitalization = True ## Boolean

How to run a cell in google colab (and in a jupyter notebook):

![](https://github.com/AL0UNE/courses/blob/main/figures/run_a_cell.png?raw=true)

Alternatively you can select the cell that you want to run then use the shortcut ``ctrl + Enter``

You can also print the values stored in each variables.

In [None]:
# Printing one by one
print("Patient Name:", patient_name)
print("Age:", age)
print("Temperature:", temperature)
print("First hospitalization:", first_hospitalization)

In [None]:
# Printing all together
print(patient_name, 'is', age, 'and has a body temperature of', temperature, '°C and first hospitalization status is', first_hospitalization)

Some values, such as decimals values can be rounded when printing:

In [None]:
print(1/3)
print(round(1/3, 2))

You can perform basic operations with numerical values:

In [None]:
height = 170
weight = 60
bmi = weight/((height/100)**2)
print('BMI:', bmi)


<b>Exercise:</b>

- Calculate the BMI for a patient with the following characteristics:
    - Weight: 132.277 lb 
    - Height: 66.9291 inches
    
The formula for BMI is given by: 
$$\text{BMI} = \frac{\text{weight (kg)}}{(\text{height (m)})^2}= \frac{\text{weight (lb)}}{(\text{height (in)}^2)} \times 703$$

- Print the results with 2 decimals using only the `round` function.

**Lists** are mutable heterogenous data structures.<br>
- An heterogenous data structure can store multiple type of data (e.g., ``bool``,``int``,``float``, ...)<br>
- A data structure is mutable if it can be modified (add, delete or modify an element)

In [None]:
# Lists

symptoms = ["fever", "cough", "daeth", "fatigue"]
heights = [160, 165, 170, 175]
patient_info = [patient_name, age, temperature]

print(symptoms)
print(heights)
print(patient_info)

In [None]:
# Accessing List Elements
print("First height:", heights[0])
print("Last Symptom:", symptoms[-1])

In [None]:
# Adding Elements to a List
symptoms.append("headache")
print("Updated Symptoms:", symptoms)

In [None]:
# Modify Elements of a List
symptoms[2] = "death"
print("Updated Symptoms:", symptoms)

In [None]:
# Delete Elements of a List
del symptoms[2]
print("Updated Symptoms:", symptoms)

A **dictionary** in Python is a mutable data structure that stores **key-value pairs**. Compared to a list, dictionary are designed for more efficient retrieval of values based on their associated keys.

In [None]:
patient_info = {
    'name': patient_name,
    'age': age,
    'temperature': temperature
}

patient_info['height'] = 175

print(patient_info)

Dictionaries cannot be accessed using integers. Only the key can return the corresponding value.

In [None]:
patient_info[0]

In [None]:
print(patient_info['name'])
print(patient_info['age'])
print(patient_info['temperature'])

**For loop** can be used to iterate over a collection of items (e.g., lists, dictionaries).

In [None]:
for symptom in symptoms: ## list
    print("Patient has", symptom)

In [None]:
for keys, values in patient_info.items(): ## dictionary
    print(keys,":", values)

Operations using lists are limited:

In [None]:
heights = [160, 165, 170, 175]
weights = [70, 75, 80, 85]
print(heights+weights)

In [None]:
print(heights/weights)

Numpy arrays are a useful objects for peforming numerical operations with numerical values contained in lists!<br>
The library `numpy`contains a homogenous data strucute called **arrays** which is especially useful to do mathematical operations. 

In [None]:
import numpy as np

# Creating NumPy Arrays
heights = np.array([160, 165, 170, 175])
weights = np.array([70, 75, 80, 85])

bmi = weights / ((heights / 100) ** 2)
print("BMI:", bmi)

### Functions

In Python, a **function** is a block of reusable code that performs a specific task. It encapsulates a set of instructions that can be called and executed multiple times. A function can take arguments to be used inside the function. <br>Functions help in organizing code, promoting reusability, and improving readability. <br> <br>
Let's see an example:

Let's say we want to print patient's information except their name:

In [None]:
patient_1 = {
    'name': "patient one",
    'age': 26,
    'temperature': 37.5
}

In [None]:
for key, value in patient_1.items():
        # Check if the key is 'name', if no, print the value
        if key != "name":
            # Print the key-value pair
            print(key, ":", value)

Then, for a new patient we would have to simply copy-paste the same code and change the patient dictionary:

In [None]:
patient_2 = {
    'name': "patient two",
    'age': 42,
    'temperature': 37.2
}

In [None]:
for key, value in patient_2.items():
        # Check if the key is 'name', if no, print the value
        if key != "name":
            # Print the key-value pair
            print(key, ":", value)

Instead, we could write a function that takes a dictionary and prints all the information contained in it except the name.

In [None]:
def print_anonymised_info(patient_dict, anonymised=True):

    for key, value in patient_dict.items():
        # Check if the key is 'name' and print_name is False, then skip printing it
        if key != "name" or not anonymised:
            print(key, ":", value)
    
    return None

In [None]:
print_anonymised_info(patient_1)

In [None]:
print_anonymised_info(patient_2)

In [None]:
print_anonymised_info(patient_2, anonymised=False)

### Libraries

Python comes with a vast collection of libraries like NumPy, Pandas, and Scikit-learn. These libraries simplify complex tasks such as data analysis, scientific computation, data visualization, machine learning, etc. To use a function implemented in a library we simply need to import it.<br>

Let's try to import the function in the numpy library that generates random values from a normal distribution.<br> The function is implemented in the `numpy` library.

In [None]:
import numpy as np 

np.random.randn(10)

Let's now move on to our real dataset!

Let's first import common python libraries.

In [None]:
import numpy as np ## for numerical operations
import pandas as pd ## for data analysis and manipulation
import matplotlib.pyplot as plt ## for visualization (basic)
import seaborn as sns ## for visualization (advanced)

In [None]:
pd.set_option('future.no_silent_downcasting', True) ## hide warning message

Let's load our dataset. 

In [None]:
url_data = 'https://raw.githubusercontent.com/AL0UNE/courses/refs/heads/main/icu_training.csv'

data = pd.read_csv(url_data, index_col=0)

![](https://github.com/AL0UNE/courses/blob/main/figures/initial_data.png?raw=true)

- This dataset is a synthetic subset of the [MIMIC-III](https://mimic.mit.edu/) dataset. <br>
(Johnson, A., Pollard, T., Shen, L. *et al.* MIMIC-III, a freely accessible critical care database. *Sci Data* **3**, 160035 (2016)) 
- It contains clinical data of patients admitted to the ICU.
- Our goal is to predict for each patient the risk of hospital mortality.

A good reflex to have is to print the first lines of the dataframe.

In [None]:
data.head()

We can see that the data has been read correctly.

We can print its dimensions using the `.shape` attribute of a pandas dataframe.

In [None]:
print(data.shape)

Our dataset consists of 5000 rows and 29 columns.

Let's have a look at the different columns of our dataset.

In [None]:
print(data.columns.values)

- **Outcome variable $Y$** (what we want to predict)<br>
    - **hospital_mortality**: Binary value where 1 indicates that the patient died during his hospital stay and 0 means that the patients survived<br>

In [None]:
data['hospital_mortality'].head()

- **Covariates $X$** (features that can be used to predict the outcome variable)<br>

#### Demographic and Laboratory Tests

| Variable | Description |
| :--- | :--- |
| **subject_id** | Patient ID. |
| **age** | Age of the patient. |
| **gender** | Sex of the patient (F/M). |
| **weight** | Weight in kilograms (kg). |
| **height** | Height in centimeters (cm). |
| **first_icu_stay**| Indicates if it is the patient's first stay in the ICU. |
| **CURR_CAREUNIT_transfers**| Current care unit: MICU (Medical ICU) or SICU (Surgical ICU). |
| **bun_min** | Minimum urea (mM) in the first 24 hours. |
| **hemoglobin_max**| Maximum hemoglobin (g/dL) in the first 24 hours. |
| **lactate_max** | Maximum lactate (mM) in the first 24 hours. |
| **creatinine_max**| Maximum creatinine (mg/dL) in the first 24 hours. |
| **ptt_max** | Maximum partial thromboplastin time in the first 24 hours. |

#### SAPS II Score Variables
*Variables used in the [SAPS II score](https://jamanetwork.com/journals/jama/article-abstract/409979). The score is assessed by clinicians on ICU admission. High scores correspond to greater severity for the patient. [Tools](https://www.mdcalc.com/calc/4044/simplified-acute-physiology-score-saps-ii#use-cases) have been developed to easily compute the score.*

| Variable | Description |
| :--- | :--- |
| **age_score** | Score derived from age. |
| **hr_score** | Score derived from heart rate. |
| **sysbp_score** | Score derived from systolic blood pressure. |
| **temp_score** | Score derived from temperature. |
| **pao2fio2_score**| Score derived from PaO2/FiO2 (Fraction of inspired oxygen). |
| **uo_score** | Score derived from urine output. |
| **bun_score** | Score derived from urea nitrogen present in the blood. |
| **wbc_score** | Score derived from leucocytes (white blood cells) values. |
| **potassium_score**| Score derived from potassium values. |
| **sodium_score** | Score derived from sodium values. |
| **bicarbonate_score**| Score derived from bicarbonate values. |
| **bilirubin_score**| Score derived from bilirubin values. |
| **gcs_score** | Score derived from the Glasgow Coma Scale. |
| **comorbidity_score**| Score derived from chronic disease. |
| **admissiontype_score**| Score derived from admission type. |

You can access a specific column in a DataFrame by using square brackets and specifying the column name, like this: `data['column name']`.

In [None]:
data['lactate_max']

You can also access multiple columns in a DataFrame by using a list of column name.

In [None]:
data[['lactate_max', 'hospital_mortality']]

<b>Exercise:</b>

- Find and print the corresponding columns for each of the following variables in the DataFrame:
    - *Heart rate score*
    - *Age*
    - *Gender*
    - *First stay in the ICU*

- Describe the characteristics of each variable based on their column values.

Basic statistics, such as the *mean, minimum*, and *maximum*, can be calculated in pandas using the appropriate methods.

For instance: 

`data['age'].min()` will return the minimum age of the dataset whereas `data['age'].max()` will return the maximum.

In [None]:
print("The youngest patient in our dataset is ", data['age'].min(), " years old.")

In [None]:
print("The oldest patient in our dataset is ", data['age'].max(), " years old.")

<b>Exercise:</b>
    
- Find the highest and lowest values of the variable *hemoglobin_max* in the dataset.
- Calculate the average height in the dataset.
- Calculate the median lactate value in the dataset.

You can also create a subset of your dataset by applying filters based on specific conditions.
- Using standard comperators `<`, `<=`, `>`, `>=`, `==`, `!=`.
- Combine two conditions with **AND** using the `&` operator.
- Combine two conditions with **OR** using the `|` operator.

A filter is a list boolean (`True`, `False`) of length equal to the number of lines of the dataset the filter is applied to. If the `i`th element of the filter is `True` the `i`th line of the dataset is preserved, if it is `False` the line is discarded.

In [None]:
## Selecting only patient' first visit in ICU:
data[data["first_icu_stay"]==True]

In [None]:
## Selecting patients between 20 and 40:
data[(data['age']>=20) & (data['age']<=40)]

Which subset of the data is returned using the following conditions?

In [None]:
data[((data['age']>=20) & (data['age']<=40)) | data['first_icu_stay']==True]['height'].max()

<b>Exercise:</b>
    
- How many patients have a lactate above 1.9 mM or a creatinine value above 1.2 mg/dL? How many of those patients died in the hospital?
- Calculate the average height for both men and women.
- Identify the patient with the highest number of visits. <br>(*Hint: **(1)** Try to find which variable corresponds to the patient identifier and **(2)** Use the `value_counts()` method*)    

**Exercice**:
- Which of the following methods will compute the basic statistics on the whole dataset?
    - [data.plot()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html)
    - [data.sample()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html)
    - [data.describe()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)
    - [data.count()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.count.html)

## Data visualization


While descriptive statistics offer insights into the data, they must be interpreted cautiously. Data visualization, however, can reveal valuable insights that descriptive statistics may overlook.

The **Anscombe's quartet** serves as a perfect example to illustrate the significance of data visualization.

The Anscombe's quartet is composed of 4 different datasets, each containing 10 data points with two coordinates. Let's have a look:

|    |   X1 |    Y1 |    |    |   X2 |    Y2 |    |    |   X3 |    Y3 |    |    |   X4 |    Y4 |
|----|-----:|------:|----|----|-----:|------:|----|----|-----:|------:|----|----|-----:|------:|
||   10 |  8.04 |    |    |   10 |  9.14 |    |    |   10 |  7.46 |    |    |    8 |  6.58 |
||    8 |  6.95 |    |    |    8 |  8.14 |    |    |    8 |  6.77 |    |    |    8 |  5.76 |
||   13 |  7.58 |    |    |   13 |  8.74 |    |    |   13 | 12.74 |    |    |    8 |  7.71 |
||    9 |  8.81 |    |    |    9 |  8.77 |    |    |    9 |  7.11 |    |    |    8 |  8.84 |
||   11 |  8.33 |    |    |   11 |  9.26 |    |    |   11 |  7.81 |    |    |    8 |  8.47 |
||   14 |  9.96 |    |    |   14 |  8.10 |    |    |   14 |  8.84 |    |    |    8 |  7.04 |
||    6 |  7.24 |    |    |    6 |  6.13 |    |    |    6 |  6.08 |    |    |    8 |  5.25 |
||    4 |  4.26 |    |    |    4 |  3.10 |    |    |    4 |  5.39 |    |    |   19 | 12.50 |
||   12 | 10.84 |    |    |   12 |  9.13 |    |    |   12 |  8.15 |    |    |    8 |  5.56 |
||    7 |  4.82 |    |    |    7 |  7.26 |    |    |    7 |  6.42 |    |    |    8 |  7.91 |
||    5 |  5.68 |    |    |    5 |  4.74 |    |    |    5 |  5.73 |    |    |   10 |  6.89 |


At first glance, everything appears normal. Let's proceed by printing some descriptive statistics:

|      |   X1 |   X2 |   X3 |   X4 |    |    |   Y1 |   Y2 |   Y3 |   Y4 |
|------|-----:|-----:|-----:|-----:|----|----|-----:|-----:|-----:|-----:|
| mean |  9.0 |  9.0 |  9.0 |  9.0 |    |    |  7.5 |  7.5 |  7.5 |  7.5 |
| std  |  3.32|  3.32|  3.32|  3.32|    |    |  2.03|  2.03|  2.03|  2.03|
| min  |  4.0 |  4.0 |  4.0 |  8.0 |    |    |  4.26|  3.10|  5.39|  5.25|
| 25%  |  6.5 |  6.5 |  6.5 |  8.0 |    |    |  6.31|  6.70|  6.25|  6.17|
| 50%  |  9.0 |  9.0 |  9.0 |  8.0 |    |    |  7.58|  8.14|  7.11|  7.04|
| 75%  | 11.5 | 11.5 | 11.5 |  8.0 |    |    |  8.57|  8.95|  7.98|  8.19|
| max  | 14.0 | 14.0 | 14.0 | 19.0 |    |    | 10.84|  9.26| 12.74| 12.50|


Once more, everything seems normal. The datasets exhibit similar statistics, such as mean and standard deviation, and their values are within the same order of magnitude. At this point, we might consider fitting a regression model to predict $Y$ using $X$. However, to ensure that everything is normal, let's visualize the data points:

![](https://github.com/AL0UNE/courses/blob/main/figures/Quartet_anscombe.png?raw=true)

What is even more surprising is that performing **linear regression** fits on any of the four datasets yields identical regression lines!

![](https://github.com/AL0UNE/courses/blob/main/figures/Quartet_anscombe_with_linear_reg.png?raw=true)

This illustrative example highlights the importance of conducting exploratory data analysis before developing any machine learning models. <br>Let's now explore some of the fundamental visualization plots.

**Categorical data**

In [None]:
sns.countplot(data=data, x='hospital_mortality');

**Continuous data**

In [None]:
sns.histplot(data=data, x='hemoglobin_max');

**Categorical/Categorical**

In [None]:
sns.countplot(data=data, x='hospital_mortality', hue='gender');

**Continuous/Continuous**

In [None]:
sns.scatterplot(data=data, x='creatinine_max', y='bun_min', alpha=.05);

**Categorical/Continuous**

In [None]:
sns.boxplot(data=data, x='hospital_mortality', y='age');

**Adding a third information**

In [None]:
sns.scatterplot(data=data, x='creatinine_max', y='bun_min', hue='hospital_mortality', alpha=0.3);

<b>Exercise:</b> 

- Which method would you choose to plot both *gender* and *age*? Make the figure.
- Modify it to also include the hospital mortality. Try to interpret the figure.
- Visualize the *temp_score* feature. What observations can you make? (Hint: Use the `value_counts()` method)
- Create a plot of *age_score* against *age* and try to interpret the scoring rule.

There are many other ways to represents data, depending on the type of the data and the information you want to represent. <br>The [Data to viz](https://www.data-to-viz.com) website lists many visualization methods along with the corresponding python (and R) code for replication.

## Data preprocessing


Data pre-processing plays a crucial role prior to model development. It encompasses several essential steps, including handling missing values through imputation or removal, addressing outliers, data encoding, feature engineering, and data standardization.

To start, we will keep only a subset of variables.

In [None]:
keep = [
    "age", "gender", "height", "weight", "CURR_CAREUNIT_transfers" , 
    "hr_score", "sysbp_score", "pao2fio2_score",
    "bun_min", "hemoglobin_max", "lactate_max", "creatinine_max",
    "ptt_max", "first_icu_stay", "hospital_mortality"
]

In [None]:
df = data[keep].copy()

### Missing data

Missing data is ubiquitous. Reasons for observing missing data are many, especially in healthcare: 
- Missing when entering the data in the electronic healthcare system 
- Data not being collected in a different healthcare institution
- Missingness is actually a value by itself (e.g. patient' not answering specific questions in a survey)
- Physician not ordering a specific lab exams based on the patient health condition

Let's check if there are missing values in our data:

In [None]:
df.isna().sum().sort_values(ascending=False)

We have 100 observations where `height` is missing and 6 observations where `first_icu_stay` is missing. Let's have a look on those features.

In [None]:
sns.histplot(data=df, x='height');

In [None]:
sns.countplot(data=df, x='first_icu_stay');

`Height` is a continuous feature with a distribution that looks gaussian (bell-shaped). <br>`First_icu_stay` is a binary variable.

Missing values can be addressed in various ways, for example:
- Imputing continuous features with mean or median.
- Imputing categorical features with the most frequent value.
- Considering the missing value as a distinct value (impute it with a new value).
- Imputing the missing features using other variables (e.g. predicting missing patient' height using his weight and/or age).
- Removing the features with missing values (columns).
- Removing the observations with missing values (rows).

**There is no unique solution when dealing with missing data. It depends on the use-case (i.e., why is the data missing?)** 

In this practical, we will do a naïve imputation by using the mean for the `height` and the most frequent value in `first_icu_stay`.<br>

In [None]:
## we store in two variables the value that will be used to impute the missing values
## the mean observed height
mean_height = df['height'].mean()
## the most frequent value for patients first visit to the ICU
most_frequent_first_icu_stay = df['first_icu_stay'].mode()[0] 

In [None]:
## we fill the missing values using the fillna method by giving a dictionary with the column names and the values to use for imputation
df_processed = df.copy()
df_processed = df_processed.fillna({"height": mean_height, "first_icu_stay": most_frequent_first_icu_stay})

**Important**: Potential missing values in the testing dataset need to be imputed using information from the training data. In other words, the values used to impute the testing data are the same used to impute the training data.

**Exercice**

Explore different methods for imputing or handling missing values in the dataset. You could try the following approaches:
- Impute using the median height value.

- Remove rows with missing values (i.e., drop patients with incomplete data). Hint: Check the **axis** parameter in the df.dropna() method.

- Remove columns with missing values (i.e., drop features with incomplete data). Hint: Again, the **axis** parameter in df.dropna() is key here.

### Feature encoding

Among these features, some, such as *gender* and *first_icu_stay*, are non-numeric values. Most machine learning algorithms can only process data in numerical form.

Therefore, we need a method to encode these variables. This can be done in two ways:

- **Manual Encoding**: This involves manually defining the appropriate mapping or transformation for each categorical variable.


In [None]:
print(df_processed['gender'].unique()) ## print unique values of that feature
df_processed['gender'] = df_processed['gender'].replace({'M': 1, 'F': 0}) ## encode the values with predefined mapping
print(df_processed['gender'].unique()) ## print unique values of encoded features

In [None]:
print(df_processed['first_icu_stay'].unique())
df_processed['first_icu_stay'] = df_processed['first_icu_stay'].replace({True: 1, False: 0})
print(df_processed['first_icu_stay'].unique())

- **Automatical encoding with scikit-learn package**: The scikit-learn package provides a convenient solution for categorical variable encoding through its `LabelEncoder` function. This function automatically converts categorical values into numeric labels, simplifying the encoding process.

The **scikit-learn** package is a python library that provides a set of python modules for machine learning and data analysis.

In [None]:
from sklearn.preprocessing import LabelEncoder

print(df_processed['CURR_CAREUNIT_transfers'].unique())

## instantiate the LabelEncoder object
encoder = LabelEncoder() 

## "Train" and transform the encoder to form a mapping
df_processed['CURR_CAREUNIT_transfers'] = encoder.fit_transform(df_processed['CURR_CAREUNIT_transfers']) 

print(df_processed['CURR_CAREUNIT_transfers'].unique()) ## print uniques values of encoded features

### Feature correlation 

Pairwise correlation provides information about which features are correlated. The method `.corr()` method computes the correlation matrix of our dataset. This can help to remove redundant features (correlation of ±1).

In [None]:
corr_matrix = df_processed.corr(method='spearman')
mask = np.zeros_like(corr_matrix)
mask[np.triu_indices_from(mask)] = True

In [None]:
plt.figure(figsize=(18,6))
sns.heatmap(corr_matrix, mask=mask,annot=True, cmap="YlGnBu");

<b>Exercise:</b> 

- Using the provided heatmap, determine the correlation between *creatinine_max* and *bun_min*. Plot a scatterplot with both features. Does it make sense with the correlation value?
- Calculate and create the *BMI (Body Mass Index)* feature. <br>To create and add a new feature to a dataframe, use the same syntax as for accessing an existing feature: <br>`df['name of the new feature'] = ...`

### Standardization

Standardization is a common practice in data preprocessing. It involves scaling the data to have a **mean of 0** and a **standard deviation of 1**. This process is performed to prevent features with different scales from disproportionately influencing the training process, especially when using regularization. Additionally, standardization can improve the performance of certain models during training. *Standardization* is not the only way to scale the data, for example *normalization* consists in rescaling the values between $[0,1]$ or $[-1,1]$.

In [None]:
X = df_processed.drop('hospital_mortality', axis=1)
y = df_processed['hospital_mortality']

In [None]:
train_mean = X.mean()
train_std = X.std()
X = (X-train_mean)/train_std 

We can check if the data has been correctly standardized by checking the new mean and standard deviation.

In [None]:
X.mean() 

In [None]:
X.std()

To summarize, we have performed the following steps for data preprocessing:
- Selected a subset of features
- Handled missing values using imputation
- Encoded categorical and boolean features
- Created a new feature (BMI)
- Standardized the features

Which results in the following variables:

$X$ containing the processed features of each patient.

In [None]:
X.head()

$y$ containing the label (hospital mortality) of each patient.

In [None]:
y.head()

# Modeling 

## Regression or Classification

In machine learning, the primary goal is to train a model that can make accurate predictions on new, unseen data. Most prediction tasks fall into one of two categories: **regression** or **classification**, depending on the nature of the outcome variable.


In a **regression** task, the outcome is a continuous numerical value. The model learns to predict quantities such as:

- Length of stay in the ICU
- Tumor size


In a **classification** task, the goal is to predict a discrete category or label. The model estimates the most likely class for a given observation. Examples include:

- Presence or absence of a disease
- Risk of hospital-mortality

In this course, since our objective is to predict the risk of hospital mortality, we will specifically cover **classification** methods.

**1. Logistic regression**<br>
*Logistic Regression* is a statistical model that predicts a binary outcome (e.g., alive/dead) by fitting a logistic curve to the data. It estimates the probability of an event occurring based on one or more predictor variables.

**2. K-nearest neighbors (KNN)**<br>
*KNN* is a natural algorithm that classifies a data point based on the majority class of its $k$ nearest neighbors in the feature space.

**3. Decision Tree**<br>
*Decision Tree*  makes predictions by following a series of feature-based splits. Each node represents a decision on a feature, leading to a final prediction at the leaf nodes. It is highly interpretable.

**4. Random Forest**<br>
*Random Forest* is an ensemble algorithm that combines multiple *Decision Trees*. It aggregates predictions from individual trees to improve prediction accuracy.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.tree import plot_tree
from sklearn.datasets import make_circles, make_moons
from matplotlib.colors import LinearSegmentedColormap
from sklearn.inspection import DecisionBoundaryDisplay

### Training

Training a model is extremely simple!<br>
We simply need to instantiate a model and use the corresponding `.fit()` method to train the model by giving the features and the labels.

In [None]:
clf = LogisticRegression() ## instantiate the model
clf.fit(X,y); ## train the model

### Prediction

Predictions on new data can be obtained in two ways for a classifier:

In [None]:
clf.predict_proba(X.head(1))
print('Predicted probabilities', clf.predict_proba(X.head(1)))

- Obtain the predicted probability (or score) of each label using the `predict_proba()` method, which returns the estimated probabilities for each class.<br>
    - The first element of the list corresponds to $\hat P(Y=0|X) = 0.975$, that is the estimated probability that the patient survives. <br>
    - The second element corresponds to the probability that the $\hat P(Y=1|X) = 1-\hat P(Y=0|X) = 0.025$, corresponding to the estimated probability that the patient dies in ICU.

In [None]:
clf.predict(X.head(1))
print('Predicted target', clf.predict(X.head(1)))

- Predict the label directly using the `predict()` method, which returns the predicted class label based on the highest probability (or score) among all the classes. By default, sklearn models will use a threshold of 0.5 to get the labels. 

The predicted target is obtained by thresholding the predicted probability.
$$\hat Y = 
\begin{cases}
    0,& \text{if } \hat P(Y=0|X)\geq 0.5\\
    1,              & \text{otherwise}
\end{cases}
$$

In [None]:
clf.predict_proba(X)

In [None]:
clf.predict(X)

We will utilize synthetic data to demonstrate the specificities of the different algorithms. <br>
The synthetic dataset created using one of the function provided by scikit-learn consists in **two classes**, represented by the *blue* and *orange* points. <br>Our objective is to identify the most effective classifier capable of accurately distinguishing between the two classes.

In [None]:
def process_data(data):
    X,y = data[0] if len(data)==1 else data
    df = pd.DataFrame(X, columns=["X1", "X2"])
    df['Label'] = y
    df = df.sort_values('Label', ascending=False)
    return df[['X1', 'X2']], df['Label']


# Function to plot all components (data points and decision boundaries)
def plot_all(X, y, model=None, figsize=(6,6), ax=None, response_method='predict', title=None):
  
    if ax is None:
        fig, ax = plt.subplots(1,1, figsize=figsize)

    colors = ['#4C72B0', '#DD8452']
    colors_gradient = ['#7895c3', 'white', '#e5a27c']
    cm = LinearSegmentedColormap.from_list(
                "Custom", colors_gradient[::-1], N=300)
    if model is not None:
        DecisionBoundaryDisplay.from_estimator(
            model, X, cmap=cm, alpha=0.8, ax=ax, eps=0.2, response_method=response_method, grid_resolution=500, levels=300
            )
        # Compute accuracy score
        y_pred = model.predict(X)
        acc = accuracy_score(y, y_pred)
        
        # Set title with accuracy
        if title:
            ax.set_title(f"{title}\nAccuracy: {acc:.2%}", fontsize=14)
        else:
            ax.set_title(f"Accuracy: {acc:.2%}", fontsize=14)

    sns.scatterplot(x=X['X1'],y=X['X2'], hue=y, ax=ax, s=125, palette=colors[::-1], linewidths=0.7, edgecolor="k")

    ax.set_xlabel('X1', fontdict=dict(size=12))
    ax.set_ylabel('X2', fontdict=dict(size=12))
    
    ax.legend(shadow=True, framealpha=0.9);

In [None]:
synthetic_dataset = make_circles(n_samples=300, noise=0.13, factor=0.3, random_state=50)
X_train, y_train = process_data(synthetic_dataset)

In [None]:
plot_all(X_train, y_train);

### Logistic regression

**Logistic Regression**: Predicts binary outcomes using a logistic (sigmoid) function.
$$P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \ldots + \beta_pX_p)}}$$

- Where $\beta_j$ corresponds to the coefficient associated with the feature $j$.
- Linear with respect to the predictors
- Coefficient are interpretable
- Flexible (can also include non-linear relationship)

In [None]:
clf_logistic = LogisticRegression(penalty=None)
clf_logistic.fit(X_train,y_train);

In [None]:
plot_all(X_train,y_train, model=clf_logistic, title="Logistic Regression");

### K-nearest neighbors (KNN)

The KNN algorithm can be summarized into the famous saying: 
*<p style="text-align:center;">Tell me who your friends are, and I will tell you who you are </p>*


In practice, it makes predictions based on similarity to known data points. To make a prediction the algorithm will:
- Identifies the *k* nearest neighbors in the training data.
- Performs a majority vote among these neighbors to determine the predicted class.

Let's see this with an example.

<img src="https://github.com/AL0UNE/courses/blob/main/figures/knn_2.png?raw=true" alt="drawing" width="300"/>

<img src="https://github.com/AL0UNE/courses/blob/main/figures/knn_3.png?raw=true" alt="drawing" width="300"/>

<img src="https://github.com/AL0UNE/courses/blob/main/figures/knn_4.png?raw=true" alt="drawing" width="300"/>

<img src="https://github.com/AL0UNE/courses/blob/main/figures/knn_5.png?raw=true" alt="drawing" width="300"/>

<img src="https://github.com/AL0UNE/courses/blob/main/figures/knn_6.png?raw=true" alt="drawing" width="300"/>

#### KNN with 1 neighbor

In [None]:
clf_knn_1 = KNeighborsClassifier(n_neighbors=1) ## 1 neighbor
clf_knn_1.fit(X_train,y_train);

In [None]:
plot_all(X_train, y_train, model=clf_knn_1, title="KNN - Neighbor = 1");

#### KNN with 10 neighbors

In [None]:
clf_knn_10 = KNeighborsClassifier(n_neighbors=10) ## 10 neighbors
clf_knn_10.fit(X_train,y_train);

In [None]:
plot_all(X_train, y_train, model=clf_knn_10, title="KNN - Neighbor = 10");

### Decision Tree for classification (Classification Tree)

- A classification tree is a hierarchical structure used for classification tasks.
- It recursively partitions the feature space into regions, where each region corresponds to a specific class label.
- It splits the feature space based on features' values, aiming to maximize some criterion. A common criterion for classification tasks is the gini score, a measure of homegeneity.

#### Decision tree with a maximum depth of 1

In [None]:
clf_dt = DecisionTreeClassifier(max_depth=1) ## maximum depth of 1
clf_dt.fit(X_train, y_train);

Let's see on our example:

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(18,8))
tree = DecisionTreeClassifier(max_depth=1).fit(X_train,y_train)
plot_all(X_train, y_train, model=tree, ax= ax[0], title="Decision Tree - Max Depth = 1")
plot_tree(tree, ax=ax[1], feature_names=["X1", "X2"], label='all',  filled=True, rounded=True, impurity=False);

#### Decision tree with a maximum depth of 2

In [None]:
clf_dt = DecisionTreeClassifier(max_depth=2) ## maximum depth of 2
clf_dt.fit(X_train, y_train);

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(18,8))
tree = DecisionTreeClassifier(max_depth=2).fit(X_train,y_train)
plot_all(X_train, y_train, model=tree, ax= ax[0], title="Decision Tree - Max Depth = 2")
plot_tree(tree, ax=ax[1], feature_names=["X1", "X2"], label='all', precision=2, filled=True, rounded=True, impurity=False);

#### Decision tree with a maximum depth of 4

In [None]:
clf_dt = DecisionTreeClassifier(max_depth=4) ## maximum depth of 4
clf_dt.fit(X_train, y_train);

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(18,8))
tree = DecisionTreeClassifier(max_depth=4).fit(X_train, y_train)
plot_all(X_train, y_train, model=tree, ax= ax[0], title="Decision Tree - Max Depth = 4")
plot_tree(tree, ax=ax[1], feature_names=["X1", "X2"], label='all', precision=2, filled=True, rounded=True, impurity=False);

### Random Forest

- **Random forests** are an **ensemble** learning method based on **decision trees**.
- They build multiple decision trees and combine their predictions  - by voting or averaging - to improve accuracy.

**Question**: How do we generate multiple decision trees using the same dataset?

To achieve diversity of trees within a random forest, two primary strategies are utilized:

1. **Sample Sampling**: This involves resampling the dataset with replacement, known as **Bootstrap**, to create multiple variations of the dataset. Each decision tree is then built on one of these resampled datasets. The predictions from each tree are combined, often through averaging, in a process called **Bagging**.

![](https://github.com/AL0UNE/courses/blob/main/figures/Bootstrap.png?raw=true)

2. **Feature Sampling**: Diversity by randomly selecting a subset of features (without replacement) for each decision tree. This means that each tree is built using only a portion of the available features, adding further variety to the ensemble.

![](https://github.com/AL0UNE/courses/blob/main/figures/Random_forest.png?raw=true)

The random forest algorithm leverages both of these strategies to create an ensemble of diverse decision trees, which collectively provide more robust and accurate predictions.

#### Random forest with 2 trees

In [None]:
clf_rf = RandomForestClassifier(n_estimators=2) ## 2 trees in the forest
clf_rf.fit(X_train,y_train);

In [None]:
plot_all(X_train, y_train, model=clf_rf,title="Random Forest - 2 Trees");

#### Random forest with 100 trees

In [None]:
clf_rf = RandomForestClassifier(n_estimators=100) ## 100 trees in the forest
clf_rf.fit(X_train,y_train);

In [None]:
plot_all(X_train, y_train, model=clf_rf,title="Random Forest - 100 Trees");

**Exercice**

- Using the dataset provided below, tune each model by identifying the best values for their hyperparameters (number of neighbors for KNN, maximum depth for tree, number of trees for random forest).

- For each model, comment on how each hyperparameter affects the model's decision function.

- Do the same on ICU dataset (i.e., using $X$ and $y$). In this case, visualization will not be possible since $X$ contains more than 2 features, so evaluate the performance of the different models using the accuracy (use the `compute_accuracy` function)

In [None]:
## new synthetic dataset
data = make_moons(n_samples=300, noise=0.12, random_state=48)
X_train, y_train = process_data(data)

In [None]:
plot_all(X_train, y_train)

In [None]:
clf_logistic = LogisticRegression(penalty=None) ## instantiate the model
clf_logistic.fit(X_train,y_train); ## train the model

In [None]:
plot_all(X_train, y_train, model=clf_logistic, title="Logistic Regression") ## plot decision boundary

In [None]:
clf_knn = KNeighborsClassifier(n_neighbors=...)
clf_knn.fit(X_train, y_train);

In [None]:
plot_all(X_train, y_train, model=clf_knn, title="KNN - Neighbor = ...")

In [None]:
clf_dt = DecisionTreeClassifier(max_depth=...)
clf_dt.fit(X_train, y_train);

In [None]:
plot_all(X_train, y_train, model=clf_dt, title="Decision Tree - Max Depth = ...")

In [None]:
clf_rf = RandomForestClassifier(n_estimators=...) 
clf_rf.fit(X_train, y_train);

In [None]:
plot_all(X_train, y_train, model=clf_rf, title="Random Forest - ... Trees");

#### ICU dataset

In [None]:
def compute_accuracy(model, X_train, y_train):
    y_pred = model.predict(X_train)
    accuracy = accuracy_score(y_train, y_pred)
    print('Model accuracy is {:.2f}'.format(accuracy))

In [None]:
model = ... ## try different models and value of hyperparameters
model.fit() ## train on X and y
compute_accuracy(model, X, y) ## compute accuracy