# Decision Trees - Class Exercise 1

## Introduction

Cardiovascular diseases (CVDs) are the number one cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide.

We will use the Heart Failure dataset for this exercise, as heart failure is a common event caused by CVDs. This dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

Our goal is to build a decision tree a model to predict mortality caused by Heart Failure.

## Metadata (Data Dictionary)

| Variables               | Description                                                |
|-------------------------|------------------------------------------------------------|
| age                     | Age of patient (years)                                     |
| anaemia                 | Low count of red blood cells or haemoglobin (1 = yes, 0 = no) |
| creatinine_phosphokinase| Level of the CPK enzyme in the blood (mcg/L)               |
| diabetes                | Whether the patient has diabetes (1 = yes, 0 = no)         |
| ejection_fraction       | Percent of blood leaving the heart at each contraction (%) |
| high_blood_pressure     | Whether the patient has hypertension (1 = yes, 0 = no)     |
| platelets               | Platelets in the blood (kiloplatelets/mL)                  |
| serum_creatinine        | Level of serum creatinine in the blood (mg/dL)             |
| serum_sodium            | Level of serum sodium in the blood (mEq/L)                 |
| sex                     | Gender of patient (1 = male, 0 = female)                   |
| smoking                 | Whether the patient smokes or not (1 = yes, 0 = no)        |
| DEATH_EVENT             | Whether the patient died during the follow-up period (1 = dead, 0 = alive) |


## Import necessary libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

## Import data

In [None]:
df = pd.read_csv(_)
df

## Check for missing values

In [None]:
missing_values = df._
missing_values

We observe that there are no missing values in the dataset.

## Train-test split

In [None]:
label = _
excluded_columns = [label]
features = [feature for feature in list(df) if feature not in excluded_columns]

In [None]:
X = df[features]
y = df[label]

In [None]:
# Specify split parameters
random_seed = 9002
test_size = 0.2

# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_seed)

In [None]:
print('Size of train set: ', len(X_train))
print('Size of test set: ', len(X_test))

## Train model

In [None]:
# Specify model parameters
criterion = 'gini'
min_samples_leaf = 40

# Build model
model = DecisionTreeClassifier(criterion=criterion, min_samples_leaf=min_samples_leaf)

# Fit model on training data
model.fit(_, _)

# Visualize the decision tree
feature_names = X_train.columns.tolist()
plt.figure(figsize=(8, 8))
plot_tree(model, filled=True, feature_names=feature_names)
plt.show()

## Evaluate Model

In [None]:
# Predict test data
y_pred = model.predict(_)

In [None]:
# Generate confusion matrix
cm = confusion_matrix(_, _)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.show()

In [None]:
# Check test accuracy
accuracy_test = accuracy_score(_, _)
print(f"Test accuracy: {accuracy_test:.2f}")

## Model Improvement

To improve the model, we will experiment with two hyperparameters:
* ```criterion```
* ```min_samples_leaf```

We will first define the hyperparameter grid for ```criterion``` and ```min_samples_leaf```, which contains values for these two hyperparameters that we will be experimenting with. Then, we will use ```GridSearchCV``` to perform a grid search to obtain the optimal values of the two hyperparameters.

In [None]:
# Define hyperparameter grid
param_grid = {
    'criterion': ['gini', 'entropy'],
    'min_samples_leaf': [10, 20, 30, 40, 50, 60, 70, 80]
}

In [None]:
# Perform grid search with 10-fold cross-validation
model = DecisionTreeClassifier()
cv = KFold(n_splits=10, shuffle=True, random_state=9002)
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=cv, scoring='accuracy')
grid_search.fit(X_train, y_train);

In [None]:
# Display best params and best validation score
print("Best parameters:", grid_search.best_params_)
print(f"Best average cross-validation score: {grid_search.best_score_:.2f}")

In [None]:
# Fit optimal model using best params found above
optimal_model = grid_search.best_estimator_

# Visualize the optimal decision tree
plt.figure(figsize=(20, 10))
plot_tree(optimal_model, filled=True, feature_names=feature_names)
plt.show()

In [None]:
# Apply the optimal model on the test data
y_test_pred = optimal_model.predict(_)
test_accuracy = accuracy_score(_, _)
print(f"Test accuracy: {test_accuracy:.2f}")

In [None]:
# Generate confusion matrix for optimal model
cm = confusion_matrix(_, _)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.show()