# Decision Trees - Class Exercise 3

## Introduction

The Cleveland Heart Disease Dataset, hosted by the UCI Machine Learning Repository, is a cornerstone in the field of medical informatics for predicting the presence of heart disease in patients. This dataset comprises 303 individual records, each described by 14 variables, including age, sex, chest pain type, resting blood pressure, serum cholesterol levels, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, and others. The target variable indicates the presence or absence of heart disease.

Originally contributed by the Cleveland Clinic Foundation, this dataset has been widely used for benchmarking machine learning models in binary classification tasks, where the objective is to accurately predict whether or not a patient has heart disease based on their medical measurements. It serves not only as a practical dataset for predictive modeling but also as a valuable resource for exploring machine learning techniques in healthcare applications.

Our goal is to build a decision tree a model to predict whether a patient has heart disease or not.

## Metadata

| Variables     | Description                                                 |
|---------------|-------------------------------------------------------------|
| age           | Age of patient (in years)                                   |
| sex           | Gender of patient (0 = female; 1 = male)                    |
| cp            | Chest pain type (1: typical angina; 2: atypical angina; 3: non-anginal pain; 4: asymptomatic) |
| restbps       | Resting blood pressure on admission to hospital (in mmHg)   |
| chol          | Serum cholesterol (in mg/dl)                                |
| fbs           | Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)       |
| restecg       | Resting electrocardiographic results (values 0, 1, 2)       |
| thalach       | Maximum heart rate achieved                                 |
| exang         | Exercise induced angina (1 = yes; 0 = no)                   |
| oldpeak       | ST depression induced by exercise relative to rest          |
| slope         | Slope of the peak exercise ST segment (values 1, 2, 3)      |
| ca            | Number of major vessels colored by fluoroscopy (0 to 4)     |
| thal          | 3 = normal; 6 = fixed defect; 7 = reversible defect         |
| target        | Presence (Yes) or absence (No) of heart disease             |



## Import Necessary Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

## Import Data

In [None]:
df = pd.read_csv(_)
df

## Check for Missing Values

In [None]:
missing_values = df._
missing_values

We observe that there are no missing values in the dataset.

## One-hot Encoding for Multiclass Variables

To prevent the potential problem of *spurious ordering* in multiclass variables with ```K>2``` classes, we will apply one-hot encoding to transform all of them into separate ```K-1``` binary vectors.

In [None]:
# Initialize encoder
encoder = OneHotEncoder(drop='first', sparse_output=False)

# Declare all multiclass variables with K>2 classes
multiclass_columns = [_]

# Encode multiclass variables with K>2 classes
for column in multiclass_columns:
    encoded_result = encoder.fit_transform(df[[column]])
    encoded_df = pd.DataFrame(encoded_result, columns=encoder.get_feature_names_out([column]))
    # Drop original column and concatenate the new one-hot encoded DataFrame
    df.drop(column, axis=1, inplace=True)
    df = pd.concat([df, encoded_df], axis=1)

df.head()

## Train-Test Split

In [None]:
label = _
excluded_columns = [label]
features = [feature for feature in list(df) if feature not in excluded_columns]

In [None]:
X = df[features]
y = df[label]

In [None]:
# Specify split parameters
random_seed = 9002
test_size = 0.2

# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(_)

In [None]:
print('Size of train set: ', len(X_train))
print('Size of test set: ', len(X_test))

## Train Model

In [None]:
# Specify model parameters
criterion = 'gini'
min_samples_leaf = 40

# Build model
model = DecisionTreeClassifier(_)

# Fit model on training data
model.fit(_)

# Visualize the decision tree
feature_names = X_train.columns.tolist()
plt.figure(figsize=(8, 5))
plot_tree(model, filled=True, feature_names=feature_names)
plt.show()

## Evaluate Model

In [None]:
# Predict test data
y_pred = model.predict(_)

In [None]:
# Generate confusion matrix
cm = confusion_matrix(_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.show()

In [None]:
# Check test accuracy
accuracy_test = accuracy_score(_)
print(f"Test accuracy: {accuracy_test:.2f}")

## Model Improvement

To improve the model, we will experiment with two hyperparameters:
* ```criterion```
* ```min_samples_leaf```

We will first define the hyperparameter grid for ```criterion``` and ```min_samples_leaf```, which contains values for these two hyperparameters that we will be experimenting with. Then, we will use ```GridSearchCV``` to perform a grid search to obtain the optimal values of the two hyperparameters.

In [None]:
# Define hyperparameter grid
param_grid = {
    'criterion': ['gini', 'entropy'],
    'min_samples_leaf': [10, 20, 30, 40, 50, 60, 70, 80]
}

In [None]:
# Perform grid search with 10-fold cross-validation
model = DecisionTreeClassifier()
cv = KFold(n_splits=10, shuffle=True, random_state=9002)
grid_search = GridSearchCV(_)
grid_search.fit(_)

In [None]:
# Display best params and best validation score
print("Best parameters:", grid_search.best_params_)
print(f"Best average cross-validation score: {grid_search.best_score_:.2f}")

In [None]:
# Fit optimal model using best params found above
optimal_model = grid_search.best_estimator_

# Visualize the optimal decision tree
plt.figure(figsize=(8, 5))
plot_tree(optimal_model, filled=True, feature_names=feature_names)
plt.show()

In [None]:
# Apply the optimal model on the test data
y_test_pred = optimal_model.predict(_)
test_accuracy = accuracy_score(_)
print(f"Test accuracy: {test_accuracy:.2f}")

In [None]:
# Generate confusion matrix for optimal model
cm = confusion_matrix(_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.show()