# Decision Tree - Heart Failure Prediction

This notebook builds and evaluates a **Decision Tree** model for predicting the likelihood of heart failure based on patient health attributes.

---

## Objective

Use Decision Tree Classifier as a **baseline classifier** to:
- Identify which health factors are most strongly associated with heart failure.
- Establish a performance benchmark for comparison with more complex models such as Random Forests and Gradient Boosted Trees

---

## Dataset Summary

The dataset contains **918 patient records** and **11 health-related features**, along with a binary target variable:

- **Target:** `HeartDisease` (1 = patient has or is at risk of heart failure, 0 = otherwise)

**Features include:**
- *Age, Sex, ChestPainType, RestingBP, Cholesterol, FastingBS, RestingECG, MaxHR, ExerciseAngina, Oldpeak, ST_Slope*

---

## Overview

1. **Load the dataset** (`dataset/heart.csv`)  
2. **Preprocess features** (encode categorical variables and scale numerical features)  
3. **Train and evaluate** a Decision Tree model using both a **standard 80/20 train–test split** and **10-fold cross-validation**  
4. **Compare performance** across the two evaluation methods to assess model stability and generalization  
5. **Analyze feature coefficients** to identify which clinical and demographic factors most influence heart disease predictions

---

## Notes

- All preprocessing (encoding, scaling, and splitting) is performed **within this notebook** for simplicity and reproducibility.  
- A fixed random seed (`random_state=42`) is used to maintain consistent splits and fair comparisons between methods.  
- This notebook establishes a **baseline** for model performance and interpretability, providing a foundation for future experiments with more complex algorithms such as Random Forests and Gradient Boosted Trees.
---


## Loading the Dataset

We load the dataset from *Kaggle's* **.csv** file to initialize a dataframe (`df`).

Afterwards, we can verify `df` has been populated and observe the structure of the data.

In [3]:
import pandas as pd

df = pd.read_csv("dataset/heart.csv")

df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [4]:
df.shape

(918, 12)

In [5]:
df.info

<bound method DataFrame.info of      Age Sex ChestPainType  RestingBP  Cholesterol  FastingBS RestingECG  \
0     40   M           ATA        140          289          0     Normal   
1     49   F           NAP        160          180          0     Normal   
2     37   M           ATA        130          283          0         ST   
3     48   F           ASY        138          214          0     Normal   
4     54   M           NAP        150          195          0     Normal   
..   ...  ..           ...        ...          ...        ...        ...   
913   45   M            TA        110          264          0     Normal   
914   68   M           ASY        144          193          1     Normal   
915   57   M           ASY        130          131          0     Normal   
916   57   F           ATA        130          236          0        LVH   
917   38   M           NAP        138          175          0     Normal   

     MaxHR ExerciseAngina  Oldpeak ST_Slope  HeartDisea

## Data Preparation and Feature Construction

Now that we’ve verified the dataset’s integrity, we can construct the **feature matrix (`X`)** and **label vector (`y`)** for model training.

We begin by selecting all relevant feature columns from the dataset.  

The target variable, `HeartDisease`, will serve as our label vector (`y`), while the remaining columns form the feature matrix (`X`).


In [13]:
# label vector
y = df['HeartDisease']
y[::100]

0      0
100    1
200    0
300    1
400    1
500    1
600    0
700    0
800    0
900    1
Name: HeartDisease, dtype: int64

In [15]:
features_cols = df.columns[:-1]

# feature matrix
X = df[features_cols]
X[::100]

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up
100,65,M,ASY,130,275,0,ST,115,Y,1.0,Flat
200,47,M,TA,110,249,0,Normal,150,N,0.0,Up
300,60,M,ASY,160,0,1,Normal,149,N,0.4,Flat
400,50,F,ASY,160,0,1,Normal,110,N,0.0,Flat
500,65,M,ASY,136,248,0,Normal,140,Y,4.0,Down
600,57,M,ASY,130,207,0,ST,96,Y,1.0,Flat
700,42,M,TA,148,244,0,LVH,178,N,0.8,Up
800,43,M,NAP,130,315,0,Normal,162,N,1.9,Up
900,58,M,ASY,114,318,0,ST,140,N,4.4,Down


With the data prepared, we now split it into **training** and **testing** subsets to evaluate how well the model generalizes to unseen data.  

## Splitting the Data for Model Evaluation

To properly assess our Decision Tree model, we need to separate the dataset into distinct training and evaluation sets.  
This ensures that the model is tested on unseen data and prevents overfitting.

We will explore **two common evaluation strategies**:

1. **Standard Train/Test Split** – a single 80/20 split providing a quick baseline of model performance.  
2. **k-Fold Cross-Validation** – a more robust method that repeatedly trains and tests the model across multiple data partitions.

By comparing results from both approaches, we can evaluate the **stability and consistency** of our model’s predictions and choose the most reliable validation strategy for further tuning.


## Standard Train/Test Split

We begin by establishing a **baseline performance** for our Decision Tree model using a simple 80/20 train–test split.  
This approach provides an initial benchmark for accuracy and other key metrics before applying more rigorous validation methods such as k-fold cross-validation.

The data is split into **training** and **testing** subsets using `train_test_split`.

We will use the following parameters: `test_size`=**0.2**, `random_state`=**42**.

Our *test size* indicates that our training dataset will take up 80% of the total dataset while the testing set takes up 20%.

Our *random state* is a seed that allows us to have replicable results when splitting the data.

We then train the Decision Tree model on the training data and evaluate it on the unseen test data.

In [17]:
from sklearn.model_selection import train_test_split

# Split data into testing and training sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.2, random_state = 42
)

## Feature Scaling

The Decision Tree Model does not require feature scaling despite the scale differences in the dataset. This is because Decision Trees work with relative ordering and split points. Threshold values for each feature depend on information gain rather than scale, meaning the tree structure is unaffected by differences in feature magnitude.


## Encoding Categorical Features
Several features in this dataset are **categorical** (e.g., `Sex`, `ChestPainType`, `RestingECG`, `ExerciseAngina`, `ST_Slope`), and must be encoded. During encoding, the categorical features are converted into numeric values, since sklearn's Decision Tree model only processes numerical data.

We use scikit-learn’s `OneHotEncoder` to transform these categorical columns into binary indicator variables.  
This creates a new set of columns representing each category, allowing the model to learn from categorical distinctions without assuming any ordinal relationship.

> **Note:** We set `sparse_output=False` so that the encoder returns a dense NumPy array, which can easily be converted into a Pandas DataFrame.


In [32]:

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse_output = False) # use drop first to avoid redundacy

cat_feature_cols = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
num_features_cols = ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak']

# encode the categorical features
X_train_cat = encoder.fit_transform(X_train[cat_feature_cols])
X_test_cat = encoder.transform(X_test[cat_feature_cols])

# Get the encoded column names
encoded_cols = encoder.get_feature_names_out(cat_feature_cols)

# convert categorical features to Dataframes
X_train_encoded = pd.DataFrame(X_train_cat, columns=encoded_cols).reset_index(drop=True)
X_test_encoded = pd.DataFrame(X_test_cat, columns=encoded_cols).reset_index(drop=True)

# Combine the numerical and categorical features

# Obtain the numerical Dataframe
X_train_num = X_train[num_features_cols].reset_index(drop=True)
X_test_num = X_test[num_features_cols].reset_index(drop=True)

# combine
X_train_final = pd.concat([X_train_encoded,X_train_num], axis=1).reset_index(drop=True)
X_test_final = pd.concat([X_test_encoded,X_test_num], axis=1).reset_index(drop=True)

# check data
print(X_train_final)
print(X_test_final)





     Sex_M  ChestPainType_ATA  ChestPainType_NAP  ChestPainType_TA  \
0      1.0                0.0                1.0               0.0   
1      1.0                0.0                1.0               0.0   
2      1.0                0.0                0.0               0.0   
3      0.0                0.0                1.0               0.0   
4      1.0                0.0                0.0               0.0   
..     ...                ...                ...               ...   
729    0.0                0.0                0.0               0.0   
730    1.0                0.0                0.0               0.0   
731    1.0                0.0                0.0               0.0   
732    1.0                0.0                0.0               0.0   
733    0.0                0.0                0.0               0.0   

     RestingECG_Normal  RestingECG_ST  ExerciseAngina_Y  ST_Slope_Flat  \
0                  1.0            0.0               0.0            0.0   
1          

## Training the Decision Tree Model

Next, we initialize and fit the **Decision Tree** model using the training data, then generate predictions and predicted probabilities for the test set.

In [33]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

decision_tree = DecisionTreeClassifier(random_state=42)

decision_tree.fit(X_train_final, y_train)

y_pred = decision_tree.predict(X_test_final)

dt_acc = accuracy_score(y_test, y_pred)

print(dt_acc)



0.8369565217391305
