# Heart Disease Prediction: An Educational Guide

### Using PyCaret, Plotly, and Scikit-Learn

# Heart Disease Prediction: The "Clinic First" Approach
### ðŸ©º The Scenario
Imagine you are a Lead Data Scientist at a local hospital. A patient walks into the clinic for a routine check-up. Before the doctor orders expensive, invasive, or time-consuming tests (like a Cardiac Fluoroscopy or a Thalassemia Stress Test), they want to use the patient's basic health profile to answer one critical question:
> "Based on this patient's current metrics, what is the probability they have underlying heart disease?"

### ðŸŽ¯ Our Mission
We want to build a machine learning model that acts as a pre-screening tool.
* The Goal: Catch potential heart disease early.
* The Constraint: Use features that are easily accessible during a standard physical exam.
* The Metric that Matters: In medicine, we care deeply about Recall (minimizing "False Negatives"). We don't want to tell a sick patient they are healthy!
Learning Objectives:
* Exploratory Data Analysis (EDA): Visualizing the "Patient Profile" using Plotly.
* Feature Selection: Identifying which data points are "pre-diagnostic" vs. "invasive."
* Automated ML (PyCaret): Finding the best clinical model in seconds.
* Evaluation: Moving beyond accuracy to understand the "Medical Cost" of errors.



In [1]:
# !pip install pycaret plotly 

import numpy as np
import pandas as pd
import os
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Manual Modeling
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

# PyCaret
from pycaret.classification import *


# 2. Data Preparation

We use the Heart Disease dataset to predict the presence of heart disease based on clinical parameters.

In [2]:
# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Set the path to the file you'd like to load
file_path = "heart.csv"

# Load the latest version
df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "johnsmith88/heart-disease-dataset",
  file_path,
  # Provide any additional arguments like 
  # sql_query or pandas_kwargs. See the 
  # documenation for more information:
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)

df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [3]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,1025.0,54.434146,9.07229,29.0,48.0,56.0,61.0,77.0
sex,1025.0,0.69561,0.460373,0.0,0.0,1.0,1.0,1.0
cp,1025.0,0.942439,1.029641,0.0,0.0,1.0,2.0,3.0
trestbps,1025.0,131.611707,17.516718,94.0,120.0,130.0,140.0,200.0
chol,1025.0,246.0,51.59251,126.0,211.0,240.0,275.0,564.0
fbs,1025.0,0.149268,0.356527,0.0,0.0,0.0,0.0,1.0
restecg,1025.0,0.529756,0.527878,0.0,0.0,1.0,1.0,2.0
thalach,1025.0,149.114146,23.005724,71.0,132.0,152.0,166.0,202.0
exang,1025.0,0.336585,0.472772,0.0,0.0,0.0,1.0,1.0
oldpeak,1025.0,1.071512,1.175053,0.0,0.0,0.8,1.8,6.2


# 2. Understanding our Medical Features
To make this interactive, let's categorize our data. In a real clinic, some data is easy to get (Age, Blood Pressure), while some requires specialist equipment (Number of Vessels via Fluoroscopy).

| **Name**     | **Values**                                                                            | **Type**                       | **Action (ML Processing)**                | **Feature Meaning**                            |
| ------------ | ------------------------------------------------------------------------------------- | ------------------------------ | ----------------------------------------- | ---------------------------------------------- |
| **age**      | 29â€“77 (years)                                                                         | Numerical â€“ Continuous (Ratio) | Scale (optional)                          | Age of the patient in years                    |
| **sex**      | 0 = Female, 1 = Male                                                                  | Categorical â€“ Binary (Nominal) | Keep as-is                                | Biological sex of the patient                  |
| **cp**       | 0 = Typical angina<br>1 = Atypical angina<br>2 = Non-anginal pain<br>3 = Asymptomatic | Categorical â€“ Nominal          | One-Hot Encode (non-tree models)          | Type of chest pain experienced                 |
| **trestbps** | 94â€“200 mm Hg                                                                          | Numerical â€“ Continuous (Ratio) | Scale (optional)                          | Resting blood pressure                         |
| **chol**     | 126â€“564 mg/dl                                                                         | Numerical â€“ Continuous (Ratio) | Scale / optional log                      | Serum cholesterol level                        |
| **fbs**      | 0 = â‰¤120 mg/dl<br>1 = >120 mg/dl                                                      | Categorical â€“ Binary (Nominal) | Keep as-is                                | Fasting blood sugar indicator                  |
| **restecg**  | 0 = Normal<br>1 = ST-T abnormality<br>2 = LV hypertrophy                              | Categorical â€“ Nominal          | One-Hot Encode                            | Resting electrocardiographic result            |
| **thalach**  | 71â€“202 bpm                                                                            | Numerical â€“ Continuous (Ratio) | Scale                                     | Maximum heart rate achieved                    |
| **exang**    | 0 = No<br>1 = Yes                                                                     | Categorical â€“ Binary (Nominal) | Keep or drop (diagnostic risk)            | Exercise-induced angina                        |
| **oldpeak**  | 0.0â€“6.2                                                                               | Numerical â€“ Continuous (Ratio) | Scale                                     | ST depression induced by exercise              |
| **slope**    | 0 = Upsloping<br>1 = Flat<br>2 = Downsloping                                          | Categorical â€“ Ordinal          | Keep ordinal or One-Hot                   | Slope of peak exercise ST segment              |
| **ca**       | 0â€“3 vessels                                                                           | Numerical â€“ Discrete (Ordinal) | âš  Drop for baseline / Keep for diagnostic | Number of major vessels colored by fluoroscopy |
| **thal**     | 1 = Normal<br>2 = Fixed defect<br>3 = Reversible defect                               | Categorical â€“ Nominal          | âš  One-Hot or Drop (leakage risk)          | Thalassemia blood disorder status              |
| **target**   | 0 = No disease<br>1 = Disease                                                         | Categorical â€“ Binary (Nominal) | Target variable                           | Presence of heart disease                      |


| Feature Group | Attributes | Clinical Context |
| :--- | :--- | :--- |
| Demographics | Age, Sex | Basic patient identity. |
| Vitals (Easy) | Resting BP, Cholesterol, Fasting Blood Sugar | Taken during a standard 15-minute nurse check. |
| Symptomatic | Chest Pain Type, Exercise Angina | Self-reported by the patient or observed during movement. |
| Diagnostic (Harder) | ECG, Max Heart Rate, ST Depression | Requires specialized machines and stress tests. |
| Advanced (Invasive)| No. of Vessels, Thalassemia | Expensive, high-resource procedures. |


In [5]:
# Identify feature types
column_rename = {
        'age': 'Age',
        'sex': 'Gender',
        'cp': 'Chest_Pain_Type',
        'trestbps': 'Resting_Blood_Pressure',
        'chol': 'Cholesterol',
        'fbs': 'Fasting_Blood_Sugar',
        'restecg': 'Resting_ECG',
        'thalach': 'Max_Heart_Rate',
        'exang': 'Exercise_Induced_Angina',
        'oldpeak': 'ST_Depression',
        'slope': 'ST_Slope',
        'ca': 'Number_of_Vessels',
        'thal': 'Thalassemia_Type',
        'target': 'target'
    }
df.rename(columns=column_rename, inplace=True)


In [32]:
X_train

Unnamed: 0,Gender,Chest_Pain_Type,Fasting_Blood_Sugar,Age,Resting_Blood_Pressure,Cholesterol
4,0,0,1,62,138,294
688,0,0,1,56,200,288
477,1,2,0,57,128,229
336,1,2,1,57,150,126
960,0,2,0,52,136,196
...,...,...,...,...,...,...
882,1,0,0,57,130,131
367,1,1,0,48,110,229
393,0,0,0,62,160,164
777,1,0,0,53,123,282


In [6]:
# 1. Define our feature groups based on clinical availability
diagnostic_cols = ['Resting_ECG', 'Max_Heart_Rate', 'ST_Depression', 'ST_Slope']
advanced_cols   = ['Number_of_Vessels', 'Thalassemia_Type']
cols_to_drop    = diagnostic_cols + advanced_cols

# 2. Create the screening dataset
df_screening = df.drop(columns=cols_to_drop)

# 3. Dynamically identify remaining features
# We define categorical features from what's left in the screening dataframe
all_screening_cats = [
    'Gender', 'Chest_Pain_Type', 'Fasting_Blood_Sugar', 'Exercise_Induced_Angina'
]

# Ensure we only keep categories that weren't dropped
cat_features_screening = [f for f in all_screening_cats if f in df_screening.columns]

# Automatically identify numerical features from the remaining columns
num_features_screening = df_screening.select_dtypes(include="number").columns.drop(['target'] + cat_features_screening).tolist()

print(f"âœ… Dropped high-resource features: {cols_to_drop}")
print(f"ðŸ“Š Remaining Numerical: {num_features_screening}")
print(f"ðŸ“Š Remaining Categorical: {cat_features_screening}")

âœ… Dropped high-resource features: ['Resting_ECG', 'Max_Heart_Rate', 'ST_Depression', 'ST_Slope', 'Number_of_Vessels', 'Thalassemia_Type']
ðŸ“Š Remaining Numerical: ['Age', 'Resting_Blood_Pressure', 'Cholesterol']
ðŸ“Š Remaining Categorical: ['Gender', 'Chest_Pain_Type', 'Fasting_Blood_Sugar', 'Exercise_Induced_Angina']


Converts encoded numerical values into descriptive categorical strings 
for better visualization representation.

In [8]:
df_viz = df_screening.copy()

# Define mappings based on your dataset table
mappings = {
    'Gender': {0: 'Female', 1: 'Male'},
    'Chest_Pain_Type': {
        0: 'Typical Angina', 
        1: 'Atypical Angina', 
        2: 'Non-Anginal Pain', 
        3: 'Asymptomatic'
    },
    'Fasting_Blood_Sugar': {0: '<= 120 mg/dl', 1: '> 120 mg/dl'},
    'Exercise_Induced_Angina': {0: 'No', 1: 'Yes'},
    'target': {0: 'No Disease', 1: 'Disease'}
}



# Apply mappings and convert to 'category' type
for col, mapping in mappings.items():
    if col in df_viz.columns:
        df_viz[col] = df_viz[col].map(mapping).astype('category')



df_viz = df_viz.rename(columns=column_rename)

df_viz.head()

Unnamed: 0,Age,Gender,Chest_Pain_Type,Resting_Blood_Pressure,Cholesterol,Fasting_Blood_Sugar,Exercise_Induced_Angina,target
0,52,Male,Typical Angina,125,212,<= 120 mg/dl,No,No Disease
1,53,Male,Typical Angina,140,203,> 120 mg/dl,Yes,No Disease
2,70,Male,Typical Angina,145,174,<= 120 mg/dl,Yes,No Disease
3,61,Male,Typical Angina,148,203,<= 120 mg/dl,No,No Disease
4,62,Female,Typical Angina,138,294,> 120 mg/dl,No,No Disease



# 3. Interactive EDA (Plotly)

Unlike static plots, Plotly allows to hover over data points and zoom into specific distributions.

In [9]:
df_viz['target'].value_counts(normalize=True)

target
Disease       0.513171
No Disease    0.486829
Name: proportion, dtype: float64


### Distribution of Features

In [10]:
# Boxplots for numerical features
for col in num_features_screening:
    fig = px.histogram(df_viz, x=col, color="target", 
                 title=f"{col} vs Heart Disease",
                 marginal="box",
                 template="plotly_white",
                 barmode="overlay", color_discrete_sequence=['#636EFA', '#EF553B'])
    fig.show()



In [11]:
# Grouped bar charts for categorical features
for col in cat_features_screening:
    # Calculate counts for plotting
    counts = df_viz.groupby([col, 'target']).size().reset_index(name='count')
    fig = px.bar(counts, x=col, y='count', color='target', 
                 barmode='group',
                 title=f"{col} Distribution by Target",
                 labels={"target": "Heart Disease"},
                 template="plotly_white")
    fig.show()

Exercise_Induced_Angina could be considered a target leak if the angina (chest pain) was diagnosed during a stress test specifically used to confirm the presence of heart disease, thereby embedding the knowledge of the "target" outcome directly into the input feature.

In [13]:
# Define leak features to remove
target_leak = [
    'Exercise_Induced_Angina', 
]

# Update feature lists
cat_features_clean = [f for f in cat_features_screening if f not in target_leak]
num_features_clean = [f for f in num_features_screening if f not in target_leak]
features_name = cat_features_clean + num_features_clean

df_clean = df[features_name + ["target"]]

print(f"Features kept: {features_name}")

Features kept: ['Gender', 'Chest_Pain_Type', 'Fasting_Blood_Sugar', 'Age', 'Resting_Blood_Pressure', 'Cholesterol']


In [17]:
cat_features_clean, num_features_clean

(['Gender', 'Chest_Pain_Type', 'Fasting_Blood_Sugar'],
 ['Age', 'Resting_Blood_Pressure', 'Cholesterol'])

In [None]:
X = df_clean[features_name]
y = df_clean['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Preprocessing Pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_features_clean),
        ('cat', OneHotEncoder(drop='first'), cat_features_clean)
    ]
)

preprocessor


In [22]:
# 1. Logistic Regression (Baseline)
lr_model = Pipeline([('prep', preprocessor), ('model', LogisticRegression())])
lr_auc = cross_val_score(lr_model, X_train, y_train, cv=5, scoring='accuracy').mean()

# 2. Random Forest
rf_model = Pipeline([('prep', preprocessor), ('model', RandomForestClassifier(n_estimators=300, max_depth=4, random_state=42))])
rf_auc = cross_val_score(rf_model, X_train, y_train, cv=5, scoring='accuracy').mean()


print(f"Manual Model Performance (ROC-AUC):")
print(f"Logistic Regression: {lr_auc:.4f}")
print(f"Random Forest: {rf_auc:.4f}")

Manual Model Performance (ROC-AUC):
Logistic Regression: 0.7976
Random Forest: 0.8451



# 4. AutoML with PyCaret

PyCaret simplifies the workflow by handling preprocessing and model comparison in a few lines of code.

In [24]:

# Initialize PyCaret setup

# session_id ensures reproducibility

s = setup(X_train, target =y_train, session_id = 123, verbose=False)

# Compare models to see which one performs best automatically

best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.9618,0.9938,0.963,0.9635,0.9627,0.9235,0.9246,0.05
rf,Random Forest Classifier,0.9547,0.9816,0.9525,0.9606,0.9555,0.9094,0.9115,0.063
lightgbm,Light Gradient Boosting Machine,0.9512,0.9726,0.9525,0.9533,0.9523,0.9024,0.9038,0.07
dt,Decision Tree Classifier,0.9251,0.9251,0.9254,0.9342,0.9273,0.85,0.8547,0.008
gbc,Gradient Boosting Classifier,0.9113,0.9451,0.9087,0.9195,0.9132,0.8224,0.8241,0.036
ada,Ada Boost Classifier,0.7895,0.8798,0.8009,0.796,0.7967,0.5785,0.5811,0.035
ridge,Ridge Classifier,0.7735,0.8312,0.7529,0.7983,0.7723,0.5477,0.5517,0.008
lda,Linear Discriminant Analysis,0.7717,0.8314,0.7495,0.7976,0.7699,0.5443,0.5487,0.008
lr,Logistic Regression,0.7647,0.8329,0.7494,0.7872,0.7652,0.53,0.5338,0.497
qda,Quadratic Discriminant Analysis,0.7613,0.8352,0.7566,0.7782,0.7633,0.5231,0.5282,0.008


In [25]:
best_model


# 5. Understanding the Pipeline: Cross-Validation & Grid Search


### What is Cross-Validation (CV)?

Instead of a single train/test split, we split the data into  folds. The model trains on  folds and validates on the remaining fold. This process repeats  times.


### What is Grid Search?

Grid Search is an exhaustive search over specified parameter values for an estimator. We combine it with CV to find the "sweet spot" where the model generalizes best without overfitting.

In [30]:
# Define the Parameter Grid
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import GridSearchCV, KFold


param_grid = {
'n_estimators': [48, 50, 52],
# 'max_depth': [None, 5, 10],
# 'min_samples_split': [2, 5],
}

# Initialize K-Fold

cv_strategy = KFold(n_splits=5, shuffle=True, random_state=123)

# Initialize Grid Search

grid_search = GridSearchCV(estimator=ExtraTreesClassifier(random_state=123),
param_grid=param_grid,
cv=cv_strategy,
scoring='accuracy',
verbose=1)

# Fit Grid Search

grid_search.fit(X_train, y_train)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best CV Accuracy: {grid_search.best_score_:.4f}")

Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best Parameters: {'n_estimators': 48}
Best CV Accuracy: 0.9793



# 6. Model Evaluation

Let's visualize the performance of our tuned model using Plotly for the Confusion Matrix.

In [31]:

# Predict on test set

from sklearn.metrics import confusion_matrix


y_pred = grid_search.best_estimator_.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

# Create an interactive Confusion Matrix

fig = px.imshow(cm,
labels=dict(x="Predicted Label", y="True Label"),
x=['Healthy', 'Heart Disease'],
y=['Healthy', 'Heart Disease'],
text_auto=True,
color_continuous_scale='Blues',
title="Confusion Matrix: Best Random Forest Model")
fig.show()

In [33]:
grid_search.best_estimator_.feature_importances_

array([0.07252027, 0.27924414, 0.02998069, 0.23765893, 0.19183803,
       0.18875794])


# 7. Conclusion & Discussion

**Key Takeaways for Students:**

1. **PyCaret vs. Manual:** PyCaret is excellent for rapid prototyping, while Scikit-learn's `GridSearchCV` provides granular control over the tuning process.

2. **Cross-Validation:** It provides a more robust estimate of model performance than a single split, especially on smaller medical datasets.

3. **The "Cost" of Misclassification:** In heart disease prediction, a **False Negative** (predicting someone is healthy when they are sick) is much more dangerous than a **False Positive**.

**Next Steps:**

* Try adjusting the `scoring` parameter in Grid Search to `recall` to minimize False Negatives.

* Implement feature engineering to see if we can improve the F1-score.