## Heart Failure Prediction

Team Members: Joshua Hanscom,  Andrew Rivera and Abigail Diaz

Course: CS4661 – Data Science / Machine Learning

Instructor: Professor Mohammad Pourhomayoun

Date: December 01, 2025

## 1. Introduction

Heart failure is a serious medical condition influenced by a combination of demographic, clinical, and physiological factors. Early prediction of potential heart failure can support proactive care and improve patient outcomes.

In this project, we analyze the **Heart Failure Prediction** dataset to identify factors associated with heart failure and build models capable of predicting patient survival or risk of heart failure.

## 2. Objectives

1. **Classification:** Predict whether a patient is likely to experience heart failure (or survive) based on health attributes.
2. **Clustering:** Use unsupervised learning to identify subgroups of patients with similar risk profiles.
3. **Feature Analysis:** Investigate which features are most strongly associated with heart failure to uncover key risk indicators.

By comparing multiple modeling strategies, we aim to determine which methods provide the most accurate and meaningful insights into heart failure risk.


## 3. Dataset Description

The dataset, sourced from [Kaggle - Heart Failure Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction), contains **918 unique patient records**, each with **11 features** and a binary target variable `HeartDisease` (1 = yes, 0 = no).

**Features include:**

- Age
- Sex
- ChestPainType
- RestingBP
- Cholesterol
- FastingBS
- RestingECG
- MaxHR
- ExerciseAngina
- Oldpeak
- ST_Slope

The target variable indicates whether the patient has experienced or is at risk of heart failure.


## 4. Data Preprocessing

Before model training, the dataset will be cleaned and prepared as follows:

- **Initial Observation** check dataset overall shape and data types.
- **Handle missing values** (if any)
- **Handle duplicate records** (if any)
- **Split data** into training and testing subsets for evaluation consistency
- **Scale numerical variables** as required by model type

### 4.1 Data Loading and Initial Exploration
We loaded the Heart Failure Prediction dataset from Kaggle's CSV file 
(heart.csv) into a pandas DataFrame named df. To verify successful loading, 
we examined the DataFrame's shape, first few rows, structure, data types and missing or duplicate records.

In [6]:
import pandas as pd

df = pd.read_csv("dataset/heart.csv")

df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


### Initial Observations
The dataset contains mixed data types requiring preprocessing: numerical features (Age, RestingBP, Cholesterol, MaxHR, Oldpeak) will need scaling for distance-based models, while categorical features (Sex, ChestPainType, RestingECG, ExerciseAngina, ST_Slope) require one-hot encoding. A comprehensive data exploration and preprocessing pipeline is described in Section 4.2.

In [26]:
print(f"Dataset shape: {df.shape}")
print(f"Patient records: {df.shape[0]}")
print(f"Total columns: {df.shape[1]}")

Dataset shape: (918, 12)
Patient records: 918
Total columns: 12


The shape `(918, 12)` establishes our dataset baseline: 918 patients and 12 
variables. This is critical for subsequent quality checks, since we expect all columns 
to contain 918 non-null values if the data is complete.


### Handling Missing Data
Missing data can significantly impact model performance by introducing bias, 
reducing statistical power, or causing errors during training. Before proceeding 
with analysis, we must verify data completeness.

We examined the dataset structure to identify any missing values:

In [20]:
# Shows column types, non-null counts, memory usage
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB
None


The `df.info()` output above confirms data integrity. For clarity, we present 
a formatted summary below:

In [25]:
# Check structure and data types
print("Dataset Structure and Completeness Check:")
# Info table
info_df = pd.DataFrame({
    'Column': df.columns,
    'Non-Null Count': df.count().values,
    'Data Type': df.dtypes.values
})

print(info_df)

Dataset Structure and Completeness Check:
            Column  Non-Null Count Data Type
0              Age             918     int64
1              Sex             918    object
2    ChestPainType             918    object
3        RestingBP             918     int64
4      Cholesterol             918     int64
5        FastingBS             918     int64
6       RestingECG             918    object
7            MaxHR             918     int64
8   ExerciseAngina             918    object
9          Oldpeak             918   float64
10        ST_Slope             918    object
11    HeartDisease             918     int64


**Observation**

The **Non-Null Count** column confirms data completeness: all features show 
918 non-null values, matching our total record count from `shape`. This 
indicates no missing data, allowing us to proceed without imputation and 
preserve the full sample for model training.

### Handling Duplicate Records

In [30]:
# Check for duplicate records
print("Checking for duplicate patient records")
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

Checking for duplicate patient records
Number of duplicate rows: 0


**Observation**

No duplicate records were found in the dataset. Each of the 918 entries 
represents a unique patient record, ensuring data integrity for model training.

### Data Cleaning
Initial inspection of the dataset using `df.info()`revealed no missing values across all 918 records and 12 columns. Additionally, a check for duplicate entries using `df.duplicated().sum()` confirmed that each 
patient record is unique. Given the completeness and integrity of the data, no 
cleaning steps is required and we proceed directly to feature encoding.

### 4.2 Data Preparation

Now that we’ve verified the dataset’s integrity, we can construct the **feature matrix (`X`)** and **label vector (`y`)** for model training.

We begin by selecting all relevant feature columns from the dataset.  

The target variable, `HeartDisease`, will serve as our label vector (`y`), while the remaining columns form the feature matrix (`X`).

In [34]:
# label vector
y = df['HeartDisease']

counts = y.value_counts().to_dict()

print("HeartDisease Distribution:")
print(f"No heart disease = {counts[0]}")
print(f"Heart disease = {counts[1]}")

HeartDisease Distribution:
No heart disease = 410
Heart disease = 508


In [32]:
features_cols = df.columns[:-1]

# feature matrix
X = df[features_cols]
X[::100]

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up
100,65,M,ASY,130,275,0,ST,115,Y,1.0,Flat
200,47,M,TA,110,249,0,Normal,150,N,0.0,Up
300,60,M,ASY,160,0,1,Normal,149,N,0.4,Flat
400,50,F,ASY,160,0,1,Normal,110,N,0.0,Flat
500,65,M,ASY,136,248,0,Normal,140,Y,4.0,Down
600,57,M,ASY,130,207,0,ST,96,Y,1.0,Flat
700,42,M,TA,148,244,0,LVH,178,N,0.8,Up
800,43,M,NAP,130,315,0,Normal,162,N,1.9,Up
900,58,M,ASY,114,318,0,ST,140,N,4.4,Down


With the data prepared, we now split it into **training** and **testing** subsets to evaluate how well the model generalizes to unseen data.  

### Splitting the Data for Model Evaluation

We explore two common evaluation strategies to assess model performance:

1. **k-Fold Cross-Validation** – The training set is further evaluated using k-fold cross-validation, which repeatedly trains and tests the model across multiple folds of the data. This provides a stable estimate of model performance, helps guide hyperparameter selection, and reduces sensitivity to any particular partition of the training data.

2. **Standard Train/Test Split** – The dataset is split into an 80% training set and a 20% hold-out test set. This provides a quick baseline evaluation of each model’s performance on unseen data. The test set remains untouched during training, ensuring an unbiased estimate of generalization.

By combining these two strategies, we obtain both robust performance estimates via cross-validation and final, unbiased evaluation using the hold-out test set.


### Standard Train/Test Split

We begin by establishing a **baseline performance** for our models using a simple 80/20 train–test split.  
This approach provides an initial benchmark for accuracy and other key metrics before applying more rigorous validation methods such as k-fold cross-validation.

The data is split into **training** and **testing** subsets using `train_test_split`.

We will use the following parameters: `test_size`=**0.2**, `random_state`=**42**.

Our *test size* indicates that our training dataset will take up 80% of the total dataset while the testing set takes up 20%.

Our *random state* is a seed that allows us to have replicable results when splitting the data.

We then train the Decision Tree model on the training data and evaluate it on the unseen test data.

In [35]:
from sklearn.model_selection import train_test_split

# Split data into testing and training sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.2, random_state = 42
)

### 4.3 Standardization

### Scaling Numerical Features

Before training our models, we standardize all numerical features to ensure they are on a comparable scale. This is important because we use both distance-based models (KNN and K-Means) and gradient-based models (Logistic Regression), which can be heavily influenced by differences in feature magnitude. Without scaling, variables with larger numeric ranges (e.g., Cholesterol, MaxHR) could dominate the learning process and distort model performance.

We use `StandardScaler` to transform numerical features by removing the mean and scaling to unit variance. This centers the data around zero and ensures all numeric features contribute proportionally to the model.

A crucial detail is that scaling must be fit only on the training data—not the entire dataset.
If we scale using all data before splitting or before cross-validation, the scaler “sees” information from the test set or validation folds. This results in data leakage, giving the model access to information it should not have during training. This contaminates the evaluation and leads to overly optimistic results.

To avoid this, the scaler is fit on the training set only, then applied (transformed) to both the training and test sets.

Decision Tree, Random Forest and Gradient Boosted Trees does not require feature scaling despite the scale differences in the dataset. This is because these models work with relative ordering and split points. Threshold values for each feature depend on information gain rather than scale, meaning the tree structure is unaffected by differences in feature magnitude.

> **Note:** Categorical variables will be handled separately through one-hot encoding and are *NOT* affected by scaling.

In [37]:
from sklearn.preprocessing import StandardScaler

num_feature_cols = ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak']

# Fit the scaler ONLY on the training data
scaler = StandardScaler()
scaler.fit(X_train[num_feature_cols])

# Transform the training and test numeric columns
X_train_scaled_array = scaler.transform(X_train[num_feature_cols])
X_test_scaled_array = scaler.transform(X_test[num_feature_cols])

# Rebuild DataFrames so everything stays consistent
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled[num_feature_cols] = pd.DataFrame(
    X_train_scaled_array, 
    columns=num_feature_cols, 
    index=X_train.index
)

X_test_scaled[num_feature_cols] = pd.DataFrame(
    X_test_scaled_array, 
    columns=num_feature_cols, 
    index=X_test.index
)

print(X_train_scaled)



          Age Sex ChestPainType  RestingBP  Cholesterol  FastingBS RestingECG  \
795 -1.245067   M           NAP  -0.708985     0.372803   1.842609     Normal   
25  -1.886236   M           NAP  -0.166285     0.086146  -0.542709     Normal   
84   0.250993   M           ASY   0.919115     0.123134   1.842609     Normal   
10  -1.779375   F           NAP  -0.166285     0.104640  -0.542709     Normal   
344 -0.283314   M           ASY  -0.708985    -1.846478   1.842609     Normal   
..        ...  ..           ...        ...          ...        ...        ...   
106 -0.603898   F           ASY  -0.708985     0.502261  -0.542709         ST   
270 -0.924483   M           ASY  -0.708985     0.234098  -0.542709     Normal   
860  0.678439   M           ASY  -0.166285     0.493014  -0.542709     Normal   
435  0.678439   M           ASY   1.027656    -1.846478  -0.542709         ST   
102 -1.458790   F           ASY   0.919115     1.778348  -0.542709     Normal   

        MaxHR ExerciseAngin

### Encoding Categorical Features

Several features in this dataset are categorical (e.g., `Sex`, `ChestPainType`, `RestingECG`, `ExerciseAngina`, `ST_Slope`) and must be encoded before model training. Since scikit-learn's models, including Logistic Regression, KNN, Random Forests, Gradient Boosting, and Decision Trees—require numeric inputs, these values cannot be used in their raw text form.

We use scikit-learn's `OneHotEncoder` to convert each categorical column into a set of binary indicator variables (0/1). This avoids incorrectly introducing an ordinal relationship between categories and allows the model to treat each category independently.

**Note:** We set `sparse_output=False` so the encoder returns a dense NumPy array, which can be easily converted into a Pandas DataFrame.

#### Why the Encoder Is Fit on the Training Data Only

Just like scaling, `OneHotEncoder` must be fit using only the training data. This prevents **data leakage**, which happens when information from the test set unintentionally influences the training process.

If the encoder is fit on the entire dataset:
- It "sees" categories from the test set ahead of time
- The test set is no longer truly unseen
- Evaluation metrics become overly optimistic and invalid

To avoid this, the correct workflow is:

1. **Fit** the encoder on the training set (learns the categories)
2. **Transform** both training and test sets using this fitted encoder
3. During cross-validation, the encoder is fit inside each fold using only the fold's training data

This ensures that at every stage, the model only has access to information available during training, maintaining fair and valid evaluation results.

   
## 4.4 Training Models
   [Split once here - use same split for all models for fair comparison]
   [This ensures all models are evaluated on the same test set]

   
### Model-Specific Preprocessing
   [Explain that different models require different preprocessing]

## Methods / Models Used

We explore both **supervised** and **unsupervised** learning methods:

### Supervised Models

- **Logistic Regression:** Baseline linear model to identify features most strongly influencing heart failure risk.
- **Decision Tree:** A baseline tree-based model providing a visual, hierarchical representation of how features split and contribute to heart failure prediction.”
- **Random Forest:** Captures non-linear relationships and provides robust feature importance metrics.
- **Gradient Boosted Trees (GB Trees):** Improves accuracy through sequential learning and weighted updates.

### Unsupervised Learning

- **Clustering (e.g., K-Means):** Used to explore underlying patterns or patient subgroups within the dataset.



## Results and Evaluation

In this project, we do not use the full dataset directly for training and evaluation, as this would risk overestimating the model’s performance on unseen data. Instead, we split the dataset into a training set and a hold-out test set, reserving the test set for an unbiased evaluation of the final models. On the training set, we apply 10-fold cross-validation, which repeatedly trains and validates the models on different subsets of the training data. This provides a stable estimate of model performance, guides hyperparameter selection, and helps prevent overfitting. Importantly, cross-validation scores are not directly comparable to test set accuracy, because cross-validation only uses parts of the training data and does not reflect performance on truly unseen data. For consistency and fairness, the same cross-validation procedure is applied to all models. After selecting the best hyperparameters, each model is fitted on the full training set and evaluated once on the hold-out test set, providing an unbiased measure of generalization. Performance metrics such as accuracy, ROC curves, and AUC are reported based on both cross-validation and test set evaluation, allowing a thorough comparison of the algorithms.

### Cross Validation -- finding hyperparameters, etc

Model performance will be compared using metrics such as:

- Accuracy
- Precision / Recall
- F1-Score
- ROC-AUC

Visualizations (confusion matrices, ROC curves, feature importance plots) will accompany the results.

###  Model Performance Comparison
[Table/chart comparing all models' accuracy, precision, recall, F1]

### Feature Importance Analysis Across Models
[Compare what each model considers important]

### Model-Specific Insights
   - Logistic Regression: [interpretation of coefficients]
   - Decision Tree: [tree structure insights]
   - Random Forest: [ensemble patterns]

## Discussion

This section will include:

- Comparison of model performance and interpretability
- Key insights from logistic regression coefficients and tree-based feature importances
- Observations from clustering and subgroup analysis
- Limitations due to dataset size, imbalance, or potential bias
  

Why do models agree/disagree?

"All models agreed on ST_slope_Up, suggesting this is a robust finding"
"Random Forest showed more distributed importance, likely because it captures feature interactions that single trees miss"


What does this mean for heart disease prediction?

Clinical implications
Which features healthcare providers should prioritize


Limitations:

"Decision Tree may overfit to ST_slope_Up"
"Feature importance doesn't reveal interactions between features"

## Conclusion

We will summarize findings, highlight effective prediction methods, and discuss potential improvements such as:

- Collecting larger or more diverse datasets
- Incorporating additional health metrics
- Applying deep learning or ensemble techniques

## Appendix
### Team Contributions
The authors contributed to this work as follows:

A. Joshua Hanscom led data preprocessing, model training,
    exploratory data analysis, performance evaluation and model development for 
    Logistic Regression and Gradient Boosted Trees (GB Trees).

B. Andrew Rivera contributed to model training and performance evaluation.

C. Abigail Diaz handled feature engineering, visualization, and report writing.

All authors reviewed and approved the final manuscript.