# Predicting Heart Disease â€” End-to-End ML Pipeline

This notebook presents a complete machine learning workflow for the Kaggle Playground Series S6E2 competition.

## Workflow

Data inspection and quality checks

Robust preprocessing using ColumnTransformer

Stratified cross-validation using ROC AUC

Baseline model: Logistic Regression

Improved model: HistGradientBoostingClassifier

Model ensembling (probability averaging)

Kaggle submission generation

## Results

Logistic Regression CV AUC: ~0.950

HistGradientBoosting CV AUC: ~0.955

Public Leaderboard AUC: 0.95284

The close alignment between cross-validation and leaderboard scores indicates a stable and well-validated modeling approach.

This notebook demonstrates practical tabular machine learning skills, proper validation strategy, and competition workflow.

# Predicting Heart Disease (Kaggle Playground Series S6E2)

## Project Overview
This notebook builds a complete, reproducible machine learning pipeline for predicting
the probability of heart disease using structured tabular data. The workflow covers:
data inspection, cleaning, exploratory analysis, robust preprocessing, model training
with cross-validation, and Kaggle submission generation.

## Evaluation
Submissions are evaluated using ROC AUC (higher is better).

## Tools
Python, pandas, numpy, matplotlib/seaborn, scikit-learn
(Optionally: CatBoost/LightGBM)

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/playground-series-s6e2/sample_submission.csv
/kaggle/input/playground-series-s6e2/train.csv
/kaggle/input/playground-series-s6e2/test.csv


In [2]:
# WE start by importing the necessary Packages and Libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedKFold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression

plt.style.use("seaborn-v0_8")
RANDOM_STATE = 42

In [3]:
# We load the data into our notebook
train = pd.read_csv("/kaggle/input/playground-series-s6e2/train.csv")
test  = pd.read_csv("/kaggle/input/playground-series-s6e2/test.csv")
sub   = pd.read_csv("/kaggle/input/playground-series-s6e2/sample_submission.csv")

# We inspect the number of the fields and records in each dataset 
train.shape, test.shape, sub.shape


((630000, 15), (270000, 14), (270000, 2))

In [4]:
# Identifying the Target columns 
TARGET = list(set(train.columns) - set(test.columns))[0]
TARGET

'Heart Disease'

Our Target columns is Heart Disease

In [5]:
X = train.drop(columns=[TARGET])
y = train[TARGET]

X.head(), y.value_counts(normalize=True)

(   id  Age  Sex  Chest pain type   BP  Cholesterol  FBS over 120  EKG results  \
 0   0   58    1                4  152          239             0            0   
 1   1   52    1                1  125          325             0            2   
 2   2   56    0                2  160          188             0            2   
 3   3   44    0                3  134          229             0            2   
 4   4   58    1                4  140          234             0            2   
 
    Max HR  Exercise angina  ST depression  Slope of ST  \
 0     158                1            3.6            2   
 1     171                0            0.0            1   
 2     151                0            0.0            1   
 3     150                0            1.0            2   
 4     125                1            3.8            2   
 
    Number of vessels fluro  Thallium  
 0                        2         7  
 1                        0         3  
 2                        0   

In [6]:
# Feature engineering and encoding
# We separate the numerical eatures from the categorical features
num_features = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
cat_features = X.select_dtypes(include=["object"]).columns.tolist()

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_features),
        ("cat", categorical_transformer, cat_features)
    ]
)

In [7]:
# We proceed to logistic regression to generate the model
# We will transform and fit our model in this cell
model = LogisticRegression(max_iter=5000, solver="lbfgs")

pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", model)
])

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
auc_scores = []

for tr_idx, va_idx in skf.split(X, y):
    X_tr, X_va = X.iloc[tr_idx], X.iloc[va_idx]
    y_tr, y_va = y.iloc[tr_idx], y.iloc[va_idx]

    pipeline.fit(X_tr, y_tr)
    proba = pipeline.predict_proba(X_va)[:, 1]
    auc_scores.append(roc_auc_score(y_va, proba))

print("Baseline CV AUC mean:", np.mean(auc_scores))
print("Baseline CV AUC std :", np.std(auc_scores))

Baseline CV AUC mean: 0.950489503531894
Baseline CV AUC std : 0.0003377262532574926


Note: Logistic Regression was fitted with standardized numeric features to improve
optimization stability and avoid convergence warnings when combining scaled numeric
features with one-hot encoded categorical features.

## 4. Improved Model: Gradient Boosting

To capture non-linear relationships between features, a gradient boosting
classifier was trained and evaluated using the same stratified cross-validation
strategy.


In [8]:
from sklearn.ensemble import HistGradientBoostingClassifier

hgb_model = HistGradientBoostingClassifier(
    learning_rate=0.05,
    max_depth=8,
    max_iter=500,
    random_state=RANDOM_STATE
)

hgb_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", hgb_model)
])

In [9]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
auc_scores_hgb = []

for tr_idx, va_idx in skf.split(X, y):
    X_tr, X_va = X.iloc[tr_idx], X.iloc[va_idx]
    y_tr, y_va = y.iloc[tr_idx], y.iloc[va_idx]

    hgb_pipeline.fit(X_tr, y_tr)
    proba = hgb_pipeline.predict_proba(X_va)[:, 1]
    auc_scores_hgb.append(roc_auc_score(y_va, proba))

print("HGB CV AUC mean:", np.mean(auc_scores_hgb))
print("HGB CV AUC std :", np.std(auc_scores_hgb))


HGB CV AUC mean: 0.9549893109907034
HGB CV AUC std : 0.0004228408493996135


### Model Comparison

| Model | CV AUC |
|------|-------|
| Logistic Regression (Baseline) | ~0.9505 |
| HistGradientBoostingClassifier | ~0.9550 |

The gradient boosting model provides improved performance by modeling
non-linear feature interactions while maintaining stable cross-validation
results.

In [10]:
hgb_pipeline.fit(X, y)

test_proba = hgb_pipeline.predict_proba(test)[:, 1]

submission = sub.copy()
submission[TARGET] = test_proba
submission.head()

Unnamed: 0,id,Heart Disease
0,630000,0.937859
1,630001,0.008232
2,630002,0.983614
3,630003,0.005502
4,630004,0.203191


In [11]:
submission.to_csv("submission.csv", index=False)
print("submission.csv saved with shape:", submission.shape)

submission.csv saved with shape: (270000, 2)


In [12]:
assert submission.shape[0] == test.shape[0]
assert list(submission.columns) == list(sub.columns)

In [13]:
# Fit baseline
pipeline.fit(X, y)
lr_test_proba = pipeline.predict_proba(test)[:, 1]

# Fit HGB
hgb_pipeline.fit(X, y)
hgb_test_proba = hgb_pipeline.predict_proba(test)[:, 1]


In [14]:
ensemble_proba = (lr_test_proba + hgb_test_proba) / 2

In [15]:
submission_ens = sub.copy()
submission_ens[TARGET] = ensemble_proba

submission_ens.to_csv("submission_ensemble.csv", index=False)
print("submission_ensemble.csv saved")

submission_ensemble.csv saved


### Model Ensembling

To improve prediction stability, probabilities from Logistic Regression and
HistGradientBoosting models were averaged. Ensembling combines linear and
non-linear model strengths and often improves generalization performance.