# Heart Disease Prediction  
### Kaggle Playground Series – Season 6, Episode 2

This notebook builds a baseline machine learning model to predict the likelihood of heart disease using a synthetic tabular dataset provided by Kaggle.


# Heart Disease Prediction — Kaggle Playground Series S6E2

## Overview
This notebook builds a baseline machine learning model to predict the probability of heart disease using tabular data.  
The evaluation metric is ROC AUC, which measures how well the model ranks positive cases higher than negative ones.

## Dataset
- Train set: Features + target (Heart Disease)
- Test set: Features only
- The dataset is synthetically generated for learning purposes.

## Approach
1. Load and inspect the data  
2. Separate features and target  
3. Scale numerical features  
4. Train a Logistic Regression model  
5. Generate probability predictions  
6. Create a Kaggle submission file

## 1. Import Libraries
We import the required Python libraries for data manipulation, modeling, and evaluation.


In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score


## 2. Load Dataset
The training dataset contains features and the target variable (`Heart Disease`).  
The test dataset contains features only and is used for final predictions.


In [2]:
train = pd.read_csv('/kaggle/input/playground-series-s6e2/train.csv')
test = pd.read_csv('/kaggle/input/playground-series-s6e2/test.csv')
sample_sub = pd.read_csv('/kaggle/input/playground-series-s6e2/sample_submission.csv')

train.head()


Unnamed: 0,id,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
0,0,58,1,4,152,239,0,0,158,1,3.6,2,2,7,Presence
1,1,52,1,1,125,325,0,2,171,0,0.0,1,0,3,Absence
2,2,56,0,2,160,188,0,2,151,0,0.0,1,0,3,Absence
3,3,44,0,3,134,229,0,2,150,0,1.0,2,0,3,Absence
4,4,58,1,4,140,234,0,2,125,1,3.8,2,3,3,Presence


## 3. Data Overview
We inspect the structure and basic properties of the dataset.


In [3]:
train.shape, test.shape

((630000, 15), (270000, 14))

## 4. Feature and Target Separation
We separate the input features from the target variable.


In [4]:
X = train.drop(columns=['Heart Disease'])
y = train['Heart Disease']

X.head(), y.head()


(   id  Age  Sex  Chest pain type   BP  Cholesterol  FBS over 120  EKG results  \
 0   0   58    1                4  152          239             0            0   
 1   1   52    1                1  125          325             0            2   
 2   2   56    0                2  160          188             0            2   
 3   3   44    0                3  134          229             0            2   
 4   4   58    1                4  140          234             0            2   
 
    Max HR  Exercise angina  ST depression  Slope of ST  \
 0     158                1            3.6            2   
 1     171                0            0.0            1   
 2     151                0            0.0            1   
 3     150                0            1.0            2   
 4     125                1            3.8            2   
 
    Number of vessels fluro  Thallium  
 0                        2         7  
 1                        0         3  
 2                        0   

## 5. Train–Validation Split
The dataset is split into training and validation sets to evaluate model performance.


In [5]:
X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


In [6]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled   = scaler.transform(X_val)
X_test_scaled  = scaler.transform(test)

## 6. Feature Scaling
Standardization is applied to improve the performance of Logistic Regression.


In [7]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(test)


## 7. Model Training
A Logistic Regression model is trained as a baseline classifier.


In [8]:
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_scaled, y_train)


## 8. Model Evaluation
Model performance is evaluated using the ROC AUC metric.


In [9]:
val_preds = model.predict_proba(X_val_scaled)[:, 1]
roc_auc_score(y_val, val_preds)


np.float64(0.9515466668454322)

## 9. Create Submission File
Predicted probabilities are generated for the test dataset and saved in the required submission format.


In [10]:
test_preds = model.predict_proba(X_test_scaled)[:, 1]

submission = sample_sub.copy()
submission['Heart Disease'] = test_preds

submission.to_csv('submission.csv', index=False)
submission.head()


Unnamed: 0,id,Heart Disease
0,630000,0.966179
1,630001,0.003077
2,630002,0.993959
3,630003,0.009147
4,630004,0.113335


## Results
- Public leaderboard ROC AUC: 0.94807

This baseline Logistic Regression model performs strongly on the synthetic dataset and provides a simple, interpretable solution for predicting heart disease risk.
