# ❤️ Heart Disease Prediction Using Machine Learning

This notebook demonstrates how to build a machine learning model to predict the likelihood of heart disease based on patient data.  
We apply data preprocessing, exploratory data analysis (EDA), feature engineering, and train three models: Logistic Regression, Random Forest, and XGBoost.

**Dataset:** Kaggle Heart Disease Dataset (https://www.kaggle.com/datasets/rishidamarla/heart-disease-prediction)  
**Goal:** Predict whether a person is likely to have heart disease (`target`: 1 = Yes, 0 = No).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score, roc_curve, ConfusionMatrixDisplay, RocCurveDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import pickle

import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv("C:/Users/PC/Downloads/archive/Heart_Disease_Prediction.csv")
df.head()

### Data Overview

In [None]:
df.info()
df.isnull().sum()
df.describe()

### Data Summary
- No missing values.
- 14 features including demographic, clinical, and lab results.
- `Heart Disease` is the target variable.

### Exploratory Data Analysis

In [None]:
sns.countplot(x='Heart Disease', data=df)
plt.title("Target Variable Distribution")
plt.show()

# Age vs Target
sns.boxplot(x='Heart Disease', y='Age', data=df)
plt.title("Age distribution by Heart Disease status")
plt.show()

# Correlation Heatmap
numeric_df = df.select_dtypes(include=['number']) # Filter numeric columns only for correlation

plt.figure(figsize=(12,8))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm')
plt.title("Feature Correlation (Numeric Variables Only)")
plt.show()

### Data Preprocessing

In [None]:
# Encode target
df['Heart Disease'] = df['Heart Disease'].map({'Presence': 1, 'Absence': 0})

# Dummy encode categorical variables
df_encoded = pd.get_dummies(df, drop_first=True)

In [None]:
# Split
X = df_encoded.drop('Heart Disease', axis=1)
y = df_encoded['Heart Disease']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)

### Model Training

In [None]:
# Logistic Regression
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train, y_train)

# Random Forest
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# XGBoost
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb_model.fit(X_train, y_train)

### Evaluation

In [None]:
# Logistic Regression
print("🔹 Logistic Regression Report:")
print(classification_report(y_val, lr_model.predict(X_val)))

# Random Forest
print("\n🔸 Random Forest Report:")
print(classification_report(y_val, rf_model.predict(X_val)))

# XGBoost
print("\n🔺 XGBoost Report:")
print(classification_report(y_val, xgb_model.predict(X_val)))

### ROC Curve and Confusion Matrix
Each model's ROC curve is plotted below to visualize performance trade-offs:

In [None]:
# ROC
RocCurveDisplay.from_estimator(lr_model, X_val, y_val, name="Logistic Regression")
RocCurveDisplay.from_estimator(rf_model, X_val, y_val, name="Random Forest")
RocCurveDisplay.from_estimator(xgb_model, X_val, y_val, name="XGBoost")
plt.plot([0, 1], [0, 1], 'k--')
plt.title("ROC Curve Comparison")
plt.legend()
plt.grid(True)
plt.show()

# Confusion Matrices
fig, axs = plt.subplots(1, 3, figsize=(18, 4))
ConfusionMatrixDisplay.from_estimator(lr_model, X_val, y_val, ax=axs[0]).ax_.set_title("Logistic Regression")
ConfusionMatrixDisplay.from_estimator(rf_model, X_val, y_val, ax=axs[1]).ax_.set_title("Random Forest")
ConfusionMatrixDisplay.from_estimator(xgb_model, X_val, y_val, ax=axs[2]).ax_.set_title("XGBoost")
plt.tight_layout()
plt.show()

#### 📌 Insights:
- Logistic Regression slightly outperformed the others with the highest AUC (0.90).
- Random Forest and XGBoost performed equally in terms of AUC but may differ in interpretability and training efficiency.
- All models show strong potential in detecting heart disease, with Logistic Regression showing the best overall balance.

### Save Best Model

In [None]:
with open("logistic_regression_model.pkl", "wb") as f:
    pickle.dump(lr_model, f)

## ✅ Conclusion

- **Best Model**: Logistic Regression achieved the highest ROC AUC of 0.90.
- We built, evaluated, and saved the model for deployment using Streamlit.
- This pipeline can help clinicians quickly assess heart disease risk based on input metrics.
