<a href="https://colab.research.google.com/github/Shaurya200911/Heart-Disease-Predictor/blob/master/Heart_Disease_Predictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

⭐ Heart Disease Prediction Project

This is a Heart Disease Risk Predictor, built using scikit-learn, Kaggle datasets, and various ML techniques to create a strong and reliable health-risk classification model.

About the Project

The goal of this project is to analyze patient health metrics and predict the likelihood of heart disease using classical machine learning algorithms.
This system helps demonstrate how ML can provide early risk assessment based on medical data.

The project includes a complete ML pipeline, from data preprocessing to deployment-ready prediction logic.

Technologies & Tools Used

Python

Pandas / NumPy

scikit-learn

Matplotlib / Seaborn

Kaggle Heart Disease dataset

Train–Test Split / Cross-Validation

sklearn model evaluation tools

Pickle (for saving trained model)

(Optional) FastAPI or Flask for creating a prediction API

⭐ Concepts Involved
1. Data Cleaning & Preprocessing

Handling missing values

Removing duplicates

Encoding categorical features

Scaling numerical features

Detecting outliers

Balancing classes (if needed)

Exploratory Data Analysis (EDA)

2. Feature Engineering

Correlation analysis

Feature importance ranking

Creating new features

Selecting optimal predictors

Removing irrelevant or noisy features

3. Building Multiple ML Models (Model Zoo)

This project includes training and comparing several ML algorithms:

Logistic Regression

K-Nearest Neighbors (KNN)

Decision Tree

Random Forest

Gradient Boosting / XGBoost

Support Vector Machine (optional)

Each model is evaluated, compared, and ranked using metrics such as:

Accuracy

Precision

Recall

F1-Score

ROC-AUC

4. Cross-Validation & Model Tuning

K-Fold Cross-Validation

Hyperparameter tuning

GridSearchCV / RandomizedSearchCV

Avoiding overfitting & improving generalization

5. Model Evaluation & Visualization

Confusion matrix

ROC curve

Precision-Recall curves

Feature importance plots

These graphs help interpret model performance and trustworthiness.

In [1]:
!pip install --upgrade pip
!pip install --upgrade scikit-learn==1.7.2
!pip install --upgrade pandas joblib cloudpickle imbalanced-learn xgboost lightgbm


Collecting pandas
  Using cached pandas-2.3.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (91 kB)
Collecting xgboost
  Using cached xgboost-3.1.2-py3-none-manylinux_2_28_x86_64.whl.metadata (2.1 kB)
Using cached pandas-2.3.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (12.4 MB)
Using cached xgboost-3.1.2-py3-none-manylinux_2_28_x86_64.whl (115.9 MB)
Installing collected packages: xgboost, pandas
[2K  Attempting uninstall: xgboost
[2K    Found existing installation: xgboost 3.1.1
[2K    Uninstalling xgboost-3.1.1:
[2K      Successfully uninstalled xgboost-3.1.1
[2K  Attempting uninstall: pandas
[2K    Found existing installation: pandas 2.2.2
[2K    Uninstalling pandas-2.2.2:
[2K      Successfully uninstalled pandas-2.2.2
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [pandas]
[1A[2K[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import joblib

In [4]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("oktayrdeki/heart-disease")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'heart-disease' dataset.
Path to dataset files: /kaggle/input/heart-disease


**Dataset to be used**

In [5]:
df=pd.read_csv('/kaggle/input/heart-disease/heart_disease.csv')
df.shape

(10000, 21)

In [78]:
df.head()

Unnamed: 0,Age,Gender,Blood Pressure,Cholesterol Level,Exercise Habits,Smoking,Family Heart Disease,Diabetes,BMI,High Blood Pressure,...,High LDL Cholesterol,Alcohol Consumption,Stress Level,Sleep Hours,Sugar Consumption,Triglyceride Level,Fasting Blood Sugar,CRP Level,Homocysteine Level,Heart Disease Status
0,56.0,Male,153.0,155.0,High,Yes,Yes,No,24.991591,Yes,...,No,High,Medium,7.633228,Medium,342.0,,12.969246,12.38725,No
1,69.0,Female,146.0,286.0,High,No,Yes,Yes,25.221799,No,...,No,Medium,High,8.744034,Medium,133.0,157.0,9.355389,19.298875,No
2,46.0,Male,126.0,216.0,Low,No,No,No,29.855447,No,...,Yes,Low,Low,4.44044,Low,393.0,92.0,12.709873,11.230926,No
3,32.0,Female,122.0,293.0,High,Yes,Yes,No,24.130477,Yes,...,Yes,Low,High,5.249405,High,293.0,94.0,12.509046,5.961958,No
4,60.0,Male,166.0,242.0,Low,Yes,Yes,Yes,20.486289,Yes,...,No,Low,High,7.030971,High,263.0,154.0,10.381259,8.153887,No


In [6]:
numerical_features = [
    "Age","Blood Pressure","Cholesterol Level","BMI","Sleep Hours",
    "Triglyceride Level","Fasting Blood Sugar","CRP Level","Homocysteine Level"
]

binary_features = [
    "Smoking","Family Heart Disease","Diabetes","High Blood Pressure","High LDL Cholesterol",
]

multi_category_features = [
    "Gender","Exercise Habits","Alcohol Consumption","Stress Level"
]

numerical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

binary_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OrdinalEncoder(categories=[["No", "Yes"]] * len(binary_features)))
])

multi_cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent", fill_value="missing")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

preprocesser = ColumnTransformer(
    transformers=[
        ("numerical", numerical_pipeline, numerical_features),
        ("binary", binary_pipeline, binary_features),
        ("multi_category", multi_cat_pipeline, multi_category_features)
    ],
    remainder="drop"
)

X = df[numerical_features + binary_features + multi_category_features]
y = df["Heart Disease Status"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

**Prediction Using Logistic Regression**

In [62]:
pipe1 = Pipeline([
    ("preprocesser", preprocesser),
    ("model", LogisticRegression())
])

In [63]:
pipe1.fit(X_train, y_train)

In [64]:
pred1=pipe1.predict(X_test)
print(pred1)

['No' 'No' 'No' ... 'No' 'No' 'No']


**Prediction Using KNeighborClassifier**

In [65]:
pipe2=Pipeline([
    ("preprocesser",preprocesser),
    ("model", KNeighborsClassifier())
])
pipe2.fit(X_train,y_train)

In [66]:
pred2=pipe2.predict(X_test)
print(pred2)

['No' 'No' 'No' ... 'No' 'Yes' 'No']


**Using Decsision Tree Classifier**

In [67]:
pipe3 = Pipeline([
    ("preprocesser",preprocesser),
    ("model", DecisionTreeClassifier())
])
pipe3.fit(X_train, y_train)

In [68]:
pred3=pipe3.predict(X_test)
print(pred3)

['Yes' 'No' 'No' ... 'Yes' 'No' 'No']


**Prediction using Random Forest**

In [70]:
pipe4 = Pipeline([
    ("preprocesser",preprocesser),
    ("model", RandomForestClassifier())
])
pipe4.fit(X_train, y_train)

In [71]:
pred4=pipe4.predict(X_test)
print(pred4)

['No' 'No' 'No' ... 'No' 'No' 'No']


**Using all Models**

In [7]:

voting_model = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(max_iter=1000, class_weight='balanced')),
        ('knn', KNeighborsClassifier()),
        ('dt', DecisionTreeClassifier()),
        ('rf', RandomForestClassifier(class_weight='balanced'))
    ],
    voting='soft'
)

pipe5 = Pipeline([
    ("preprocesser",preprocesser),
    ("model", voting_model)
])
pipe5.fit(X_train, y_train)

0,1,2
,steps,"[('preprocesser', ...), ('model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('numerical', ...), ('binary', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,"[['No', 'Yes'], ['No', 'Yes'], ...]"
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,unknown_value,
,encoded_missing_value,
,min_frequency,
,max_categories,

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,'missing'
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,estimators,"[('lr', ...), ('knn', ...), ...]"
,voting,'soft'
,weights,
,n_jobs,
,flatten_transform,True
,verbose,False

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,'balanced'
,random_state,
,solver,'lbfgs'
,max_iter,1000

0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [8]:
p1={
  "Age": 41,
  "Gender": "Female",
  "Blood Pressure": 118,
  "Cholesterol Level": 200,
  "BMI": 24.5,
  "Sleep Hours": 8,
  "Triglyceride Level": 39,
  "Fasting Blood Sugar": 85.29,
  "CRP Level": 0.7,
  "Homocysteine Level": None,
  "Smoking": "No",
  "Family Heart Disease": "No",
  "Diabetes": "No",
  "High Blood Pressure": "No",
  "High LDL Cholesterol": "No",
  "Exercise Habits": "Low",
  "Alcohol Consumption": "Low",
  "Stress Level": "High"
}
p1_df = pd.DataFrame([p1])

pred5=pipe5.predict(X_test)
print(pred5)
y_proba = pipe5.predict_proba(X_test)[:, 1]
threshold = 0.20
y_pred_new = (y_proba >= threshold).astype(int)
print(y_pred_new)

['No' 'No' 'No' ... 'No' 'No' 'No']
[1 1 1 ... 1 1 0]


In [98]:
print("Threshold | Precision | Recall | F1")
print("---------------------------------------")

for t in [0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40]:
    y_pred_t = (y_proba >= t).astype(int)
    p = precision_score(y_test_numerical, y_pred_t)
    r = recall_score(y_test_numerical, y_pred_t)
    f1 = f1_score(y_test_numerical, y_pred_t)
    print(f"{t:.2f}      | {p:.3f}     | {r:.3f}  | {f1:.3f}")

Threshold | Precision | Recall | F1
---------------------------------------
0.10      | 0.200     | 1.000  | 0.333
0.15      | 0.200     | 0.963  | 0.331
0.20      | 0.202     | 0.725  | 0.315
0.25      | 0.198     | 0.450  | 0.275
0.30      | 0.201     | 0.297  | 0.240
0.35      | 0.194     | 0.235  | 0.212
0.40      | 0.178     | 0.205  | 0.190


In [10]:
final_pipe = pipe5
joblib.dump(final_pipe, 'heart_disease_predictor.joblib')

['heart_disease_predictor.joblib']