# Heart Disease Prediction using Machine Learning

## Problem Statement
The objective of this project is to build and evaluate machine learning models
to predict the presence of heart disease based on patient clinical data.

Early detection of heart disease can help in timely medical intervention
and improve patient outcomes.


In [83]:
import pandas as pd
import numpy as np

# import matplotlib.pyplot as plt
# import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold, cross_val_score

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


In [84]:
df = pd.read_csv('heart.csv')

## Dataset Overview
- Dataset contains clinical attributes such as age, cholesterol, blood pressure, and heart rate.
- Target variable indicates presence (1) or absence (0) of heart disease.


In [85]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [86]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB


## Exploratory Data Analysis (EDA)

Basic exploration was performed to understand feature distributions
and target balance.


In [87]:
df.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [88]:
df['age'].duplicated().sum()

np.int64(984)

In [89]:
for i in df.columns:
    print(i,df[i].duplicated().sum())
    # can't drop anything

age 984
sex 1023
cp 1021
trestbps 976
chol 873
fbs 1023
restecg 1022
thalach 934
exang 1023
oldpeak 985
slope 1022
ca 1020
thal 1021
target 1023


In [90]:
df.corr()['target']
# corr is not everyting but tell us a lot

age        -0.229324
sex        -0.279501
cp          0.434854
trestbps   -0.138772
chol       -0.099966
fbs        -0.041164
restecg     0.134468
thalach     0.422895
exang      -0.438029
oldpeak    -0.438441
slope       0.345512
ca         -0.382085
thal       -0.337838
target      1.000000
Name: target, dtype: float64

In [91]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0
mean,54.434146,0.69561,0.942439,131.611707,246.0,0.149268,0.529756,149.114146,0.336585,1.071512,1.385366,0.754146,2.323902,0.513171
std,9.07229,0.460373,1.029641,17.516718,51.59251,0.356527,0.527878,23.005724,0.472772,1.175053,0.617755,1.030798,0.62066,0.50007
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,132.0,0.0,0.0,1.0,0.0,2.0,0.0
50%,56.0,1.0,1.0,130.0,240.0,0.0,1.0,152.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,275.0,0.0,1.0,166.0,1.0,1.8,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [92]:
df['target'].value_counts()

target
1    526
0    499
Name: count, dtype: int64

The target variable is reasonably balanced, therefore no resampling
techniques such as SMOTE were required.


## Data Preprocessing

- Selected numerical features
- Split data into training and testing sets
- Applied feature scaling using StandardScaler


In [93]:
df_num= df.select_dtypes(exclude='category')

In [94]:
X = df_num.drop('target',axis=1)
y=df_num['target']

X_train,X_test,y_train,y_test= train_test_split(X,y,random_state=27,test_size=0.2)

Using a pipeline ensures that scaling is applied correctly
and prevents data leakage during model evaluation.


## Model Training

The following models were trained and evaluated:
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Support Vector Machine (SVM)

Cross-validation was used to ensure robust performance estimation.


In [95]:

log_reg_pipeline  = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=5000))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(log_reg_pipeline, X_train, y_train, cv=cv, scoring='f1')
log_reg_pipeline.fit(X_train, y_train)
y_pred = log_reg_pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
print("F1 scores:", scores)
print("Mean F1:", scores.mean())


              precision    recall  f1-score   support

           0       0.94      0.79      0.86       102
           1       0.82      0.95      0.88       103

    accuracy                           0.87       205
   macro avg       0.88      0.87      0.87       205
weighted avg       0.88      0.87      0.87       205

F1 scores: [0.84615385 0.85714286 0.79069767 0.83428571 0.90502793]
Mean F1: 0.8466616049923832


In [96]:
knn_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', KNeighborsClassifier(n_neighbors=5))
])

knn_scores = cross_val_score(
    knn_pipeline, X_train, y_train, cv=cv, scoring='accuracy'
)

scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring='f1')
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred))
print("F1 scores:", scores)
print("Mean F1:", scores.mean())

              precision    recall  f1-score   support

           0       0.98      0.94      0.96       102
           1       0.94      0.98      0.96       103

    accuracy                           0.96       205
   macro avg       0.96      0.96      0.96       205
weighted avg       0.96      0.96      0.96       205

F1 scores: [0.98224852 0.96470588 0.98224852 0.97109827 0.98203593]
Mean F1: 0.9764674235625451


In [97]:
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', SVC(
        kernel="rbf",
        C=5.0,
        gamma="scale",
        probability=True,
        class_weight="balanced"
    ))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring='f1')
pipe.fit(X_train, y_train)
y_pred= pipe.predict(X_test)
print("F1 scores:", scores)
print("Mean F1:", scores.mean())
print('Classification report',classification_report(y_test, y_pred))


F1 scores: [0.98224852 0.96470588 0.98224852 0.97109827 0.98203593]
Mean F1: 0.9764674235625451
Classification report               precision    recall  f1-score   support

           0       0.98      0.94      0.96       102
           1       0.94      0.98      0.96       103

    accuracy                           0.96       205
   macro avg       0.96      0.96      0.96       205
weighted avg       0.96      0.96      0.96       205



## Model Evaluation

Models were evaluated using accuracy and cross-validation scores
to compare their performance.


In [98]:
log_reg_pipeline.fit(X_train, y_train)

y_pred = log_reg_pipeline.predict(X_test)

accuracy_score(y_test, y_pred)


0.8731707317073171

In [99]:
confusion_matrix(y_test, y_pred)

array([[81, 21],
       [ 5, 98]])

In [100]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.79      0.86       102
           1       0.82      0.95      0.88       103

    accuracy                           0.87       205
   macro avg       0.88      0.87      0.87       205
weighted avg       0.88      0.87      0.87       205



Logistic Regression showed stable performance across folds,
making it a suitable baseline model for this healthcare-related problem.


## Conclusion

- An end-to-end machine learning pipeline was successfully built.
- Logistic Regression provided the most stable performance.
- Feature scaling improved model consistency.
- This project demonstrates a complete ML workflow from data
  exploration to model evaluation.
