# ❤️ Heart Disease Prediction System using Machine Learning




---

# 📄 Project Metadata
### **Title:** **Heart Disease Prediction using Machine Learning** ❤️‍🩹
### **Author:** **Asad Ali** ✍️
### **Institute:** **University of Okara** 🎓
### **Email:** 📧 [asadalyy834@gmail.com](mailto:asadalyy834@gmail.com)
### **Course:** **Data Science / Machine Learning** 📊🧠
### **Date:** **July 2025** 📅
### **Version:** **1.0** 🔢
### **Language:** **Python** 🐍
### **Libraries Overview:**
- **Pandas** 📚: Data manipulation and analysis
- **NumPy** 🔢: Numerical computing
- **Scikit-learn** ⚙️: Machine learning tools and algorithms
- **Matplotlib** 📈: Data visualization
- **Seaborn** 🌊: Statistical data visualization
- **Plotly** 📊: Interactive data visualization
### **Dataset:** **UCI Heart Disease Dataset from Kaggle** 📋

---


# 📊 About the Dataset

### 🧬 Context

This is a **multivariate dataset** — it contains multiple statistical variables and supports numerical data analysis. Although the full dataset includes **76 attributes**, most published research focuses on **14 key features**.

- 📍 **Primary Source Used:** *Cleveland database* — the most commonly used by machine learning researchers.
- 🧠 **Main Objective:** Predict whether a patient has heart disease or not based on medical parameters.
- 🔎 **Secondary Objective:** Gain diagnostic insights through statistical and machine learning exploration.

---

### 📌 Selected Attribute Descriptions (14 Core Features)

| 🔢 No. | 🧬 Column Name | 📖 Description                                                                 |
|-------:|---------------|--------------------------------------------------------------------------------|
| 1️⃣    | `age`         | Age of the patient (in years)                                                  |
| 2️⃣    | `sex`         | Gender of patient (`0` = Female, `1` = Male)                                   |
| 3️⃣    | `cp`          | Chest pain type: `typical angina`, `atypical angina`, `non-anginal`, `asymptomatic` |
| 4️⃣    | `trestbps`    | Resting blood pressure (in mm Hg at admission)                                 |
| 5️⃣    | `chol`        | Serum cholesterol level (in mg/dl)                                             |
| 6️⃣    | `fbs`         | Fasting blood sugar > 120 mg/dl (`1` = True; `0` = False)                       |
| 7️⃣    | `restecg`     | ECG results: `normal`, `ST-T abnormality`, `left ventricular hypertrophy`      |
| 8️⃣    | `thalach`     | Maximum heart rate achieved                                                    |
| 9️⃣    | `exang`       | Exercise-induced angina (`1` = Yes; `0` = No)                                   |
| 🔟     | `oldpeak`     | ST depression induced by exercise relative to rest                             |
| 1️⃣1️⃣ | `slope`       | Slope of the peak exercise ST segment                                          |
| 1️⃣2️⃣ | `ca`          | Number of major vessels (0–3) colored by fluoroscopy                           |
| 1️⃣3️⃣ | `thal`        | Thalassemia condition: `normal`, `fixed defect`, `reversible defect`           |
| 1️⃣4️⃣ | `target/num`  | Predicted attribute (0 = No Disease, 1 = Heart Disease)                         |

---

### 🧾 Additional Columns (May Appear in Extended Datasets)

| 🔹 Column        | 🔍 Description                              |
|------------------|--------------------------------------------|
| `id`             | Unique ID for each patient                 |
| `origin`         | Source location of data (e.g., Hungary)   |

---

### 👨‍⚕️ Acknowledgements

**Contributors & Medical Institutions:**

- 🏥 *Hungarian Institute of Cardiology, Budapest*: **Dr. Andras Janosi**  
- 🏥 *University Hospital, Zurich, Switzerland*: **Dr. William Steinbrunn**  
- 🏥 *University Hospital, Basel, Switzerland*: **Dr. Matthias Pfisterer**  
- 🏥 *V.A. Medical Center, Long Beach & Cleveland Clinic*: **Dr. Robert Detrano**

---

### 📚 Relevant Research Papers

- 📄 *International application of a new probability algorithm for the diagnosis of coronary artery disease*  
  ➤ *Detrano, R. et al., American Journal of Cardiology, 1989*

- 📄 *Instance-based prediction of heart-disease presence with the Cleveland database*  
  ➤ *David W. Aha & Dennis Kibler*

- 📄 *Models of incremental concept formation*  
  ➤ *Gennari, J.H., Langley, P., & Fisher, D., Artificial Intelligence, 1989*

---

### 🙏 Citation Request

> The authors request that any publication using this dataset must credit the principal investigators:
> 
> - **Dr. Andras Janosi** – Hungarian Institute of Cardiology  
> - **Dr. William Steinbrunn** – University Hospital, Zurich  
> - **Dr. Matthias Pfisterer** – University Hospital, Basel  
> - **Dr. Robert Detrano** – Cleveland Clinic Foundation & Long Beach VA Medical Center

---



# 1. 📚 Importing Libraries

In [99]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [100]:
# Setting to Display max rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [101]:
# Let's Load the Dataset that is in our local directory
df = pd.read_csv("heart_disease_uci.csv")
# Let's have a look at the first few rows of the dataset
print(df.head())

   id  age     sex    dataset               cp  trestbps   chol    fbs         restecg  thalch  exang  oldpeak        slope   ca               thal  num
0   1   63    Male  Cleveland   typical angina     145.0  233.0   True  lv hypertrophy   150.0  False      2.3  downsloping  0.0       fixed defect    0
1   2   67    Male  Cleveland     asymptomatic     160.0  286.0  False  lv hypertrophy   108.0   True      1.5         flat  3.0             normal    2
2   3   67    Male  Cleveland     asymptomatic     120.0  229.0  False  lv hypertrophy   129.0   True      2.6         flat  2.0  reversable defect    1
3   4   37    Male  Cleveland      non-anginal     130.0  250.0  False          normal   187.0  False      3.5  downsloping  0.0             normal    0
4   5   41  Female  Cleveland  atypical angina     130.0  204.0  False  lv hypertrophy   172.0  False      1.4    upsloping  0.0             normal    0


In [102]:
# Getting the info of our Dataset
print("Information about the Dataset")
print(df.info())

Information about the Dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        920 non-null    int64  
 1   age       920 non-null    int64  
 2   sex       920 non-null    object 
 3   dataset   920 non-null    object 
 4   cp        920 non-null    object 
 5   trestbps  861 non-null    float64
 6   chol      890 non-null    float64
 7   fbs       830 non-null    object 
 8   restecg   918 non-null    object 
 9   thalch    865 non-null    float64
 10  exang     865 non-null    object 
 11  oldpeak   858 non-null    float64
 12  slope     611 non-null    object 
 13  ca        309 non-null    float64
 14  thal      434 non-null    object 
 15  num       920 non-null    int64  
dtypes: float64(5), int64(3), object(8)
memory usage: 115.1+ KB
None


In [103]:
# Check the Shape of the Dataset
print("Shape of the Dataset")
print("The Dataset has", df.shape[0], "rows and", df.shape[1], "columns.")

Shape of the Dataset
The Dataset has 920 rows and 16 columns.


-----

# 2. Data Preprocessing 🔍
## 1. **Handling Missing Values:** Fill or drop missing data.




In [104]:
# Let's First check for missing values in our dataset
missing_values = df.isnull().sum().sort_values(ascending=False)
print("Missing Values in the Dataset:")
print(missing_values[missing_values > 0])

Missing Values in the Dataset:
ca          611
thal        486
slope       309
fbs          90
oldpeak      62
trestbps     59
exang        55
thalch       55
chol         30
restecg       2
dtype: int64


In [105]:
# Check the percentage of missing values in our dataset
print((df.isnull().sum() / len(df) * 100).sort_values(ascending=False))

ca          66.413043
thal        52.826087
slope       33.586957
fbs          9.782609
oldpeak      6.739130
trestbps     6.413043
exang        5.978261
thalch       5.978261
chol         3.260870
restecg      0.217391
cp           0.000000
dataset      0.000000
id           0.000000
age          0.000000
sex          0.000000
num          0.000000
dtype: float64


In [106]:
# Let's check the type of each column that has missing values in our dataset.
print("Data Types of Columns with Missing Values:")
print(df.dtypes[missing_values[missing_values > 0].index])

Data Types of Columns with Missing Values:
ca          float64
thal         object
slope        object
fbs          object
oldpeak     float64
trestbps    float64
exang        object
thalch      float64
chol        float64
restecg      object
dtype: object


In [107]:
# Let's impute missing values that have int or float data types using KNN Imputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(imputer.fit_transform(df.select_dtypes(include=[np.number])), columns=df.select_dtypes(include=[np.number]).columns)
# Replacing the original numeric columns with the imputed ones
df[df.select_dtypes(include=[np.number]).columns] = df_imputed
# Let's again have a look at the missing values after imputation of numeric columns
missing_values_after_imputation = df.isnull().sum().sort_values(ascending=False)
print("Missing Values in the Dataset after Imputation:")
print(missing_values_after_imputation[missing_values_after_imputation > 0])


Missing Values in the Dataset after Imputation:
thal       486
slope      309
fbs         90
exang       55
restecg      2
dtype: int64


In [108]:
# Let's again check the datatype of each column that contain missing values after imputation of numeric columns
print("Data Types of Columns with Missing Values after Imputation:")
print(df.dtypes[missing_values_after_imputation[missing_values_after_imputation > 0].index])


Data Types of Columns with Missing Values after Imputation:
thal       object
slope      object
fbs        object
exang      object
restecg    object
dtype: object


In [None]:
# Encode , Impute and Decode using ML Model
def encode_and_rf_impute(df):
    df_copy = df.copy()
    # Encode object columns
    label_encoders = {}
    object_cols = df_copy.select_dtypes(include=['object']).columns
    for col in object_cols:
        le = LabelEncoder()
        df_copy[col] = le.fit_transform(df_copy[col].astype(str))
        label_encoders[col] = le

    # Impute object columns using RandomForestClassifier
    from sklearn.ensemble import RandomForestClassifier
    for col in object_cols:
        missing = df[col].isnull()
        if missing.any():
            not_missing = ~missing
            X_train = df_copy.loc[not_missing].drop(columns=[col])
            y_train = df_copy.loc[not_missing, col]
            X_pred = df_copy.loc[missing].drop(columns=[col])
            clf = RandomForestClassifier(n_estimators=100, random_state=0)
            clf.fit(X_train, y_train)
            df_copy.loc[missing, col] = clf.predict(X_pred)

    # Decode object columns back to original values
    for col, le in label_encoders.items():
        df_copy[col] = le.inverse_transform(df_copy[col].round().astype(int))

    return df_copy

# Apply the function using RandomForestClassifier for categorical columns
df  = encode_and_rf_impute(df)
print("Missing Values in the Dataset after Encoding and RF Imputation:")
print(df.isnull().sum())


Missing Values in the Dataset after Encoding and RF Imputation:
id          0
age         0
sex         0
dataset     0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalch      0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
num         0
dtype: int64


In [110]:
# Let's again check the few random rows of the dataset after encoding and imputation
print("Random Rows of the Dataset after Encoding and Imputation:")
print(df.sample(5))

Random Rows of the Dataset after Encoding and Imputation:
        id   age     sex      dataset               cp  trestbps   chol    fbs           restecg  thalch  exang  oldpeak        slope   ca               thal  num
705  706.0  65.0    Male  Switzerland     asymptomatic     145.0    0.0  False  st-t abnormality    67.0  False      1.3         flat  1.4       fixed defect  3.0
75    76.0  65.0  Female    Cleveland      non-anginal     160.0  360.0  False    lv hypertrophy   151.0  False      0.8    upsloping  0.0             normal  0.0
9     10.0  53.0    Male    Cleveland     asymptomatic     140.0  203.0   True    lv hypertrophy   155.0   True      3.1  downsloping  0.0  reversable defect  1.0
373  374.0  44.0    Male      Hungary     asymptomatic     150.0  412.0  False            normal   170.0  False      0.0    upsloping  1.2             normal  0.0
208  209.0  55.0    Male    Cleveland  atypical angina     130.0  262.0  False            normal   155.0  False      0.0    ups

In [111]:
# Getting the info of our dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        920 non-null    float64
 1   age       920 non-null    float64
 2   sex       920 non-null    object 
 3   dataset   920 non-null    object 
 4   cp        920 non-null    object 
 5   trestbps  920 non-null    float64
 6   chol      920 non-null    float64
 7   fbs       920 non-null    object 
 8   restecg   920 non-null    object 
 9   thalch    920 non-null    float64
 10  exang     920 non-null    object 
 11  oldpeak   920 non-null    float64
 12  slope     920 non-null    object 
 13  ca        920 non-null    float64
 14  thal      920 non-null    object 
 15  num       920 non-null    float64
dtypes: float64(8), object(8)
memory usage: 115.1+ KB


- ###  Missing values ka rola howa khtm 😎 

## 2. **Cleaning:** Remove duplicates and handle outliers.