# ❤️ Heart Disease Prediction System using Machine Learning




---

# 📄 Project Metadata
### **Title:** **Heart Disease Prediction using Machine Learning** ❤️‍🩹
### **Author:** **Asad Ali** ✍️
### **Institute:** **University of Okara** 🎓
### **Email:** 📧 [asadalyy834@gmail.com](mailto:asadalyy834@gmail.com)
### **Course:** **Data Science / Machine Learning** 📊🧠
### **Date:** **July 2025** 📅
### **Version:** **1.0** 🔢
### **Language:** **Python** 🐍
### **Libraries Overview:**
- **Pandas** 📚: Data manipulation and analysis
- **NumPy** 🔢: Numerical computing
- **Scikit-learn** ⚙️: Machine learning tools and algorithms
- **Matplotlib** 📈: Data visualization
- **Seaborn** 🌊: Statistical data visualization
- **Plotly** 📊: Interactive data visualization
### **Dataset:** **UCI Heart Disease Dataset from Kaggle** 📋

---


## 📊 About the Dataset

### 🧬 Context

This is a **multivariate dataset** — it contains multiple statistical variables and supports numerical data analysis. Although the full dataset includes **76 attributes**, most published research focuses on **14 key features**.

- 📍 **Primary Source Used:** *Cleveland database* — the most commonly used by machine learning researchers.
- 🧠 **Main Objective:** Predict whether a patient has heart disease or not based on medical parameters.
- 🔎 **Secondary Objective:** Gain diagnostic insights through statistical and machine learning exploration.

---

### 📌 Selected Attribute Descriptions (14 Core Features)

| 🔢 No. | 🧬 Column Name | 📖 Description                                                                 |
|-------:|---------------|--------------------------------------------------------------------------------|
| 1️⃣    | `age`         | Age of the patient (in years)                                                  |
| 2️⃣    | `sex`         | Gender of patient (`0` = Female, `1` = Male)                                   |
| 3️⃣    | `cp`          | Chest pain type: `typical angina`, `atypical angina`, `non-anginal`, `asymptomatic` |
| 4️⃣    | `trestbps`    | Resting blood pressure (in mm Hg at admission)                                 |
| 5️⃣    | `chol`        | Serum cholesterol level (in mg/dl)                                             |
| 6️⃣    | `fbs`         | Fasting blood sugar > 120 mg/dl (`1` = True; `0` = False)                       |
| 7️⃣    | `restecg`     | ECG results: `normal`, `ST-T abnormality`, `left ventricular hypertrophy`      |
| 8️⃣    | `thalach`     | Maximum heart rate achieved                                                    |
| 9️⃣    | `exang`       | Exercise-induced angina (`1` = Yes; `0` = No)                                   |
| 🔟     | `oldpeak`     | ST depression induced by exercise relative to rest                             |
| 1️⃣1️⃣ | `slope`       | Slope of the peak exercise ST segment                                          |
| 1️⃣2️⃣ | `ca`          | Number of major vessels (0–3) colored by fluoroscopy                           |
| 1️⃣3️⃣ | `thal`        | Thalassemia condition: `normal`, `fixed defect`, `reversible defect`           |
| 1️⃣4️⃣ | `target/num`  | Predicted attribute (0 = No Disease, 1 = Heart Disease)                         |

---

### 🧾 Additional Columns (May Appear in Extended Datasets)

| 🔹 Column        | 🔍 Description                              |
|------------------|--------------------------------------------|
| `id`             | Unique ID for each patient                 |
| `origin`         | Source location of data (e.g., Hungary)   |

---

### 👨‍⚕️ Acknowledgements

**Contributors & Medical Institutions:**

- 🏥 *Hungarian Institute of Cardiology, Budapest*: **Dr. Andras Janosi**  
- 🏥 *University Hospital, Zurich, Switzerland*: **Dr. William Steinbrunn**  
- 🏥 *University Hospital, Basel, Switzerland*: **Dr. Matthias Pfisterer**  
- 🏥 *V.A. Medical Center, Long Beach & Cleveland Clinic*: **Dr. Robert Detrano**

---

### 📚 Relevant Research Papers

- 📄 *International application of a new probability algorithm for the diagnosis of coronary artery disease*  
  ➤ *Detrano, R. et al., American Journal of Cardiology, 1989*

- 📄 *Instance-based prediction of heart-disease presence with the Cleveland database*  
  ➤ *David W. Aha & Dennis Kibler*

- 📄 *Models of incremental concept formation*  
  ➤ *Gennari, J.H., Langley, P., & Fisher, D., Artificial Intelligence, 1989*

---

### 🙏 Citation Request

> The authors request that any publication using this dataset must credit the principal investigators:
> 
> - **Dr. Andras Janosi** – Hungarian Institute of Cardiology  
> - **Dr. William Steinbrunn** – University Hospital, Zurich  
> - **Dr. Matthias Pfisterer** – University Hospital, Basel  
> - **Dr. Robert Detrano** – Cleveland Clinic Foundation & Long Beach VA Medical Center

---



## 1. 📚 Importing Libraries

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [16]:
# Setting to Display max rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [17]:
# Let's Load the Dataset that is in our local directory
df = pd.read_csv("heart_disease_uci.csv")
# Let's have a look at the first few rows of the dataset
print(df.head())

   id  age     sex    dataset               cp  trestbps   chol    fbs  \
0   1   63    Male  Cleveland   typical angina     145.0  233.0   True   
1   2   67    Male  Cleveland     asymptomatic     160.0  286.0  False   
2   3   67    Male  Cleveland     asymptomatic     120.0  229.0  False   
3   4   37    Male  Cleveland      non-anginal     130.0  250.0  False   
4   5   41  Female  Cleveland  atypical angina     130.0  204.0  False   

          restecg  thalch  exang  oldpeak        slope   ca  \
0  lv hypertrophy   150.0  False      2.3  downsloping  0.0   
1  lv hypertrophy   108.0   True      1.5         flat  3.0   
2  lv hypertrophy   129.0   True      2.6         flat  2.0   
3          normal   187.0  False      3.5  downsloping  0.0   
4  lv hypertrophy   172.0  False      1.4    upsloping  0.0   

                thal  num  
0       fixed defect    0  
1             normal    2  
2  reversable defect    1  
3             normal    0  
4             normal    0  


In [18]:
# Getting the info of our Dataset
print("Information about the Dataset")
print(df.info())

Information about the Dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        920 non-null    int64  
 1   age       920 non-null    int64  
 2   sex       920 non-null    object 
 3   dataset   920 non-null    object 
 4   cp        920 non-null    object 
 5   trestbps  861 non-null    float64
 6   chol      890 non-null    float64
 7   fbs       830 non-null    object 
 8   restecg   918 non-null    object 
 9   thalch    865 non-null    float64
 10  exang     865 non-null    object 
 11  oldpeak   858 non-null    float64
 12  slope     611 non-null    object 
 13  ca        309 non-null    float64
 14  thal      434 non-null    object 
 15  num       920 non-null    int64  
dtypes: float64(5), int64(3), object(8)
memory usage: 115.1+ KB
None


In [19]:
# Check the Shape of the Dataset
print("Shape of the Dataset")
print("The dataset has", df.shape[0], "rows and", df.shape[1], "columns.")

Shape of the Dataset
The dataset has 920 rows and 16 columns.


-----

## 2. Data Preprocessing 🔍
- **Cleaning:** Remove duplicates and handle outliers.
- **Handling Missing Values:** Fill or drop missing data.
- **Encoding:** Convert categorical variables into numerical formats.
