# Heart Disease Prediction (Classification)

## Problem Statement
Cardiovascular diseases are one of the leading causes of death worldwide.
Early detection of heart disease can help improve patient outcomes and guide preventive care.

The goal of this project is to build a **machine learning classification model** that predicts whether a patient has **heart disease** based on clinical and demographic features.

---

## Dataset
- **Source:** UCI Machine Learning Repository (Heart Disease dataset – Cleveland subset)
- **Observations:** 303 patients
- **Features:** 13 clinical attributes (age, cholesterol, blood pressure, etc.)
- **Target variable:**
  - `0` → No heart disease
  - `1` → Presence of heart disease

The original target variable contained multiple levels of disease severity (0–4).
For this project, it was **converted into a binary classification problem**:
- `0` = absence of heart disease
- `1` = presence of heart disease

---

## Objective
Train and evaluate a machine learning model that can accurately classify patients as having heart disease or not, using structured medical data.

---

## Machine Learning Task
- **Type:** Supervised learning
- **Problem:** Binary classification
- **Models considered:** Logistic Regression, Tree-based models
- **Evaluation metrics:**
  - Accuracy
  - Precision / Recall
  - ROC-AUC score

### 1. Setting up the CSV and cleaning it (for this case is clean):

In [1]:
import pandas as pd

df = pd.read_csv(
    "Dataset.data",
    header=None
)

df.columns = [
    "age", "sex", "cp", "trestbps", "chol",
    "fbs", "restecg", "thalach", "exang",
    "oldpeak", "slope", "ca", "thal", "target"
]

df.head(10)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
5,56.0,1.0,2.0,120.0,236.0,0.0,0.0,178.0,0.0,0.8,1.0,0.0,3.0,0
6,62.0,0.0,4.0,140.0,268.0,0.0,2.0,160.0,0.0,3.6,3.0,2.0,3.0,3
7,57.0,0.0,4.0,120.0,354.0,0.0,0.0,163.0,1.0,0.6,1.0,0.0,3.0,0
8,63.0,1.0,4.0,130.0,254.0,0.0,2.0,147.0,0.0,1.4,2.0,1.0,7.0,2
9,53.0,1.0,4.0,140.0,203.0,1.0,2.0,155.0,1.0,3.1,3.0,0.0,7.0,1


### 2. Setting up X and Y axis for graphs and ML model:

In [2]:
from sklearn.model_selection import train_test_split

df["target"] = (df["target"] > 0).astype(int)

X = df.drop(columns=["target"])  # <-- FIX
y = df["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

num_cols = [
    "age",
    "trestbps",
    "chol",
    "thalach",
    "oldpeak"
]

cat_cols = [
    "sex",
    "cp",
    "fbs",
    "restecg",
    "exang",
    "slope",
    "ca",
    "thal"
]

preprocess = ColumnTransformer(
    transformers=[
        ("num", "passthrough", num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
    ]
)

### 3. Setting up the model:

In [35]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline([
    ("preprocess", preprocess),
    ("model", RandomForestClassifier(
        n_estimators=500,
        random_state=42,
        n_jobs=-1,
        max_depth=None,
        min_samples_split=2,
    ))
])

In [36]:
pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)

### 4. Results

In [37]:
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.97      0.85      0.90        33
           1       0.84      0.96      0.90        28

    accuracy                           0.90        61
   macro avg       0.90      0.91      0.90        61
weighted avg       0.91      0.90      0.90        61

[[28  5]
 [ 1 27]]
