# Heart Disease ML Pipeline (UCI) — Notebooks

These notebooks implement a full pipeline on the **UCI Heart Disease** dataset using your requested loader:

```python
from ucimlrepo import fetch_ucirepo
heart_disease = fetch_ucirepo(id=45)
X = heart_disease.data.features
y = heart_disease.data.targets
```

> Bonus items (Streamlit/Ngrok) are intentionally **omitted** per the request.

## 01 — Data Preprocessing & EDA

Steps:
- Load dataset using `ucimlrepo`
- Basic EDA (shape, head, describe, class balance)
- Handle missing values (simple imputation)
- Identify categorical vs numerical, encode categoricals (OneHotEncoder)
- Scale numericals (StandardScaler)
- Save processed arrays and a combined processed CSV to `data/`

In [11]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from ucimlrepo import fetch_ucirepo
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

# Load dataset
heart_disease = fetch_ucirepo(id=45)  # UCI Heart Disease
X = heart_disease.data.features.copy()
y = heart_disease.data.targets.copy()

print("Raw shapes:", X.shape, y.shape)
print("\nTargets value counts:\n", y.iloc[:,0].value_counts())

# Basic EDA
display(X.head())
display(X.describe())
display(X.info())

# Identify categorical vs numeric
cat_cols = X.select_dtypes(include=["object","category"]).columns.tolist()
num_cols = X.select_dtypes(include=["int64","float64","int32","float32"]).columns.tolist()
print("Categorical columns:", cat_cols)
print("Numeric columns:", num_cols)

# Preprocessing pipelines
numeric_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale", StandardScaler())
])

categorical_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

pre = ColumnTransformer([
    ("num", numeric_pipe, num_cols),
    ("cat", categorical_pipe, cat_cols)
])

# Fit-transform
X_proc = pre.fit_transform(X)

# Collect processed feature names
num_features = num_cols
cat_features = list(pre.named_transformers_["cat"].named_steps["onehot"].get_feature_names_out(cat_cols)) if cat_cols else []
feature_names = num_features + cat_features

# Train-test split (stratify if binary column present)
y_series = y.iloc[:,0] if isinstance(y, pd.DataFrame) else y
X_train, X_test, y_train, y_test = train_test_split(X_proc, y_series, test_size=0.2, random_state=42, stratify=y_series)

print("Processed shapes:", X_proc.shape, "Train:", X_train.shape, "Test:", X_test.shape)

# Save processed data
proc_df = pd.DataFrame(X_proc, columns=feature_names)
proc_df["target"] = y_series.values
proc_df.to_csv("../data/processed_full.csv", index=False)

np.save("../data/X_train.npy", X_train)
np.save("../data/X_test.npy", X_test)
np.save("../data/y_train.npy", y_train.values)
np.save("../data/y_test.npy", y_test.values)

# Save the fitted preprocessor for later reuse in models
import joblib
joblib.dump(pre, "../models/preprocessor.pkl")

print("Saved processed data to ../data and preprocessor to ../models.")

Raw shapes: (303, 13) (303, 1)

Targets value counts:
 num
0    164
1     55
2     36
3     35
4     13
Name: count, dtype: int64


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,299.0,301.0
mean,54.438944,0.679868,3.158416,131.689769,246.693069,0.148515,0.990099,149.607261,0.326733,1.039604,1.60066,0.672241,4.734219
std,9.038662,0.467299,0.960126,17.599748,51.776918,0.356198,0.994971,22.875003,0.469794,1.161075,0.616226,0.937438,1.939706
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,3.0
25%,48.0,0.0,3.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,3.0
50%,56.0,1.0,3.0,130.0,241.0,0.0,1.0,153.0,0.0,0.8,2.0,0.0,3.0
75%,61.0,1.0,4.0,140.0,275.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0,7.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        299 non-null    float64
 12  thal      301 non-null    float64
dtypes: float64(3), int64(10)
memory usage: 30.9 KB


None

Categorical columns: []
Numeric columns: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
Processed shapes: (303, 13) Train: (242, 13) Test: (61, 13)
Saved processed data to ../data and preprocessor to ../models.
