# 01 – Data Preprocessing & EDA

Objectives:
- Load raw UCI Heart Disease dataset
- Inspect schema, types, target distribution
- Handle missing values
- Separate numeric vs categorical
- Encode categoricals (OneHot / Ordinal where sensible)
- Scale numeric features (StandardScaler)
- Exploratory plots: histograms, boxplots, correlation heatmap
- Build initial preprocessing Pipeline ready for later modeling.

Best Practices:
- NEVER fit transformers on test data (use Pipeline & train split)
- Preserve raw data; write cleaned frame if needed.
- Track class imbalance early (affects metrics & resampling).

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

DATA_PATH = Path('../data/heart_disease.csv')
assert DATA_PATH.exists(), 'Place heart_disease.csv in data/'
df = pd.read_csv(DATA_PATH)
df.head()

## 1. Basic Inspection

In [None]:
df.shape, df.dtypes.head()

In [None]:
df.isna().sum().sort_values(ascending=False).head(15)

In [None]:
df.describe().T

## 2. Target Distribution
Replace 'target' with the actual target column name in dataset variant (often 'target' or 'num').

In [None]:
target_col = 'target' if 'target' in df.columns else 'num'
df[target_col].value_counts(normalize=True) * 100

## 3. Split Features / Target & Identify Types

In [None]:
y = df[target_col]
X = df.drop(columns=[target_col])
# Heuristic: treat small-cardinality int columns as categorical
categorical = [c for c in X.columns if X[c].dtype in ['object','category']]
# Add ints with low unique count
for c in X.columns:
    if X[c].dtype in [np.int64, np.int32, 'int64'] and X[c].nunique() <= 10:
        categorical.append(c)
categorical = sorted(set(categorical))
numeric = [c for c in X.columns if c not in categorical]
categorical, numeric[:5]

## 4. EDA – Histograms (Numeric)

In [None]:
X[numeric].hist(figsize=(14,10), bins=20); plt.tight_layout()

## 5. EDA – Boxplots (Outlier Scan)

In [None]:
plt.figure(figsize=(14, len(numeric)*0.4))
sns.boxplot(data=X[numeric], orient='h')
plt.title('Numeric Feature Distributions – Boxplots')
plt.show()

## 6. Correlation Heatmap (Numeric)

In [None]:
plt.figure(figsize=(12,10))
corr = df[numeric + [target_col]].corr()
sns.heatmap(corr, cmap='coolwarm', center=0, annot=False)
plt.title('Correlation Heatmap')
plt.show()

## 7. Preprocessing Pipeline
Impute: median for numeric, most_frequent for categorical. Scale numeric. One-hot encode categorical.

In [None]:
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric),
    ('cat', categorical_transformer, categorical)
])

preprocessor

## 8. Train/Test Split (Stratified)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
X_train.shape, X_test.shape

## 9. Fit & Transform (Preview Encoded Feature Matrix Shape)

In [None]:
Xt = preprocessor.fit_transform(X_train)
Xt.shape

## 10. Save Preprocessor (Optional for Reuse)

In [None]:
import joblib
from pathlib import Path
Path('../models').mkdir(exist_ok=True)
joblib.dump(preprocessor, '../models/preprocessor.pkl')
print('Saved preprocessor.')

## Notes & Pitfalls
- Verify correct target column name early.
- Low-cardinality numeric encoded as categorical can improve tree models but may hurt linear; revisit after baseline.
- Outliers: consider robust scaling if heavy tails.
- Class imbalance: if strong, consider ROC AUC, PR AUC, and maybe class_weight or resampling later.