# Preprocessing & Splits

## 🎯 Concept Primer

Preprocessing transforms raw features into ML-ready format. **Critical rule:** Fit scalers/encoders ONLY on training data, then transform val/test to prevent leakage.

### Preprocessing Steps
1. **Encode categoricals** — One-Hot or Ordinal encoding
2. **Scale continuous** — StandardScaler (mean=0, std=1) or MinMaxScaler (0-1)
3. **Split data** — Train (70%) / Val (15%) / Test (15%), stratified by target
4. **Handle imbalance** — Class weights, oversampling, or threshold tuning

**Expected outputs:** X_train, y_train, X_val, y_val, X_test, y_test

## 📋 Objectives

By the end of this notebook, you will:
1. Encode categorical features (One-Hot or Ordinal)
2. Scale continuous features using StandardScaler
3. Split into train/val/test (70/15/15 stratified)
4. Choose an imbalance handling strategy
5. Verify shapes and dtypes

## ✅ Acceptance Criteria

You'll know you're done when:
- [ ] All categoricals encoded
- [ ] Continuous features scaled
- [ ] Data split into train/val/test
- [ ] Imbalance strategy chosen and documented
- [ ] Shapes printed: X_train.shape, y_train.shape, etc.
- [ ] No leakage (transformers fit only on train)

## 🔧 Setup

In [None]:
# TODO 1: Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import torch

df = pd.read_csv("../../../datasets/diabetes_BRFSS2015.csv")
df.columns = df.columns.str.lower().str.replace(' ', '_')
numeric_cols = ['bmi', 'genhlth', 'menthlth', 'physhlth']

df.head()

## 🏷️ Separate Features and Target

### TODO 2: Split data into X and y

**Expected:**
- X: All columns except `diabetes_binary`
- y: Only `diabetes_binary`

**Shapes:** X will be (N, D) where D = number of features

In [None]:
# TODO 2: Separate features and target
# X = df.drop('diabetes_binary', axis=1)
# y = df['diabetes_binary']
# print(f"X shape: {X.shape}, y shape: {y.shape}")

## 📊 Handle Imbalance

### TODO 3: Choose imbalance strategy

**Options:**
1. **Class weights** — Weight loss function by class frequency
2. **Oversampling** — SMOTE or RandomOverSampler (on train only!)
3. **Threshold tuning** — Adjust decision threshold at inference

**Decision:** Choose one and document why in reflection.

**Check imbalance:**
```python
y.value_counts(normalize=True)
```

In [None]:
# TODO 3: Check imbalance
# print(y.value_counts(normalize=True))

# If using class weights:
# from sklearn.utils.class_weight import compute_class_weight
# class_weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
# class_weight_dict = {0: class_weights[0], 1: class_weights[1]}

## 🔄 Encode Categoricals

### TODO 4: Apply One-Hot encoding

**Columns to encode:** Binary features (already 0/1) and ordinal features

**Options:**
- OneHotEncoder: Creates separate columns for each category
- Keep it simple: Most columns are already numeric!

**Expected:** After encoding, all features should be numeric

In [None]:
# TODO 4: Encode categoricals (if needed)
# If you have string categoricals:
# ohe = OneHotEncoder(drop='first', sparse=False)
# X_encoded = ohe.fit_transform(X[['categorical_col']])
# X_encoded_df = pd.DataFrame(X_encoded, columns=ohe.get_feature_names_out())
# X = pd.concat([X.drop('categorical_col', axis=1), X_encoded_df], axis=1)

## ⚖️ Scale Continuous Features

### TODO 5: Apply StandardScaler

**Fit on train only!** Then transform val/test.

**Features to scale:** BMI, MentHlth, PhysHlth, Age (if numeric)

**Process:**
1. Split data first
2. Fit scaler on X_train
3. Transform X_train, X_val, X_test

In [None]:
# TODO 5: Scale features
# Split first
# X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
# X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

# Then scale
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_val_scaled = scaler.transform(X_val)
# X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrames
# X_train = pd.DataFrame(X_train_scaled, columns=X_train.columns)
# X_val = pd.DataFrame(X_val_scaled, columns=X_val.columns)
# X_test = pd.DataFrame(X_test_scaled, columns=X_test.columns)

## ✅ Verify Splits

### TODO 6: Print shapes and check dtypes

**Expected outputs:**
- X_train: (N_train, D)
- y_train: (N_train,)
- Similar for val and test
- All should be numeric (float or int)

In [None]:
# TODO 6: Verify splits
# print(f"Train: X={X_train.shape}, y={y_train.shape}")
# print(f"Val: X={X_val.shape}, y={y_val.shape}")
# print(f"Test: X={X_test.shape}, y={y_test.shape}")
# print(f"\nDtypes:\n{X_train.dtypes}")

## 🤔 Reflection

1. **Imbalance strategy:** Which did you choose? Why?
2. **Scaler choice:** StandardScaler vs MinMaxScaler? Why?
3. **Split strategy:** Why 70/15/15? Is test set large enough?
4. **Leakage check:** Are you sure no val/test info leaked into training?

**Your reflection:**

*Write your answers here*

## 📌 Summary

✅ **Encoded:** Categoricals converted to numeric  
✅ **Scaled:** Continuous features standardized  
✅ **Split:** Train/val/test created  
✅ **Balanced:** Imbalance strategy applied  
✅ **Ready for next step:** Train baseline models

**Next notebook:** `06_baselines_logreg_rf.ipynb`