# Module 5 – Integrated Modeling, Evaluation & Capstone Mini‑Project
Build a full end‑to‑end regression workflow: preprocessing pipeline, model comparison, nested cross‑validation, and reporting.

## 1 | Learning Objectives
By the end of this module you will be able to:

1. **Assemble** a leakage‑free ML pipeline (preprocessing → model → evaluation).
2. **Select** and justify regression metrics (RMSE, MAE, R²) for different goals.
3. **Apply** robust validation (nested CV) to compare Ridge, Lasso & LightGBM fairly.
4. **Tune** hyper‑parameters efficiently (Grid/Random/Bayesian optional).
5. **Communicate** findings in a concise, reproducible report.

## 2 | Key Concepts & Analogies
| Concept | Plain Explanation | Analogy |
|---------|------------------|---------|
| **Pipeline** | Chains preprocessing + model so CV sees identical steps each fold. | Assembly line: every car (fold) goes through same stations. |
| **Nested CV** | Inner loop tunes hyper‑params; outer loop estimates generalization. | Taste‑testing (inner) with blindfolded judges (outer). |
| **Metric Choice** | RMSE penalizes large errors; MAE robust; R² relative fit. | Different grading rubrics: RMSE = harsh, MAE = lenient, R² = percent score. |
| **Baseline Model** | Simple predictor to check value of complex models. | Control group in a clinical trial. |
| **Reproducibility** | Fix seeds, store configs, save artifacts. | Baking recipe with exact grams & oven temp. |

In [None]:
# Cell 1 – Imports & Settings
import numpy as np, pandas as pd, matplotlib.pyplot as plt, json, joblib
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge, Lasso
from lightgbm import LGBMRegressor, plot_importance
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
plt.rcParams['figure.dpi'] = 110
np.random.seed(0)

In [None]:
# Cell 2 – Load Dataset & Split
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.25, random_state=0)
print(X_train.shape, X_val.shape, X_test.shape)

### 3 | Exploratory Data Analysis

In [None]:
pd.concat([X_train.assign(target=y_train).head(), X_train.describe().T.head()])

### 4 | Preprocessing Pipeline

In [None]:
numeric_cols = X.select_dtypes('number').columns
categorical_cols = X.select_dtypes('object').columns

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
])

### 5 | Model Definitions

In [None]:
ridge_pipe = Pipeline([('pre', preprocessor), ('model', Ridge())])
lasso_pipe = Pipeline([('pre', preprocessor), ('model', Lasso(max_iter=5000))])
lgbm_pipe  = Pipeline([('pre', preprocessor), ('model', LGBMRegressor(
    objective='regression', n_estimators=600, random_state=0, n_jobs=-1))])

### 6 | Hyper‑parameter Grids

In [None]:
ridge_grid = {'model__alpha': 10.0 ** np.arange(-3, 4)}
lasso_grid = {'model__alpha': 10.0 ** np.arange(-3, 2)}
lgbm_grid  = {
    'model__num_leaves':[31, 63],
    'model__learning_rate':[0.1, 0.05],
    'model__min_child_samples':[10, 20]
}

### 7 | Nested Cross‑Validation Comparison

In [None]:
outer_cv = KFold(n_splits=5, shuffle=True, random_state=0)
models = {
    'Ridge': (ridge_pipe, ridge_grid),
    'Lasso': (lasso_pipe, lasso_grid),
    'LightGBM': (lgbm_pipe, lgbm_grid)
}

results = {}
for name, (pipe, grid) in models.items():
    gs = GridSearchCV(pipe, grid, cv=3, scoring='neg_root_mean_squared_error')
    cv_scores = cross_val_score(gs, X_trainval, y_trainval,
                                cv=outer_cv, scoring='neg_root_mean_squared_error')
    results[name] = -cv_scores  # convert to positive RMSE

pd.DataFrame(results).describe().T

### 8 | Select Final Model & Evaluate

In [None]:
best_pipe, best_grid = models['LightGBM']  # choose based on previous step
search_final = GridSearchCV(best_pipe, best_grid, cv=3,
                           scoring='neg_root_mean_squared_error')
search_final.fit(X_trainval, y_trainval)
final_model = search_final.best_estimator_

preds = final_model.predict(X_test)
print('Test RMSE:', mean_squared_error(y_test, preds, squared=False))
print('Test MAE :', mean_absolute_error(y_test, preds))
print('Test R²  :', r2_score(y_test, preds))

### 9 | Feature Importance

In [None]:
if hasattr(final_model.named_steps['model'], 'feature_importances_'):
    feats = final_model.named_steps['pre'].get_feature_names_out()
    fi = pd.Series(final_model.named_steps['model'].feature_importances_, index=feats)
    fi.sort_values(ascending=False)[:10].plot(kind='bar', title='Top 10 Importances'); plt.show()

### 10 | Save Artifacts

In [None]:
joblib.dump(final_model, 'best_model.joblib')
meta = {'metric':'RMSE','value': mean_squared_error(y_test, preds, squared=False)}
with open('model_meta.json','w') as f:
    json.dump(meta, f, indent=2)
meta

## 11 | Interactive Checkpoints
### 11.1 Quick Quiz ✅
*Q:* Which metric is most sensitive to extreme errors and why?  

### 11.2 Coding Exercise 💻  
Implement `RandomizedSearchCV` for LightGBM; compare runtime and RMSE with grid‑search above.

### 11.3 Reflection ✍️  
*When might a linear model beat LightGBM in production? Consider data size, feature sparsity, interpretability, and latency.*


## 12 | Readings & Resources
* **scikit‑learn docs** – Evaluation, Cross‑Validation, Pipelines
* Varoquaux (2018) – “Cross‑validation pitfalls and how to avoid them”
* Optuna – Hyperparameter Optimization intro
* Blog – “Nested CV vs Train/Test/Val” (Sebastian Raschka)

## 13 | Optional Advanced Challenge 🌟
1. **Bayesian Optimization with Optuna**: optimize LightGBM RMSE & training time.
2. **Prediction Intervals**: build ensemble of seeds; compute mean ± 1.96×std.
3. **Deployment Sketch**: serialize pipeline & create FastAPI endpoint.

## 14 | Completion Checklist ✅
* Leakage‑free pipeline built.
* Nested CV comparison complete.
* Metrics chosen & justified.
* Best model saved (`best_model.joblib`) with metadata.
* Interpretation plot produced.