# Module 4 – Gradient Boosting & LightGBM Regressor
Master gradient‑boosted tree ensembles: intuition, hyper‑parameter tuning, diagnostics, and interpretation.

## 1 | Learning Objectives
By the end of this module you will be able to:

1. **Describe** gradient boosting intuition and key LightGBM innovations.
2. **Configure** core hyper‑parameters (`num_leaves`, `learning_rate`, `n_estimators`, `max_depth`, early‑stopping).
3. **Train & validate** an `LGBMRegressor` with early stopping and cross‑validation.
4. **Interpret** models using feature importance and SHAP value sketches.
5. **Compare** LightGBM with Ridge/Lasso and justify model choice.

## 2 | Key Concepts & Analogies
| Concept | Plain Explanation | Analogy |
|---------|------------------|---------|
| **Boosting** | Builds an ensemble *sequentially*, each new tree fits residual errors of the previous ensemble. | Relay race: every runner starts where the last finished, closing the gap. |
| **Learning Rate (η)** | Scales how much each tree corrects the ensemble; small η ⇒ more trees but smoother learning. | Sipping hot coffee slowly vs gulping: safer but takes longer. |
| **num_leaves** | Maximum leaves per tree; higher values capture complex patterns but risk overfitting. | Camera resolution: high res shows more detail *and* noise. |
| **Leaf‑Wise Growth** | LightGBM splits the leaf with maximum gain first, growing unevenly. | Feeding the strongest plant branch first. |
| **Histogram Binning** | Buckets continuous features for faster training & lower memory. | Rolling coins into sleeves instead of counting each coin. |
| **Early Stopping** | Stops adding trees when validation loss stops improving. | Leaving a buffet when comfortably full, not stuffed. |
| **Feature Importance** | Gain or split counts indicate influential variables. | Voting tally: loudest voices matter more. |

In [None]:
# Cell 1 – Imports & Settings
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from lightgbm import LGBMRegressor, plot_importance

plt.rcParams['figure.dpi'] = 110

In [None]:
# Cell 2 – Load Data & Split
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.25, random_state=42
)
X.head()

### 3 | Baseline LightGBM with Early‑Stopping
Set a generous `n_estimators`; early stopping cuts training once validation RMSE plateaus.

In [None]:
lgb_params = dict(
    objective='regression',
    metric='rmse',
    n_estimators=1000,          # upper bound
    learning_rate=0.05,
    num_leaves=31,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=0,
    n_jobs=-1
)

model = LGBMRegressor(**lgb_params)
model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    eval_metric='rmse',
    early_stopping_rounds=50,
    verbose=False
)

print('Best iteration:', model.best_iteration_)
print('Validation RMSE:', model.best_score_['valid_0']['rmse'])

### 4 | Hyper‑parameter Grid‑Search (lightweight)

In [None]:
grid = {
    'num_leaves': [15, 31, 63],
    'learning_rate': [0.1, 0.05],
    'min_child_samples': [10, 20],
    'subsample': [0.8, 1.0]
}

base_lgb = LGBMRegressor(
    n_estimators=600,
    objective='regression',
    colsample_bytree=0.8,
    random_state=0,
    n_jobs=-1
)

search = GridSearchCV(base_lgb, grid, cv=3,
                      scoring='neg_root_mean_squared_error',
                      verbose=0)
search.fit(X_train, y_train)
print('Best params:', search.best_params_)
print('CV RMSE:', -search.best_score_)

### 5 | Feature Importance Plot

In [None]:
ax = plot_importance(model, max_num_features=10)
ax.set_title('Top 10 Feature Importances (gain)')
plt.show()

### 6 | SHAP Value Sketch *(optional – requires `shap`)*

In [None]:
# Uncomment to install SHAP in a fresh environment
# !pip install shap -q
import shap
explainer = shap.Explainer(model)  # TreeExplainer for LightGBM
shap_values = explainer(X_val.iloc[:200])  # subsample for speed
shap.summary_plot(shap_values, X_val.iloc[:200])

## 7 | Interactive Checkpoints
### 7.1 Quick Quiz ✅
1. Boosting reduces **bias** predominantly.
2. **False** – lower `learning_rate` may still overfit given enough trees.
3. Besides `num_leaves`, `max_depth` limits tree complexity.
4. Early stopping triggers after a user‑set number (e.g., 50) rounds without improvement.

### 7.2 Coding Exercise 💻
**Task:**
1. Train **Model A** (`learning_rate=0.1`, `num_leaves=31`, `n_estimators=400`).
2. Train **Model B** (`learning_rate=0.03`, `num_leaves=63`, `n_estimators=1200`, `early_stopping_rounds=80`).
3. Record validation RMSE & training time (`%%time`).
4. Explain which model you’d deploy and why.

### 7.3 Reflection ✍️
*When might a linear Ridge/Lasso beat LightGBM? Discuss data size, dimensionality & interpretability.*

## 8 | Readings & Resources
* **LightGBM Docs** – Parameters, Python API, FAQ
* Microsoft: “LightGBM Cheatsheet”
* Kaggle Blog: “How to Tune LightGBM”
* Video: Stat‑Quest – Gradient Boosting & XGBoost
* **Interpretability**: SHAP documentation – TreeSHAP

## 9 | Optional Advanced Challenge 🌟
**Custom Objective & Cross‑Validation Script**
1. Implement a Huber-loss custom objective.
2. Use `lightgbm.cv` with 5‑fold CV, early stopping, `n_estimators=10_000`, `learning_rate=0.01`.
3. Plot CV RMSE vs iteration; find optimal.
4. Compare with squared‑error objective on noisy targets.

*Stretch:* Export best model to ONNX and reload for prediction.

## 10 | Completion Checklist ✅
Advance to **Module 5** when you can:
* Tune LightGBM hyper‑parameters & justify with validation curves.
* Employ early stopping effectively.
* Explain feature importance & demonstrate a SHAP summary.
* Compare LightGBM with linear models and articulate trade‑offs.