# Ensemble Methods & Boosting — Student Lab

Week 4 introduces sklearn models, but you must still explain *why* they work (bias/variance).

In [None]:
import numpy as np
from sklearn.datasets import make_classification, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

def check(name: str, cond: bool):
    if not cond:
        raise AssertionError(f'Failed: {name}')
    print(f'OK: {name}')

rng = np.random.default_rng(0)

## Section 0 — Dataset (synthetic default, real optional)

### Task 0.1: Choose dataset
Use synthetic by default. Optionally switch to breast cancer dataset.

# TODO: set `use_real = False` or True

In [None]:
use_real = False  # TODO

if use_real:
    data = load_breast_cancer()
    X = data.data
    y = data.target
else:
    X, y = make_classification(
        n_samples=2000,
        n_features=20,
        n_informative=8,
        n_redundant=4,
        class_sep=1.0,
        flip_y=0.03,
        random_state=0,
    )

Xtr, Xva, ytr, yva = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)
check('shapes', Xtr.shape[0]==ytr.shape[0] and Xva.shape[0]==yva.shape[0])
Xtr.shape

## Section 1 — Baseline vs Trees vs Random Forest

### Task 1.1: Train baseline decision tree vs random forest

# TODO: Train:
- DecisionTreeClassifier(max_depth=?)
- RandomForestClassifier(n_estimators=?, max_depth=?, oob_score=True, bootstrap=True)

Compute accuracy + ROC-AUC on validation.

**Checkpoint:** Why does bagging reduce variance?

In [None]:
# TODO
tree = ...
rf = ...

tree.fit(Xtr, ytr)
rf.fit(Xtr, ytr)

def eval_model(clf, X, y):
    pred = clf.predict(X)
    acc = accuracy_score(y, pred)
    # many sklearn classifiers have predict_proba; handle if not
    if hasattr(clf, 'predict_proba'):
        proba = clf.predict_proba(X)[:, 1]
        auc = roc_auc_score(y, proba)
    else:
        auc = float('nan')
    return acc, auc

print('tree', eval_model(tree, Xva, yva))
print('rf  ', eval_model(rf, Xva, yva))

if hasattr(rf, 'oob_score_'):
    print('rf oob_score', rf.oob_score_)

### Task 1.2: Feature importance gotcha

Inspect `feature_importances_` and explain why correlated features can distort importances.

# TODO: print top 10 features by importance.

In [None]:
# TODO
imp = rf.feature_importances_
top = np.argsort(-imp)[:10]
print('top idx', top)
print('top importances', imp[top])

## Section 2 — Gradient Boosting

### Task 2.1: Train GradientBoostingClassifier

# TODO: Train GB with different n_estimators and learning_rate and compare.

**Checkpoint:** Why can boosting overfit with too many estimators?

In [None]:
settings = [
    {'n_estimators': 50, 'learning_rate': 0.1, 'max_depth': 2},
    {'n_estimators': 200, 'learning_rate': 0.1, 'max_depth': 2},
    {'n_estimators': 200, 'learning_rate': 0.05, 'max_depth': 2},
]

for s in settings:
    gb = GradientBoostingClassifier(random_state=0, **s)
    gb.fit(Xtr, ytr)
    print('gb', s, eval_model(gb, Xva, yva))

## Section 3 — XGBoost-style knobs (conceptual)

### Task 3.1: Explain what each knob does
Write 2-3 bullets each:
- subsample
- colsample
- learning rate
- max_depth

- **subsample:**
- **colsample:**
- **learning_rate:**
- **max_depth:**

---
## Submission Checklist
- All TODOs completed
- Baseline vs RF vs GB compared
- OOB score discussed (if available)
- Feature importance gotcha explained