# Ensemble Methods & Boosting — Student Lab

Week 4 introduces sklearn models, but you must still explain *why* they work (bias/variance).

In [7]:
import numpy as np
from sklearn.datasets import make_classification, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

def check(name: str, cond: bool):
    if not cond:
        raise AssertionError(f'Failed: {name}')
    print(f'OK: {name}')

rng = np.random.default_rng(0)

## Section 0 — Dataset (synthetic default, real optional)

### Task 0.1: Choose dataset
Use synthetic by default. Optionally switch to breast cancer dataset.

# TODO: set `use_real = False` or True

In [8]:
use_real = False  # TODO

if use_real:
    data = load_breast_cancer()
    X = data.data
    y = data.target
else:
    X, y = make_classification(
        n_samples=2000,
        n_features=20,
        n_informative=8,
        n_redundant=4,
        class_sep=1.0,
        flip_y=0.03,
        random_state=0,
    )

Xtr, Xva, ytr, yva = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)
check('shapes', Xtr.shape[0]==ytr.shape[0] and Xva.shape[0]==yva.shape[0])
Xtr.shape

OK: shapes


(1400, 20)

## Section 1 — Baseline vs Trees vs Random Forest

### Task 1.1: Train baseline decision tree vs random forest

# TODO: Train:
- DecisionTreeClassifier(max_depth=?)
- RandomForestClassifier(n_estimators=?, max_depth=?, oob_score=True, bootstrap=True)

Compute accuracy + ROC-AUC on validation.

**Checkpoint:** Why does bagging reduce variance?

Because averaging will cancel out errors.

In [9]:
# TODO
tree = DecisionTreeClassifier(max_depth=3, random_state=0)
rf = RandomForestClassifier(n_estimators=300, max_depth=None, min_samples_leaf=2, oob_score=True)

tree.fit(Xtr, ytr)
rf.fit(Xtr, ytr)

def eval_model(clf, X, y):
    pred = clf.predict(X)
    acc = accuracy_score(y, pred)
    # many sklearn classifiers have predict_proba; handle if not
    if hasattr(clf, 'predict_proba'):
        proba = clf.predict_proba(X)[:, 1]
        auc = roc_auc_score(y, proba)
    else:
        auc = float('nan')
    return acc, auc

print('tree', eval_model(tree, Xva, yva))
print('rf  ', eval_model(rf, Xva, yva))

if hasattr(rf, 'oob_score_'):
    print('rf oob_score', rf.oob_score_)

tree (0.805, 0.8664611111111112)
rf   (0.8916666666666667, 0.9509777777777777)
rf oob_score 0.9135714285714286


### Task 1.2: Feature importance gotcha

Inspect `feature_importances_` and explain why correlated features can distort importances.

# TODO: print top 10 features by importance.

In [10]:
# TODO
imp = rf.feature_importances_
top = np.argsort(-imp)[:10]
print('top idx', top)
print('top importances', imp[top])

top idx [12 15 17  3  7 11  4 18  1 13]
top importances [0.12932485 0.12600767 0.12339085 0.12060583 0.10097197 0.09630965
 0.05309302 0.03569581 0.03334735 0.03130096]


## Section 2 — Gradient Boosting

### Task 2.1: Train GradientBoostingClassifier

# TODO: Train GB with different n_estimators and learning_rate and compare.

**Checkpoint:** Why can boosting overfit with too many estimators?

when we make too many trees, they will try to capture the noise also. hence overfitting occurs when we use too many estimators.

In [11]:
settings = [
    {'n_estimators': 50, 'learning_rate': 0.1, 'max_depth': 2},
    {'n_estimators': 200, 'learning_rate': 0.1, 'max_depth': 2},
    {'n_estimators': 200, 'learning_rate': 0.05, 'max_depth': 2},
]

for s in settings:
    gb = GradientBoostingClassifier(random_state=0, **s)
    gb.fit(Xtr, ytr)
    print('gb', s, eval_model(gb, Xva, yva))

gb {'n_estimators': 50, 'learning_rate': 0.1, 'max_depth': 2} (0.8616666666666667, 0.9393722222222223)
gb {'n_estimators': 200, 'learning_rate': 0.1, 'max_depth': 2} (0.8883333333333333, 0.9497888888888889)
gb {'n_estimators': 200, 'learning_rate': 0.05, 'max_depth': 2} (0.8833333333333333, 0.9481888888888889)


## Section 3 — XGBoost-style knobs (conceptual)

### Task 3.1: Explain what each knob does
Write 2-3 bullets each:
- subsample
- colsample
- learning rate
- max_depth

- **subsample:** 
    - using a portion of data to train for each tree adds a randomness
    - it helps in reducing overfitting.
    - it helps with more regularization. 

- **colsample:**
    - using only a random set of feastures to train each tree
    - prevets a single dominent feature from dominating
    - helps with generalization

- **learning_rate:**
    - Controls rate at which model is chaged
    - it dictates hoe aggresivly model is trained
    
- **max_depth:**
    - Controls how deep each tree can grow.
    - deeper the tree more complex model becomes.

---
## Submission Checklist
- All TODOs completed
- Baseline vs RF vs GB compared
- OOB score discussed (if available)
- Feature importance gotcha explained