# 16. Ensemble Learning: Bagging

**Purpose:** Learn and revise **Bagging** (Bootstrap Aggregating) in Scikit-learn.

---

## What is Bagging?

**Bagging** trains **many copies** of the same **base estimator** (e.g. a decision tree) each on a **different bootstrap sample** of the training set (random sample with replacement). Predictions are made by **averaging** (regression) or **majority voting** (classification).

- **Bootstrap:** Each model sees a random subset of rows; typically ~63% of data per sample (rest are "out-of-bag").
- **Goal:** Reduce **variance** by averaging over many slightly different models; base estimator should be **unstable** (e.g. trees) to benefit.

**Key idea:** Random Forest is bagging of trees with extra feature randomness. **BaggingClassifier** / **BaggingRegressor** let you bag **any** base estimator.

## Concepts to Remember

| Concept | Description |
|--------|-------------|
| **base_estimator** | The model to bag (e.g. DecisionTreeClassifier); default is a tree. |
| **n_estimators** | Number of bootstrap models. |
| **max_samples** | Fraction or count of samples per bootstrap (default 1.0). |
| **oob_score** | If True, compute out-of-bag score (only with max_samples=1.0). |

In [None]:
import numpy as np
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

In [None]:
np.random.seed(42)
X = np.random.randn(300, 4)
y = (X[:, 0]**2 + X[:, 1] > 0).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
base = DecisionTreeClassifier(max_depth=4, random_state=42)
model = BaggingClassifier(estimator=base, n_estimators=50, random_state=42, oob_score=True)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy (test):", accuracy_score(y_test, y_pred))
print("OOB score:", model.oob_score_)
print(classification_report(y_test, y_pred))

## Key Takeaways

- **BaggingClassifier** / **BaggingRegressor**; **estimator** = base model, **n_estimators** = number of bags.
- Use **oob_score=True** for a validation estimate without a separate holdout set.
- Bagging works best with **high-variance** base estimators (e.g. deep trees).