# 10. Random Forests

**Purpose:** Learn and revise **Random Forests** in Scikit-learn.

---

## What is a Random Forest?

A **random forest** is an **ensemble** of many **decision trees**. Each tree is trained on a **bootstrap sample** of the data and at each split considers only a **random subset of features**. Predictions are made by **averaging** (regression) or **majority voting** (classification).

- **Bootstrap:** Each tree sees a random sample of rows (with replacement); about 1/3 of rows are "out-of-bag" per tree.
- **Feature randomness:** At each split, only \( m \) features are tried (e.g. \( \sqrt{p} \) for classification); reduces correlation between trees.

**Key idea:** Averaging many noisy but diverse trees reduces variance and overfitting while keeping the flexibility of trees.

## Concepts to Remember

| Concept | Description |
|--------|-------------|
| **n_estimators** | Number of trees; more = often better (with diminishing returns), slower. |
| **max_features** | Features considered per split; smaller = more diversity, less overfitting. |
| **Out-of-bag (OOB)** | Set **oob_score=True** to get an unbiased estimate without a separate validation set. |
| **When to use** | Strong default for tabular data; robust, handles non-linearity, gives feature importances. |

In [None]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

In [None]:
np.random.seed(42)
X = np.random.randn(300, 4)
y = (X[:, 0]**2 + X[:, 1] > 0.5).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42, oob_score=True)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy (test):", accuracy_score(y_test, y_pred))
print("OOB score:", model.oob_score_)
print("Feature importances:", model.feature_importances_)
print(classification_report(y_test, y_pred))

## Key Takeaways

- **RandomForestClassifier** / **RandomForestRegressor**; tune **n_estimators**, **max_depth**, **max_features**.
- **feature_importances_** aggregates importance across trees (based on impurity decrease).
- Use **oob_score=True** for a quick validation estimate without cross-validation.