# Bagging Classifier

**Bagging (Bootstrap Aggregating)** is an ensemble learning technique that improves model stability and accuracy by combining multiple base estimators trained on random subsets of the data.

**Core Concept:** Train multiple instances of the same algorithm on different data samples, then combine their predictions.



### 1. How it Works

#### A. Bootstrap Sampling
* Creates multiple datasets by sampling **with replacement** from the original training data.
* Each bootstrap sample contains roughly **63.2% unique instances** (on average).
* The remaining **~36.8%** are "out-of-bag" (OOB) samples that can be used for internal validation.

#### B. Prediction Aggregation
* **Classification:** Majority voting ("hard voting") or average probabilities ("soft voting").
* **Regression:** Average of all predictions.

#### Mathematical Formulation
For classification with $n$ estimators:

$$\hat{y} = \text{mode}\{h_1(x), h_2(x), ..., h_n(x)\}$$

For regression:

$$\hat{y} = \frac{1}{n} \sum_{i=1}^{n} h_i(x)$$

---

### 2. Types of Bagging Classifiers in Scikit-Learn

#### 1. Standard `BaggingClassifier`
Allows you to bag *any* base estimator (e.g., Logistic Regression, SVM, Decision Trees).

```python
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Base estimator
base_estimator = DecisionTreeClassifier(max_depth=5)

# Bagging classifier
bagging = BaggingClassifier(
    estimator=base_estimator,
    n_estimators=100,
    max_samples=0.8,        # Train on 80% of samples
    max_features=0.8,       # Use 80% of features
    bootstrap=True,         # Sampling with replacement
    bootstrap_features=False,
    oob_score=True,         # Use OOB for validation
    random_state=42
)

RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier

# Random Forest is essentially bagging with decision trees
# PLUS feature randomization at each split
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features='sqrt',    # Feature bagging
    bootstrap=True,
    oob_score=True,
    random_state=42
)

ExtraTreesClassifier

from sklearn.ensemble import ExtraTreesClassifier

# Similar to Random Forest but with random splits
et = ExtraTreesClassifier(
    n_estimators=100,
    max_features='sqrt',
    bootstrap=True,
    random_state=42
)


## How does Bagging reduce overfitting?

Bagging reduces overfitting by averaging multiple models trained on different data subsets. This decreases variance without increasing bias. The randomness in sampling ensures models learn different patterns, and their aggregation cancels out individual errors.



# Bagging / Ensemble Model Parameters

Below are some key parameters often used in ensemble methods like **Bagging** or **Random Forest**:

| Parameter           | Description                                                                                     |
|--------------------|-------------------------------------------------------------------------------------------------|
| `n_estimators`     | Number of base estimators (more estimators usually reduce variance, but improvements diminish after a point). |
| `max_samples`      | Fraction or number of samples to draw for each estimator. Default is `1.0` (use all samples).  |
| `max_features`     | Fraction or number of features to draw for each estimator (controls diversity among estimators). |
| `bootstrap`        | Whether to sample **with replacement** (`True` for standard bagging, `False` for no replacement). |
| `bootstrap_features` | Whether to sample features with replacement. Useful in Random Forests for decorrelating trees. |
| `oob_score`        | Whether to use **out-of-bag samples** for validation. Helps estimate model performance without a separate validation set. |

---

### Example: Using Bagging with Scikit-Learn

```python
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

# Load dataset
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Bagging Regressor
bag_model = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(),
    n_estimators=50,
    max_samples=0.8,
    max_features=0.8,
    bootstrap=True,
    bootstrap_features=False,
    oob_score=True,
    random_state=42
)

# Train
bag_model.fit(X_train, y_train)

# Evaluate
print("OOB Score:", bag_model.oob_score_)
print("Test Score:", bag_model.score(X_test, y_test))


### Strengths & Weaknesses of Bagging

**Bagging (Bootstrap Aggregating)** is primarily designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression.



#### 1. Advantages
* **Reduces Variance:** Especially effective for high-variance estimators (like deep decision trees). By averaging multiple models, the variance of the final prediction is reduced.
* **Improves Stability:** The ensemble is less sensitive to noise and specific outliers in the training data.
* **Parallelizable:** Since each base estimator is independent of the others, they can be trained simultaneously across multiple cores or machines.
* **OOB (Out-Of-Bag) Validation:** Provides built-in validation. The data left out during the bootstrap sampling (approx. 37%) can be used to evaluate the model without needing a separate validation set.
* **Handles Overfitting:** It can regularize complex base estimators that would otherwise overfit the data.

#### 2. Disadvantages
* **Computationally Expensive:** Requires training multiple models (often hundreds), which increases training time compared to a single model.
* **Less Interpretable:** It is much harder to explain the logic of an ensemble of 100 trees than a single Decision Tree.
* **Memory Intensive:** Storing multiple models requires significantly more RAM.
* **May Not Reduce Bias:** If the base estimator has high bias (underfitting), bagging will generally not fix it. It is designed to fix variance, not bias.
* **Can Underfit:** If the base estimator is too simple (e.g., a shallow tree), the aggregate model may also underfit.

---

### Implementation Guide

| When to **Use** Bagging | When to **Avoid** Bagging |
| :--- | :--- |
| **Base estimator has high variance**<br>(e.g., deep Decision Trees, k-NN with low k) | **Base estimator is low-variance**<br>(e.g., Linear Regression, Logistic Regression) |
| **Sufficient computational resources** are available (RAM/CPU) | **Computational resources are limited** |
| **Model stability** is more important than interpretability | **Interpretability** is crucial (need to explain "why") |
| **Parallel computing** is available to speed up training | **Training data is very small** (bootstrapping might reduce unique data per model too much) |