# 🎯 Bagging Classifier

---

## 🎯 **What is it?**
A meta-classifier that fits base classifiers on random subsets of the original dataset and aggregates their predictions through majority voting. Specifically designed for classification tasks with discrete class labels.

**Key Innovation**: Combines bootstrap sampling with voting to reduce variance and improve generalization.

---

## 🗳️ **Voting Mechanism**

### 🔢 **Hard Voting (Default)**
> Predict class with most votes

$$\hat{y} = \text{mode}\left(\hat{y}_1, \hat{y}_2, ..., \hat{y}_B\right)$$

**Process:**
1. Each base classifier predicts a class label
2. Count votes for each class
3. Final prediction = class with most votes

**Example:** For 3 classifiers predicting [A, B, A] → Final: A

### 🎲 **Soft Voting**
> Predict class with highest average probability

$$\hat{y} = \arg\max_c \frac{1}{B} \sum_{i=1}^{B} P_i(y = c)$$

**Process:**
1. Each base classifier outputs class probabilities
2. Average probabilities across all classifiers
3. Final prediction = class with highest average probability

**Example:** 
- Classifier 1: [0.7, 0.3] for classes [A, B]
- Classifier 2: [0.4, 0.6] for classes [A, B]
- Average: [0.55, 0.45] → Final: A

---

## ⚙️ **Algorithm Details**

### 🔄 **Training Process**

| Step | Process | Mathematical Representation |
|------|---------|----------------------------|
| 1️⃣ | **Bootstrap Sampling** | $\mathcal{B}_i = \text{Bootstrap}(\mathcal{D}, n)$ |
| 2️⃣ | **Train Base Classifiers** | $C_i = \text{Train}(\mathcal{B}_i)$ for $i = 1, ..., B$ |
| 3️⃣ | **Store Models** | $\mathcal{M} = \{C_1, C_2, ..., C_B\}$ |

### 🎯 **Prediction Process**

| Step | Process | Formula |
|------|---------|---------|
| 1️⃣ | **Individual Predictions** | $\hat{y}_i = C_i(x)$ for all $i$ |
| 2️⃣ | **Vote Aggregation** | $\text{votes}[c] = \sum_{i=1}^{B} \mathbb{I}(\hat{y}_i = c)$ |
| 3️⃣ | **Final Prediction** | $\hat{y} = \arg\max_c \text{votes}[c]$ |

---

## 🔧 **Sklearn BaggingClassifier Parameters**

### 🎯 **Core Parameters**

| Parameter | Type | Default | Description | Typical Values |
|-----------|------|---------|-------------|----------------|
| `base_estimator` | estimator | `DecisionTreeClassifier()` | Base classifier | Any sklearn classifier |
| `n_estimators` | int | 10 | Number of base classifiers | 50-500 |
| `max_samples` | int/float | 1.0 | Samples per bootstrap | 0.5-1.0 |
| `max_features` | int/float | 1.0 | Features per classifier | 0.5-1.0 |
| `bootstrap` | bool | True | Bootstrap sampling | True/False |
| `bootstrap_features` | bool | False | Bootstrap features | True/False |

### 📊 **Advanced Parameters**

| Parameter | Type | Default | Description | Use Case |
|-----------|------|---------|-------------|----------|
| `oob_score` | bool | False | Calculate OOB error | Model validation |
| `warm_start` | bool | False | Reuse previous fit | Incremental training |
| `n_jobs` | int | None | Parallel jobs | -1 for all cores |
| `random_state` | int | None | Random seed | Reproducibility |
| `verbose` | int | 0 | Verbosity level | Debugging |

---

## 💻 **Implementation Examples**

### 🌳 **Basic Bagging with Decision Trees**
```python
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=20, 
                          n_informative=15, n_redundant=5, 
                          n_classes=3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Create bagging classifier
bagging_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=10),
    n_estimators=100,
    max_samples=0.8,        # 80% of training data per bag
    max_features=0.8,       # 80% of features per classifier
    bootstrap=True,
    oob_score=True,
    random_state=42,
    n_jobs=-1
)

# Train and evaluate
bagging_clf.fit(X_train, y_train)
train_score = bagging_clf.score(X_train, y_train)
test_score = bagging_clf.score(X_test, y_test)
oob_score = bagging_clf.oob_score_

print(f"Training Accuracy: {train_score:.4f}")
print(f"Test Accuracy: {test_score:.4f}")
print(f"OOB Score: {oob_score:.4f}")
```

### 🎯 **Bagging with Different Base Classifiers**
```python
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

# Bagging with SVM
bagging_svm = BaggingClassifier(
    base_estimator=SVC(probability=True),  # Enable probability for soft voting
    n_estimators=50,
    max_samples=0.7,
    random_state=42
)

# Bagging with Naive Bayes
bagging_nb = BaggingClassifier(
    base_estimator=GaussianNB(),
    n_estimators=100,
    max_samples=0.9,
    random_state=42
)

# Bagging with k-NN
bagging_knn = BaggingClassifier(
    base_estimator=KNeighborsClassifier(n_neighbors=5),
    n_estimators=50,
    max_samples=0.8,
    random_state=42
)
```

---

## 📊 **Probability Estimation**

### 🎲 **Class Probability Calculation**
```python
# Get class probabilities
probabilities = bagging_clf.predict_proba(X_test)

# Manual calculation of ensemble probabilities
def manual_ensemble_proba(bagging_clf, X):
    """Calculate ensemble probabilities manually"""
    n_classes = len(bagging_clf.classes_)
    n_samples = X.shape[0]
    ensemble_proba = np.zeros((n_samples, n_classes))
    
    for estimator in bagging_clf.estimators_:
        # Get feature indices used by this estimator
        features = bagging_clf.estimators_features_[
            bagging_clf.estimators_.index(estimator)
        ]
        
        # Get probabilities for selected features
        proba = estimator.predict_proba(X[:, features])
        ensemble_proba += proba
    
    # Average probabilities
    return ensemble_proba / len(bagging_clf.estimators_)
```

### 🎯 **Decision Function**
```python
# For binary classification
decision_scores = bagging_clf.decision_function(X_test)

# Relationship with probabilities (binary case)
# probability = 1 / (1 + exp(-decision_score))
```

---

## 🔍 **Out-of-Bag (OOB) Analysis**

### 📊 **OOB Score Calculation**
```python
# Enable OOB scoring
bagging_clf = BaggingClassifier(
    n_estimators=100,
    oob_score=True,
    random_state=42
)
bagging_clf.fit(X_train, y_train)

# Access OOB score
print(f"OOB Accuracy: {bagging_clf.oob_score_:.4f}")

# Manual OOB calculation
def calculate_oob_score(bagging_clf, X, y):
    """Calculate OOB score manually"""
    n_samples = X.shape[0]
    oob_predictions = np.full(n_samples, -1)
    oob_counts = np.zeros(n_samples)
    
    for i, estimator in enumerate(bagging_clf.estimators_):
        # Get OOB indices for this estimator
        oob_indices = ~bagging_clf.estimators_samples_[i]
        
        if np.any(oob_indices):
            # Get features used by this estimator
            features = bagging_clf.estimators_features_[i]
            
            # Predict on OOB samples
            oob_pred = estimator.predict(X[oob_indices][:, features])
            
            # Update OOB predictions and counts
            oob_predictions[oob_indices] = oob_pred
            oob_counts[oob_indices] += 1
    
    # Calculate accuracy for samples with OOB predictions
    valid_oob = oob_counts > 0
    oob_accuracy = np.mean(y[valid_oob] == oob_predictions[valid_oob])
    
    return oob_accuracy
```

---

## 📈 **Performance Evaluation**

### 🎯 **Classification Metrics**
```python
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Predictions
y_pred = bagging_clf.predict(X_test)
y_proba = bagging_clf.predict_proba(X_test)

# Classification report
print(classification_report(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# ROC-AUC for multiclass
roc_auc = roc_auc_score(y_test, y_proba, multi_class='ovr')
print(f"ROC-AUC Score: {roc_auc:.4f}")
```

### 📊 **Learning Curves**
```python
def plot_bagging_learning_curves(n_estimators_range):
    """Plot learning curves for different numbers of estimators"""
    train_scores = []
    test_scores = []
    oob_scores = []
    
    for n_est in n_estimators_range:
        bagging = BaggingClassifier(
            n_estimators=n_est,
            oob_score=True,
            random_state=42
        )
        bagging.fit(X_train, y_train)
        
        train_scores.append(bagging.score(X_train, y_train))
        test_scores.append(bagging.score(X_test, y_test))
        oob_scores.append(bagging.oob_score_)
    
    return train_scores, test_scores, oob_scores

# Usage
n_estimators_range = range(10, 200, 20)
train_scores, test_scores, oob_scores = plot_bagging_learning_curves(n_estimators_range)
```

---

## 🎪 **Real-World Use Cases**

### 🏥 **Medical Diagnosis**
```python
# Multi-class disease classification
medical_bagging = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(
        max_depth=8,
        min_samples_split=10,
        min_samples_leaf=5
    ),
    n_estimators=200,
    max_samples=0.8,
    max_features=0.9,  # Keep most medical features
    oob_score=True,
    random_state=42
)
```

### 📧 **Email Spam Detection**
```python
# Binary classification for spam detection
spam_bagging = BaggingClassifier(
    base_estimator=MultinomialNB(),  # Good for text data
    n_estimators=100,
    max_samples=0.7,
    bootstrap=True,
    oob_score=True,
    random_state=42
)
```

### 🖼️ **Image Classification**
```python
# Multi-class image classification
image_bagging = BaggingClassifier(
    base_estimator=SVC(
        kernel='rbf',
        probability=True,
        gamma='scale'
    ),
    n_estimators=50,
    max_samples=0.8,
    max_features=0.8,  # Random feature subsets
    n_jobs=-1,
    random_state=42
)
```

---

## 🔧 **Hyperparameter Tuning**

### 🎯 **Grid Search Example**
```python
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_samples': [0.5, 0.7, 0.9, 1.0],
    'max_features': [0.5, 0.7, 0.9, 1.0],
    'base_estimator__max_depth': [5, 10, 15, None],
    'base_estimator__min_samples_split': [2, 5, 10]
}

# Grid search
grid_search = GridSearchCV(
    BaggingClassifier(
        base_estimator=DecisionTreeClassifier(),
        random_state=42
    ),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
```

### 🎲 **Randomized Search**
```python
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Define parameter distributions
param_dist = {
    'n_estimators': randint(50, 300),
    'max_samples': uniform(0.5, 0.5),  # 0.5 to 1.0
    'max_features': uniform(0.5, 0.5), # 0.5 to 1.0
    'base_estimator__max_depth': [5, 10, 15, 20, None],
    'base_estimator__min_samples_split': randint(2, 20)
}

# Randomized search
random_search = RandomizedSearchCV(
    BaggingClassifier(
        base_estimator=DecisionTreeClassifier(),
        random_state=42
    ),
    param_dist,
    n_iter=100,
    cv=5,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1
)
```

---

## 📊 **Comparison with Other Classifiers**

| Classifier | Accuracy | Speed | Interpretability | Overfitting Risk |
|------------|----------|-------|------------------|------------------|
| **Single Decision Tree** | Medium | Fast | High | High |
| **Bagging Classifier** | High | Medium | Medium | Low |
| **Random Forest** | High | Medium | Medium | Low |
| **Gradient Boosting** | Very High | Slow | Low | Medium |
| **SVM** | High | Medium | Low | Medium |

---

## 📋 **Best Practices**

| Practice | Recommendation | Reason |
|----------|----------------|--------|
| **Base Estimator Choice** | High-variance models (deep trees) | Bagging reduces variance effectively |
| **Number of Estimators** | 100-300 for most cases | Balance between performance and speed |
| **Bootstrap Sample Size** | 0.7-1.0 | Smaller for very large datasets |
| **Feature Sampling** | 0.5-0.8 for high-dimensional data | Reduces correlation between models |
| **OOB Validation** | Always enable when possible | Free model validation |
| **Parallel Processing** | Use `n_jobs=-1` | Significant speedup |

---

## ⚠️ **Common Pitfalls**

```
❌ Using stable, low-variance base estimators
❌ Too few estimators (< 50)
❌ Not using OOB score for validation
❌ Ignoring class imbalance
❌ Not tuning base estimator parameters
❌ Using bagging with already ensemble methods
```

---

> **💡 Pro Tip**: Bagging works best with unstable classifiers like decision trees. For stable classifiers, consider using different ensemble methods or focus on feature engineering!

In [1]:
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

In [2]:
X,y = make_classification(n_samples=10000, n_features=10,n_informative=3)

In [3]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [4]:
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train,y_train)
y_pred = dt.predict(X_test)

print("Decision Tree accuracy",accuracy_score(y_test,y_pred))

Decision Tree accuracy 0.946


## Bagging

In [10]:
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=0.5,
    bootstrap=True,
    random_state=42,
    n_jobs=-1
)

In [11]:
bag.fit(X_train,y_train)

In [12]:
y_pred = bag.predict(X_test)

In [13]:
accuracy_score(y_test,y_pred)

0.9645

# Bagging using SVM

In [15]:
bag = BaggingClassifier(
    estimator=SVC(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=True,
    random_state=42,
    n_jobs=-1
)

In [16]:
bag.fit(X_train,y_train)
y_pred = bag.predict(X_test)
print("Bagging using SVM",accuracy_score(y_test,y_pred))

Bagging using SVM 0.961


# Pasting

In [18]:
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=False,
    random_state=42,
    verbose = 1,
    n_jobs=-1
)

In [19]:
bag.fit(X_train,y_train)
y_pred = bag.predict(X_test)
print("Pasting classifier",accuracy_score(y_test,y_pred))

[Parallel(n_jobs=16)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done   2 out of  16 | elapsed:    0.9s remaining:    6.7s


Pasting classifier 0.961


[Parallel(n_jobs=16)]: Done  16 out of  16 | elapsed:    1.2s finished
[Parallel(n_jobs=16)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done   2 out of  16 | elapsed:    0.0s remaining:    0.2s
[Parallel(n_jobs=16)]: Done  16 out of  16 | elapsed:    0.0s finished


# Random Subspaces

In [21]:
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=1.0,
    bootstrap=False,
    max_features=0.5,
    bootstrap_features=True,
    random_state=42,
    n_jobs=-1
)

In [22]:
bag.fit(X_train,y_train)
y_pred = bag.predict(X_test)
print("Random Subspaces classifier",accuracy_score(y_test,y_pred))

Random Subspaces classifier 0.959


# Random Patches

In [23]:
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=True,
    max_features=0.5,
    bootstrap_features=True,
    random_state=42,
    n_jobs=-1
)

In [24]:
bag.fit(X_train,y_train)
y_pred = bag.predict(X_test)
print("Random Patches classifier",accuracy_score(y_test,y_pred))

Random Patches classifier 0.9555


# 🏷️ Out-of-Bag (OOB) Score

## 📌 What is OOB Score?

The **Out-of-Bag (OOB) score** is a method of **internal validation** used primarily with **bagging algorithms** like **Random Forest**.

It provides a way to estimate the model's performance **without needing a separate validation set or cross-validation**.

---

## 🎯 Core Idea

In **bagging**, each model is trained on a **bootstrap sample** (sampling with replacement), meaning:
- Some data points are used multiple times
- **~37% of the data is left out** from each bootstrap sample on average

These **left-out points** are called **Out-of-Bag samples**.

🧠 **Key Insight**:  
> We can use OOB samples as a **test set** for that specific model!

---

## 🧮 Why ~37% OOB?

For a dataset with $n$ samples:

Each data point has a probability of **not being selected** in one draw:
$$
P(\text{not selected}) = 1 - \frac{1}{n}
$$

Over $n$ draws:
$$
P(\text{never selected}) = \left(1 - \frac{1}{n}\right)^n \approx e^{-1} \approx 0.368
$$

So, **~36.8%** of data is OOB for each tree.

---

## 🧪 How OOB Score is Computed

1. For each sample $x_i$, collect predictions from all trees **where $x_i$ was OOB**
2. Average those predictions (for regression) or take a majority vote (for classification)
3. Compare with true label $y_i$
4. Compute accuracy or error over the entire dataset

---

## 🛠 Example (Classification)

Let’s say we train a Random Forest with 100 trees on 1000 samples.

- Each tree uses a bootstrap sample (~632 samples)
- Each sample is OOB in ~36 trees

Prediction for sample $x_i$ is the **majority vote** from the ~36 trees where it was OOB.

OOB score is the overall **accuracy** on all such predictions.

---

## ✅ Advantages of OOB Score

| Benefit                             | Explanation                            |
|-------------------------------------|----------------------------------------|
| No separate validation set needed   | Saves data for training                |
| Less computation than cross-validation | More efficient                        |
| Useful for hyperparameter tuning    | e.g., `n_estimators`, `max_depth`     |
| Built-in feature in Random Forest   | `oob_score=True` in scikit-learn      |

---

## ❌ Limitations

- Not available for all ensemble methods (only **bagging-based**)
- Less reliable for **very small datasets**
- May be **biased** if trees are too shallow or OOB sample is too small

---

## 🧪 Scikit-Learn Example

```python
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, oob_score=True, bootstrap=True)
rf.fit(X_train, y_train)

print("OOB Score:", rf.oob_score_)


# OOB Score

In [26]:
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=True,
    oob_score=True,
    random_state=42,
    n_jobs=-1
)

In [27]:
bag.fit(X_train,y_train)

In [28]:
bag.oob_score_

0.96625

In [29]:
y_pred = bag.predict(X_test)
print("Accuracy",accuracy_score(y_test,y_pred))

Accuracy 0.962


# Bagging Tips

- Bagging generally gives better results than Pasting
- Good results come around the 25% to 50% row sampling mark
- Random patches and subspaces should be used while dealing with high dimensional data
- To find the correct hyperparameter values we can do GridSearchCV/RandomSearchCV

# Applying GridSearchCV

In [30]:
from sklearn.model_selection import GridSearchCV

In [31]:
parameters = {
    'n_estimators': [50,100,500], 
    'max_samples': [0.1,0.4,0.7,1.0],
    'bootstrap' : [True,False],
    'max_features' : [0.1,0.4,0.7,1.0]
    }

In [36]:
search = GridSearchCV(BaggingClassifier(), parameters, cv=5, n_jobs=-1, verbose=2)

In [None]:
search.fit(X_train,y_train)

Fitting 5 folds for each of 96 candidates, totalling 480 fits


In [None]:
search.best_score_