# Theory: The "Wisdom of Crowds"

**Imagine we are a panel of doctors diagnosing a patient.**

* **Doctor A:** Expert in X-rays but misses details in blood work.
* **Doctor B:** Expert in blood work but isn't great at X-rays.
* **Doctor C:** A generalist who is decent at both but not an expert.

If we rely on just one doctor, we might misdiagnose. But if we take a vote among all three, the experts cover each other's blind spots.

---

### In Machine Learning: Ensemble Learning
A **Voting Model** is a meta-model that combines the predictions of several base models to improve generalizability and robustness compared to a single model.

## Mechanism: How it Votes

There are two main strategies for voting:

### A. Hard Voting (Majority Rule)
We count the specific class predictions from each classifier. The class with the most votes wins.

> **Analogy:** "3 doctors say 'Sick', 1 says 'Healthy'. The diagnosis is 'Sick'."

* **Best for:** Classifiers that output distinct labels (or when we don't trust the probabilities).

### B. Soft Voting (Weighted Average Probabilities)
We sum the predicted probabilities (confidence scores) for each class from every classifier and average them. The class with the highest average probability wins.

> **Analogy:** Doctor A says "90% Sick", Doctor B says "60% Sick", Doctor C says "40% Healthy" (which is 60% Sick). The average confidence is high for "Sick".

* **Best for:** Calibrated classifiers (models that output reliable probabilities like Logistic Regression). This usually performs better than hard voting because it gives more weight to highly confident votes.

$$
\hat{y} = \arg\max_i \sum_{j=1}^{m} w_j P_j(y=i | \mathbf{x})
$$

Where:
* $w_j$ is the weight of model $j$
* $P_j$ is the probability predicted by model $j$

# Applying the Model: Basic Implementation

Let's use a dataset where simple linear models might struggle, but an ensemble can succeed. We will use the Scikit-Learn Breast Cancer dataset. It is a binary classification problem ideal for checking model robustness.

We will combine three different "experts":

1. Logistic Regression (Linear expert)

2. Decision Tree (Non-linear expert)

3. Support Vector Machine (Distance-based expert)

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load Data
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Define our "Doctors" (Base Estimators)
# Note: probability=True is required for SVC if we want to use Soft Voting later
clf1 = LogisticRegression(solver='liblinear', random_state=42)
clf2 = DecisionTreeClassifier(random_state=42)
clf3 = SVC(probability=True, random_state=42)

# 3. Define the Voting Model (Hard Voting)
voting_clf = VotingClassifier(
    estimators=[('lr', clf1), ('dt', clf2), ('svc', clf3)],
    voting='hard'
)

# 4. Train
voting_clf.fit(X_train, y_train)

# 5. Predict
y_pred = voting_clf.predict(X_test)
print(f"Voting Classifier Accuracy: {accuracy_score(y_test, y_pred):.4f}")

Voting Classifier Accuracy: 0.9649


### Main Parameters Explained

* **`estimators`**:
    A list of tuples, e.g., `[('name1', model1), ('name2', model2)]`. This stores the diverse models we want to vote on.

* **`voting`**:
    * `'hard'`: Majority rule (default).
    * `'soft'`: Average probabilities.
        > **Crucial:** All base models must support `predict_proba()` for this to work.

* **`weights`**:
    *(Optional)* A list of floats, e.g., `[1, 2, 1]`. Use this if we trust one model (e.g., the Decision Tree) twice as much as the others.

## Advanced Pipeline & Hyperparameter Tuning
Now, let's build a production-grade pipeline.

### Why Scaling?
**Yes, data scaling is critical here.** * **The Reason:** Our ensemble includes **Logistic Regression** and **SVM**. These models calculate distances or weights based on feature magnitude.
* **The Example:** If one feature is "Income ($100,000$)" and another is "Age ($50$)", the model will be biased toward Income.
* **The Exception:** A **Decision Tree** doesn't care about scale, but the **Voting Classifier** (containing SVM/LogReg) will fail without it.



---

### Implementation Details
We will use a different dataset: **The Wine Dataset** (Multiclass classification). We will tune the hyperparameters of the internal models using `GridSearchCV`.

In [2]:
from sklearn.datasets import load_wine
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

# 1. Load Data
wine = load_wine()
X_wine, y_wine = wine.data, wine.target

# 2. Build Pipeline
# We need scaling first, then the voting classifier
clf1 = LogisticRegression(solver='liblinear', random_state=42)
clf2 = DecisionTreeClassifier(random_state=42)
clf3 = SVC(probability=True, random_state=42)

voting_clf = VotingClassifier(
    estimators=[('lr', clf1), ('dt', clf2), ('svc', clf3)],
    voting='soft'
)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('voter', voting_clf)
])

# 3. Hyperparameter Tuning
# NOTE: To tune parameters of models INSIDE the voter, we use double underscores.
# Format: <step_name>__<estimator_name>__<parameter>
param_grid = {
    'voter__lr__C': [0.1, 1.0, 10],       # Tuning C for Logistic Regression
    'voter__dt__max_depth': [3, 5, None], # Tuning depth for Decision Tree
    'voter__weights': [[1, 1, 1], [2, 1, 1], [1, 2, 1]] # Tuning influence of models
}

# Grid Search with Cross Validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_wine, y_wine)

print(f"Best Params: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_:.4f}")

Best Params: {'voter__dt__max_depth': 3, 'voter__lr__C': 1.0, 'voter__weights': [1, 1, 1]}
Best CV Score: 0.9833


## Critical Considerations

### 1. Bias-Variance Tradeoff
* **Variance Reduction:** This is the primary superpower of Voting Models. If a Decision Tree has high variance (overfits) and a Logistic Regression has high bias (underfits), averaging them tends to smooth out the noise. The ensemble usually has **lower variance** than single models.
* **Bias:** The ensemble bias is usually similar to the average bias of the base models. It won't magically fix a problem if all our models are underfitting (high bias).



---

### 2. When does this fail?
* **Correlated Errors:** If all our models make the same mistakes (e.g., they all struggle with the same specific edge case), voting changes nothing. Three wrong votes is still a wrong decision.
* **Overwhelmingly Bad Model:** If we have two terrible models and one great model, the two bad ones will outvote the expert. (*Note: Always check individual performance before ensemble inclusion.*)

---

### 3. When to use?
* **Competition/High Stakes:** When every 0.1% accuracy counts (Kaggle competitions often use this).
* **Ambiguous Data:** When the decision boundary is fuzzy and different models "see" the data differently.

---

### 4. Metric & Interpretability
* **Metric:** We need a metric like **Accuracy** or **F1-Score**.
* **Tradeoff:** Voting Classifiers are harder to interpret ("Black Box") than a single Decision Tree.

In [3]:
# ==========================================
# 1. IMPORTS & SETUP
# ==========================================
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Base Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# Ensemble
from sklearn.ensemble import VotingClassifier

# Metrics
from sklearn.metrics import classification_report, accuracy_score

# ==========================================
# 2. DATA PREPARATION
# ==========================================
# We use Breast Cancer dataset: 30 features, binary target (Malignant/Benign)
data = load_breast_cancer()
X = data.data
y = data.target

# Split data (Essential to avoid data leakage during testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Dataset Shape: {X.shape}")
print(f"Training Samples: {X_train.shape[0]}, Test Samples: {X_test.shape[0]}")

# ==========================================
# 3. DEFINING BASE LEARNERS
# ==========================================
# We choose diverse models to maximize ensemble benefit.
# 1. Logistic Regression (Linear)
# 2. KNN (Distance based - requires scaling)
# 3. SVC (Complex boundaries - requires scaling)

# Note: We do NOT scale X_train here manually.
# We will put scaling inside the Pipeline to prevent leakage during CV.

clf_log = LogisticRegression(solver='liblinear', random_state=42)
clf_knn = KNeighborsClassifier()
clf_svc = SVC(probability=True, random_state=42) # probability=True needed for soft voting

# ==========================================
# 4. BUILDING THE VOTING PIPELINE
# ==========================================
# We create the voting classifier
voting_clf = VotingClassifier(
    estimators=[
        ('lr', clf_log),
        ('knn', clf_knn),
        ('svc', clf_svc)
    ],
    voting='soft' # Soft voting generally outperforms hard voting on this dataset
)

# We wrap it in a pipeline to handle scaling automatically
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Scale data first!
    ('voting', voting_clf)         # Then vote
])

# ==========================================
# 5. HYPERPARAMETER TUNING (GridSearch)
# ==========================================
# We want to tune the parameters of the base models *through* the voting classifier.
# Syntax: voting__<estimator_name>__<param>

params = {
    'voting__lr__C': [0.1, 1.0, 10],      # Regularization for LogReg
    'voting__knn__n_neighbors': [3, 5, 7], # Neighbors for KNN
    'voting__weights': [[1, 1, 1], [2, 1, 1], [1, 2, 1]] # Weighting strategy
}

print("\nStarting Hyperparameter Tuning (this may take a moment)...")
grid = GridSearchCV(pipeline, params, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

print(f"\nBest Parameters Found:\n{grid.best_params_}")
print(f"Best CV Accuracy: {grid.best_score_:.4f}")

# ==========================================
# 6. FINAL EVALUATION
# ==========================================
# Predict on the held-out test set
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("\n--- Final Test Set Results ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

# ==========================================
# 7. HOW IT LEARNS (INTERPRETATION)
# ==========================================
# To see how the voting actually happened for a specific sample:
sample_id = 0
sample_data = X_test[sample_id].reshape(1, -1)
true_label = y_test[sample_id]

# Access the voter inside the pipeline
voter = best_model.named_steps['voting']
scaler = best_model.named_steps['scaler']
scaled_sample = scaler.transform(sample_data)

# Get probabilities from each internal estimator
print(f"\n--- Anatomy of a Vote (Sample {sample_id}) ---")
for name, method in voter.named_estimators_.items():
    prob = method.predict_proba(scaled_sample)[0]
    print(f"Model [{name}] says: Benign: {prob[1]:.2f}, Malignant: {prob[0]:.2f}")

final_prob = best_model.predict_proba(sample_data)[0]
print(f"--> FINAL Weighted Vote: Benign: {final_prob[1]:.2f}, Malignant: {final_prob[0]:.2f}")
print(f"True Label: {'Benign' if true_label==1 else 'Malignant'}")

Dataset Shape: (569, 30)
Training Samples: 455, Test Samples: 114

Starting Hyperparameter Tuning (this may take a moment)...

Best Parameters Found:
{'voting__knn__n_neighbors': 5, 'voting__lr__C': 10, 'voting__weights': [1, 1, 1]}
Best CV Accuracy: 0.9802

--- Final Test Set Results ---
Accuracy: 0.9649

Classification Report:
              precision    recall  f1-score   support

   malignant       0.95      0.95      0.95        43
      benign       0.97      0.97      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.96      0.96       114
weighted avg       0.96      0.96      0.96       114


--- Anatomy of a Vote (Sample 0) ---
Model [lr] says: Benign: 0.89, Malignant: 0.11
Model [knn] says: Benign: 1.00, Malignant: 0.00
Model [svc] says: Benign: 0.98, Malignant: 0.02
--> FINAL Weighted Vote: Benign: 0.96, Malignant: 0.04
True Label: Benign


# Theory: Averaging "Opinions"

In classification, we voted on a label. In regression, we cannot "vote" because the outputs are continuous numbers (e.g., House Price). Instead, we take the **Average**.

**Imagine we are appraising a house:**
* **Agent A (Linear Model):** says $500k
* **Agent B (Decision Tree):** says $550k
* **Agent C (SVR):** says $480k

The Voting Regressor calculates the average prediction:
$$\frac{500 + 550 + 480}{3} = \$510k$$

By averaging these diverse predictions, we smooth out the error. If Agent B overestimates and Agent C underestimates, they cancel each other out, leaving us closer to the truth.

### The Mathematical Formula
$$\hat{y} = \frac{1}{\sum w_j} \sum_{j=1}^{m} w_j f_j(\mathbf{x})$$

**Where:**
* $w_j$ is the weight of the $j$-th model.
* $f_j(\mathbf{x})$ is the prediction of that model.

---

# Applying the Model: The Regression Pipeline

We will use the **California Housing Dataset**. This is a classic regression problem (predicting house values) that benefits from scaling and ensemble methods.

### Our Strategy:
* **Base Models:**
    * `LinearRegression`: Captures global linear trends.
    * `DecisionTreeRegressor`: Captures non-linear local patterns.
    * `SVR` (Support Vector Regressor): Captures complex boundaries (needs scaling).
* **Pipeline:** We must scale the data because SVR and Linear Regression are sensitive to feature magnitude.
* **Tuning:** We will tune the weights to see which "Agent" we should trust the most.

In [4]:
# ==========================================
# 1. IMPORTS & SETUP
# ==========================================
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Base Regressors
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor # Let's add a strong one

# The Ensemble
from sklearn.ensemble import VotingRegressor

# Metrics
from sklearn.metrics import r2_score, mean_squared_error

# ==========================================
# 2. DATA PREPARATION
# ==========================================
# California Housing: Predict median house value
data = fetch_california_housing()
X, y = data.data, data.target

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {X_train.shape[0]}")
print(f"Feature scale example (feature 0): Mean={X_train[:,0].mean():.2f}, Std={X_train[:,0].std():.2f}")
# Note: Since features have different scales, scaling is mandatory for SVR.

# ==========================================
# 3. DEFINING BASE LEARNERS
# ==========================================
reg1 = LinearRegression()
reg2 = DecisionTreeRegressor(max_depth=5, random_state=42) # Constrain depth to reduce overfitting
reg3 = SVR(kernel='rbf')

# ==========================================
# 4. BUILDING THE PIPELINE
# ==========================================
# Define Voting Regressor
voting_reg = VotingRegressor(
    estimators=[
        ('lr', reg1),
        ('dt', reg2),
        ('svr', reg3)
    ]
)

# Pipeline: Scaler -> Voter
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('voter', voting_reg)
])

# ==========================================
# 5. HYPERPARAMETER TUNING
# ==========================================
# We will tune the "weights" (how much we trust each model)
# and the SVR's regularization parameter 'C'.

params = {
    'voter__weights': [[1, 1, 1], [1, 2, 1], [1, 1, 2]], # Equal trust vs trusting DT/SVR more
    'voter__svr__C': [1.0, 10.0]  # Tuning the SVR inside the voter
}

print("\nRunning Grid Search (tuning ensemble weights)...")
grid = GridSearchCV(pipeline, params, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
grid.fit(X_train, y_train)

print(f"Best Params: {grid.best_params_}")
# Note: Scoring is negative MSE in sklearn (higher is better), so we flip sign
print(f"Best RMSE (CV): {np.sqrt(-grid.best_score_):.4f}")

# ==========================================
# 6. EVALUATION
# ==========================================
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("\n--- Final Test Set Results ---")
print(f"R2 Score: {r2:.4f} (1.0 is perfect)")
print(f"RMSE: {rmse:.4f}")

# ==========================================
# 7. ANATOMY OF A PREDICTION
# ==========================================
# Let's peek under the hood for the first test sample
sample = X_test[0].reshape(1, -1)
true_val = y_test[0]

# Access steps
scaler = best_model.named_steps['scaler']
voter = best_model.named_steps['voter']
scaled_sample = scaler.transform(sample)

print(f"\n--- How the prediction was made (Sample 0) ---")
predictions = []
for name, estimator in voter.named_estimators_.items():
    pred = estimator.predict(scaled_sample)[0]
    predictions.append(pred)
    print(f"Model [{name}] predicts: {pred:.3f}")

weights = grid.best_params_['voter__weights']
# Calculate weighted average manually to prove concept
manual_pred = np.average(predictions, weights=weights)

print(f"--> Ensemble Weighted Average: {manual_pred:.3f}")
print(f"--> Actual Model Output: {best_model.predict(sample)[0]:.3f}")
print(f"--> True Value: {true_val:.3f}")

Training samples: 16512
Feature scale example (feature 0): Mean=3.88, Std=1.90

Running Grid Search (tuning ensemble weights)...
Best Params: {'voter__svr__C': 10.0, 'voter__weights': [1, 1, 2]}
Best RMSE (CV): 0.5870

--- Final Test Set Results ---
R2 Score: 0.7274 (1.0 is perfect)
RMSE: 0.5977

--- How the prediction was made (Sample 0) ---
Model [lr] predicts: 0.719
Model [dt] predicts: 1.169
Model [svr] predicts: 0.490
--> Ensemble Weighted Average: 0.717
--> Actual Model Output: 0.717
--> True Value: 0.477


## Bias-Variance Tradeoff

* **Variance:** Like the classifier, the main goal here is **Variance Reduction**.
* **Intuition:** If the Linear Regression is too rigid (**High Bias**) and the Decision Tree is too chaotic (**High Variance**), the SVR might find a middle ground. Averaging them dampens the outliers produced by the Decision Tree.



---

## When to use Voting Regressor?

1.  **Uncertainty in Model Selection:** When we don't know if the data is linear or non-linear, we mix both.
2.  **Improving Stability:** Single Decision Trees change drastically with small data changes. An ensemble of a Tree + Linear Regression is much more stable.
3.  **High Stakes Prediction:** Used in financial forecasting where reliability is more important than interpretability.

---

## When does it fail?

> [!CAUTION]
> **Outliers:** If one of our models produces an extreme outlier (e.g., predicts $10M instead of $1M), it can skew the average significantly. Unlike classification, where a wrong vote is just one vote, in regression, the **magnitude** of the error matters.

* **Solution:** Check the individual $R^2$ scores of base models. If one model is performing significantly worse than the others, remove it from the ensemble.

---

## Does it need scaling?

**Yes.** We are using `StandardScaler` in the pipeline.

* **The Reason:** The Voting Regressor itself is just a mathematical average, but the **Base Models** inside it (such as SVR, Linear Regression, or KNN) absolutely require scaled data to function correctly.

# Theory: The "Parallel Universe" Strategy

In the **Voting Model**, we used different types of "doctors" (Linear, SVM, Tree).  
In **Bagging**, we use the **same type of doctor** (usually a Decision Tree), but we train them on different variations of the training data.

---

## Two Key Steps:

### 1. Bootstrapping (The "Bootstrap")
We create multiple random subsets of our original training data. Crucially, we sample **with replacement**.

* **Imagine:** We have a deck of flashcards. We draw a card, write it down, put it back in the deck, and shuffle. We do this until we have a new deck.
* **Result:** Some cards appear 2-3 times, others (about 37%) never appear. These missing cards are called **Out-of-Bag (OOB)** instances.

### 2. Aggregating (The "Aggregating")
We train a separate model on each "bootstrapped" dataset. Because the datasets are slightly different, the models will be slightly different.

* **Classification:** We take the majority vote.
* **Regression:** We take the average.

---

## Why do we do this?
**To kill Variance.**



A single Decision Tree is very sensitive to noise (high variance). By averaging 100 trees trained on slightly different data, the noise cancels out, and the true pattern remains.

---

##  Applying the Model: Single Tree vs. Bagging

Let's prove the theory. We will compare a single **Decision Tree** against a **Bagging Classifier** on the Digits dataset.

### Does it need scaling?
**Usually No.** Bagging is typically used with Decision Trees, and Trees do not require scaling.

* **Exception:** If we bag models that *do* need scaling (like Bagged SVMs or Bagged KNN), then yes, we must scale.
* **Best Practice:** For this example (Trees), we can skip it, but we will include it in the pipeline just to be safe for future model swaps.

In [5]:
# ==========================================
# 1. IMPORTS
# ==========================================
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report

# ==========================================
# 2. DATA PREPARATION (Digits Dataset)
# ==========================================
# The Digits dataset: 8x8 pixel images of numbers 0-9.
digits = load_digits()
X, y = digits.data, digits.target

# Split Data (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Data Shape: {X.shape} (1797 samples, 64 pixels/features)")

# ==========================================
# 3. ESTABLISH A BASELINE
# ==========================================
# Let's see how a single Decision Tree performs.
# Trees tend to overfit (High training score, lower test score).
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)
base_acc = single_tree.score(X_test, y_test)
print(f"Single Decision Tree Accuracy: {base_acc:.4f}")

# ==========================================
# 4. BUILD THE BAGGING PIPELINE
# ==========================================
# We use a Pipeline even though Trees don't need scaling,


# The Base Learner: A Decision Tree
base_estimator = DecisionTreeClassifier(random_state=42)

# The Bagging Ensemble
bagging_clf = BaggingClassifier(
    estimator=base_estimator,
    n_estimators=100,      # 100 Trees
    max_samples=0.8,       # Each tree sees 80% of the training data
    oob_score=True,        # Use the left-out data for validation automatically
    random_state=42,
    n_jobs=-1              # Use all CPU cores (Bagging is parallel!)
)

pipeline = Pipeline([
    ('scaler', StandardScaler()), # Optional for trees, critical for Bagged SVM/KNN
    ('bagging', bagging_clf)
])

# ==========================================
# 5. HYPERPARAMETER TUNING
# ==========================================
# We tune:
# 1. n_estimators: How many trees?
# 2. max_samples: How much data does each tree get?
# 3. max_features: Can we limit features too? (This moves us closer to Random Forest)

param_grid = {
    'bagging__n_estimators': [50, 100, 200],
    'bagging__max_samples': [0.5, 0.7, 1.0],
    'bagging__max_features': [0.5, 1.0] # train on 50% or 100% of features
}

print("\nRunning Grid Search (this handles the CV)...")
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

print(f"Best Params: {grid.best_params_}")
print(f"Best CV Score: {grid.best_score_:.4f}")

# ==========================================
# 6. FINAL EVALUATION & OOB SCORE
# ==========================================
best_model = grid.best_estimator_

# Retrieve OOB Score (Out-of-Bag Score)
# This is an estimate of test accuracy calculated during training without looking at X_test
oob_score = best_model.named_steps['bagging'].oob_score_
print(f"\nOOB Score (Validation estimate): {oob_score:.4f}")

# Final Test Prediction
y_pred = best_model.predict(X_test)
final_acc = accuracy_score(y_test, y_pred)

print(f"Final Test Accuracy: {final_acc:.4f}")
print(f"Improvement over Single Tree: +{(final_acc - base_acc)*100:.2f}%")

# ==========================================
# 7. HOW IT LEARNS (INTERPRETATION)
# ==========================================
print("\n--- Insight ---")
print("We trained multiple trees on random subsets.")
print(f"Because we set max_features={grid.best_params_['bagging__max_features']},")
print("each tree also only saw a portion of the pixels!")
print("This forces the trees to be diverse, making the vote robust.")

Data Shape: (1797, 64) (1797 samples, 64 pixels/features)
Single Decision Tree Accuracy: 0.8417

Running Grid Search (this handles the CV)...
Best Params: {'bagging__max_features': 0.5, 'bagging__max_samples': 0.7, 'bagging__n_estimators': 200}
Best CV Score: 0.9694

OOB Score (Validation estimate): 0.9701
Final Test Accuracy: 0.9722
Improvement over Single Tree: +13.06%

--- Insight ---
We trained multiple trees on random subsets.
Because we set max_features=0.5,
each tree also only saw a portion of the pixels!
This forces the trees to be diverse, making the vote robust.


# Advanced Topics & Considerations

## The "Out-of-Bag" (OOB) Concept
This is a unique advantage of Bagging.



* **How it works:** Since we sample with replacement, about **37%** of the training data is never seen by a specific tree.
* **The Benefit:** We can use this "leftover" data to test that specific tree.
* **OOB Score:** Averaging these tests gives us the OOB Score, which acts like a **Validation Set** without us needing to split the data manually.

---

## Bias-Variance Tradeoff

* **Variance:** Bagging is the "King of Variance Reduction." It smooths out the "jitters" of complex models.
* **Bias:** Bagging **cannot** fix high bias. If we bag a simple Linear Regression (which underfits), the average of 100 underfitting models is still an underfitting model.
* **Rule of Thumb:** Bagging works best on **"Strong, Complex"** models (like deep Decision Trees).

---

## Sampling Variations

### Bagging vs. Pasting
* **Bagging:** Sampling **with** replacement (the same data point can be chosen twice for one tree). This is the default and usually reduces variance better.
* **Pasting:** Sampling **without** replacement. Used sometimes for massive datasets.

### Random Patches vs. Random Subspaces
[Image diagram comparing Random Patches and Random Subspaces sampling]

* **Random Subspaces:** Sampling features (columns) but using all samples (rows). Good when we have very high feature counts (e.g., DNA data).
* **Random Patches:** Sampling **both** features (columns) and samples (rows).

---

## When does this fail?

* **Computationally Expensive:** Training 100 models takes 100x the time (though it can be parallelized using `n_jobs=-1`).
* **Loss of Interpretability:** A single Decision Tree is easy to visualize (Yes/No path). A bag of 100 trees is a **"Black Box"**â€”we know it works, but it's hard to explain exactly *why* a specific decision was made.