## Scope of This Note

This note demonstrates **Stacking Ensembles in two settings**:

1. **Classification Stacking**
   - Dataset: Breast Cancer
   - Goal: Accuracy & interpretability
   - Meta-learner: Logistic Regression

2. **Regression Stacking**
   - Dataset: California Housing
   - Goal: RMSE reduction
   - Meta-learner: Ridge Regression

The core stacking principles are identical; only the loss functions and evaluation metrics differ.


# Theory: The "Manager" Architecture

In **Voting**, we assigned fixed weights (e.g., "Trust Model A twice as much").  
In **Stacking**, we train a model to **learn those weights dynamically**.



---

## The Architecture (Two Layers)

### Level 0 (Base Learners):
We train diverse models (SVM, Tree, KNN) on the data. They make predictions.
* **Analogy:** **The Engineers.** One is good at math, one is good at creative design.

### Level 1 (Meta-Learner / Blender):
This is a new model (usually a simple Linear Regression or Logistic Regression). It takes the predictions from Level 0 as its **inputs**.
* **Analogy:** **The Manager.** She sees that for "Project Type X", the Math Engineer is usually right, so she listens to him. For "Project Type Y", she listens to the Creative Engineer.

---

## How it learns (The Trick)
To avoid "cheating" (**overfitting**), we cannot train the Manager on the same data the Workers saw.

1. We split the training data.
2. The Workers train on **Part A**.
3. They predict on **Part B**.
4. The Manager trains on the **Predictions** made on Part B.

> **Note:** Scikit-Learn handles this automatically using internal **Cross-Validation**.

---

## Applying the Model: The Setup

We will use the **Breast Cancer Dataset**. This is ideal for Stacking because the decision boundary is complex, and different models (Linear vs. Non-linear) pick up different signal types. Stacking helps combine them to squeeze out that final 1-2% of accuracy.

### Does it need scaling?
**Yes.**

* **For Base Models:** If our base models include SVM or KNN, the input data must be scaled.
* **For Meta Model:** The meta-model sees "predictions" (probabilities) as input, which are already roughly scaled (0 to 1), so it's less critical there, but the input pipeline must include scaling for the workers.

---

## The Stacking Notebook
We will build a pipeline where:

* **Level 0:** Random Forest (Robust), SVM (Distance-based), KNN (Local patterns).
* **Level 1:** Logistic Regression (The Manager).

In [1]:
# ==========================================
# 1. IMPORTS
# ==========================================
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Level 0 Models (The Workers)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# Level 1 Model (The Manager)
from sklearn.linear_model import LogisticRegression

# The Stacking Ensemble
from sklearn.ensemble import StackingClassifier

# Metrics
from sklearn.metrics import accuracy_score, classification_report

# ==========================================
# 2. DATA PREPARATION
# ==========================================
data = load_breast_cancer()
X, y = data.data, data.target

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Data Loaded: {X.shape}")

# ==========================================
# 3. DEFINE BASE LEARNERS (LEVEL 0)
# ==========================================
# We want diverse models. If they all make the same errors, the Manager learns nothing.

# Note: We must ensure SVM outputs probabilities if we want the Manager
# to see confidence scores rather than just hard labels.
estimators = [
    ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
    ('svr', SVC(probability=True, random_state=42)), # probability=True is key
    ('knn', KNeighborsClassifier(n_neighbors=5))
]

# ==========================================
# 4. DEFINE STACKING CLASSIFIER
# ==========================================
# final_estimator is our Manager.
# LogisticRegression is best here to keep the "Management" logic simple and interpretable.

clf_stack = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(),
    cv=5,            # Internal CV to generate training data for the Manager
    stack_method='predict_proba', # Manager sees probabilities (0.9, 0.1) instead of labels (1, 0)
    n_jobs=-1
)

# ==========================================
# 5. PIPELINE & TUNING
# ==========================================
# We wrap everything in a pipeline to handle scaling for the SVC/KNN inside the Stack.
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('stack', clf_stack)
])


# ‚ö†Ô∏è Note on Scaling in Stacking Pipelines:
# Although Random Forest does not require scaling, we apply scaling globally because:
# - SVC and KNN require scaled features
# - Tree-based models are scale-invariant, so scaling does not harm them
# This is a pragmatic engineering trade-off.




# HYPERPARAMETER TUNING
# We can tune the Base Models AND the Manager simultaneously.
# Syntax: stack__<estimator_name>__<parameter>
# Syntax for final estimator: stack__final_estimator__<parameter>

params = {
    # Tuning a Base Model (Random Forest)
    'stack__rf__n_estimators': [10, 50],

    # Tuning the Manager (Logistic Regression)
    # C controls how much the Manager trusts the inputs vs regularizing
    'stack__final_estimator__C': [0.1, 1.0, 10.0]
}

print("Running Grid Search on the Stack...")
grid = GridSearchCV(pipeline, params, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

print(f"Best Params: {grid.best_params_}")
print(f"Best CV Score: {grid.best_score_:.4f}")

# ==========================================
# 6. EVALUATION
# ==========================================
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("\n--- Final Results ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred))

# ==========================================
# 7. HOW IT LEARNS (INTERPRETATION)
# ==========================================
# Let's look at the "Manager's" learned weights.
# This tells us which Base Model the Manager trusts the most.

final_layer = best_model.named_steps['stack'].final_estimator_

print("\n--- The Manager's Logic ---")
print("Coefficients assigned to each Base Model's predictions:")
# The shape of coef_ depends on the classes. For binary, it's usually (1, n_estimators * n_classes)
# or (1, n_estimators) depending on setup.
print(f"Coefficients: {final_layer.coef_}")

# Usually, higher coefficient = The Manager trusts this model more.
names = ['RandomForest', 'SVC', 'KNN']
print(f"Order: {names}")

Data Loaded: (569, 30)
Running Grid Search on the Stack...
Best Params: {'stack__final_estimator__C': 1.0, 'stack__rf__n_estimators': 10}
Best CV Score: 0.9692

--- Final Results ---
Accuracy: 0.9561
              precision    recall  f1-score   support

           0       0.93      0.95      0.94        43
           1       0.97      0.96      0.96        71

    accuracy                           0.96       114
   macro avg       0.95      0.96      0.95       114
weighted avg       0.96      0.96      0.96       114


--- The Manager's Logic ---
Coefficients assigned to each Base Model's predictions:
Coefficients: [[1.93455919 3.9602202  2.43502367]]
Order: ['RandomForest', 'SVC', 'KNN']


## Main Parameters Explained

* **`estimators`**: A list of `(name, model)` tuples. These are the **workers**.
* **`final_estimator`**: The model that aggregates the workers' outputs.
    > **Recommendation:** Always start with `LogisticRegression` (for classification) or `LinearRegression` (for regression). If we use a complex model here (like a Decision Tree), the "Manager" might overfit to the noise in the predictions.
* **`cv` (Cross-Validation Strategy)**:
    This is critical. To train the Manager, the stack splits the training data (e.g., 5 folds). It trains the workers on 4 folds and predicts on the 5th. This creates "clean" predictions for the Manager to learn from. The meta-learner is trained only on **out-of-fold predictions**, never on predictions from models that saw the same samples during training. This prevents target leakage.

    * If `cv=None`, it defaults to 5.
* **`stack_method`**:
    * `'auto'`: Tries to use `predict_proba` (probabilities), falls back to `decision_function`, then `predict`.
    * **Advice:** For classification, explicit probabilities (`predict_proba`) usually give the Manager more information to work with than hard labels.

---

## Considerations & Trade-offs

### Bias-Variance Tradeoff
Stacking primarily reduces **bias** by combining models with different inductive assumptions.
It can also stabilize variance, but this depends on:
- Diversity of base learners
- Strength and regularization of the meta-learner




---

### When does this fail?
* **Redundant Base Models:** If we stack a Random Forest and a Bagged Decision Tree, they are doing the same thing. The Manager won't find unique signals.
* **Overfitting the Meta-Learner:** If the dataset is small, the Manager might just memorize "When Model A is wrong, Model B is right."
* **Latency:** Stacking is slow. To predict one new sample, we have to run it through all base models, then the meta-model. Not ideal for real-time, low-latency apps.

---

### When to use?
* **Competitions (Kaggle):** Almost all winning solutions are stacked ensembles.
* **Plateaued Performance:** When we have tuned our single models extensively and can't get past a performance wall (e.g., 95% accuracy). Stacking is the "sledgehammer" used to break that wall.

In [2]:
# ==========================================
# 1. IMPORTS & CONFIG
# ==========================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-Learn
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer

# Regressors
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor

# Metrics
from sklearn.metrics import mean_squared_error

# Global Config
SEED = 42
NP_SEED = np.random.seed(SEED)

# ==========================================
# 2. LOAD DATA
# ==========================================
raw_data = fetch_california_housing(as_frame=True)
df = raw_data.frame

# In a real Kaggle comp, we would have train.csv and test.csv
# Here, we simulate that split manually.
X = df.drop(columns=['MedHouseVal'])
y = df['MedHouseVal']

# 80% Train (Our Playground), 20% Test (The "Private Leaderboard")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

print(f"Train Shape: {X_train.shape}")
print(f"Test Shape:  {X_test.shape}")
df.head(3)

Train Shape: (16512, 8)
Test Shape:  (4128, 8)


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521


In [3]:
# ==========================================
# 3. DEFINE PREPROCESSING
# ==========================================
# We use RobustScaler because housing data often has outliers (mansions vs shacks)
scaler = RobustScaler()

# Note: We won't apply this globally yet. We will attach it to specific models.

# ==========================================
# 4. BASE MODELS
# ==========================================

# 1. Ridge Regression (Linear Model)
# Good for capturing general trends. Needs Scaling.
ridge = make_pipeline(RobustScaler(), Ridge(alpha=1.0, random_state=SEED))

# 2. SVR (Support Vector Regressor)
# Good for complex non-linear boundaries. Needs Scaling.
# Note: SVR is slow on large data, but effective.
svr = make_pipeline(RobustScaler(), SVR(C=1.0, epsilon=0.2))

# 3. Random Forest (Bagging)
# Good for high variance reduction. No scaling needed.
rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=SEED, n_jobs=-1)

# 4. Gradient Boosting (Boosting)
# Good for bias reduction. No scaling needed.
gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=SEED)

# List of tuples for the Stacker
estimators = [
    ('ridge', ridge),
    ('svr', svr),
    ('rf', rf),
    ('gbr', gbr)
]


# ==========================================
# 5. STACKING CONFIGURATION
# ==========================================

# The Manager
meta_learner = Ridge(alpha=1.0, random_state=SEED)

# The Ensemble
stacking_reg = StackingRegressor(
    estimators=estimators,
    final_estimator=meta_learner,
    cv=5,       # 5-fold cross-validation for training the meta-learner
    n_jobs=-1,  # Parallelize base model training
    passthrough=False # If True, feeds original features + predictions to meta-learner. Usually False is safer.
)


# Why passthrough=False?
# Feeding original features to the meta-learner can:
# - Increase dimensionality
# - Reintroduce multicollinearity
# - Cause the meta-learner to ignore base learners entirely

# For small-to-medium datasets, prediction-only stacking is safer.


# ==========================================
# 6. TRAINING & EVALUATION
# ==========================================
print("Training Stacking Regressor... (This might take a minute)")
stacking_reg.fit(X_train, y_train)

# Predict on the "Private Leaderboard" (Test Set)
y_pred = stacking_reg.predict(X_test)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"\n‚úÖ Final Stacking RMSE: {rmse:.4f}")

# Compare with a single best model (e.g., Random Forest) just to see the lift
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_pred))
print(f"üìä Single Random Forest RMSE: {rf_rmse:.4f}")
print(f"üöÄ Improvement: {rf_rmse - rmse:.4f}")

Training Stacking Regressor... (This might take a minute)

‚úÖ Final Stacking RMSE: 0.5289
üìä Single Random Forest RMSE: 0.5445
üöÄ Improvement: 0.0156


In [4]:
# ==========================================
# 7. MODEL INTERPRETATION
# ==========================================
# Access the meta-learner (Ridge) from the stack
manager = stacking_reg.final_estimator_

print("\n--- The Manager's Weights ---")
print(f"Intercept: {manager.intercept_:.2f}")
for name, weight in zip([n[0] for n in estimators], manager.coef_):
    print(f"Model [{name}]: {weight:.2f}")

# Interpretation:
# If 'rf' has a high weight (e.g., 0.6) and 'ridge' has low (0.1),
# the manager relies heavily on the Forest but uses Ridge to correct slight offsets.


--- The Manager's Weights ---
Intercept: -0.06
Model [ridge]: 0.01
Model [svr]: -0.15
Model [rf]: 0.49
Model [gbr]: 0.67


The coefficients we see are the weights assigned by the 'manager' model (our Ridge Regression final_estimator) to the predictions of each base model. In essence, they tell us how much the manager model 'trusts' or relies on each base learner when making its final prediction.

Here's what our output Coefficients: [0.01, -0.15, 0.49, 0.67] for the order ['ridge', 'svr', 'rf', 'gbr'] means:

ridge (0.01): The base Ridge Regression model has a very low positive weight. This suggests the manager model gives very little direct importance to the raw predictions from the first-layer Ridge model.


svr (-0.15): The SVR base model has a negative weight. This is interesting and can indicate that the meta-learner is trying to correct for a systematic over- or under-prediction by the SVR, or perhaps the SVR's predictions are inversely related to the true value in a way the manager is exploiting. It's less common to see negative weights unless there's a specific reason for the meta-learner to 'subtract' from that model's output.



rf (0.49): The Random Forest model has a moderate positive weight. This shows that the manager considers the Random Forest's predictions to be a significant and positive contributor to the final output.



gbr (0.67): The Gradient Boosting Regressor has the highest positive weight. This means the manager model relies most heavily on the predictions made by the Gradient Boosting model, indicating it's considered the most influential or accurate base learner in this ensemble.



In summary, the manager model learns how to optimally combine the predictions from its base models. In this case, the Gradient Boosting model's predictions are given the most weight, followed by Random Forest, while the SVR's contribution is negative, and the base Ridge model's contribution is minimal.

Why use Stacking?

It combines the strengths of different "inductive biases." Trees capture non-linear steps; Linear models capture trends; SVMs capture geometric boundaries.

Why CV in Stacking?

If we trained base models on $X_{train}$ and then trained the meta-learner on their predictions for $X_{train}$, the meta-learner would see "perfect" predictions (overfitting). We use CV to ensure the meta-learner sees "out-of-sample" predictions (realistic errors).Bias-Variance: Stacking is primarily a Bias reduction technique (improving the fit), though it stabilizes variance too.

#1: When NOT to use stacking (Data size rule)

### Data Size Rule of Thumb
Stacking works best when:
- Dataset size is moderate to large
- Base learners have uncorrelated errors

For very small datasets, stacking often overfits due to limited out-of-fold samples.


#2: Competition tip (this is HUGE)

### Competition Tip (Kaggle)
In high-ranking solutions:
- First layer: many diverse models
- Second layer: simple linear model
- Third layer (optional): blending multiple stacks

The power comes from **error diversity**, not model complexity.


#3: One-liner summary

### One-Sentence Summary
Stacking is a supervised way of learning how much to trust each model, using cross-validated predictions to avoid overfitting.
