# Module 08: Stacking and Blending

**Difficulty**: ‚≠ê‚≠ê‚≠ê Advanced
**Estimated Time**: 80 minutes
**Prerequisites**: 
- Module 05: XGBoost
- Module 06: LightGBM
- Module 07: CatBoost

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand meta-learning and stacked ensembles
2. Implement multi-level stacking (2-level, 3-level)
3. Prevent overfitting in stacking using proper cross-validation
4. Use sklearn's StackingClassifier and StackingRegressor
5. Understand blending and its differences from stacking
6. Choose diverse base models for optimal ensemble
7. Select appropriate meta-model (linear vs non-linear)
8. Implement stacking from scratch for complete understanding

## Setup and Configuration

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import time
import warnings
warnings.filterwarnings('ignore')

# Machine learning
from sklearn.datasets import load_breast_cancer, make_classification
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    roc_auc_score, roc_curve, log_loss
)

# Base models
from sklearn.ensemble import (
    RandomForestClassifier, 
    GradientBoostingClassifier,
    StackingClassifier
)
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# Gradient boosting libraries
try:
    import xgboost as xgb
    XGB_AVAILABLE = True
except ImportError:
    XGB_AVAILABLE = False
    print("XGBoost not available")

try:
    import lightgbm as lgb
    LGB_AVAILABLE = True
except ImportError:
    LGB_AVAILABLE = False
    print("LightGBM not available")

try:
    from catboost import CatBoostClassifier
    CB_AVAILABLE = True
except ImportError:
    CB_AVAILABLE = False
    print("CatBoost not available")

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

print("\nSetup complete! All libraries imported successfully.")

## 1. What is Stacking?

### Stacking = Stacked Generalization

Stacking is a meta-learning technique that combines multiple models to achieve better performance than any individual model.

### Key Idea

Instead of using simple averaging or voting:
1. Train multiple diverse "base models" (Level 0)
2. Use their predictions as features
3. Train a "meta-model" (Level 1) to combine base predictions
4. Meta-model learns optimal way to combine base models

### Architecture

```
Training Data
     |
     v
+----+----+----+----+
|    |    |    |    |  Level 0: Base Models
| RF | XGB| SVM| KNN|
+----+----+----+----+
  |    |    |    |
  v    v    v    v
  Predictions (Meta-features)
         |
         v
  +-------------+      Level 1: Meta-Model
  | Logistic Reg|
  +-------------+
         |
         v
  Final Prediction
```

### Why Stacking Works

1. **Diversity**: Different models make different errors
2. **Complementary strengths**: Each model captures different patterns
3. **Optimal combination**: Meta-model learns best weighting
4. **Non-linear combinations**: Can learn complex interactions

### The Critical Challenge: Overfitting

**Naive approach (WRONG)**:
```python
# Train base models on full training data
base_predictions = base_model.predict(X_train)  # ‚Üê Leakage!
# Train meta-model on same data predictions
meta_model.fit(base_predictions, y_train)  # ‚Üê Overfitting!
```

Problem: Base models already "saw" this data, predictions are overly optimistic.

**Correct approach: Out-of-fold predictions**:
```python
# Use cross-validation to get predictions on unseen data
for fold in cv_folds:
    # Train on other folds, predict on this fold
    base_model.fit(X_train_other_folds, y_train_other_folds)
    predictions[fold] = base_model.predict(X_train_this_fold)
# Now meta-model trains on truly unseen predictions
meta_model.fit(predictions, y_train)
```

### Stacking vs Other Ensembles

| Method | Combination | Complexity | Overfitting Risk |
|--------|-------------|------------|------------------|
| **Voting** | Simple average/vote | Low | Low |
| **Weighted Voting** | Weighted average | Low | Moderate |
| **Stacking** | Learned combination | High | Higher (needs CV) |
| **Boosting** | Sequential | Moderate | Moderate |

### When to Use Stacking

**Best for**:
- Kaggle competitions (extra 0.5-2% accuracy)
- High-stakes predictions (medical, finance)
- When you have diverse strong models
- Sufficient training data

**Avoid when**:
- Small datasets (<1000 samples)
- Need interpretability
- Production complexity is concern
- Single model already excellent

In [None]:
# Load dataset
cancer_data = load_breast_cancer()
X, y = cancer_data.data, cancer_data.target
feature_names = cancer_data.feature_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

print(f"Dataset: {len(X_train)} train, {len(X_test)} test, {X.shape[1]} features")
print(f"Classes: {np.unique(y)}, Distribution: {np.bincount(y)}")

## 2. Simple Stacking with sklearn

sklearn provides `StackingClassifier` that handles cross-validation automatically.

In [None]:
# Define diverse base models (Level 0)
base_models = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE)),
    ('gb', GradientBoostingClassifier(n_estimators=100, random_state=RANDOM_STATE)),
    ('svm', SVC(probability=True, random_state=RANDOM_STATE)),
    ('knn', KNeighborsClassifier(n_neighbors=5))
]

# Define meta-model (Level 1)
meta_model = LogisticRegression(random_state=RANDOM_STATE)

# Create stacking classifier
stacking_clf = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_model,
    cv=5,  # Use 5-fold CV for out-of-fold predictions
    stack_method='predict_proba',  # Use probabilities (better than hard predictions)
    n_jobs=-1
)

print("Stacking Classifier created with:")
print(f"  Base models: {len(base_models)}")
print(f"  Meta-model: Logistic Regression")
print(f"  CV folds: 5")

In [None]:
# Train stacking classifier
print("Training stacking classifier...\n")
start = time.time()
stacking_clf.fit(X_train, y_train)
stacking_time = time.time() - start

# Predict
stacking_pred = stacking_clf.predict(X_test)
stacking_proba = stacking_clf.predict_proba(X_test)
stacking_acc = accuracy_score(y_test, stacking_pred)
stacking_auc = roc_auc_score(y_test, stacking_proba[:, 1])

print(f"Training time: {stacking_time:.2f} seconds")
print(f"Test accuracy: {stacking_acc:.4f}")
print(f"Test AUC-ROC: {stacking_auc:.4f}")

In [None]:
# Compare with individual base models
print("\nComparing Stacking vs Individual Models:")
print("=" * 70)

results = []

# Train and evaluate each base model individually
for name, model in base_models:
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    acc = accuracy_score(y_test, pred)
    
    if hasattr(model, 'predict_proba'):
        proba = model.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, proba)
    else:
        auc = np.nan
    
    results.append({'Model': name.upper(), 'Accuracy': acc, 'AUC': auc})
    print(f"{name.upper():10s} - Accuracy: {acc:.4f}, AUC: {auc:.4f}")

# Add stacking results
results.append({'Model': 'STACKING', 'Accuracy': stacking_acc, 'AUC': stacking_auc})
print(f"\n{'STACKING':10s} - Accuracy: {stacking_acc:.4f}, AUC: {stacking_auc:.4f}")

df_results = pd.DataFrame(results)
best_base_acc = df_results[df_results['Model'] != 'STACKING']['Accuracy'].max()
improvement = (stacking_acc - best_base_acc) * 100

print(f"\n‚úÖ Stacking improvement: +{improvement:.2f}% over best base model")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
colors = ['steelblue'] * (len(results) - 1) + ['#e74c3c']
axes[0].bar(range(len(results)), df_results['Accuracy'], color=colors, edgecolor='black')
axes[0].set_xticks(range(len(results)))
axes[0].set_xticklabels(df_results['Model'], rotation=45, ha='right')
axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].set_title('Model Accuracy Comparison', fontsize=13, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)
axes[0].set_ylim([0.9, 1.0])

# AUC comparison
axes[1].bar(range(len(results)), df_results['AUC'], color=colors, edgecolor='black')
axes[1].set_xticks(range(len(results)))
axes[1].set_xticklabels(df_results['Model'], rotation=45, ha='right')
axes[1].set_ylabel('AUC-ROC', fontsize=12)
axes[1].set_title('Model AUC Comparison', fontsize=13, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)
axes[1].set_ylim([0.9, 1.0])

plt.tight_layout()
plt.show()

print("\nStacking (red bar) combines strengths of all base models!")

## 3. Manual Stacking Implementation

Let's implement stacking from scratch to understand exactly how it works.

In [None]:
def get_out_of_fold_predictions(models, X_train, y_train, X_test, n_folds=5):
    """
    Generate out-of-fold predictions for stacking.
    
    Returns:
        train_meta: Out-of-fold predictions on training data (for meta-model training)
        test_meta: Average predictions on test data (for meta-model testing)
    """
    n_train = X_train.shape[0]
    n_test = X_test.shape[0]
    n_models = len(models)
    
    # Initialize arrays for meta-features
    train_meta = np.zeros((n_train, n_models))
    test_meta = np.zeros((n_test, n_models))
    
    # Create stratified folds
    skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=RANDOM_STATE)
    
    # For each model
    for model_idx, (name, model) in enumerate(models):
        print(f"\nProcessing {name}...")
        test_preds_per_fold = np.zeros((n_test, n_folds))
        
        # For each fold
        for fold_idx, (train_idx, val_idx) in enumerate(skf.split(X_train, y_train)):
            # Split data
            X_tr, X_val = X_train[train_idx], X_train[val_idx]
            y_tr, y_val = y_train[train_idx], y_train[val_idx]
            
            # Train model on this fold's training data
            model_clone = type(model)(**model.get_params())
            model_clone.fit(X_tr, y_tr)
            
            # Predict on validation fold (out-of-fold)
            if hasattr(model_clone, 'predict_proba'):
                val_pred = model_clone.predict_proba(X_val)[:, 1]
            else:
                val_pred = model_clone.predict(X_val)
            
            # Store out-of-fold predictions
            train_meta[val_idx, model_idx] = val_pred
            
            # Predict on test data
            if hasattr(model_clone, 'predict_proba'):
                test_pred = model_clone.predict_proba(X_test)[:, 1]
            else:
                test_pred = model_clone.predict(X_test)
            
            test_preds_per_fold[:, fold_idx] = test_pred
        
        # Average test predictions across folds
        test_meta[:, model_idx] = test_preds_per_fold.mean(axis=1)
        
        print(f"  Train meta-features shape: {train_meta[:, model_idx].shape}")
        print(f"  Test meta-features shape: {test_meta[:, model_idx].shape}")
    
    return train_meta, test_meta

print("Manual stacking function defined.")

In [None]:
# Define base models for manual stacking
manual_base_models = [
    ('Random Forest', RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE)),
    ('Gradient Boosting', GradientBoostingClassifier(n_estimators=100, random_state=RANDOM_STATE)),
    ('SVM', SVC(probability=True, random_state=RANDOM_STATE))
]

print("Generating out-of-fold predictions...")
print("=" * 70)

# Get out-of-fold predictions
train_meta_features, test_meta_features = get_out_of_fold_predictions(
    manual_base_models, X_train, y_train, X_test, n_folds=5
)

print(f"\n‚úÖ Meta-features generated!")
print(f"Training meta-features: {train_meta_features.shape}")
print(f"Test meta-features: {test_meta_features.shape}")

In [None]:
# Visualize meta-features
meta_df = pd.DataFrame(
    train_meta_features,
    columns=[name for name, _ in manual_base_models]
)
meta_df['True Label'] = y_train

print("\nSample meta-features:")
print(meta_df.head(10))

# Correlation between base model predictions
corr = meta_df.drop('True Label', axis=1).corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, fmt='.3f', cmap='coolwarm', 
            square=True, linewidths=1, cbar_kws={'label': 'Correlation'})
plt.title('Base Model Prediction Correlations', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nLower correlation = more diversity = better stacking!")

In [None]:
# Train meta-model
print("Training meta-model...")

meta_model_manual = LogisticRegression(random_state=RANDOM_STATE)
meta_model_manual.fit(train_meta_features, y_train)

# Predict on test set
manual_stacking_pred = meta_model_manual.predict(test_meta_features)
manual_stacking_proba = meta_model_manual.predict_proba(test_meta_features)[:, 1]
manual_stacking_acc = accuracy_score(y_test, manual_stacking_pred)
manual_stacking_auc = roc_auc_score(y_test, manual_stacking_proba)

print(f"\nManual Stacking Results:")
print(f"Test accuracy: {manual_stacking_acc:.4f}")
print(f"Test AUC-ROC: {manual_stacking_auc:.4f}")

# Show meta-model coefficients
print(f"\nMeta-model learned weights:")
for (name, _), coef in zip(manual_base_models, meta_model_manual.coef_[0]):
    print(f"  {name:20s}: {coef:+.4f}")
print(f"  {'Intercept':20s}: {meta_model_manual.intercept_[0]:+.4f}")

## 4. Blending vs Stacking

### Blending

Simpler alternative to stacking:
1. Split data: Train (60%), Blend (20%), Test (20%)
2. Train base models on Train set
3. Predict on Blend set
4. Train meta-model on Blend predictions
5. Predict on Test set

### Differences

| Aspect | Stacking | Blending |
|--------|----------|----------|
| **Data usage** | Cross-validation | Holdout set |
| **Training data** | All data used | Some data held out |
| **Complexity** | Higher | Lower |
| **Overfitting risk** | Lower (CV) | Higher (single split) |
| **Computation** | Slower (multiple folds) | Faster |
| **Stability** | More stable | Less stable |

**Recommendation**: Use stacking unless computational cost is prohibitive.

In [None]:
# Implement blending
print("Implementing Blending...\n")

# Split data: Train (60%), Blend (20%), Test (20%)
X_train_blend, X_blend, y_train_blend, y_blend = train_test_split(
    X_train, y_train, test_size=0.25, random_state=RANDOM_STATE
)  # 0.25 of 80% = 20% of total

print(f"Train set: {len(X_train_blend)} samples")
print(f"Blend set: {len(X_blend)} samples")
print(f"Test set: {len(X_test)} samples")

# Train base models on train set
blend_models = [
    RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE),
    GradientBoostingClassifier(n_estimators=100, random_state=RANDOM_STATE),
    SVC(probability=True, random_state=RANDOM_STATE)
]

blend_meta_train = np.zeros((len(X_blend), len(blend_models)))
blend_meta_test = np.zeros((len(X_test), len(blend_models)))

for i, model in enumerate(blend_models):
    print(f"\nTraining model {i+1}/{len(blend_models)}...")
    
    # Train on train set
    model.fit(X_train_blend, y_train_blend)
    
    # Predict on blend set (for meta-model training)
    blend_meta_train[:, i] = model.predict_proba(X_blend)[:, 1]
    
    # Predict on test set
    blend_meta_test[:, i] = model.predict_proba(X_test)[:, 1]

# Train meta-model on blend set
meta_model_blend = LogisticRegression(random_state=RANDOM_STATE)
meta_model_blend.fit(blend_meta_train, y_blend)

# Predict on test set
blending_pred = meta_model_blend.predict(blend_meta_test)
blending_acc = accuracy_score(y_test, blending_pred)
blending_auc = roc_auc_score(y_test, meta_model_blend.predict_proba(blend_meta_test)[:, 1])

print(f"\n‚úÖ Blending Results:")
print(f"Test accuracy: {blending_acc:.4f}")
print(f"Test AUC-ROC: {blending_auc:.4f}")

In [None]:
# Compare Stacking vs Blending
comparison = pd.DataFrame([
    {'Method': 'Stacking (CV)', 'Accuracy': manual_stacking_acc, 'AUC': manual_stacking_auc},
    {'Method': 'Blending', 'Accuracy': blending_acc, 'AUC': blending_auc}
])

print("\n" + "=" * 70)
print("Stacking vs Blending Comparison")
print("=" * 70)
print(comparison.to_string(index=False))

print("\nüí° Stacking typically outperforms blending due to better data usage.")
print("   Blending is faster but uses less training data.")

## 5. Multi-Level Stacking

Stack multiple layers for even better performance!

In [None]:
# 3-Level Stacking
print("Building 3-Level Stacking Ensemble...\n")

# Level 0: Diverse base models
level_0 = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE)),
    ('gb', GradientBoostingClassifier(n_estimators=100, random_state=RANDOM_STATE)),
    ('svm', SVC(probability=True, random_state=RANDOM_STATE)),
    ('knn', KNeighborsClassifier(n_neighbors=7))
]

# Level 1: Intermediate models that combine Level 0
level_1_model_1 = StackingClassifier(
    estimators=level_0[:2],  # RF + GB
    final_estimator=LogisticRegression(random_state=RANDOM_STATE),
    cv=5
)

level_1_model_2 = StackingClassifier(
    estimators=level_0[2:],  # SVM + KNN
    final_estimator=LogisticRegression(random_state=RANDOM_STATE),
    cv=5
)

level_1 = [
    ('stack_1', level_1_model_1),
    ('stack_2', level_1_model_2)
]

# Level 2: Final meta-model
level_2 = LogisticRegression(random_state=RANDOM_STATE)

# Create 3-level stacking
multilevel_stack = StackingClassifier(
    estimators=level_1,
    final_estimator=level_2,
    cv=5
)

print("Training 3-level stacking ensemble...")
start = time.time()
multilevel_stack.fit(X_train, y_train)
multilevel_time = time.time() - start

multilevel_pred = multilevel_stack.predict(X_test)
multilevel_acc = accuracy_score(y_test, multilevel_pred)
multilevel_auc = roc_auc_score(y_test, multilevel_stack.predict_proba(X_test)[:, 1])

print(f"\nTraining time: {multilevel_time:.2f} seconds")
print(f"Test accuracy: {multilevel_acc:.4f}")
print(f"Test AUC-ROC: {multilevel_auc:.4f}")

print("\n‚ö†Ô∏è Warning: More levels ‚â† always better!")
print("   - Increases complexity and training time")
print("   - Risk of overfitting")
print("   - Diminishing returns")

## 6. Choosing Meta-Model: Linear vs Non-Linear

In [None]:
# Test different meta-models
meta_models_to_test = [
    ('Logistic Regression', LogisticRegression(random_state=RANDOM_STATE)),
    ('Ridge', Ridge(random_state=RANDOM_STATE)),
    ('Random Forest', RandomForestClassifier(n_estimators=50, max_depth=3, random_state=RANDOM_STATE)),
    ('Gradient Boosting', GradientBoostingClassifier(n_estimators=50, max_depth=2, random_state=RANDOM_STATE))
]

print("Testing different meta-models...\n")
meta_results = []

for meta_name, meta_clf in meta_models_to_test:
    # Create stacking classifier with this meta-model
    if meta_name == 'Ridge':
        # Ridge is a regressor, train directly on meta-features
        meta_clf.fit(train_meta_features, y_train)
        pred = (meta_clf.predict(test_meta_features) > 0.5).astype(int)
        acc = accuracy_score(y_test, pred)
        auc = np.nan  # Ridge doesn't produce probabilities easily
    else:
        stack = StackingClassifier(
            estimators=base_models,
            final_estimator=meta_clf,
            cv=5
        )
        stack.fit(X_train, y_train)
        pred = stack.predict(X_test)
        acc = accuracy_score(y_test, pred)
        
        if hasattr(stack, 'predict_proba'):
            proba = stack.predict_proba(X_test)[:, 1]
            auc = roc_auc_score(y_test, proba)
        else:
            auc = np.nan
    
    meta_results.append({
        'Meta-Model': meta_name,
        'Accuracy': acc,
        'AUC': auc
    })
    
    print(f"{meta_name:20s} - Accuracy: {acc:.4f}, AUC: {auc:.4f}")

df_meta = pd.DataFrame(meta_results)

print("\nüí° Insights:")
print("   - Linear meta-models (Logistic, Ridge) often work best")
print("   - Base models already capture complexity")
print("   - Non-linear meta-models can overfit")
print("   - Keep meta-model simple!")

## 7. Feature Passthrough

Pass original features to meta-model along with base predictions.

In [None]:
# Stacking with feature passthrough
stack_passthrough = StackingClassifier(
    estimators=base_models,
    final_estimator=LogisticRegression(random_state=RANDOM_STATE, max_iter=1000),
    cv=5,
    passthrough=True  # Pass original features to meta-model
)

print("Training stacking with feature passthrough...")
stack_passthrough.fit(X_train, y_train)

passthrough_pred = stack_passthrough.predict(X_test)
passthrough_acc = accuracy_score(y_test, passthrough_pred)
passthrough_auc = roc_auc_score(y_test, stack_passthrough.predict_proba(X_test)[:, 1])

print(f"\nWith passthrough:")
print(f"  Accuracy: {passthrough_acc:.4f}")
print(f"  AUC-ROC: {passthrough_auc:.4f}")

print(f"\nWithout passthrough:")
print(f"  Accuracy: {stacking_acc:.4f}")
print(f"  AUC-ROC: {stacking_auc:.4f}")

print("\nüí° Feature passthrough can help if:")
print("   - Base models might miss some patterns")
print("   - Original features have direct predictive power")
print("   - You have enough data to prevent overfitting")

## Exercises

### Exercise 1: Gradient Boosting Ensemble

Create stacking ensemble using only gradient boosting libraries:

1. Use as base models:
   - XGBoost (if available)
   - LightGBM (if available)
   - CatBoost (if available)
   - Sklearn GradientBoostingClassifier
2. Tune each base model individually for best performance
3. Use different hyperparameters for diversity
4. Test different meta-models:
   - Logistic Regression
   - Linear SVM
   - Another gradient boosting model
5. Compare:
   - Stacking vs best individual model
   - Correlation between base predictions
   - Improvement from stacking

In [None]:
# Your code here


### Exercise 2: Optimal Base Model Selection

Determine which combination of base models produces best stacking:

1. Create pool of candidate models (8-10 different models)
2. For each subset of 3-5 models:
   - Create stacking ensemble
   - Evaluate with cross-validation
   - Measure prediction correlation
3. Analyze:
   - Which combinations work best?
   - Importance of model diversity
   - Individual model strength vs diversity trade-off
4. Find optimal number of base models
5. Test hypothesis: "More diverse = better stacking"

In [None]:
# Your code here


### Exercise 3: Stacking on Imbalanced Data

Test stacking on severely imbalanced classification:

1. Create imbalanced dataset (95:5 ratio)
2. Compare strategies:
   - Stacking with standard models
   - Stacking with balanced base models (class_weight)
   - Stacking with resampling
   - Weighted meta-model
3. Evaluate using appropriate metrics:
   - Precision-Recall AUC
   - F1 score
   - Balanced accuracy
4. Determine best approach for imbalanced stacking

In [None]:
# Your code here


### Exercise 4: Custom Stacking with Feature Engineering

Enhance stacking with custom meta-features:

1. Generate standard out-of-fold predictions
2. Create additional meta-features:
   - Prediction variance across models
   - Agreement score (how many models agree)
   - Prediction confidence (max probability)
   - Distance from decision boundary
   - Model-specific confidence metrics
3. Train meta-model on expanded features
4. Compare with standard stacking
5. Analyze which meta-features are most useful

In [None]:
# Your code here


## Summary

### Key Concepts

1. **Stacking = Meta-Learning**:
   - Combine multiple models optimally
   - Meta-model learns how to weight base predictions
   - Captures complementary strengths
   - Achieves better performance than any individual model

2. **Critical: Prevent Overfitting**:
   - **NEVER** train meta-model on same data used for base training
   - Use out-of-fold predictions (cross-validation)
   - Each sample predicted by models that didn't see it
   - Ensures realistic meta-features

3. **Stacking vs Blending**:
   - **Stacking**: Cross-validation, uses all data, more stable
   - **Blending**: Holdout set, simpler, faster, less data efficient
   - Prefer stacking for better performance
   - Use blending if computation is constrained

4. **Base Model Selection**:
   - **Diversity is key**: Different algorithm families
   - Include linear and non-linear models
   - Vary hyperparameters for more diversity
   - 3-7 base models typically optimal
   - Check correlation - lower is better

5. **Meta-Model Selection**:
   - **Simple is often best**: Logistic Regression, Ridge
   - Base models already capture complexity
   - Non-linear meta-models risk overfitting
   - Linear models provide interpretability

### Best Practices

1. **Data Requirements**:
   - Need sufficient data (1000+ samples minimum)
   - More data ‚Üí can handle more complex stacking
   - Small datasets ‚Üí risk of overfitting

2. **Base Model Strategy**:
   - Start with 3-5 diverse models
   - Include at least one linear model
   - Tune base models individually first
   - Ensure models are sufficiently different

3. **Cross-Validation**:
   - Use 5-10 folds for out-of-fold predictions
   - Stratified folds for classification
   - More folds = better but slower
   - Ensure reproducibility (set random_state)

4. **Meta-Model Training**:
   - Start simple (Logistic Regression)
   - Use probabilities, not hard predictions
   - Try feature passthrough if appropriate
   - Regularization helps prevent overfitting

5. **Multi-Level Stacking**:
   - 2 levels usually sufficient
   - 3+ levels rarely worth complexity
   - Each level increases overfitting risk
   - Diminishing returns

### Common Mistakes

‚ùå **Training meta-model on in-sample predictions**
  ‚Üí Use out-of-fold predictions!

‚ùå **Using highly correlated base models**
  ‚Üí Ensure diversity in model types

‚ùå **Over-complex meta-model**
  ‚Üí Keep it simple (linear often best)

‚ùå **Too many base models**
  ‚Üí 3-7 models, quality over quantity

‚ùå **Ignoring computational cost**
  ‚Üí Stacking is slower than single models

‚ùå **Not validating properly**
  ‚Üí Always use separate test set

### When to Use Stacking

**Best for**:
- ‚úÖ Kaggle competitions (squeeze last 0.5-2%)
- ‚úÖ High-stakes predictions (medical, finance)
- ‚úÖ Have diverse strong base models
- ‚úÖ Sufficient training data (1000+ samples)
- ‚úÖ Computational resources available

**Avoid when**:
- ‚ùå Small datasets (<500 samples)
- ‚ùå Need interpretability
- ‚ùå Production complexity is concern
- ‚ùå Limited computation
- ‚ùå Single model already excellent (>98% accuracy)

### Performance Expectations

Typical improvements from stacking:
- **Accuracy**: +0.5% to +2%
- **AUC**: +0.01 to +0.03
- **F1 Score**: +1% to +3%

Not dramatic, but can be decisive in:
- Competitions
- Medical diagnosis
- Financial predictions
- High-value decisions

### Production Considerations

1. **Complexity**: Multiple models to deploy
2. **Latency**: Slower predictions (all models + meta)
3. **Maintenance**: Update all models, not just one
4. **Dependencies**: Multiple libraries potentially
5. **Debugging**: Harder to diagnose issues

**Tip**: Carefully weigh accuracy gain vs operational complexity!

### What's Next?

In **Module 09: Voting Classifiers and Regressors**, we'll explore:
- Simpler alternative to stacking
- Hard voting vs soft voting
- Weighted voting strategies
- When voting beats stacking
- Optimal weight finding

### Additional Resources

- **Paper**: "Stacked Generalization" (Wolpert, 1992)
- **Paper**: "Issues in Stacked Generalization" (Ting & Witten, 1999)
- **sklearn**: [Stacking Documentation](https://scikit-learn.org/stable/modules/ensemble.html#stacking)
- **Tutorial**: [A Kaggler's Guide to Model Stacking](https://mlwave.com/kaggle-ensembling-guide/)
- **Book**: "Ensemble Methods" by Zhi-Hua Zhou