# Stacking Ensemble Model


Stacking (also called stacked generalization) is an ensemble learning technique that combines multiple classification or regression models via a** meta-learner**.

- The base models are trained on the complete training dataset, and then a meta-model is trained on the outputs of the base models to make a final prediction.

**Here's how stacking works:**

1. Train several base models on the original training data

2. Generate predictions from these base models on a validation set

3. Use these predictions as features to train a higher-level meta-model

4. Use this meta-model to make final predictions on test data



# Key Benefits of Stacking



- Improved prediction accuracy: By combining multiple models, stacking can often achieve better performance than any single model.
- Reduction in overfitting: The meta-learner helps to correct biases from the base models.
- Utilization of diverse models: Stacking works best when using models with different underlying assumptions and learning algorithms, as they capture different aspects of the data.



# Tips for Effective Stacking



- Use **diverse base models that have different strengths and weaknesses**

- Ensure **base models are not highly correlated **in their predictions

- **Cross-validation** is important **to prevent leakage of information**

- The meta-model doesn't need to be complex - often a simple model like logistic regression works well

In [7]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

In [2]:

# Load a sample dataset
data = load_breast_cancer()
X, y = data.data, data.target

In [8]:
pd.DataFrame(X).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [9]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [10]:

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the base models
base_models = [
    RandomForestClassifier(n_estimators=100, random_state=42),
    SVC(probability=True, random_state=42),
    LogisticRegression(random_state=42)
]


In [11]:

# Define the meta-model
meta_model = LogisticRegression(random_state=42)

In [12]:

# Implement stacking from scratch
def stacking_ensemble(base_models, meta_model, X_train, y_train, X_test, n_folds=5):
    # Create out-of-fold predictions for training meta-model
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    meta_features_train = np.zeros((X_train.shape[0], len(base_models)))

    # For each base model, create out-of-fold predictions for training data
    for i, model in enumerate(base_models):
        for train_idx, val_idx in kf.split(X_train):
            # Split data
            X_train_fold, X_val_fold = X_train[train_idx], X_train[val_idx]
            y_train_fold = y_train[train_idx]

            # Train model
            model.fit(X_train_fold, y_train_fold)

            # Create predictions for validation fold
            preds = model.predict_proba(X_val_fold)[:, 1]
            meta_features_train[val_idx, i] = preds

    # Now train all base models on full training data
    meta_features_test = np.zeros((X_test.shape[0], len(base_models)))
    for i, model in enumerate(base_models):
        model.fit(X_train, y_train)
        meta_features_test[:, i] = model.predict_proba(X_test)[:, 1]

    # Train meta-model on the meta-features
    meta_model.fit(meta_features_train, y_train)

    # Make predictions with the meta-model
    final_predictions = meta_model.predict(meta_features_test)

    return final_predictions, meta_model, base_models

In [13]:
# Fit the stacking ensemble and get predictions
stacked_predictions, trained_meta_model, trained_base_models = stacking_ensemble(
    base_models, meta_model, X_train, y_train, X_test
)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [14]:

# Evaluate the stacking ensemble
stacking_accuracy = accuracy_score(y_test, stacked_predictions)
print(f"Stacking Ensemble Accuracy: {stacking_accuracy:.4f}")

Stacking Ensemble Accuracy: 0.9766


In [15]:

# Compare with individual base models
for i, model in enumerate(trained_base_models):
    model_name = model.__class__.__name__
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{model_name} Accuracy: {accuracy:.4f}")

RandomForestClassifier Accuracy: 0.9708
SVC Accuracy: 0.9357
LogisticRegression Accuracy: 0.9708


In [16]:

# Using scikit-learn's StackingClassifier (a more convenient approach)
from sklearn.ensemble import StackingClassifier

In [17]:

# Define the same models for comparison
estimators = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('svc', SVC(probability=True, random_state=42)),
    ('lr', LogisticRegression(random_state=42))
]

In [18]:


# Create and train the stacking classifier
sklearn_stacking = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(random_state=42),
    cv=5
)

In [19]:

sklearn_stacking.fit(X_train, y_train)
sklearn_predictions = sklearn_stacking.predict(X_test)
sklearn_accuracy = accuracy_score(y_test, sklearn_predictions)
print(f"Scikit-learn StackingClassifier Accuracy: {sklearn_accuracy:.4f}")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Scikit-learn StackingClassifier Accuracy: 0.9766


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
