## Model Ensembling: Stacking and Blending - Part 04

Ensembling is a powerful technique in machine learning where multiple models (often referred to as "base models" or "weak learners") are combined to produce a stronger model with improved predictive performance. 

**Two popular ensembling techniques are Stacking and Blending.**

**1. Stacking (Stacked Generalization)**

+ Stacking involves training multiple different models and then training a "meta-model" (also called a "stacker") on their outputs.
+ The idea is that the meta-model learns how to best combine the predictions from the base models to achieve the highest accuracy.

**How Stacking Works**
+ Base Models: Train several different models on the same dataset. These models can be of different types, such as a Logistic Regression, Decision Tree, Random Forest, etc.
+ Meta-Model: A meta-model (e.g., another Logistic Regression or a Gradient Boosting model) is trained on the predictions of the base models. The meta-model learns to weigh the base models' predictions to minimize the overall error.

**Explanation:**
+ Base Models: Random Forest, Gradient Boosting, and SVM are used as base models in this example.
+ Meta-Model: A Logistic Regression model serves as the meta-model that combines the predictions of the base models.
+ Cross-Validation (cv=5): The Stacking Classifier uses cross-validation to prevent overfitting.

In [26]:
## import required libraries here
import pandas as pd
import numpy as np

from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

In [27]:
## stacking algorithms
# Load your data
data = pd.read_csv('processed_customer_data.csv')
features = data.drop(['Churn', 'customerID'], axis=1)
target = data['Churn']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Define the base models
base_models = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ('svc', SVC(probability=True, random_state=42))
]

# Define the meta-model
meta_model = LogisticRegression()

# Create the stacking classifier
stacking_clf = StackingClassifier(estimators=base_models, final_estimator=meta_model, cv=5)

# Train the stacking classifier
stacking_clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = stacking_clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print("Stacking Model Accuracy:", accuracy)
print("Classification Report for Stacking Model:\n", report)

Stacking Model Accuracy: 0.8055358410220014
Classification Report for Stacking Model:
               precision    recall  f1-score   support

           0       0.84      0.91      0.87      1036
           1       0.68      0.51      0.58       373

    accuracy                           0.81      1409
   macro avg       0.76      0.71      0.73      1409
weighted avg       0.79      0.81      0.80      1409



**2. Blending**

+ Blending is similar to stacking but slightly simpler.
+ It typically involves creating a holdout set (a small subset of the training data) to train the meta-model, rather than using cross-validation.

**How Blending Works:**
+ Split Data: Split the training data into two parts (e.g., 80% and 20%).
+ Train Base Models: Train the base models on the larger part (e.g., 80%).
+ Generate Predictions: Make predictions on the smaller holdout set (e.g., 20%).
+ Meta-Model Training: Train a meta-model on the predictions from the base models on the holdout set.
+ Final Prediction: Use the meta-model to predict on the test set.

**Explanation**
+ Base Models: Random Forest, Gradient Boosting, and SVM.
+ Holdout Set: Used to create predictions for the meta-model.
+ Meta-Model: A Logistic Regression model that combines the predictions.

In [28]:
# Load your data
data = pd.read_csv('processed_customer_data.csv')
features = data.drop(['Churn', 'customerID'], axis=1)
target = data['Churn']

# Split the data into training and holdout sets (blending step)
X_train, X_holdout, y_train, y_holdout = train_test_split(features, target, test_size=0.2, random_state=42)

# Train base models on the main training set
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb.fit(X_train, y_train)
svc = SVC(probability=True, random_state=42)
svc.fit(X_train, y_train)

# Generate predictions for the holdout set
rf_pred = rf.predict_proba(X_holdout)[:, 1]
gb_pred = gb.predict_proba(X_holdout)[:, 1]
svc_pred = svc.predict_proba(X_holdout)[:, 1]

# Create new features based on the predictions
X_meta = np.column_stack((rf_pred, gb_pred, svc_pred))

# Train the meta-model on these new features
meta_model = LogisticRegression()
meta_model.fit(X_meta, y_holdout)

# Make final predictions on the test set
rf_pred_test = rf.predict_proba(X_test)[:, 1]
gb_pred_test = gb.predict_proba(X_test)[:, 1]
svc_pred_test = svc.predict_proba(X_test)[:, 1]

X_meta_test = np.column_stack((rf_pred_test, gb_pred_test, svc_pred_test))
y_pred_final = meta_model.predict(X_meta_test)

# Evaluate the blended model
accuracy = accuracy_score(y_test, y_pred_final)
report = classification_report(y_test, y_pred_final)
print("Blended Model Accuracy:", accuracy)
print("Classification Report for Blended Model:\n", report)

Blended Model Accuracy: 0.8105039034776437
Classification Report for Blended Model:
               precision    recall  f1-score   support

           0       0.84      0.92      0.88      1036
           1       0.69      0.51      0.59       373

    accuracy                           0.81      1409
   macro avg       0.77      0.72      0.73      1409
weighted avg       0.80      0.81      0.80      1409



**Key Differences Between Stacking and Blending:**
+ Stacking uses cross-validation to train the meta-model, while Blending uses a holdout set.
+ Blending is simpler to implement but might not utilize all the training data as effectively as Stacking.


**Which One to Choose?**
+ Stacking is usually more robust but can be computationally more intensive.
+ Blending is simpler and quicker but may not perform as well if the holdout set is not representative.