# Ensemble Methods


In [2]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import statistics as st
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Using ever so common Titanic dataset
x_text = pd.read_csv('test.csv')
y_test = pd.read_csv('gender_submission.csv')

train = pd.read_csv('train.csv')

x_train, y_train  = train.drop('Survived', axis=1), train['Survived']

Ensemble methods workupon the principle that a group of weak learners come together to form a strong learner. The idea is to combine the predictions of multiple models to improve the overall performance. The ensemble methods are divided into three main categories:
1. Bagging
2. Stacking
3. Boosting

## Some Terminology
- Base Learner / Base Model / Base Estimator: the individual models used in ensemble methods
- Weak Learner: models that perform just above random chance (e.g. models with 50% accuracy)
- Strong Learner: models that perform well above random chance (e.g. models with 80% accuracy)

### Bias Variance Tradeoff

Principle behind many different regularisation techniques. Defined as:
- **Bias**: Average difference between the predicted values and the actual values. High bias models are oversimplified and do not capture the underlying patterns in the data. High bias = high error in training. Model predicts less accurately on seen data.
- **Variance**: Variability of the model's prediction for a given data point. High variance models are too complex and capture noise in the training data. High variance = high error in testing. Model predicts less accurately on unseen data.
- **Irreducible Error**: Error that cannot be reduced by model tuning. It is the error introduced by the noise in the data.

We can define Error as:
$Total Error = Bias^2 + Variance + Irreducible Error$

### Reason for using many models
Each model consists of numerous variables, each contributing to the total error. 

Even the same algorithm could yield models with different error rates. By combining multiple models, we can reduce the total error rate by averaging out the individual errors.

## Types of Ensemble Methods

Two types in general:
- **Sequential Ensemble Methods**: Models are generated sequentially and each model tries to correct the errors of the previous model. Examples include AdaBoost, Gradient Boosting, and XGBoost.
- **Parallel Ensemble Methods**: Models are generated in parallel and then combined. Examples include Random Forest and Bagging.

Parallel can be further divided into:
- **Homogenous**: All models are of the same type. Example: Random Forest.
- **Heterogenous**: Models are of different types. Example: Stacking.

## Techniques used in Ensemble Methods

## 1. Bagging (Bootstrap Aggregating)
Bagging is a parallel ensemble method that uses bootstrapping to create multiple datasets from the original dataset. Each dataset is then used to train a model. The final prediction is the average of all the predictions made by the individual models.

bootstrap sampling is a technique in which we create multiple datasets by sampling with replacement from the original dataset. This means that some data points may be repeated in the new dataset, while others may not be included at all.

Often used with decision trees, but can be used with any model.

Possible to use multiple models but typically the same model is used.

![BaggingExample](BaggingExample.png)

In [5]:
class SimpleBaggingClassifier:
    """
    A simple implementation of a Bagging Classifier
    
    Parameters:
    -----------
    base_estimator : object
        The base estimator to fit on random subsets of the dataset
    n_estimators : int, default=10
        The number of base estimators in the ensemble
    random_state : int, default=None
        Controls the random resampling of the original dataset
    """
    
    def __init__(self, base_estimator, n_estimators=10, random_state=None):
        self.base_estimator = base_estimator
        self.n_estimators = n_estimators
        self.random_state = random_state
        self.estimators = []
        
    def fit(self, X, y):
        """
        Build a Bagging ensemble of estimators from the training data
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            The training input samples
        y : array-like of shape (n_samples,)
            The target values
        
        Returns:
        --------
        self : object
        """
        # Convert X and y to numpy arrays if they're pandas dataframes
        if hasattr(X, 'values'):
            X = X.values
        if hasattr(y, 'values'):
            y = y.values
            
        n_samples = X.shape[0]
        
        # Use random_state for reproducibility
        np.random.seed(self.random_state)
        
        # Create and train each estimator
        for i in range(self.n_estimators):
            # Bootstrap sampling - randomly select samples with replacement
            # This creates a dataset of same size but with duplicates
            indices = np.random.choice(n_samples, size=n_samples, replace=True)
            X_bootstrap = X[indices]
            y_bootstrap = y[indices]
            
            # Create a clone of the base estimator
            estimator = type(self.base_estimator)()
            
            # Train the estimator on the bootstrap sample
            estimator.fit(X_bootstrap, y_bootstrap)
            
            # Add the trained estimator to our collection
            self.estimators.append(estimator)
            
        return self
    
    def predict(self, X):
        """
        Predict class for X using majority voting
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            The input samples
        
        Returns:
        --------
        y : array-like of shape (n_samples,)
            The predicted classes
        """
        if hasattr(X, 'values'):
            X = X.values
            
        # Get predictions from each estimator
        predictions = np.array([estimator.predict(X) for estimator in self.estimators])
        
        # Transpose to get a samples x estimators matrix
        predictions = predictions.T
        
        # Apply majority voting - take the most common prediction for each sample
        final_predictions = np.array([st.mode(pred) for pred in predictions])
        
        return final_predictions

# Example usage
# Let's prepare the data first
# For simplicity, we'll use only numerical features and handle missing values
# Let's preprocess the data
def preprocess_data(data):
    # Select only numerical features for simplicity
    numerical_features = ['Age', 'Fare', 'Pclass']
    X = data[numerical_features].copy()
    
    # Fill missing values with mean
    X['Age'].fillna(X['Age'].mean(), inplace=True)
    X['Fare'].fillna(X['Fare'].mean(), inplace=True)
    
    return X

# Preprocess training and test data
X_train_processed = preprocess_data(x_train)
X_test_processed = preprocess_data(x_text)

# Create and train our bagging classifier
# Using Decision Tree as base estimator
base_tree = DecisionTreeClassifier(max_depth=3)
bagging_clf = SimpleBaggingClassifier(base_estimator=base_tree, n_estimators=10, random_state=42)
bagging_clf.fit(X_train_processed, y_train)

# Make predictions
predictions = bagging_clf.predict(X_test_processed)

# Compare with actual values
accuracy = np.mean(predictions == y_test['Survived'])
print(f"Bagging Classifier Accuracy: {accuracy:.4f}")

# Compare with scikit-learn's implementation
from sklearn.ensemble import BaggingClassifier
sklearn_bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=3),
    n_estimators=10,
    random_state=42
)
sklearn_bagging.fit(X_train_processed, y_train)
sklearn_predictions = sklearn_bagging.predict(X_test_processed)
sklearn_accuracy = np.mean(sklearn_predictions == y_test['Survived'])
print(f"Scikit-learn's Bagging Classifier Accuracy: {sklearn_accuracy:.4f}")



## 2. Stacking (Stacked Generalization)

Stacking is a parallel ensemble method that combines multiple models using a meta-learner. The meta-learner takes the predictions of the individual models as input and makes the final prediction.

The idea is to use the predictions of the base models as input features for the meta-learner. The meta-learner then learns to combine the predictions of the base models to make the final prediction.

The base models are often diverse in nature, meaning that they are different types of models or models trained on different subsets of the data. This helps to reduce the correlation between the base models and improve the performance of the ensemble.

The meta-learner can be any model, but it is often a simple linear model or a tree-based model. This helps to reduce the complexity of the ensemble and divert the main inference
to the base models.


![StackingExample](StackingExample.png)

In [7]:
class SimpleStackingClassifier:
    """
    A simple implementation of a Stacking Classifier
    
    Parameters:
    -----------
    base_estimators : list
        List of base estimator objects
    meta_estimator : object
        The meta estimator to combine the predictions of base estimators
    n_folds : int, default=5
        Number of folds for cross-validation
    random_state : int, default=None
        Controls the randomness in CV split
    """
    
    def __init__(self, base_estimators, meta_estimator, n_folds=5, random_state=None):
        self.base_estimators = base_estimators
        self.meta_estimator = meta_estimator
        self.n_folds = n_folds
        self.random_state = random_state
        self.trained_base_estimators = []
    
    def fit(self, X, y):
        """
        Fit the stacking ensemble
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            The training input samples
        y : array-like of shape (n_samples,)
            The target values
        
        Returns:
        --------
        self : object
        """
        # Convert X and y to numpy arrays if they're pandas dataframes
        if hasattr(X, 'values'):
            X = X.values
        if hasattr(y, 'values'):
            y = y.values
        
        n_samples = X.shape[0]
        n_estimators = len(self.base_estimators)
        
        # Create array to hold meta-features (predictions from base models)
        meta_features = np.zeros((n_samples, n_estimators))
        
        # Split data for cross-validation
        from sklearn.model_selection import KFold
        kf = KFold(n_splits=self.n_folds, shuffle=True, random_state=self.random_state)
        
        # For each base estimator, generate out-of-fold predictions
        for i, estimator in enumerate(self.base_estimators):
            print(f"Training base estimator {i+1}/{n_estimators}: {type(estimator).__name__}")
            
            # We'll collect out-of-fold predictions here
            temp_meta_features = np.zeros(n_samples)
            
            # Create a fresh instance for each fold
            for train_idx, val_idx in kf.split(X):
                # Get training and validation sets for this fold
                X_train_fold, X_val_fold = X[train_idx], X[val_idx]
                y_train_fold, y_val_fold = y[train_idx], y[val_idx]
                
                # Clone the estimator to start fresh
                fold_estimator = type(estimator)()
                
                # Train on the training portion
                fold_estimator.fit(X_train_fold, y_train_fold)
                
                # Make predictions on validation portion
                temp_meta_features[val_idx] = fold_estimator.predict(X_val_fold)
            
            # Store the meta-features (out-of-fold predictions)
            meta_features[:, i] = temp_meta_features
            
            # Train a final model on all data for future predictions
            final_estimator = type(estimator)()
            final_estimator.fit(X, y)
            self.trained_base_estimators.append(final_estimator)
        
        # Train the meta-estimator using the meta-features
        print(f"Training meta estimator: {type(self.meta_estimator).__name__}")
        self.meta_estimator.fit(meta_features, y)
        
        return self
    
    def predict(self, X):
        """
        Predict class for X using the stacked ensemble
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            The input samples
        
        Returns:
        --------
        y : array-like of shape (n_samples,)
            The predicted classes
        """
        if hasattr(X, 'values'):
            X = X.values
        
        n_samples = X.shape[0]
        n_estimators = len(self.base_estimators)
        
        # Create array to hold meta-features for prediction
        meta_features = np.zeros((n_samples, n_estimators))
        
        # Generate meta-features using trained base estimators
        for i, estimator in enumerate(self.trained_base_estimators):
            meta_features[:, i] = estimator.predict(X)
        
        # Make final predictions using the meta-estimator
        return self.meta_estimator.predict(meta_features)

# Example usage
# Let's use the same preprocessing as in the bagging example
def preprocess_data(data):
    # Select only numerical features for simplicity
    numerical_features = ['Age', 'Fare', 'Pclass']
    X = data[numerical_features].copy()
    
    # Fill missing values with mean
    X['Age'].fillna(X['Age'].mean(), inplace=True)
    X['Fare'].fillna(X['Fare'].mean(), inplace=True)
    
    return X

# Preprocess training and test data
X_train_processed = preprocess_data(x_train)
X_test_processed = preprocess_data(x_text)

# Create base estimators (weak learners) of different types
base_estimators = [
    DecisionTreeClassifier(max_depth=3),
    KNeighborsClassifier(n_neighbors=5),
    LogisticRegression(max_iter=1000)
]

# Create meta estimator
meta_estimator = LogisticRegression()

# Create and train our stacking classifier
stacking_clf = SimpleStackingClassifier(
    base_estimators=base_estimators,
    meta_estimator=meta_estimator,
    n_folds=5,
    random_state=42
)
stacking_clf.fit(X_train_processed, y_train)

# Make predictions
predictions = stacking_clf.predict(X_test_processed)

# Calculate accuracy
accuracy = np.mean(predictions == y_test['Survived'])
print(f"\nStacking Classifier Accuracy: {accuracy:.4f}")

# Compare with individual models
print("\nIndividual Model Performances:")
for i, estimator in enumerate(base_estimators):
    # Train the individual model
    model = type(estimator)()
    model.fit(X_train_processed, y_train)
    # Make predictions
    ind_predictions = model.predict(X_test_processed)
    # Calculate accuracy
    ind_accuracy = np.mean(ind_predictions == y_test['Survived'])
    print(f"{type(model).__name__} Accuracy: {ind_accuracy:.4f}")

# Compare with scikit-learn's StackingClassifier
from sklearn.ensemble import StackingClassifier
sklearn_stacking = StackingClassifier(
    estimators=[(f"est{i}", type(est)()) for i, est in enumerate(base_estimators)],
    final_estimator=LogisticRegression(),
    cv=5
)
sklearn_stacking.fit(X_train_processed, y_train)
sklearn_predictions = sklearn_stacking.predict(X_test_processed)
sklearn_accuracy = np.mean(sklearn_predictions == y_test['Survived'])
print(f"\nScikit-learn's StackingClassifier Accuracy: {sklearn_accuracy:.4f}")



## 3. Boosting

Boosting is a sequential ensemble method that combines multiple weak learners to create a strong learner. The idea is to train the models sequentially, with each model trying to correct the errors of the previous model.

The key idea behind boosting is to assign weights to the data points based on their importance. The weights are updated at each iteration to give more importance to the data points that were misclassified by the previous models.

The final prediction is made by combining the predictions of all the models, with more weight given to the models that perform better on the training data.

Boosting is often used with decision trees, but it can be used with any model.

There are many different boosting algorithms, including AdaBoost, Gradient Boosting, and XGBoost.

![BoostingExample](BoostingExample.png)

### AdaBoost (Adaptive Boosting)

AdaBoost is a boosting algorithm that assigns weights to the data points based on their importance. The weights are updated at each iteration to give more importance to the data points that were misclassified by the previous models.

The final prediction is made by combining the predictions of all the models, with more weight given to the models that perform better on the training data.

In [9]:
class SimpleAdaBoostClassifier:
    """
    A simple implementation of AdaBoost classifier
    
    Parameters:
    -----------
    base_estimator : object
        The base estimator to use for boosting (weak learner)
    n_estimators : int, default=50
        The maximum number of estimators to use
    learning_rate : float, default=1.0
        Weight applied to each classifier at each boosting iteration
    random_state : int, default=None
        Controls the random seed for base estimator's randomness
    """
    
    def __init__(self, base_estimator, n_estimators=50, learning_rate=1.0, random_state=None):
        self.base_estimator = base_estimator
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.random_state = random_state
        
        # Will store the collection of estimators
        self.estimators = []
        # Will store the weight of each estimator in the final prediction
        self.estimator_weights = []
        
    def fit(self, X, y):
        """
        Build the AdaBoost classifier from the training data
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            The training input samples
        y : array-like of shape (n_samples,)
            The target values (class labels must be -1 and 1)
        
        Returns:
        --------
        self : object
        """
        # Convert X and y to numpy arrays if they're pandas dataframes
        if hasattr(X, 'values'):
            X = X.values
        if hasattr(y, 'values'):
            y = y.values
        
        n_samples = X.shape[0]
        
        # Make sure y contains -1 and 1 (required for AdaBoost)
        # For binary classification with 0 and 1, convert 0 to -1
        unique_classes = np.unique(y)
        if len(unique_classes) != 2:
            raise ValueError("AdaBoost requires binary classification")
        
        # Map labels to -1 and 1 if needed
        y_transformed = y.copy()
        if not np.all(np.isin(unique_classes, [-1, 1])):
            # Assume binary classification with labels like 0 and 1
            y_transformed = np.where(y_transformed == unique_classes[0], -1, 1)
            print(f"Mapped classes {unique_classes} to [-1, 1]")
        
        # Initialize sample weights (uniform distribution)
        sample_weights = np.ones(n_samples) / n_samples
        
        # Training loop
        for i in range(self.n_estimators):
            print(f"Training estimator {i+1}/{self.n_estimators}")
            
            # Clone the base estimator
            estimator = type(self.base_estimator)()
            
            # Some estimators in scikit-learn support sample_weight directly
            try:
                estimator.fit(X, y_transformed, sample_weight=sample_weights)
            except TypeError:
                # If sample_weight is not supported, use weighted sampling
                # Sample indices according to sample weights
                indices = np.random.choice(
                    n_samples, 
                    size=n_samples, 
                    replace=True, 
                    p=sample_weights
                )
                estimator.fit(X[indices], y_transformed[indices])
            
            # Get predictions
            predictions = estimator.predict(X)
            
            # Ensure predictions are -1 and 1
            if not np.all(np.isin(np.unique(predictions), [-1, 1])):
                predictions = np.where(predictions == unique_classes[0], -1, 1)
            
            # Calculate the weighted error
            incorrect = predictions != y_transformed
            error = np.sum(sample_weights * incorrect) / np.sum(sample_weights)
            
            # If error is too large (≥0.5), this estimator is no better than random guessing
            if error >= 0.5:
                print(f"  Error rate {error:.4f} ≥ 0.5, stopping early")
                break
            
            # Calculate estimator weight
            estimator_weight = self.learning_rate * np.log((1 - error) / max(error, 1e-10))
            
            # Update sample weights
            sample_weights = sample_weights * np.exp(estimator_weight * incorrect)
            
            # Normalize sample weights to sum to 1
            sample_weights = sample_weights / np.sum(sample_weights)
            
            # Store the estimator and its weight
            self.estimators.append(estimator)
            self.estimator_weights.append(estimator_weight)
            
            print(f"  Error rate: {error:.4f}, Estimator weight: {estimator_weight:.4f}")
        
        return self
    
    def predict(self, X):
        """
        Predict class for X using the weighted estimators
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            The input samples
            
        Returns:
        --------
        y : array-like of shape (n_samples,)
            The predicted classes
        """
        if hasattr(X, 'values'):
            X = X.values
            
        n_samples = X.shape[0]
        
        # Get weighted sum of predictions from all estimators
        weighted_predictions = np.zeros(n_samples)
        
        for estimator, weight in zip(self.estimators, self.estimator_weights):
            predictions = estimator.predict(X)
            
            # Ensure predictions are -1 and 1
            if not np.all(np.isin(np.unique(predictions), [-1, 1])):
                # Convert to -1 and 1 if needed
                unique_values = np.unique(predictions)
                if len(unique_values) == 2:
                    predictions = np.where(predictions == unique_values[0], -1, 1)
            
            weighted_predictions += weight * predictions
        
        # Convert back to original class labels (0 and 1) if needed
        # Positive score -> class 1, Negative score -> class 0
        return np.where(weighted_predictions >= 0, 1, 0)

# Example usage
# Preprocess training and test data (reusing the preprocessing function from earlier)
def preprocess_data(data):
    # Select only numerical features for simplicity
    numerical_features = ['Age', 'Fare', 'Pclass']
    X = data[numerical_features].copy()
    
    # Fill missing values with mean
    X['Age'].fillna(X['Age'].mean(), inplace=True)
    X['Fare'].fillna(X['Fare'].mean(), inplace=True)
    
    return X

# Preprocess training and test data
X_train_processed = preprocess_data(x_train)
X_test_processed = preprocess_data(x_text)

# Create and train our AdaBoost classifier
# Using Decision Tree with limited depth as base estimator (weak learner)
base_tree = DecisionTreeClassifier(max_depth=1)  # Decision stump (typical weak learner for AdaBoost)
adaboost_clf = SimpleAdaBoostClassifier(
    base_estimator=base_tree,
    n_estimators=50,
    learning_rate=1.0,
    random_state=42
)
adaboost_clf.fit(X_train_processed, y_train)

# Make predictions
predictions = adaboost_clf.predict(X_test_processed)

# Calculate accuracy
accuracy = np.mean(predictions == y_test['Survived'])
print(f"\nAdaBoost Classifier Accuracy: {accuracy:.4f}")

# Compare with base estimator (single decision stump)
base_model = DecisionTreeClassifier(max_depth=1)
base_model.fit(X_train_processed, y_train)
base_predictions = base_model.predict(X_test_processed)
base_accuracy = np.mean(base_predictions == y_test['Survived'])
print(f"Base Estimator (Decision Stump) Accuracy: {base_accuracy:.4f}")

# Compare with scikit-learn's AdaBoostClassifier
from sklearn.ensemble import AdaBoostClassifier
sklearn_adaboost = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=50,
    learning_rate=1.0,
    random_state=42
)
sklearn_adaboost.fit(X_train_processed, y_train)
sklearn_predictions = sklearn_adaboost.predict(X_test_processed)

