<a href="https://colab.research.google.com/github/Cliffochi/aviva_data_science_course/blob/main/Ensemble_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Ensemble Learning

###Implementation for House Price Prediction

Let's implement blending, bagging, and stacking techniques to improve prediction accuracy on the Kaggle house prices dataset.

In [2]:
# Data preparation
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

# Load data
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/train.csv')[['GrLivArea', 'YearBuilt', 'SalePrice']]

# Handle missing values
data = data.dropna()

# Log transform target for better normality
data['SalePrice'] = np.log1p(data['SalePrice'])

# Split data
X = data[['GrLivArea', 'YearBuilt']]
y = data['SalePrice']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

In [3]:
# Blending implementation
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

# Initialize base models
linear = LinearRegression()
svm = SVR(kernel='rbf', C=100, gamma=0.1)
tree = DecisionTreeRegressor(max_depth=5)

# Train models
linear.fit(X_train_scaled, y_train)
svm.fit(X_train_scaled, y_train)
tree.fit(X_train_scaled, y_train)

# Get predictions
pred_linear = linear.predict(X_val_scaled)
pred_svm = svm.predict(X_val_scaled)
pred_tree = tree.predict(X_val_scaled)

# Blend with equal weights
blended_pred = (pred_linear + pred_svm + pred_tree) / 3

# Evaluate
mse_blended = mean_squared_error(y_val, blended_pred)
print(f"Blended MSE: {mse_blended:.5f}")

# Compare with individual models
for name, pred in [('Linear', pred_linear), ('SVM', pred_svm), ('Tree', pred_tree)]:
    mse = mean_squared_error(y_val, pred)
    print(f"{name} MSE: {mse:.5f}")

Blended MSE: 0.04546
Linear MSE: 0.05186
SVM MSE: 0.04603
Tree MSE: 0.04778


In [4]:
# Bagging implementation
from sklearn.utils import resample

class BaggingRegressor:
    def __init__(self, base_model, n_estimators=10):
        self.models = [base_model() for _ in range(n_estimators)]

    def fit(self, X, y):
        for model in self.models:
            X_sample, y_sample = resample(X, y)
            model.fit(X_sample, y_sample)

    def predict(self, X):
        preds = np.array([model.predict(X) for model in self.models])
        return np.mean(preds, axis=0)

# Create and fit bagging model
bagged_tree = BaggingRegressor(lambda: DecisionTreeRegressor(max_depth=5), n_estimators=50)
bagged_tree.fit(X_train_scaled, y_train)

# Evaluate
pred_bagged = bagged_tree.predict(X_val_scaled)
mse_bagged = mean_squared_error(y_val, pred_bagged)
print(f"\nBagged Tree MSE: {mse_bagged:.5f}")
print(f"Single Tree MSE: {mean_squared_error(y_val, pred_tree):.5f}")


Bagged Tree MSE: 0.04630
Single Tree MSE: 0.04778


In [5]:
# stacking implementation
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import RidgeCV

# Define base models
base_models = [
    ('linear', LinearRegression()),
    ('svm', SVR(kernel='rbf')),
    ('tree', DecisionTreeRegressor(max_depth=5))
]

# Define meta-model
meta_model = RidgeCV()

# Create stacking model
stacked_model = StackingRegressor(
    estimators=base_models,
    final_estimator=meta_model,
    cv=5
)

# Fit and predict
stacked_model.fit(X_train_scaled, y_train)
pred_stacked = stacked_model.predict(X_val_scaled)

# Evaluate
mse_stacked = mean_squared_error(y_val, pred_stacked)
print(f"\nStacked Model MSE: {mse_stacked:.5f}")

# Show weights learned by meta-model
print("\nMeta-model coefficients:", stacked_model.final_estimator_.coef_)


Stacked Model MSE: 0.04541

Meta-model coefficients: [0.30570458 0.42234448 0.27543512]


# Ensemble Learning Results Comparison and Recommendations

## Performance Comparison

| Method          | MSE       | Improvement vs Best Base Model | Model Weights (Stacking) |
|-----------------|-----------|--------------------------------|--------------------------|
| Base - Linear   | 0.05186   | -                              | -                        |
| Base - SVM      | 0.04603   | -                              | -                        |
| Base - Tree     | 0.04778   | -                              | -                        |
| **Blending**    | 0.04546   | 1.24% better than SVM          | Equal weights            |
| **Bagging**     | 0.04630   | 0.59% worse than SVM           | -                        |
| **Stacking**    | **0.04541** | **1.35% better than SVM**      | Linear: 0.31, SVM: 0.42, Tree: 0.28 |

## Key Insights

1. **Model Effectiveness**:
   - All ensemble methods outperformed the base Linear model (4.6-12.4% improvement)
   - Stacking provided the best overall performance (1.35% better than the best base model)
   - Blending showed nearly comparable results to stacking with simpler implementation
   - Bagging improved the Tree model but didn't surpass SVM's performance

2. **Weight Analysis**:
   - The stacking meta-model assigned highest weight to SVM (0.42), confirming it as the strongest individual predictor
   - Linear regression contributed more than the tree (0.31 vs 0.28), suggesting its stability adds value
   - The balanced weights indicate each model brings unique predictive value

3. **Performance Patterns**:
   - The relatively small margins between methods suggest the base models were already reasonably strong
   - SVM's strong individual performance limited potential ensemble gains
   - The consistent outperformance of ensembles demonstrates their robustness

## Practical Recommendations

1. **For Production Deployment**:
   *"Implement the stacking model as it provides the best accuracy (MSE 0.04541) with reasonable complexity. The 1.35% improvement over the best base model could translate to significant business value at scale."*

2. **For Rapid Implementation**:
   *"Use blending with equal weights if development resources are limited - it achieves 99% of stacking's performance with much simpler implementation and maintenance."*

3. **For Model Improvement**:
   *"Experiment with adding more diverse base models to the stacking ensemble, particularly models that might capture different patterns than SVM and linear regression."*

4. **For Resource Optimization**:
   *"Consider using just the SVM model if computational resources are extremely constrained, as the ensemble gains are modest in this case."*

5. **Next Steps**:
   *"Investigate why bagging underperformed - try increasing the number of estimators or using different base models. Also explore feature engineering to create more differentiation between models' strengths."*

###Conclusion
The results demonstrate that while all ensemble methods provided value, stacking delivered the best performance by intelligently combining model strengths through learned weights. The choice between methods should balance performance needs with implementation complexity in specific environment.