## Ensemble Methods

In our continuous quest to enhance the accuracy and robustness of our predictive models for California housing prices, we delve into the realm of ensemble methods. Ensemble methods, renowned for their capability to combine multiple models to achieve superior predictive performance, offer a promising avenue for refining our housing price predictions.

#### Loading and preparing the data

In [2]:
from sklearn.datasets import  fetch_california_housing
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [3]:
california = fetch_california_housing()
print(california["DESCR"])

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

In [4]:
df_cali = pd.DataFrame(california["data"], columns = california["feature_names"])
df_cali["median_house_value"] = california["target"]

df_cali.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,median_house_value
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


#### Normalization & Feature Selection

Like we did in Feature Engineering lesson, we are going to normalize our data and select a subset of columns as our features.

#### Train Test Split

In [5]:
features = df_cali.drop(columns = ["median_house_value","AveOccup", "Population", "AveBedrms"])
target = df_cali["median_house_value"]

In [6]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

Create an instance of the normalizer

In [7]:
normalizer = MinMaxScaler()

normalizer.fit(X_train)

In [8]:
X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

In [9]:
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_train_norm.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,Latitude,Longitude
0,0.257838,0.098039,0.048751,0.137088,0.677291
1,0.268265,1.0,0.031762,0.551541,0.190239
2,0.236783,0.490196,0.027097,0.137088,0.63247
3,0.066578,0.72549,0.017987,0.156217,0.606574
4,0.184591,1.0,0.023207,0.163656,0.596614


In [10]:
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)
X_test_norm.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,Latitude,Longitude
0,0.251852,0.411765,0.034147,0.004251,0.727092
1,0.364112,0.607843,0.037296,0.146652,0.635458
2,0.265431,0.54902,0.036045,0.649309,0.25
3,0.134564,0.705882,0.029397,0.070138,0.871514
4,0.310685,0.470588,0.024621,0.557917,0.191235


## Bagging and Pasting

Bagging involves training multiple instances of the same base model on different subsets of the training data. The final prediction is obtained by averaging or voting over predictions from these models.

Just for baseline, our current best model is a Decision Tree with R-Squared of 0.70, lets see how ensembles works

In [28]:
bagging_reg = BaggingRegressor(DecisionTreeRegressor(),
                               n_estimators=100,
                               max_samples = 1000,
                               bootstrap=False) #If False, sampling without replacement is performed. default-True samples are drawn with replacement

Training Bagging model with our normalized data

In [29]:
bagging_reg.fit(X_train_norm, y_train)

Evaluate model's performance

In [30]:
pred = bagging_reg.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", r2_score(y_test, pred))

MAE 0.4132152585513567
RMSE 0.5920574832654673
R2 score 0.7311777708965429




Combining multiple trees, in this case 100, indeed yield a stronger model, now we are at 0.72 R-Squared!

Let's explore more!

In Bagging methods, we have many base estimators, so there is no feature importance method implemented.

## Random Patches

While in Bagging/Pasting, we randomize the training data that each predictor (estimator) learns from. However, in a Random Patches Method, we go a step further by also **randomizing the features** that each predictor trains with.

- Initialize a Random Forest

In [31]:
forest = RandomForestRegressor(n_estimators=100)

- Training the model

In [36]:
forest.fit(X_train_norm, y_train)

- Evaluate the model

In [37]:
pred = forest.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", r2_score(y_test, pred))

MAE 0.3227793185562017
RMSE 0.49124139972205405
R2 score 0.8149336971911408




In [38]:
df_compare = pd.DataFrame(y_test.values, columns = ['y_true'])
df_compare['pred'] = pred
df_compare

Unnamed: 0,y_true,pred
0,1.369,1.442880
1,2.413,2.396540
2,2.007,1.451190
3,0.725,0.724150
4,4.600,3.540951
...,...,...
4123,1.695,1.716640
4124,2.046,1.939610
4125,1.286,1.455230
4126,2.595,2.488790


By randomizing data also features that every estimators will learn from, we obtain even a better model!

We are now at 0.82 R-Squared.

## AdaBoost

Now, instead of training our estimators independently by training them in parallel, each estimators will learn at its predecessor's errors and focus on those datapoints where it failed.

- Initialize a AdaBoost model

In [39]:
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(),
                            n_estimators=100)


- Training the model

In [44]:
ada_reg.fit(X_train_norm, y_train)

- Evaluate the model

In [45]:
pred = ada_reg.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", r2_score(y_test, pred))

df_compare = pd.DataFrame(y_test.values, columns = ['y_true'])
df_compare['pred'] = pred
df_compare

MAE 0.2913815867248062
RMSE 0.4679238002569168
R2 score 0.8320856937167146




Unnamed: 0,y_true,pred
0,1.369,1.398
1,2.413,2.349
2,2.007,1.518
3,0.725,0.707
4,4.600,4.097
...,...,...
4123,1.695,1.767
4124,2.046,1.958
4125,1.286,1.461
4126,2.595,2.611


Even better! By randomizing training set, features and also focusing where the previous estimator failed, we obtained a better model!

## Gradient Boosting

Now, each estimator will predict the error caused by its predecessor.

- Initialize a AdaBoost model

In [46]:
gb_reg = GradientBoostingRegressor(n_estimators=100)

- Training the model

In [47]:
gb_reg.fit(X_train_norm, y_train)

- Evaluate the model

In [48]:
pred = gb_reg.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", r2_score(y_test, pred))

df_compare = pd.DataFrame(y_test.values, columns = ['y_true'])
df_compare['pred'] = pred
df_compare

MAE 0.4008225565974252
RMSE 0.5710839659492609
R2 score 0.7498863683204233




Unnamed: 0,y_true,pred
0,1.369,1.813786
1,2.413,2.722065
2,2.007,1.640388
3,0.725,0.822561
4,4.600,3.250905
...,...,...
4123,1.695,1.850772
4124,2.046,2.254445
4125,1.286,1.277222
4126,2.595,2.538263


Gradient Boosting compared with AdaBoosting, really doesnt seems doing a great job.

**However, note that none of the hyperparameters of all models we've tried where fine tunned.**

