# <font color='#eb3483'>Ensemble Methods</font>

Ensemble methods combine several several machine learning models (a.k.a. base learners) in order to produce one optimal predictive model. In this lession we study different types of ensemble methods. Before we begin, let's load the necessary libraries and dataset.

In [None]:
# Importing the libraries
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelEncoder
%matplotlib inline

### <font color='#eb3483'>Load data</font>

The California housing dataset contains information on various socio-economic features of block groups in California. Each row in the dataset represents a single block group, and there are 20,640 observations, each with 10 attributes.

Features are as follows:
1. Longitude: The longitude of the center of each block group in California.
2.Latitude: The latitude of the center of each block group in California.
3.Housing Median Age: The median age of the housing units in each block group.
4.Total Rooms: The total number of rooms in the housing units in each block group.
5.Total Bedrooms: The total number of bedrooms in the housing units in each block group.
6.Population: The total population of the block group.
7.Households: The total number of households in the block group.
8.Median Income: The median income of the block group.
9.Median House Value: The median value of the housing units in the block group.
10.Ocean Proximity: The proximity of the block group to the ocean or other bodies of water.

In [None]:
#Import from Data Folder in Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path = "/content/drive/MyDrive/Data Science | Abroad | S1 | Claire/Class Materials/Week 3-6 Special Topics/W3D1 Tree Ensembles/Classwork/housing.csv"
data = pd.read_csv(path)
# Dataset is now stored in a Pandas Dataframe

In [None]:
# See head of the dataset
data.head()

## <font color='#eb3483'>EDA</font>

In [None]:
#Check the shape of dataframe
data.shape

In [None]:
data.columns

In [None]:
data.dtypes

In [None]:
# Identifying the unique number of values in the dataset
data.nunique()

In [None]:
# Check for missing values
data.isnull().sum()

In [None]:
# See rows with missing values
data[data.isnull().any(axis=1)]

In [None]:
data["ocean_proximity"].value_counts(ascending=True)

In [None]:
# Distribution of Categorical Variable ('ocean_proximity')
plt.figure(figsize=(8, 3))
sns.countplot(x=data['ocean_proximity'])
plt.title('Distribution of Ocean Proximity')
plt.xlabel('Ocean Proximity')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Finding out the correlation between the features
# Exclude non-numeric columns for correlation heatmap
numeric_data = data.drop(columns=['ocean_proximity'])

# Correlation heatmap
plt.figure(figsize=(10, 8))
corr = numeric_data.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()

In [None]:
price_correlation= corr

for feature, correlation in price_correlation.items():
        if correlation.any() > 0:
            print(f"There is a positive correlation between house price and {feature}")
        elif correlation < 0:
            print(f"There is a negative correlation between house price and {feature}")
        else:
            print(f"There is no correlation between house price and {feature}")

In [None]:
# Scatter plot of latitude and longitude to visualize geographical data
plt.figure(figsize=(5, 3))
sns.scatterplot(x='longitude', y='latitude', data=data, hue='median_house_value', palette='coolwarm', alpha=0.6)
plt.title('Geographical Distribution of Housing Prices')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()

In [None]:
#Distribution of Target Variable ('median_house_value')

plt.figure(figsize=(7, 5))
sns.histplot(data['median_house_value'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of House Prices')
plt.xlabel('Median House Value')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.scatter(data["median_income"],data["median_house_value"],alpha=0.3)
plt.title=("median_income vs median_house_value")
plt.xlabel("median_income")
plt.ylabel("median_house_value")

In [None]:
# Boxplot of median house value
plt.figure(figsize=(10, 6))
sns.boxplot(x='ocean_proximity', y='median_house_value',data=data)
plt.title('Median House Value by Ocean Proximity')
plt.xlabel('Ocean Proximity')
plt.ylabel('Median House Value')
plt.show()

In [None]:
#Outlier detection
#Lets check if there are any outliers in our dataset
print(data.median_house_value.value_counts().head())
california = data[data.median_house_value != 50.0]
california.plot(kind='box', rot=90, logy=True,figsize=(20,10));

We see that different features are on different scales our linear models(regularized ones) would need to have the data scaled. There are a lot of outliers in the dataset in majority of the features. We will first see how each predictor models the response.

## <font color='#eb3483'>Feature Engineering</font>

In [None]:
# Create a new feature 'rooms_per_household', which represents the average number of rooms per household
data['rooms_per_household'] = data['total_rooms'] / data['households']
data.head(2)

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
data['ocean_proximity_encoded'] = label_encoder.fit_transform(data['ocean_proximity'])

In [None]:
# Splitting to training and testing data
X=data[['longitude','latitude','housing_median_age','rooms_per_household','population','households','median_income','ocean_proximity_encoded']]
y = data['median_house_value']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 4)

We will train some models to get a good baseline of what performance should be

## 1. Linear regression

### Training the model

In [None]:
# Import library for Linear Regression
from sklearn.linear_model import LinearRegression

# Create a Linear regressor
lm = LinearRegression()

# Train the model using the training sets
lm.fit(X_train, y_train)

In [None]:
# Value of y intercept
lm.intercept_

In [None]:
#Converting the coefficient values to a dataframe
coeffcients = pd.DataFrame([X_train.columns,lm.coef_]).T
coeffcients = coeffcients.rename(columns={0: 'Attribute', 1: 'Coefficients'})
coeffcients

**Remeber:**

* The magnitude of the coefficient indicates the strength of the relationship. A larger magnitude suggests a stronger impact on the target variable.

* The sign of the coefficient (+ or -) indicates the direction of the relationship. A positive coefficient means that as the feature increases, the target variable is expected to increase as well. A negative coefficient means that as the feature increases, the target variable is expected to decrease.

In [None]:
# Get feature importance from the coefficients
feature_importance = pd.Series(lm.coef_, index=X.columns)
feature_importance = feature_importance.abs().sort_values(ascending=False)

# Plotting the feature importances
plt.figure(figsize=(12, 6))
feature_importance.plot(kind='barh')
plt.gca().invert_yaxis()  # Invert the y-axis to flip the bars
plt.title('Feature Importance in Linear Regression')
plt.xlabel('Coefficient Magnitude')
plt.ylabel('Feature')
plt.show()

### Model Evaluation

In [None]:
# Model prediction on train data
y_pred = lm.predict(X_train)

In [None]:
# Model Evaluation
print('R^2:',metrics.r2_score(y_train, y_pred))
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_train, y_pred))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_train, y_pred))
print('MSE:',metrics.mean_squared_error(y_train, y_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_train, y_pred)))

𝑅^2 : It is a measure of the linear relationship between X and Y. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable.

Adjusted 𝑅^2 :The adjusted R-squared compares the explanatory power of regression models that contain different numbers of predictors.

MAE : It is the mean of the absolute value of the errors. It measures the difference between two continuous variables, here actual and predicted values of y.

MSE: The mean square error (MSE) is just like the MAE, but squares the difference before summing them all instead of using the absolute value.

RMSE: The mean square error (MSE) is just like the MAE, but squares the difference before summing them all instead of using the absolute value.

In [None]:
# Visualizing the differences between actual prices and predicted values
plt.scatter(y_train, y_pred,alpha=0.3)
plt.xlabel("Prices")
plt.ylabel("Predicted prices")
plt.title("Prices vs Predicted prices")
plt.show()

### For test data

In [None]:
# Predicting Test data with the model
y_test_pred = lm.predict(X_test)

In [None]:
# Model Evaluation
acc_linreg = metrics.r2_score(y_test, y_test_pred)
print('R^2:', acc_linreg)
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_test, y_test_pred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_test, y_test_pred))
print('MSE:',metrics.mean_squared_error(y_test, y_test_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))

Here the model evaluations scores are almost matching with that of train data. So the model is not overfitting.

## 2. Random Forest Regressor

### Train the model: X_train

In [None]:
# Import Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

# Create a Random Forest Regressor
reg = RandomForestRegressor()

# Train the model using the training sets
reg.fit(X_train, y_train)

In [None]:
RandomForestRegressor?

sklearn's RandomForest implementation trains each base tree with a dataset the same size as the training dataset (sampling with replacement if `bootstrap=True`).

In [None]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [None]:
# if we make n_estimators higher than 100 it would take looonger to run. #stop kernel
# try a few options and see the performance of the model e.g. randomforest_10 versus randomforest_100

### Model Evaluation

In [None]:
# Model prediction on train data
y_pred = reg.predict(X_train)

In [None]:
# Model Evaluation
acc_rf_train = metrics.r2_score(y_train, y_pred)
print('R^2:',metrics.r2_score(y_train, y_pred))
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_train, y_pred))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_train, y_pred))
print('MSE:',metrics.mean_squared_error(y_train, y_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_train, y_pred)))

The Root Mean Squared Error (RMSE) of approximately 18688 indicates that, on average, the model's predictions deviate by around $18688 from the actual house prices in the test set. This value gives us an understanding of the performance of the model in predicting house prices, with lower RMSE values indicating better performance.

In [None]:
# Visualizing the differences between actual prices and predicted values
plt.scatter(y_train, y_pred,alpha=0.3)
plt.xlabel("Prices")
plt.ylabel("Predicted prices")
plt.title("Prices vs Predicted prices")
plt.show()

### For test data: : X_test

In [None]:
# Predicting Test data with the model
y_test_pred = reg.predict(X_test)

In [None]:
# Model Evaluation
acc_rf = metrics.r2_score(y_test, y_test_pred)
print('R^2:', acc_rf)
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_test, y_test_pred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_test, y_test_pred))
print('MSE:',metrics.mean_squared_error(y_test, y_test_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))

Another example of "extremely randomised Trees"
There is a different kind of decision tree type named Extremely Randomized Trees that decide the tree splits into branches completely randomly (ie not based on information gain).

These trees are weak estimators by themselves (not surprisingly).

However, they are better than a 100% random estimator, and each random tree is different. This makes then a perfect estimator a perfect base estimator, because by aggregating a group of them the general error diminishes. Since each tree is trained on a different set of observations, their errors will differ.

## 3. XGBoost

XGBoost (eXtreme Gradient Boosting) implements gradient boosted trees but focused on large datasets.

Because it is a relatively new dataset (the research started in 2014, and the original paper was published in 2016 [link to the](https://arxiv.org/abs/1603.02754)) it is not implemented in scikit-learn. However it is avaiable in the package [xgboost](http://xgboost.readthedocs.io/en/latest/python/python_intro.html), that implements XGBoost that follow's scikit-learn api.

In [None]:
# Import XGBoost Regressor
from xgboost import XGBRegressor

#Create a XGBoost Regressor
reg = XGBRegressor()

# Train the model using the training sets
reg.fit(X_train, y_train)


In [None]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [None]:
# here we change a number of parameters e.g. made n_estimators=100
#XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
#       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
#       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
#       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
#       silent=True, subsample=1)

### Model Evaluation

In [None]:
# Model prediction on train data
y_pred = reg.predict(X_train)

In [None]:
# Model Evaluation
print('R^2:',metrics.r2_score(y_train, y_pred))
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_train, y_pred))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_train, y_pred))
print('MSE:',metrics.mean_squared_error(y_train, y_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_train, y_pred)))

In [None]:
# Visualizing the differences between actual prices and predicted values
plt.scatter(y_train, y_pred,alpha=0.3)
plt.xlabel("Prices")
plt.ylabel("Predicted prices")
plt.title("Prices vs Predicted prices")
plt.show()

### Test Data

In [None]:
#Predicting Test data with the model
y_test_pred = reg.predict(X_test)

In [None]:
# Model Evaluation
acc_xgb = metrics.r2_score(y_test, y_test_pred)
print('R^2:', acc_xgb)
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_test, y_test_pred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_test, y_test_pred))
print('MSE:',metrics.mean_squared_error(y_test, y_test_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))

# Evaluation and comparision of all the models

In [None]:
models = pd.DataFrame({
    'Model': ['Linear Regression', 'Random Forest', 'XGBoost'],
    'R-squared Score': [acc_linreg*100, acc_rf*100, acc_xgb*100]})
models.sort_values(by='R-squared Score', ascending=False)

In [None]:
#add test and train outputs to the table as well
#acc_rf_train*100

Which one works the best for this dataset?

BONUS: try out and compare SVM Regressor or any other algorithm of interest
Here is a useful article if you want to read up a bit more on boosting https://www.analyticsvidhya.com/blog/2020/02/4-boosting-algorithms-machine-learning/