### A Hybrid Ensemble Algorithm for ESG Score Prediction

- Description of data
- Research questions
- Assumptions and scope
- Data pre-processing
- Applied Machine Learning solution
- Modelling
- Evaluation
- Optimization
- Interpretation of results (visualizations) and discussion


#### Data Description

The dataset used in this study was primarily sourced from Bloomberg's Terminal, and it was subsequently obtained as CSV files from a GitHub repository. The dataset contains selection of about 2000 stocks from the nearly 3700 Nasdaq Composite Index. The dataset consists ESG (Environmental, Social, and Governance) information with a focus on the 'E' component, as it has the mosts substantial amount of fields (more than 500), as compared to the 'S' and 'G' components.

The data was originally structured in 3 main columns: Stock Ticker, Field, and Value. However, due to data redundancy was subjected to a restructuring process. This reorganization resulted in a dataset containing 21 columns (due to lack of data in most fields) and approximately 2,000 rows. The second data set consists of already calculated ESG scores.

The data, which focuses on the 'E' pillar, consists of three sets of columns pertaining to greenhouse gas (GHG) emissions, environmental policy and regulations related to initiatives like the Paris Climate Conventions.

#### Research Questions

The study aims to research the effectiveness of ensemble methods in predicting ESG scores. It further aims to investigate whether a hybrid ensemble approach can further amplify the predictive power of ESG scores when compared to individual ensemble methods.

1. How effective are ensemble methods, in improving the accuracy of ESG score predictions compared to traditional methods/models?
2. Can a hybrid of ensemble methods enhance the predictive power of ESG scores (more than individual ensemble methods)?
3. What is the correlation of predicted ESG scores w financical performance?

#### Assumptions and Scope

1. Assumptions:
    - Accuracy and reliability - ESG performance is self-reported and voluntary and there is a risk of greenwashing. May affect the financial analysis.
    - The ensemble methods effectiveness is assumed to be consistent across different time frames, for the different ESG pillars and for other stocks not listed in the Nasdaq Composite Index - generalization.
    
    
2. Scope:
    - The study focuses on ESG data related to environmental factors, specifically environmental policy, GHG emissions, and regulatory adherence related to environmental practices.
    - The analysis is limited to stocks within the Nasdaq Composite Index
    - There is a primary focus on ensemble methods namely Random Forest and XGBoost. 

In [1]:
# Importing necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import MinMaxScaler

#### Data Preparation

In [2]:
# Tickers and fields
env = pd.read_csv('Env_Cat.csv')
env.head()

Unnamed: 0,ticker,EU_TAX_EST_DNSH_ADP_LVL_1,EU_TAX_EST_DNSH_BIODIV_LVL_1,EU_TAX_EST_DNSH_MIT_LVL_1,EU_TAX_EST_DNSH_POLLUTION_LVL_1,EU_TAX_EST_ELIGIBLE_REV_PCT,EU_TAX_EST_DNSH_WASTE_LVL_1,BIODIVERSITY_POLICY,GREEN_BUILDING,EMISSION_REDUCTION,...,RENEWABLE_ELECTRICITY_TARGET_POL,SUS_SUP_GDL_ENC_ESG_AREA_PUB_DIS,CLIMATE_CHG_POLICY,ENVIRON_QUAL_MGT,ENVIRON_SUPPLY_MGT,WATER_POLICY,TOTAL_GHG_CO2_ESTIMATE_PER_SALES,TOTAL_GHG_ESTIMATE,GHG_SCOPE_1_ESTIMATE,GHG_SCOPE_2_ESTIMATE
0,AAL UW Equity,100.0,75.0,95.83,75.0,1.13446,50.0,0,1,1,...,1,1,1,1,1,1,0.984366,0.999269,0.998178,0.999977
1,AAME UQ Equity,0.0,0.0,0.0,0.0,37.0751,0.0,0,0,0,...,0,0,0,0,0,0,0.99675,0.999848,0.999771,0.9998
2,AAOI UQ Equity,25.0,25.0,12.5,33.33,100.0,20.0,0,0,0,...,0,0,0,1,0,0,0.992494,0.999649,0.999693,0.999252
3,AAON UW Equity,25.0,0.0,50.0,0.0,100.0,20.0,0,0,1,...,0,0,1,0,0,1,0.997414,0.999879,0.99985,0.999799
4,AAPL UW Equity,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,...,0,0,0,0,0,0,0.999963,0.999998,1.0,0.999994


In [3]:
env.shape

(1974, 22)

In [4]:
#Basic information about the data
env.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1974 entries, 0 to 1973
Data columns (total 22 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   ticker                            1974 non-null   object 
 1   EU_TAX_EST_DNSH_ADP_LVL_1         1758 non-null   float64
 2   EU_TAX_EST_DNSH_BIODIV_LVL_1      1758 non-null   float64
 3   EU_TAX_EST_DNSH_MIT_LVL_1         1758 non-null   float64
 4   EU_TAX_EST_DNSH_POLLUTION_LVL_1   1758 non-null   float64
 5   EU_TAX_EST_ELIGIBLE_REV_PCT       1914 non-null   float64
 6   EU_TAX_EST_DNSH_WASTE_LVL_1       1758 non-null   float64
 7   BIODIVERSITY_POLICY               1974 non-null   int64  
 8   GREEN_BUILDING                    1974 non-null   int64  
 9   EMISSION_REDUCTION                1974 non-null   int64  
 10  ENERGY_EFFIC_POLICY               1974 non-null   int64  
 11  INDEPENDENT_ASSESSMENT_CONDUCTED  1974 non-null   int64  
 12  RENEWA

In [5]:
#Summary statistics
env.describe()

Unnamed: 0,EU_TAX_EST_DNSH_ADP_LVL_1,EU_TAX_EST_DNSH_BIODIV_LVL_1,EU_TAX_EST_DNSH_MIT_LVL_1,EU_TAX_EST_DNSH_POLLUTION_LVL_1,EU_TAX_EST_ELIGIBLE_REV_PCT,EU_TAX_EST_DNSH_WASTE_LVL_1,BIODIVERSITY_POLICY,GREEN_BUILDING,EMISSION_REDUCTION,ENERGY_EFFIC_POLICY,...,RENEWABLE_ELECTRICITY_TARGET_POL,SUS_SUP_GDL_ENC_ESG_AREA_PUB_DIS,CLIMATE_CHG_POLICY,ENVIRON_QUAL_MGT,ENVIRON_SUPPLY_MGT,WATER_POLICY,TOTAL_GHG_CO2_ESTIMATE_PER_SALES,TOTAL_GHG_ESTIMATE,GHG_SCOPE_1_ESTIMATE,GHG_SCOPE_2_ESTIMATE
count,1758.0,1758.0,1758.0,1758.0,1914.0,1758.0,1974.0,1974.0,1974.0,1974.0,...,1974.0,1974.0,1974.0,1974.0,1974.0,1974.0,1974.0,1974.0,1974.0,1974.0
mean,12.746519,12.627986,12.052349,15.173066,30.799929,13.784716,0.034954,0.087639,0.161601,0.193516,...,0.02381,0.154002,0.118034,0.104863,0.164134,0.118034,0.984151,0.99774,0.998268,0.998473
std,21.588746,21.909535,23.083573,25.810168,45.337766,23.142902,0.183711,0.282841,0.368177,0.395154,...,0.152494,0.361042,0.322731,0.306455,0.370491,0.322731,0.044053,0.03176,0.023431,0.023393
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.986555,0.999552,0.999377,0.999479
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.995417,0.999845,0.999779,0.999809
75%,25.0,25.0,16.67,33.33,100.0,20.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.998142,0.999944,0.999924,0.999935
max,100.0,100.0,100.0,100.0,100.0,100.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [6]:
#Checking for null values
env.isnull().sum()

ticker                                0
EU_TAX_EST_DNSH_ADP_LVL_1           216
EU_TAX_EST_DNSH_BIODIV_LVL_1        216
EU_TAX_EST_DNSH_MIT_LVL_1           216
EU_TAX_EST_DNSH_POLLUTION_LVL_1     216
EU_TAX_EST_ELIGIBLE_REV_PCT          60
EU_TAX_EST_DNSH_WASTE_LVL_1         216
BIODIVERSITY_POLICY                   0
GREEN_BUILDING                        0
EMISSION_REDUCTION                    0
ENERGY_EFFIC_POLICY                   0
INDEPENDENT_ASSESSMENT_CONDUCTED      0
RENEWABLE_ELECTRICITY_TARGET_POL      0
SUS_SUP_GDL_ENC_ESG_AREA_PUB_DIS      0
CLIMATE_CHG_POLICY                    0
ENVIRON_QUAL_MGT                      0
ENVIRON_SUPPLY_MGT                    0
WATER_POLICY                          0
TOTAL_GHG_CO2_ESTIMATE_PER_SALES      0
TOTAL_GHG_ESTIMATE                    0
GHG_SCOPE_1_ESTIMATE                  0
GHG_SCOPE_2_ESTIMATE                  0
dtype: int64

In [7]:
#scores
scores = pd.read_csv('Env_Sc.csv')
scores.head()

Unnamed: 0,ticker,E_score
0,AAL UW Equity,96.3
1,AAME UQ Equity,52.08
2,AAOI UQ Equity,85.16
3,AAON UW Equity,83.33
4,AAPL UW Equity,44.38


In [8]:
#scores.shape

In [9]:
#Basic information about the data
scores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1974 entries, 0 to 1973
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   ticker   1974 non-null   object 
 1   E_score  1974 non-null   float64
dtypes: float64(1), object(1)
memory usage: 31.0+ KB


In [10]:
#Summary statistics
scores.describe()

Unnamed: 0,E_score
count,1974.0
mean,50.025274
std,28.874698
min,0.05
25%,25.0425
50%,50.025
75%,75.015
max,100.0


In [11]:
#Checking for null values
scores.isnull().sum()

ticker     0
E_score    0
dtype: int64

In [12]:
#Fill null values with 0 and check for nulls again
env = env.fillna(0)
env.isnull().sum()

ticker                              0
EU_TAX_EST_DNSH_ADP_LVL_1           0
EU_TAX_EST_DNSH_BIODIV_LVL_1        0
EU_TAX_EST_DNSH_MIT_LVL_1           0
EU_TAX_EST_DNSH_POLLUTION_LVL_1     0
EU_TAX_EST_ELIGIBLE_REV_PCT         0
EU_TAX_EST_DNSH_WASTE_LVL_1         0
BIODIVERSITY_POLICY                 0
GREEN_BUILDING                      0
EMISSION_REDUCTION                  0
ENERGY_EFFIC_POLICY                 0
INDEPENDENT_ASSESSMENT_CONDUCTED    0
RENEWABLE_ELECTRICITY_TARGET_POL    0
SUS_SUP_GDL_ENC_ESG_AREA_PUB_DIS    0
CLIMATE_CHG_POLICY                  0
ENVIRON_QUAL_MGT                    0
ENVIRON_SUPPLY_MGT                  0
WATER_POLICY                        0
TOTAL_GHG_CO2_ESTIMATE_PER_SALES    0
TOTAL_GHG_ESTIMATE                  0
GHG_SCOPE_1_ESTIMATE                0
GHG_SCOPE_2_ESTIMATE                0
dtype: int64

In [13]:
# Combine the two dataframes matching the fields and the scores

data = env.merge(scores, on='ticker')

data.head()

Unnamed: 0,ticker,EU_TAX_EST_DNSH_ADP_LVL_1,EU_TAX_EST_DNSH_BIODIV_LVL_1,EU_TAX_EST_DNSH_MIT_LVL_1,EU_TAX_EST_DNSH_POLLUTION_LVL_1,EU_TAX_EST_ELIGIBLE_REV_PCT,EU_TAX_EST_DNSH_WASTE_LVL_1,BIODIVERSITY_POLICY,GREEN_BUILDING,EMISSION_REDUCTION,...,SUS_SUP_GDL_ENC_ESG_AREA_PUB_DIS,CLIMATE_CHG_POLICY,ENVIRON_QUAL_MGT,ENVIRON_SUPPLY_MGT,WATER_POLICY,TOTAL_GHG_CO2_ESTIMATE_PER_SALES,TOTAL_GHG_ESTIMATE,GHG_SCOPE_1_ESTIMATE,GHG_SCOPE_2_ESTIMATE,E_score
0,AAL UW Equity,100.0,75.0,95.83,75.0,1.13446,50.0,0,1,1,...,1,1,1,1,1,0.984366,0.999269,0.998178,0.999977,96.3
1,AAME UQ Equity,0.0,0.0,0.0,0.0,37.0751,0.0,0,0,0,...,0,0,0,0,0,0.99675,0.999848,0.999771,0.9998,52.08
2,AAOI UQ Equity,25.0,25.0,12.5,33.33,100.0,20.0,0,0,0,...,0,0,1,0,0,0.992494,0.999649,0.999693,0.999252,85.16
3,AAON UW Equity,25.0,0.0,50.0,0.0,100.0,20.0,0,0,1,...,0,1,0,0,1,0.997414,0.999879,0.99985,0.999799,83.33
4,AAPL UW Equity,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,...,0,0,0,0,0,0.999963,0.999998,1.0,0.999994,44.38


In [14]:
data.columns

Index(['ticker', 'EU_TAX_EST_DNSH_ADP_LVL_1', 'EU_TAX_EST_DNSH_BIODIV_LVL_1',
       'EU_TAX_EST_DNSH_MIT_LVL_1', 'EU_TAX_EST_DNSH_POLLUTION_LVL_1',
       'EU_TAX_EST_ELIGIBLE_REV_PCT', 'EU_TAX_EST_DNSH_WASTE_LVL_1',
       'BIODIVERSITY_POLICY', 'GREEN_BUILDING', 'EMISSION_REDUCTION',
       'ENERGY_EFFIC_POLICY', 'INDEPENDENT_ASSESSMENT_CONDUCTED',
       'RENEWABLE_ELECTRICITY_TARGET_POL', 'SUS_SUP_GDL_ENC_ESG_AREA_PUB_DIS',
       'CLIMATE_CHG_POLICY', 'ENVIRON_QUAL_MGT', 'ENVIRON_SUPPLY_MGT',
       'WATER_POLICY', 'TOTAL_GHG_CO2_ESTIMATE_PER_SALES',
       'TOTAL_GHG_ESTIMATE', 'GHG_SCOPE_1_ESTIMATE', 'GHG_SCOPE_2_ESTIMATE',
       'E_score'],
      dtype='object')

In [15]:
data.isna().sum()

ticker                              0
EU_TAX_EST_DNSH_ADP_LVL_1           0
EU_TAX_EST_DNSH_BIODIV_LVL_1        0
EU_TAX_EST_DNSH_MIT_LVL_1           0
EU_TAX_EST_DNSH_POLLUTION_LVL_1     0
EU_TAX_EST_ELIGIBLE_REV_PCT         0
EU_TAX_EST_DNSH_WASTE_LVL_1         0
BIODIVERSITY_POLICY                 0
GREEN_BUILDING                      0
EMISSION_REDUCTION                  0
ENERGY_EFFIC_POLICY                 0
INDEPENDENT_ASSESSMENT_CONDUCTED    0
RENEWABLE_ELECTRICITY_TARGET_POL    0
SUS_SUP_GDL_ENC_ESG_AREA_PUB_DIS    0
CLIMATE_CHG_POLICY                  0
ENVIRON_QUAL_MGT                    0
ENVIRON_SUPPLY_MGT                  0
WATER_POLICY                        0
TOTAL_GHG_CO2_ESTIMATE_PER_SALES    0
TOTAL_GHG_ESTIMATE                  0
GHG_SCOPE_1_ESTIMATE                0
GHG_SCOPE_2_ESTIMATE                0
E_score                             0
dtype: int64

#### Feature Engineering

We have the policy variables with points that are either 1 or 0, the regulation variables with points ranging from 0-100 and the GHG variables with points ranging from 0-1. We standardise these in order to improve perfromance and decrease processing time. 

In [31]:
numeric_columns = data.columns[1:] 

# Creating a scaler
scaler = MinMaxScaler(feature_range=(0, 100))

#Fit and transform the df using the scaler
data1 = pd.DataFrame(scaler.fit_transform(data[numeric_columns]), columns=numeric_columns, index=data.index)
data1.head()

Unnamed: 0,EU_TAX_EST_DNSH_ADP_LVL_1,EU_TAX_EST_DNSH_BIODIV_LVL_1,EU_TAX_EST_DNSH_MIT_LVL_1,EU_TAX_EST_DNSH_POLLUTION_LVL_1,EU_TAX_EST_ELIGIBLE_REV_PCT,EU_TAX_EST_DNSH_WASTE_LVL_1,BIODIVERSITY_POLICY,GREEN_BUILDING,EMISSION_REDUCTION,ENERGY_EFFIC_POLICY,...,SUS_SUP_GDL_ENC_ESG_AREA_PUB_DIS,CLIMATE_CHG_POLICY,ENVIRON_QUAL_MGT,ENVIRON_SUPPLY_MGT,WATER_POLICY,TOTAL_GHG_CO2_ESTIMATE_PER_SALES,TOTAL_GHG_ESTIMATE,GHG_SCOPE_1_ESTIMATE,GHG_SCOPE_2_ESTIMATE,E_score
0,100.0,75.0,95.83,75.0,1.13446,50.0,0.0,100.0,100.0,100.0,...,100.0,100.0,100.0,100.0,100.0,98.436638,99.926938,99.817823,99.997676,96.298149
1,0.0,0.0,0.0,0.0,37.0751,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,99.675003,99.984785,99.977111,99.97998,52.056028
2,25.0,25.0,12.5,33.33,100.0,20.0,0.0,0.0,0.0,0.0,...,0.0,0.0,100.0,0.0,0.0,99.249397,99.964904,99.969277,99.925166,85.152576
3,25.0,0.0,50.0,0.0,100.0,20.0,0.0,0.0,100.0,100.0,...,0.0,100.0,0.0,0.0,100.0,99.741364,99.987885,99.984971,99.979908,83.321661
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,99.996269,99.999794,99.999973,99.999361,44.352176


In [32]:
# Separate the target variable 'E_score' from the features
X = data1.drop(columns=['E_score'])
y = data1['E_score']

# Split the data into training and testing sets (adjust the test_size as needed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Modeling

We will use a variety of models to predict scores including:

- Default (Baseline) - Linear Regression & Decision Trees
- Ensemble - Random Forest & Gradient Boosting
- Hybrid Model 

1. Linear Regression (Baseline Model)

In [33]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Evaluate the model's performance
lr_mse = mean_squared_error(y_test, y_pred)
lr_mae = mean_absolute_error(y_test, y_pred)
lr_r2 = r2_score(y_test, y_pred)

print("Coefficients:", model.coef_)
print("LR - Mean Squared Error:", lr_mse)
print("LR - Mean Absolute Error:", lr_mae)
print("LR - R-squared:", lr_r2)

Coefficients: [ 4.06557910e-01  2.01242630e-01 -2.96379602e-01  2.69279357e-01
  3.05316245e-01  2.61585733e-01  3.32595010e-02  2.91322093e-02
  1.18252925e-01  6.36527929e-02  1.75689268e-02 -8.25082315e-02
  6.69436145e-02 -1.38916546e-04 -1.34038252e-01 -9.34098168e-02
 -3.67089222e-02  1.14753991e+00  6.41683595e-01 -9.41772256e+00
  8.93420139e+00]
LR - Mean Squared Error: 165.91613578100834
LR - Mean Absolute Error: 10.289108132064618
LR - R-squared: 0.8010868793536596


2. Decision Trees (Baseline Model)

In [34]:
from sklearn.tree import DecisionTreeRegressor

dt_model = DecisionTreeRegressor(random_state=42)

dt_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred_dt = dt_model.predict(X_test)

# Evaluate the model's performance
mse_dt = mean_squared_error(y_test, y_pred_dt)
mae_dt = mean_absolute_error(y_test, y_pred_dt)
r2_dt = r2_score(y_test, y_pred_dt)

print("Decision Tree - Mean Squared Error:", mse_dt)
print("Decision Tree - Mean Absolute Error:", mae_dt)
print("Decision Tree - R-squared:", r2_dt)

Decision Tree - Mean Squared Error: 17.23864404020457
Decision Tree - Mean Absolute Error: 1.223514107867677
Decision Tree - R-squared: 0.9793329776781058


3. Random Forest

A random forest is a combination (ensemble) of multiple decision trees. It offers advantages over a  DT due to its avoidance and prevention of overfitting through the use of multiple trees.

In [35]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

rf_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_model.predict(X_test)

# Evaluate the model's performance
rf_mse = mean_squared_error(y_test, y_pred)
rf_mae = mean_absolute_error(y_test, y_pred)
rf_r2 = r2_score(y_test, y_pred)

print("RF - Mean Squared Error:", rf_mse)
print("RF - Mean Absolute Error:", rf_mae)
print("RF - R-squared:", rf_r2)

RF - Mean Squared Error: 7.898663274415982
RF - Mean Absolute Error: 1.014001102313065
RF - R-squared: 0.9905304703882299


4. XGBoost

XGBoost, Extreme Gradient Boosting, is a scalable, distributed gradient-boosted decision tree. It is an ensemble learning method that combines the predictions of multiple weak models to produce a stronger prediction.

In [36]:
import xgboost as xgb

xgb_model = xgb.XGBRegressor(random_state=42)

xgb_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate the model's performance
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print("XGBoost - Mean Squared Error:", mse_xgb)
print("XGBoost - Mean Absolute Error:", mae_xgb)
print("XGBoost - R-squared:", r2_xgb)

XGBoost - Mean Squared Error: 5.212598921013828
XGBoost - Mean Absolute Error: 1.0318610426060038
XGBoost - R-squared: 0.9937507324819501


#### Hybrid Models

1. Random Forest and Gradient Boosting: Combining the ensemble power of Random Forest with the boosting capability of Gradient Boosting. We create an ensemble of Random Forest and Gradient Boosting models (stacking) and use an averaging approach to combine their predictions.

In [37]:
# Make predictions on the test data using both models
y_pred_rf = rf_model.predict(X_test)
y_pred_xgb = xgb_model.predict(X_test)

# Create a new feature matrix for stacking with predictions from both models
# We create a new group of predictions and put them in a matrix
X_stack = np.column_stack((y_pred_rf, y_pred_xgb))

# Create a meta-model (Linear Regression) for stacking
# Stacking is a way to ensemble multiple classifications or regression models.
# Meta-model learns from both models to and finds the best way to combine them to get better predictions
meta_model = LinearRegression()

# Fit the meta-model on the stacked predictions 
meta_model.fit(X_stack, y_test)

# Make predictions on the test data using the meta-model
stacked_predictions = meta_model.predict(X_stack)

# Evaluate the stacked model's performance
mse_stacked = mean_squared_error(y_test, stacked_predictions)
mae_stacked = mean_absolute_error(y_test, stacked_predictions)
r2_stacked = r2_score(y_test, stacked_predictions)

# Print the evaluation metrics for the stacked model
print("Stacked Model - Mean Squared Error:", mse_stacked)
print("Stacked Model - Mean Absolute Error:", mae_stacked)
print("Stacked Model - R-squared:", r2_stacked)

Stacked Model - Mean Squared Error: 5.0931130843850445
Stacked Model - Mean Absolute Error: 1.0368247597686469
Stacked Model - R-squared: 0.9938939813620243


In [38]:
# The matrix of predictions by the two models
X_stack[:5,:]

array([[48.81770885, 48.13746643],
       [30.90865433, 31.92324257],
       [27.83281641, 30.19358253],
       [60.05962981, 61.24953079],
       [11.33674112, 11.31967926]])

2. Bagging and Boosting Ensembles: Combining bagging (e.g., Random Forest) and boosting (e.g., XGBoost) ensembles to create a powerful hybrid ensemble. We train multiple bagged models and then boost their performance using a boosting algorithm.

    - Bagging: Ensemble learning method that is commonly used to reduce variance within a noisy dataset. Bagging fits base classifiers on random subsets of the original dataset and then aggregates their individual predictions (either by voting or by averaging) to form a final prediction.
    - Boosting: Ensemble modeling technique that attempts to build a strong classifier from the number of weak classifiers. It combines the predictions of multiple weak or base models (mostly decision trees) to create a strong predictive model.

In [39]:
from sklearn.ensemble import BaggingRegressor

# Train Multiple Bagged Models (Random Forests)
# Each of these models will be trained on different subsets of the training data to introduce diversity.
num_bagged_models = 5
bagged_models = []

for i in range(num_bagged_models):
    rf_model = RandomForestRegressor(n_estimators=100, random_state=np.random.randint(1, 1000))
    bagged_models.append(rf_model)

# Fit each bagged model on the training data
for model in bagged_models:
    model.fit(X_train, y_train)

# Combine the Predictions of Bagged Models
bagged_predictions = np.zeros((len(X_test), num_bagged_models))

for i, model in enumerate(bagged_models):
    bagged_predictions[:, i] = model.predict(X_test)

# Train a Boosting Model (XGBoost) on Bagged Predictions
xgb_model = xgb.XGBRegressor(random_state=42)
xgb_model.fit(bagged_predictions, y_test)

# Make Final Predictions using the Boosting Model
# We use XGBoost to further enhance the performance (boosting)
bagged_predictions_test = np.zeros((len(X_test), num_bagged_models))

for i, model in enumerate(bagged_models):
    bagged_predictions_test[:, i] = model.predict(X_test)

final_predictions = xgb_model.predict(bagged_predictions_test)

# Evaluate the hybrid ensemble model's performance
mse_hybrid = mean_squared_error(y_test, final_predictions)
mae_hybrid = mean_absolute_error(y_test, final_predictions)
r2_hybrid = r2_score(y_test, final_predictions)

# Print the evaluation metrics for the hybrid ensemble model
print("Hybrid Ensemble Model - Mean Squared Error:", mse_hybrid)
print("Hybrid Ensemble Model - Mean Absolute Error:", mae_hybrid)
print("Hybrid Ensemble Model - R-squared:", r2_hybrid)

Hybrid Ensemble Model - Mean Squared Error: 0.0470184902972982
Hybrid Ensemble Model - Mean Absolute Error: 0.1175445673553231
Hybrid Ensemble Model - R-squared: 0.9999436305903034


In [40]:
bagged_predictions

array([[47.90945473, 48.84082041, 48.83911956, 48.35297649, 48.27433717],
       [30.8190095 , 30.84002001, 30.83911956, 30.95197599, 30.87913957],
       [28.49084542, 28.29334667, 28.58689345, 28.43311656, 29.64282141],
       ...,
       [86.4128064 , 86.4010005 , 86.37958979, 86.42841421, 86.44482241],
       [11.33118362, 11.34330283, 11.3343034 , 11.34507367, 11.33555888],
       [41.45789561, 41.73190762, 39.92323662, 39.96710022, 41.02434551]])

In [41]:
results = {
    'Model': ['LR', 'DT', 'RF', 'XGB', 'Hybrid1', 'Hybrid2'],
    'MSE': [lr_mse, mse_dt, rf_mse, mse_xgb, mse_stacked, mse_hybrid],
    'R2': [lr_r2, r2_dt, rf_r2, r2_xgb, r2_stacked, r2_hybrid]
}

results_df = pd.DataFrame(results)

results_df


Unnamed: 0,Model,MSE,R2
0,LR,165.916136,0.801087
1,DT,17.238644,0.979333
2,RF,7.898663,0.99053
3,XGB,5.212599,0.993751
4,Hybrid1,5.093113,0.993894
5,Hybrid2,0.047018,0.999944
