## Ensemble Learning

Ensemble refers to a collection of elements considered as a unified entity rather than individual components. An Ensemble method entails the creation of multiple models, which are then combined to address a problem. These ensemble techniques enhance the model's robustness and capacity for generalization.
<br>
<img src="https://github.com/EDGE-Programe/Python-Basics/blob/master/Python_edge_program/Data%20Science/notebook_images/nb_24/ensemble1.png?raw=1" alt="KNN" width=58% height=49% title="Ensemble Learning">
<br>
So, ensemble learning is a powerful machine learning technique that leverages the combination of multiple models (learners) to improve overall predictive performance and enhance the robustness of a machine learning system. It's based on the idea that combining several weak learners (models that are slightly better than random guessing) can often yield a strong learner (a model with high predictive accuracy).

Now, let's use a **multivariate dataset** called QCM alcohol dataset from UCL repository to study ensemble methods.

### Basic Ensemble Methods

**1. Averaging:** Averaging Technique: This approach is primarily applied in regression scenarios. It involves creating several models separately and then providing the average of their predictions as the result. Typically, this combined output outperforms individual outputs due to a reduction in variance. In the following illustration, we train three regression models (linear regression, xgboost, and random forest) and compute their predictions. The ultimate prediction result is referred to as **`pred_final`**.

In [None]:
# combining all csv files into one
import pandas as pd

qcm3 = pd.read_csv('../datasets/qcm_alcohol/QCM3.csv', sep = ';')
qcm6 = pd.read_csv('../datasets/qcm_alcohol/QCM6.csv', sep = ';')
qcm7 = pd.read_csv('../datasets/qcm_alcohol/QCM7.csv', sep = ';')
qcm10 = pd.read_csv('../datasets/qcm_alcohol/QCM10.csv', sep = ';')
qcm12 = pd.read_csv('../datasets/qcm_alcohol/QCM12.csv', sep = ';')

In [None]:
dataset = pd.concat([qcm3, qcm6, qcm7, qcm10, qcm12])
print("Shape of dataset: ", dataset.shape)

Shape of dataset:  (125, 15)


In [None]:
# training data

X = dataset.iloc[:, 0:10].values
y = dataset.iloc[:, [10,11,12,13,14]].values

In [None]:
! pip install xgboost



In [None]:
# importing utility modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# importing machine learning models for prediction
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.linear_model import LinearRegression


# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

# initializing all the model objects with default parameters
model_1 = LinearRegression()
model_2 = xgb.XGBRegressor()
model_3 = RandomForestRegressor()

# training all the model on the training dataset
model_1.fit(X_train, y_train)
model_2.fit(X_train, y_train)
model_3.fit(X_train, y_train)

# predicting the output on the validation dataset
pred_1 = model_1.predict(X_test)
pred_2 = model_2.predict(X_test)
pred_3 = model_3.predict(X_test)

# final prediction after averaging on the prediction of all 3 models
pred_final = (pred_1+pred_2+pred_3)/3.0

# printing the mean squared error between real value and predicted value
print("MSE for Model_1:",mean_squared_error(y_test, pred_1))
print("MSE for Model_2:",mean_squared_error(y_test, pred_2))
print("MSE for Model_3:",mean_squared_error(y_test, pred_3))
print("MSE for Average Ensemble:",mean_squared_error(y_test, pred_final))

MSE for Model_1: 0.09631921967994472
MSE for Model_2: 0.07602632533272329
MSE for Model_3: 0.035407999999999995
MSE for Average Ensemble: 0.04763073766233296


**`2.Majority Voting:`** In the voting process, multiple models are trained individually, and their predictions are amalgamated to produce a concluding prediction, typically through one of these methods:
<br>
<img src="https://github.com/EDGE-Programe/Python-Basics/blob/master/Python_edge_program/Data%20Science/notebook_images/nb_24/ensemble2.jpg?raw=1" alt="KNN" width=49% height=31% title="Ensemble Learning">
<br>

- **Hard Voting:** In hard voting, the ultimate prediction is the most frequently occurring prediction among all the models. Let's say we have three classifiers that made predictions for the output class, and their predictions are (A, A, B). In this case, the majority of the classifiers have predicted class A as the output. Therefore, the final prediction will be class A.
<br>
<img src="https://github.com/EDGE-Programe/Python-Basics/blob/master/Python_edge_program/Data%20Science/notebook_images/nb_24/ensemble4.jpg?raw=1" alt="Soft Voting" width=31% height=31% title="Hard Voting Classifier">
<br>

- **Soft Voting:** In contrast, soft voting involves each model producing a probability distribution instead of a binary prediction. The predicted class is then the one with the highest probability. Let's consider an example where we have input data for three models, and their prediction probabilities for class A are (0.30, 0.47, 0.53), while for class B, they are (0.20, 0.32, 0.40). If we calculate the averages for class A and class B, we obtain 0.4333 for class A and 0.3067 for class B. It's evident that class A is the winner because it has the highest average probability among all the classifiers.
<br>
<img src="https://github.com/EDGE-Programe/Python-Basics/blob/master/Python_edge_program/Data%20Science/notebook_images/nb_24/ensemble3.jpg?raw=1" alt="Soft Voting" width=31% height=31% title="Soft Voting Classifier">
<br>

- **Weighted Voting:** Weighted voting operates under the premise that certain models possess higher predictive accuracy than others. Therefore, these more skillful models are assigned a greater influence or weight when contributing to the final prediction.

In [None]:
# importing libraries
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# loading iris dataset
iris = load_iris()
X = iris.data
Y = iris.target

# train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.20, random_state = 42)

# group / ensemble of models
estimator = []
estimator.append(('LR',LogisticRegression(solver ='lbfgs',multi_class ='multinomial',max_iter = 200)))
estimator.append(('SVC', SVC(gamma ='auto', probability = True)))
estimator.append(('DTC', DecisionTreeClassifier()))

# Voting Classifier with hard voting
vot_hard = VotingClassifier(estimators = estimator, voting ='hard')
vot_hard.fit(X_train, y_train)
y_pred = vot_hard.predict(X_test)

# using accuracy_score metric to predict accuracy
score = accuracy_score(y_test, y_pred)
print("Hard Voting Score % f" % score)

# Voting Classifier with soft voting
vot_soft = VotingClassifier(estimators = estimator, voting ='soft')
vot_soft.fit(X_train, y_train)
y_pred = vot_soft.predict(X_test)

# using accuracy_score
score = accuracy_score(y_test, y_pred)
print("Soft Voting Score % f" % score)

Hard Voting Score  1.000000
Soft Voting Score  1.000000


### Weighted Voting Classifier

Weighted voting is required when our ensemble models are drastically different interms of their performance. So, we need to implement a way to prioritize the model with the best accuracy.

In [None]:
# Weighted Voting Classifier with hard voting
weights = [0.9, 0.8, 0.76]
vot_hard = VotingClassifier(estimators = estimator, weights = weights, voting ='hard')
vot_hard.fit(X_train, y_train)
y_pred = vot_hard.predict(X_test)

# using accuracy_score metric to predict accuracy
score = accuracy_score(y_test, y_pred)
print("Weighted Hard Voting Score % f" % score)

# Voting Classifier with soft voting
weights = [0.9, 0.8, 0.76]
vot_soft = VotingClassifier(estimators = estimator, weights = weights, voting ='soft')
vot_soft.fit(X_train, y_train)
y_pred = vot_soft.predict(X_test)

# using accuracy_score
score = accuracy_score(y_test, y_pred)
print("Weighted Soft Voting Score % f" % score)

Weighted Hard Voting Score  1.000000
Weighted Soft Voting Score  1.000000


### Advantages:
- The voting architecture is straightforward to set up, especially when compared to stacking and blending, and it generally doesn't demand intricate fine-tuning.

- Employing multiple base learners in the voting system reduces vulnerability to the impact of individual models, enhancing the stability and reliability of predictions.

### Disadvantages:
- Handling conflicts in predictions among the models can pose a challenge, making it tricky to arrive at a meaningful final decision.

- Increasing the number of models in the ensemble voting model doesn't always lead to an improvement in the final performance, which can limit its effectiveness.

## Bagging Method

This approach is also referred to as the bootstrapping method. In this method, base models are applied to bags, which ensures a representative sampling of the entire dataset. A **`bag`** represents a subset of the dataset, and it includes replacements to match the size of the complete dataset. The final result is generated by consolidating the outputs of all the base models.

### Algorithm:

- Generate several datasets by randomly selecting observations from the training dataset with replacement.

- Apply a base model to each of these generated datasets independently.

- Aggregate the predictions from all the base models to produce the final output.

Bagging normally uses only one base model (XGBoost Regressor used in the code below).

<br>
<img src="https://github.com/EDGE-Programe/Python-Basics/blob/master/Python_edge_program/Data%20Science/notebook_images/nb_24/ensemble_bagging.webp?raw=1" alt="Bagging" width=76% height=85% title="Bagging Method">
<br>

In [None]:
# training data
X = dataset.iloc[:, 0:10].values
y = dataset.iloc[:, [10,11,12,13,14]].values

In [None]:
# importing utility modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# importing machine learning models for prediction
import xgboost as xgb

# importing bagging module
from sklearn.ensemble import BaggingRegressor


# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

# initializing the bagging model using XGboost as base model with default parameters
model = BaggingRegressor(base_estimator=xgb.XGBRegressor())

# training model
model.fit(X_train, y_train)

# predicting the output on the test dataset
pred = model.predict(X_test)

# printing the mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred))



0.04805340355619929


**`Aggregation:`** This is a step that involves the process of combining the output of all base models and, based on their output, predicting an aggregate result with greater accuracy and reduced variance.

<br>
<img src="https://github.com/EDGE-Programe/Python-Basics/blob/master/Python_edge_program/Data%20Science/notebook_images/nb_24/ensemble_bagging2.png?raw=1" alt="Bagging" width=67% height=58% title="Bagging Method">
<br>

### Advantages:

- The modeling process is simple and doesn't require complex mathematical concepts, plus it can handle missing values.

- Implementation is made easy by the scikit-learn package, which includes modules for combining predictions from each base learner.

- Bagging has a notable impact on reducing variance in high-variance classifiers, particularly beneficial for high-dimensional data, where it helps prevent overfitting to new data.

- Bagging provides an unbiased estimate of the out-of-bag error, which represents the average error or loss across all these classifiers.

### Disadvantages:

- Bagging can be computationally intensive as it involves using multiple models.

- The process of averaging predictions can make it challenging to interpret the final result, reducing the model's transparency.

## Stacking Method

Stacking is an ensemble technique that merges various models, whether they are for classification or regression, using a meta-model, which could be a meta-classifier or a meta-regression model. To implement stacking, the base models are initially trained on the entire dataset, and subsequently, the meta-model is trained on the features generated as output by the base models. It's important to note that the base models employed in stacking are usually diverse in nature. The role of the meta-model is to discern and utilize the most informative features derived from the base models in order to attain the highest possible level of accuracy.
<br>
<img src="https://github.com/EDGE-Programe/Python-Basics/blob/master/Python_edge_program/Data%20Science/notebook_images/nb_24/Stacking_ensemble.png?raw=1" alt="Bagging" width=67% height=58% title="Bagging Method">
<br>

**Stacking Algorithm:**
- Divide the training dataset into **n** subsets.
<br>
- Train a base model, such as linear regression, on **n-1** of these subsets and make predictions for the remaining nth subset. Repeat this process for each of the n subsets in the training set.
<br>
- Fit the base model using the entire training dataset.
<br>
- Utilize this model to make predictions on the test dataset.
<br>
- Repeat steps 2 to 4 for another base model, resulting in another set of predictions for both the training and test datasets.
<br>
- Use the predictions made on the training dataset as additional features to construct a new model.
<br>
- Employ this final model to generate predictions on the test dataset.

Stacking is a bit different from the basic ensembling methods because it has first-level and second-level models. Stacking features are first extracted by training the dataset with all the first-level models. A first-level model is then using the train stacking features to train the model than this model predicts the final output with test stacking features.
<br>
<img src="https://github.com/EDGE-Programe/Python-Basics/blob/master/Python_edge_program/Data%20Science/notebook_images/nb_24/stacking.png?raw=1" alt="Bagging" width=67% height=58% title="Bagging Method">
<br>

In [None]:
! pip install vecstack

Collecting vecstack
  Downloading vecstack-0.4.0.tar.gz (18 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: vecstack
  Building wheel for vecstack (setup.py) ... [?25ldone
[?25h  Created wheel for vecstack: filename=vecstack-0.4.0-py3-none-any.whl size=19864 sha256=7486a00f4e271c2316581745f42d36ac1f61413020c629f2782f6d34d03a10a1
  Stored in directory: /home/rubayet/.cache/pip/wheels/17/89/0b/21d5484cbf713c95b641ec1bdc40dd7ae798cbdea2337e3535
Successfully built vecstack
Installing collected packages: vecstack
Successfully installed vecstack-0.4.0


In [None]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Load the Boston Housing dataset
boston = fetch_openml(name='boston')

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(boston.data, boston.target, test_size=0.2, random_state=42)

  warn(
  warn(


In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# Train the base Decision Tree Model
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)

In [None]:
# Train the base RandomForestRegressor Model
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)

In [None]:
# Train the base GradientBoostingRegressor Model
gb = GradientBoostingRegressor(random_state=42)
gb.fit(X_train, y_train)

In [None]:
# Make predictions on the validation set
dt_pred = dt.predict(X_val)
rf_pred = rf.predict(X_val)
gb_pred = gb.predict(X_val)

In [None]:
# Train the meta model
import numpy as np
from sklearn.linear_model import LinearRegression

# Combine the predictions of the base models into a single feature matrix
X_val_meta = np.column_stack((dt_pred, rf_pred, gb_pred))

# Train the meta-model on the combined feature matrix and the target values
meta_model = LinearRegression()
meta_model.fit(X_val_meta, y_val)

In [None]:
# Make predictions on new data
X_new = np.array([[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3]])
print(X_new.shape)
dt_pred_new = dt.predict(X_new)
rf_pred_new = rf.predict(X_new)
gb_pred_new = gb.predict(X_new)

# Combine the predictions of the base models into a single feature matrix
X_new_meta = np.column_stack((dt_pred_new, rf_pred_new, gb_pred_new))

# Make a prediction using the meta-model
y_new_pred = meta_model.predict(X_new_meta)

print("Predicted median value of owner-occupied homes: ${:.2f} thousand".format(y_new_pred[0]))

(1, 13)
Predicted median value of owner-occupied homes: $49.75 thousand




# An Overview of Popular Ensemble Algorithms
In the preceding sections, we discussed various types of ensemble models. Now, let's provide a concise overview of some popular models.

## Random Forest
Random Forest is a widely used model capable of tackling both classification and regression problems. It consists of multiple decision trees trained using a technique called bagging. The final prediction of a Random Forest is obtained by averaging the predictions made by individual trees.

Before training a Random Forest, there are three key hyperparameters to set: **node size**, **the number of trees**, and **the number of features to be sampled**.

Randomness is introduced through a process known as feature bagging or feature randomness. This method selects a random subset of features to ensure low correlation among the decision trees. It sets Random Forests apart from decision trees, which consider all possible feature splits, whereas Random Forests only consider a subset of these features.

## XGBoost
Extreme Gradient Boosting, abbreviated as XGBoost, serves both classification and regression tasks. XGBoost is designed to be highly efficient and scalable, implementing the gradient boosting decision tree framework.

It is particularly suitable for handling large-scale datasets and is compatible with major distributed environments like Hadoop, MPI (Message Passing Inference), and SGI (Sun Grid Engine).

## AdaBoost
AdaBoost, short for Adaptive Boosting, stands as one of the earliest ensemble boosting classifiers designed for successful binary classification. Its **'adaptive'** nature lies in the fact that it reassigns weights to each instance, giving higher weights to misclassified instances.