# Getting started with Ensemble Machine Learning

## Introduction to ensemble machine learning

Simply speaking, **<font color=cyan>ensemble machine learning refers to a technique that integrates output from multiple learners and is applied to a dataset to make a prediction</font>**. **These multiple learners are usually referred to as base learners**. **When multiple base models are used to extract predictions that are combined into one single prediction, that prediction is likely to provide better accuracy than individual base learners**.

**Ensemble models are known for providing an advantage over single models in terms of performance**. They can be applied to **both regression and classification problems**. You can either decide to **build ensemble models with algorithms from the same family or opt to pick them from different families**. **<font color=cyan>If multiple models are built on the same dataset using neural networks only, then that ensemble would be called a homogeneous ensemble model. If multiple models are built using different algorithms, such as support vector machines (SVMs), neural networks, and random forests, then the ensemble model would be called a heterogeneous ensemble model.</font>**

The construction of an ensemble model requires two steps:

1. **Base learners are learners that are designed and fit on training data**
2. The **base learners are combined to form a single prediction model by using specific ensembling techniques** such as *max-voting*, *averaging*, and *weighted averaging*

The following diagram shows the structure of the ensemble model:

![Alt text](ensemble_ml_diagram.png)

However, to get an ensemble model that performs well, **the base learners themselves should be as accurate as possible**. A common way to **measure the performance of a model** is to **evaluate its generalization error**. A generalization error is a term to **measure how accurately a model is able to make a prediction, based on a new dataset that the model hasn't seen**.

To perform well, the ensemble models require a sufficient amount of data. **Ensemble techniques prove to be more useful when you have large and non-linear datasets**.

> An ensemble model may overfit if too many models are included, although this isn't very common.

Irrespective of how well you fine-tune your models, **there's always the risk of high bias or high variance**. Even the best model can fail if the bias and variance aren't taken into account while training the model. Both bias and variance represent a kind of error in the predictions. In fact, **the total error is comprised of bias-related error, variance-related error, and unavoidable noise-related error (or irreducible error)**. The noise-related error is mainly due to noise in the training data and can't be removed. However, the errors due to bias and variance can be reduced.

The total error can be expressed as follows:

$\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreductible Error}$

A measure such as **mean square error (MSE)** captures **all of these errors for a continuous target variable** and can be represented as follows:

$$
MSE = E[(Y - \hat{f}(x))^2]
$$

In this formulat, $E$ stands for **expected mean**, $Y$ represents **the actual target values** and $\hat{f}(x)$ is the **predicted values for the target variable**. It can be broken down into its components such as bias, variance and noise as shown in the following formula:

$$
MSE = \frac{[E[\hat{f}(x)] - f(x)]^2}{Bias} + \frac{E[\hat{f}(x) - E[\hat{f}(x)]^2]}{Variance} + \frac{\epsilon}{Noise}
$$

While **<font color=cyan>bias refers to how close is the ground truth to the expected value of our estimate</font>**, **<font color=cyan>the variance, on the other hand, measures the deviation from the expected estimator value</font>**. **Estimators with small MSE is what is desirable**. In order to minimize the MSE error, we would like to be centered ($0$-bias) at ground truth and have a low deviation (low variance) from the ground truth (correct) value. In other words, we'd like to be confident (low variance, low uncertainty, more peaked distribution) about the value of our estimate. **High bias degrades the performance of the algorithm on the training dataset and leads to underfitting**. **High variance**, on the other hand, **is characterized by low training errors and high validation errors**. **Having high variance reduces the performance of the learners on unseen data, leading to overfitting**.

> Ensemble models can reduce bias and/or variance in the models.

## Max-voting

Max-voting, which is **generally used for classification problems**, is one of the simplest ways of combining predictions from multiple machine learning algorithms.

**In max-voting, each base model makes a prediction and votes for each sample. Only the sample class with the highest votes is included in the final predictive class**.

For example, let's say we have an online survey, in which consumers answer a question in a five-level Likert scale. We can assume that a few consumers will provide a rating of five, while others will provide a rating of four, and so on. If a majority, say more than $50\%$ of the consumers, provide a rating of four, then the final rating is taken as four. In this example, taking the final rating as four is similar to taking a mode for all of the ratings.

### Getting started

In [2]:
import os
import pandas as pd
# set working directory
os.chdir("../2_getting_started_with_ensemble_ml/")
os.getcwd()

'c:\\Users\\HP\\Documents\\Ensemble machine learning\\2_getting_started_with_ensemble_ml'

In [3]:
df_cryotherapydata = pd.read_csv('./Cryotherapy.csv')
df_cryotherapydata.head(5)

Unnamed: 0,sex,age,Time,Number_of_Warts,Type,Area,Result_of_Treatment
0,1,35,12.0,5,1,100,0
1,1,29,7.0,5,1,96,1
2,1,50,8.0,1,3,132,0
3,1,32,11.75,7,3,750,0
4,1,67,9.25,1,1,42,0


### How to do it

You can create a voting ensemble model for a classification problem using the `VotingClassifier` class from scikit-learn.

In [4]:
# 1 Import the required libs for building the decision tree, SVM, and logistic regression models
# We also import VotingClassifier for max-voting

# Import required libs
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier

# 2 We then move on to the building our feature set and creating our train and test datasets

# we create train and test sample from our dataset
from sklearn.model_selection import train_test_split

# create feature and response sets
feature_columns = ['sex', 'age', 'Time', 'Number_of_Warts', 'Type', 'Area']
X = df_cryotherapydata[feature_columns]
Y = df_cryotherapydata['Result_of_Treatment']

# Create train & test sets
X_train, X_test, Y_train, Y_test = \
train_test_split(X, Y, test_size=0.20, random_state=1)

# 3 We build our models with the decision tree, SVM, and logistic regression algorithms

# create the sub models
estimators = []

dt_model = DecisionTreeClassifier(random_state=1)
estimators.append(('DecisionTree', dt_model))

svm_model = SVC(random_state=1)
estimators.append(('SupportVector', svm_model))

logit_model = LogisticRegression(random_state=1)
estimators.append(('Logistic Regression', logit_model))

# 4 We build individual models with each of the classifiers we've chosen:
from sklearn.metrics import accuracy_score

for each_estimator in (dt_model, svm_model, logit_model):
    each_estimator.fit(X_train, Y_train)
    Y_pred = each_estimator.predict(X_test)
    print(each_estimator.__class__.__name__, accuracy_score(Y_test, Y_pred))


DecisionTreeClassifier 0.8333333333333334
SVC 0.4444444444444444
LogisticRegression 0.9444444444444444


In [5]:
# 5 We proceed to ensemble our models and use VotingClassifier to score the accuracy
# of the ensemble model:

# Using VotingClassifier() to build ensemble model with Hard Voting
ensemble_model = VotingClassifier(estimators=estimators, voting='hard')

ensemble_model.fit(X_train, Y_train)
predicted_labels = ensemble_model.predict(X_test)

print("Classifier Accuracy using Hard Voting: ", accuracy_score(Y_test, predicted_labels))

Classifier Accuracy using Hard Voting:  0.8333333333333334


### How it works

`VotingClassifier` implements two types of voting—**hard** and **soft** voting. In **hard voting, the final class label is predicted as the class label that has been predicted most frequently by the classification models**. In other words, **the predictions from all classifiers are aggregated to predict the class that gets the most votes**. In simple terms, **it takes the mode of the predicted class labels.**

In hard voting for the class labels $\hat{y}$ is the prediction based on the majority voting of each classifier $C_i$, where $i=1 \dots n$ observations we have the following

$$
\hat{y} = mode{C_i(x), C_2(x), \dots, C_n(x)}
$$

As shown in the previous section, we have three models, one from the decision tree, one from the SVMs, and one from logistic regression. Let's say that the models classify a training observation as class $1$, class $0$, and class $1$ respectively. Then with majority voting, we have the following:

$$
\hat{y} = mode\{1, 0, 1\} = 1
$$

In this case, we would classify the observation as class $1$.

In the preceding section, in Step 1, we imported the required libraries to build our models. In Step 2, we created our feature set. We also split our data to create the training and testing samples. In Step 3, we trained three models with the decision tree, SVMs, and logistic regression respectively. In Step 4, we looked at the accuracy score of each of the base learners, while in Step 5, we ensembled the models using `VotingClassifier()` and looked at the accuracy score of the ensemble model.

### There's more

Many classifiers can estimate class probabilities. In this case, the class labels are predicted by averaging the class probabilities. This is called **soft voting** and is recommended for an ensemble of well-tuned classifiers.

In the scikit-learn library, many classification algorithms have the `predict_proba()` method to predict the class probabilities. To perform the ensemble with soft voting, simply replace `voting='hard'` with `voting='soft'` in `VotingClassifier()`.

The following code creates an ensemble using soft voting

In [6]:
# create the sub models
estimators = []

dt_model = DecisionTreeClassifier(random_state=1)
estimators.append(('DecisionTree', dt_model))

svm_model = SVC(random_state=1, probability=True)
estimators.append(('SupportVector', svm_model))

logit_model = LogisticRegression(random_state=1)
estimators.append(('Logistic Regression', logit_model))

for each_estimator in (dt_model, svm_model, logit_model):
    each_estimator.fit(X_train, Y_train)
    Y_pred = each_estimator.predict(X_test)
    print(each_estimator.__class__.__name__, accuracy_score(Y_test, Y_pred))

# Using VotingClassifier() to build ensemble model with soft voting
ensemble_model = VotingClassifier(estimators=estimators, voting='soft')
ensemble_model.fit(X_train, Y_train)
predicted_labels = ensemble_model.predict(X_test)
print("Classifier Accuraty using Soft Voting: ", accuracy_score(Y_test, predicted_labels))

DecisionTreeClassifier 0.8333333333333334
SVC 0.4444444444444444
LogisticRegression 0.9444444444444444
Classifier Accuraty using Soft Voting:  0.8888888888888888


> The **SVC class can't estimate class probabilities by default, so we've set its probability hyper-parameter to True** in the preceding code. With `probability=True`, SVC will be able to estimate class probabilities.

## Averaging

Averaging is **usually usec for regression problems** or **can be used while estimating the probabilities in classification tasks**. Predictions are **extracted from multiple models and an average of the predictions are used to make the final prediction**.

### Getting ready

In [7]:
df_winedata = pd.read_csv("whitewines.csv")
df_winedata.head(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,6.7,0.62,0.24,1.1,0.039,6.0,62.0,0.9934,3.41,0.32,10.4,5
1,5.7,0.22,0.2,16.0,0.044,41.0,113.0,0.99862,3.22,0.46,8.9,6
2,5.9,0.19,0.26,7.4,0.034,33.0,123.0,0.995,3.49,0.42,10.1,6
3,5.3,0.47,0.1,1.3,0.036,11.0,74.0,0.99082,3.48,0.54,11.2,4
4,6.4,0.29,0.21,9.65,0.041,36.0,119.0,0.99334,2.99,0.34,10.933333,6


### How to do it

We have a dataset that is based on the properties of wines. Using this dataset, we'll build multiple regression models with the quality as our response variable. With multiple learners, we extract multiple predictions. The averaging technique would take the average of all of the predicted values for each training sample:

In [10]:
# 1 import the libraries

# Import required libraries
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR

# 2 create feature and response varable set
from sklearn.model_selection import train_test_split

# create feature and response variables
feature_columns = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar','chlorides', 'free sulfur dioxide', 'total sulfur dioxide','density', 'pH', 'sulphates', 'alcohol']
X = df_winedata[feature_columns]
y = df_winedata['quality']

# 3 Split the data into training and testing sets
# create train & test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1)

# 4 Building the base regression learners using linear regression, SVR and a decision tree:

# build base learners
linreg_model = LinearRegression()
svr_model = SVR()
regressiontree_model = DecisionTreeRegressor()

# fitting the model
linreg_model.fit(X_train, y_train)
svr_model.fit(X_train, y_train)
regressiontree_model.fit(X_train, y_train)

# 5 Use the base learners to make a prediction based on the test data
linreg_predictions = linreg_model.predict(X_test)
svr_predictions = svr_model.predict(X_test)
regtree_predictions = regressiontree_model.predict(X_test)

# 6 Add the predictions and divide by the number of base learners
# we divide the summation of the predictions by 3 i.e number of base
# learners
average_predictions=(linreg_predictions + svr_predictions + regtree_predictions)/3

## Weighted averaging

Like averaging, weighted averaging is also used for regression tasks. Alternatively, it can be used while estimating probabilities in classification problems. Base learners are assigned different weights, which represent the importance of each model in the prediction.

> A weight-averaged model should always be at least as good as your best model.

### Getting ready

In [11]:
df_cancerdata = pd.read_csv("wisc_bc_data.csv")
df_cancerdata.head(5)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,points_worst,symmetry_worst,dimension_worst
0,87139402,B,12.32,12.39,78.85,464.1,0.1028,0.06981,0.03987,0.037,...,13.5,15.64,86.97,549.1,0.1385,0.1266,0.1242,0.09391,0.2827,0.06771
1,8910251,B,10.6,18.95,69.28,346.4,0.09688,0.1147,0.06387,0.02642,...,11.88,22.94,78.28,424.8,0.1213,0.2515,0.1916,0.07926,0.294,0.07587
2,905520,B,11.04,16.83,70.92,373.2,0.1077,0.07804,0.03046,0.0248,...,12.41,26.44,79.93,471.4,0.1369,0.1482,0.1067,0.07431,0.2998,0.07881
3,868871,B,11.28,13.39,73.0,384.8,0.1164,0.1136,0.04635,0.04796,...,11.92,15.77,76.53,434.0,0.1367,0.1822,0.08669,0.08611,0.2102,0.06784
4,9012568,B,15.19,13.21,97.65,711.8,0.07963,0.06934,0.03393,0.02657,...,16.2,15.73,104.5,819.1,0.1126,0.1737,0.1362,0.08178,0.2487,0.06766


### How to do it

Here, we have a dataset based on the properties of cancerous tumors. Using this dataset, we'll **build multiple classification models with diagnosis as our response variable**. The diagnosis variable has the values, `B` and `M`, which indicate **whether the tumor is benign or malignant**. With multiple learners, we extract multiple predictions. The weighted averaging technique takes the average of all of the predicted values for each training sample.

In this example, we consider the predicted probabilities as the output and use the `predict_proba()` function of the scikit-learn algorithms to predict the class probabilities:

In [12]:
# 1 Import the required libraries:

# Import required libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# 2 Create the response and feature sets:

# Create feature and response variable set
# We create train & test sample from our dataset
from sklearn.model_selection import train_test_split

# create feature & response variables
X = df_cancerdata.iloc[:, 2:32]
y = df_cancerdata['diagnosis']

> We retrieved the feature columns using the `iloc()` function of the pandas DataFrame, which is purely integer-location based indexing for selection by position. The `iloc()` function takes row and column selection as its parameter, in the form: `data.iloc(<row selection>, <column selection>)`. **The row and column selection can either be an integer list or a slice of rows and columns**. For example, it might look as follows: `df_cancerdata.iloc(2:100, 2:30)`.

In [14]:
# 3 We'll then split our data into training and testing sets:
# Create train & test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

# 4 Build the base classifier models:
# create the sub models
estimators = []

dt_model = DecisionTreeClassifier()
estimators.append(('DecisionTree', dt_model))

svm_model = SVC(probability=True)
estimators.append(('SupportVector', svm_model))

logit_model = LogisticRegression()
estimators.append(('Logistic Regression', logit_model))

#  5 Fit the models on the test data:
dt_model.fit(X_train, y_train)
svm_model.fit(X_train, y_train)
logit_model.fit(X_train, y_train)

# 6 Use the predict_proba() function to predict the class probabilities:
dt_predictions = dt_model.predict_proba(X_test)
svm_predictions = svm_model.predict_proba(X_test)
logit_predictions = logit_model.predict_proba(X_test)

# 7 Assign different weights to each of the models to get our final predictions:
weighted_average_predictions = (dt_predictions * 0.3 + svm_predictions * 0.4 + logit_predictions * 0.3)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
