# Ensemble Learning

> "একতাই শক্তি" 

This old saying expresses pretty well the underlying idea that rules the very powerful “ensemble methods” in machine learning. 

Ensemble learning is a machine learning paradigm where multiple models (often called **“weak learners”**) are trained to solve the same problem and combined to get better results. The main hypothesis is that when weak models are correctly combined we can obtain more accurate and/or robust models.

Roughly, ensemble learning methods, that often trust the top rankings of many machine learning competitions (including Kaggle’s competitions), are based on the hypothesis that combining multiple models together can often produce a much more powerful model.

# Different types of Ensemble Learning techniques

There are simple and advanced ensemble learning techniques.

**1. Simple:**
    1. Max Voting
    2. Averaging
    3. Weighted Averaging
**2. Advanced:**
    1. Bagging
    2. Boosting

# Synthetic Dataset Generation

In [1]:
# dataset
from sklearn.datasets import make_classification
from sklearn.datasets import make_regression


# define dataset
reg_X, reg_y = make_regression(n_samples=1000, n_features=20, n_informative=12, 
                               noise=0.1, random_state=5)

# define dataset
cls_X, cls_y = make_classification(n_samples=1000, n_features=20, n_informative=12, 
                                   n_classes=5, n_redundant=6, random_state=5)
# summarize the dataset
print(reg_X.shape, cls_X.shape)

(1000, 20) (1000, 20)


In [2]:
# train & test
x_train = cls_X[:700]
y_train = cls_y[:700]

x_test = cls_X[700:]
y_test = cls_y[700:]

# Max Voting

The max voting method is generally used for classification problems. In this technique, multiple models are used to make predictions for each data point. The predictions by each model are considered as a `vote`. The predictions which we get from the majority of the models are used as the final prediction.

<img src='img/majority_voting.png' width='400'>

In [3]:
from scipy.stats import mode
from sklearn.metrics import accuracy_score
import numpy as np

# For ML
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression


model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)

pred1=model1.predict(x_test)
pred2=model2.predict(x_test)
pred3=model3.predict(x_test)

final_pred = []
for i in range(0,len(x_test)):
    final_pred.append(mode([pred1[i], pred2[i], pred3[i]])[0])
    
print("result : ")
print(f"model1 accuracy { accuracy_score(y_test, pred1) }")
print(f"model2 accuracy { accuracy_score(y_test, pred2) }")
print(f"model3 accuracy { accuracy_score(y_test, pred3) }")

print(f"Max Voting accuracy { accuracy_score(y_test, final_pred) }")

result : 
model1 accuracy 0.49666666666666665
model2 accuracy 0.6933333333333334
model3 accuracy 0.55
Max Voting accuracy 0.6566666666666666


# Averaging
Similar to the max voting technique, multiple predictions are made for each data point in averaging. In this method, we take an average of predictions from all the models and use it to make the final prediction. Averaging can be used for making predictions in regression problems or while calculating probabilities for classification problems.

<img src='img/average.png' width='400'>

In [4]:
model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)

pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)

finalpred=(pred1+pred2+pred3)/3

print("result : ")
print(f"model1 accuracy { accuracy_score(y_test, np.argmax(pred1, axis=1)) }")
print(f"model2 accuracy { accuracy_score(y_test, np.argmax(pred2, axis=1)) }")
print(f"model3 accuracy { accuracy_score(y_test, np.argmax(pred3, axis=1)) }")

print(f"Avg. accuracy { accuracy_score(y_test, np.argmax(finalpred, axis=1)) }")

result : 
model1 accuracy 0.4866666666666667
model2 accuracy 0.6933333333333334
model3 accuracy 0.55
Avg. accuracy 0.5866666666666667


# Weighted Average

This is an extension of the averaging method. All models are assigned different weights defining the importance of each model for prediction.

<img src='img/weighted-unweighted.png' width='450'>

In [5]:
model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)

pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)

finalpred=(pred1*0.2 + pred2*0.5 + pred3*0.3)

print("result : ")
print(f"model1 accuracy { accuracy_score(y_test, np.argmax(pred1, axis=1)) }")
print(f"model2 accuracy { accuracy_score(y_test, np.argmax(pred2, axis=1)) }")
print(f"model3 accuracy { accuracy_score(y_test, np.argmax(pred3, axis=1)) }")

print(f"w. avg. accuracy { accuracy_score(y_test, np.argmax(finalpred, axis=1)) }")

result : 
model1 accuracy 0.49333333333333335
model2 accuracy 0.6933333333333334
model3 accuracy 0.55
w. avg. accuracy 0.6933333333333334


> # Task #1 (8 mins)

1. Make a Synthetic Dataset for classification with n_classes = 4
2. Try Max Voting Ensemble Learning techniques on that dataset

# Advanced Ensemble techniques
Let’s move on to understanding the advanced techniques

1. Bagging
2. Boosting

# Bagging

The idea behind bagging is combining the results of multiple models (for instance, all decision trees) to get a generalized result. Here’s a question: If you create all the models on the same set of data and combine it, will it be useful? There is a high chance that these models will give the same result since they are getting the same input. 

So how can we solve this problem? One of the techniques is bootstrapping.

Bootstrapping is a sampling technique in which we create subsets of observations from the original dataset, **with replacement / replica**. 

<img src='img/bootstrap.png' width='350'>

Bagging <  **"Bootstrap Aggregation"** 


<img src='img/bagging1.png' width='650'>


**Bagging algorithms:**

1. Bagging meta-estimator
2. Random forest


## Bagging meta-estimator

Bagging meta-estimator is an ensembling algorithm that can be used for both classification (BaggingClassifier) and regression (BaggingRegressor) problems. It follows the typical bagging technique to make predictions. Following are the steps for the bagging meta-estimator algorithm:

Random subsets are created from the original dataset (Bootstrapping).
The subset of the dataset includes all features.
A user-specified base estimator is fitted on each of these smaller sets.
Predictions from each model are combined to get the final result.



In [6]:
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import BaggingRegressor
from sklearn import tree

model = BaggingClassifier(n_estimators=5)

model.fit(x_train, y_train)
model.score(x_test,y_test)

0.61

In [7]:
# KNN 
model = BaggingClassifier(base_estimator=KNeighborsClassifier(), 
                          n_estimators=30)
model.fit(x_train, y_train)
model.score(x_test,y_test)

0.7166666666666667

In [8]:
# train & test for regression
Rx_train = reg_X[:700]
Ry_train = reg_y[:700]

Rx_test = reg_X[700:]
Ry_test = reg_y[700:]

> # Task #2 (5 mins)
1. Use Bagging meta-estimator for regression

In [9]:
# code here task #2


# Boosting

Before we go further, here’s another question for you: If a data point is incorrectly predicted by the first model, and then the next (probably all models), will combining the predictions provide better results? Such situations are taken care of by boosting.

Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model. Let’s understand the way boosting works in the below steps.

1. First, generate Random Sample from Training Data-set.
2. Now, Train a classifier model 1 for this generated sample data and test the whole training data-set.
3. Now, Calculate the error for each instance prediction. if the instance is classified wrongly, increase the weight for that instance and create another sample.
4. Repeat this process until you get high accuracy from the system.

<img src='img/boosting.png' width='550'>

Some boosting algorithms: 

1. XGBoost
2. Light GBM
3. CatBoost

<img src='img/history.png' width='650'>


# XGBoost

XGBoost stands for eXtreme Gradient Boosting.

XGBoost (extreme Gradient Boosting) is an advanced implementation of the gradient boosting algorithm. XGBoost has proved to be a highly effective ML algorithm, extensively used in machine learning competitions and hackathons. XGBoost has high predictive power and is almost 10 times faster than the other gradient boosting techniques. It also includes a variety of regularization which reduces overfitting and improves overall performance. Hence it is also known as **regularized boosting** technique.

In [10]:
from xgboost import XGBClassifier, XGBRegressor

model=XGBClassifier()

model.fit(x_train, y_train)
model.score(x_test,y_test)

0.6933333333333334

> # Task #3 (5 mins)

Apply XGBRegressor on regression dataset and check score

In [11]:
# task #3 

# Light GBM

Before discussing how Light GBM works, let’s first understand why we need this algorithm when we have so many others (like the ones we have seen above). Light GBM beats all the other algorithms when the dataset is extremely large. Compared to the other algorithms, Light GBM takes lesser time to run on a huge dataset.

LightGBM is a gradient boosting framework that uses tree-based algorithms and follows leaf-wise approach while other algorithms work in a level-wise approach pattern. The images below will help you understand the difference in a better way.

<img src='img/light.png' width='450'>

In [12]:
from lightgbm import LGBMClassifier, LGBMRegressor

model = LGBMClassifier()

model.fit(x_train, y_train, eval_set=(x_test, y_test))
model.score(x_test, y_test)

[1]	valid_0's multi_logloss: 1.55608
[2]	valid_0's multi_logloss: 1.50038
[3]	valid_0's multi_logloss: 1.44956
[4]	valid_0's multi_logloss: 1.40209
[5]	valid_0's multi_logloss: 1.36048
[6]	valid_0's multi_logloss: 1.32227
[7]	valid_0's multi_logloss: 1.28969
[8]	valid_0's multi_logloss: 1.25992
[9]	valid_0's multi_logloss: 1.22988
[10]	valid_0's multi_logloss: 1.20254
[11]	valid_0's multi_logloss: 1.1805
[12]	valid_0's multi_logloss: 1.15519
[13]	valid_0's multi_logloss: 1.13834
[14]	valid_0's multi_logloss: 1.11996
[15]	valid_0's multi_logloss: 1.10142
[16]	valid_0's multi_logloss: 1.0873
[17]	valid_0's multi_logloss: 1.07291
[18]	valid_0's multi_logloss: 1.06033
[19]	valid_0's multi_logloss: 1.04948
[20]	valid_0's multi_logloss: 1.03693
[21]	valid_0's multi_logloss: 1.0245
[22]	valid_0's multi_logloss: 1.01167
[23]	valid_0's multi_logloss: 0.999995
[24]	valid_0's multi_logloss: 0.990029
[25]	valid_0's multi_logloss: 0.979912
[26]	valid_0's multi_logloss: 0.970825
[27]	valid_0's multi

0.71

> # Task #4 (5 mins)

Apply LGBMRegressor on regression dataset and check score

In [13]:
# task #4

# CatBoost

`CatBoost` name comes from two words `Category` and `Boosting`.

Handling categorical variables is a tedious process, especially when you have a large number of such variables. When your categorical variables have too many labels (i.e. they are highly cardinal), performing one-hot-encoding on them exponentially increases the dimensionality and it becomes really difficult to work with the dataset.

CatBoost can automatically deal with categorical variables and does not require extensive data preprocessing like other machine learning algorithms.

In [17]:
from catboost import CatBoostClassifier
model=CatBoostClassifier(iterations=30)

model.fit(x_train, y_train, eval_set=(x_test, y_test))
model.score(x_test, y_test)

Learning rate set to 0.409232
0:	learn: 1.4453159	test: 1.4906592	best: 1.4906592 (0)	total: 37.8ms	remaining: 1.09s
1:	learn: 1.3169959	test: 1.3969287	best: 1.3969287 (1)	total: 65.5ms	remaining: 917ms
2:	learn: 1.2202081	test: 1.3320256	best: 1.3320256 (2)	total: 92.2ms	remaining: 830ms
3:	learn: 1.1323737	test: 1.2613613	best: 1.2613613 (3)	total: 124ms	remaining: 804ms
4:	learn: 1.0799921	test: 1.2415714	best: 1.2415714 (4)	total: 151ms	remaining: 756ms
5:	learn: 1.0262965	test: 1.2253608	best: 1.2253608 (5)	total: 180ms	remaining: 721ms
6:	learn: 0.9699305	test: 1.1949114	best: 1.1949114 (6)	total: 212ms	remaining: 695ms
7:	learn: 0.9204620	test: 1.1624453	best: 1.1624453 (7)	total: 239ms	remaining: 656ms
8:	learn: 0.8748685	test: 1.1238489	best: 1.1238489 (8)	total: 267ms	remaining: 622ms
9:	learn: 0.8402588	test: 1.1033707	best: 1.1033707 (9)	total: 309ms	remaining: 618ms
10:	learn: 0.7912615	test: 1.0624828	best: 1.0624828 (10)	total: 345ms	remaining: 595ms
11:	learn: 0.762538

0.7233333333333334

> # Task #5 (5 mins)

Apply CatBoostRegressor on regression dataset and check score

# K-Fold Cross-Validation
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

The choice of k is usually 5 or 10, but there is no formal rule.


<img src='img/cv.png' width='450'>


# Grid Search

A model hyperparameter is a characteristic of a model that is external to the model and whose value cannot be estimated from data. The value of the hyperparameter has to be set before the learning process begins. 

Grid-search is used to find the optimal hyperparameters of a model which results in the most ‘accurate’ predictions.

In [15]:
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier, XGBRegressor

model=XGBClassifier()
params = {'min_child_weight': [1, 3],
          'n_estimators': [10, 50, 100],
          'max_depth': [5, 7, 9],
         }

print(model)
grid = GridSearchCV(model, param_grid=params, scoring='accuracy', cv=5, verbose=1)
grid.fit(x_train, y_train)

# print information
print(f'Best score: {grid.best_score_}')
print(f'Best parameters >>> {grid.best_params_}')

grid.score(x_test, y_test)

XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
              colsample_bynode=None, colsample_bytree=None, gamma=None,
              gpu_id=None, importance_type='gain', interaction_constraints=None,
              learning_rate=None, max_delta_step=None, max_depth=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              objective='binary:logistic', random_state=None, reg_alpha=None,
              reg_lambda=None, scale_pos_weight=None, subsample=None,
              tree_method=None, validate_parameters=False, verbosity=None)
Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:   53.6s finished


Best score: 0.6885714285714286
Best parameters >>> {'max_depth': 5, 'min_child_weight': 3, 'n_estimators': 100}


0.7

> # Assignment (grid search):
**Do hyperparameter tuning using Grid Search-**
1. Xgboost Regressor
2. Light GBM Regressor
3. CatBoost Regressor

Email assignment at nurulaktertowhid@gmail.com

**Email Subject : Assignment (grid search) "Your Name"**

**Deadline : 24th July 2020 (7PM)**