<a class="anchor" id="0"></a>
# **AdaBoost Classifier Tutorial in Python**

Boosting algorithms such as AdaBoost, Gradient Boosting, and XGBoost are widely used machine learning algorithms. In this kernel, we will discuss **AdaBoost  algorithm**.

# **Notebook Contents**

1. [Intro to Ensemble Machine Learning](#1)
    - [1.1. Bagging](#1.1)
    - [1.2. Boosting](#1.2)
    - [1.3. Stacking](#1.3)
1. [How are base-learners classified](#2)
1. [AdaBoost Classifier](#3)
1. [AdaBoost algorithm intuition](#4)
1. [Difference between AdaBoost and Gradient Boosting model](#5)
1. [AdaBoost implementation in Python](#6)
    - [6.1 Import libraries](#6.1)
    - [6.2 Load dataset](#6.2)
    - [6.3 EDA](#6.3)
    - [6.4 Split dataset into training and test set](#6.4)
    - [6.5 Build the AdaBoost model](#6.5)
    - [6.6 Evaluate model](#6.6)
    - [6.7 Further evaluation with SVC base estimator](#6.7)
1. [Advantages and disadvantages of AdaBoost](#7)
1. [Results and Conclusion](#8)

# **1. Intro to Ensemble Machine Learning** <a class="anchor" id="1"></a>

[Back to Notebook Contents](#0.1)


- An ensemble model is a composite model which combines a series of low performing or weak classifiers with the aim of creating a strong classifier.

- Here, individual classifiers vote and final prediction label returned that performs majority voting.

- Now, these individual classifiers are combined according to some specific criterion to create an ensemble model.

- These ensemble models offer greater accuracy than individual or base classifiers.

- These models can parallelize by allocating each base learner to different mechanisms.

- So, we can say that ensemble learning methods are meta-algorithms that combine several machine learning algorithms into a single predictive model to increase performance.

- Ensemble models are created according to some specific criterion as stated below:-

  - **Bagging** - They can be created to decrease model variance using bagging approach.
  
  - **Boosting** - They can be created to decrease model bias using a boosting approach.
  
  - **Stacking** - They can be created to improve model predictions using stacking approach.
  
  
- It can be depicted with the help of following diagram.

![Ensemble Machine Learning](https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1542651255/image_1_joyt3x.png)

### **1.1 Bagging** <a class="anchor" id="1.1"></a>

- **Bagging** stands for **bootstrap aggregation**.

- It combines multiple learners in a way to reduce the variance of estimates.

- For example, random forest trains N Decision Trees where we will train N different trees on different random subsets of the data and perform voting for final prediction.

- **Bagging ensembles** methods are **Random Forest** and **Extra Trees**.

### ** 1.2 Boosting** <a class="anchor" id="1.2"></a>

- **Boosting** algorithms are a set of the weak classifiers to create a strong classifier.

- Strong classifiers offer error rate close to 0.

- Boosting algorithm can track the model who failed the accurate prediction.

- Boosting algorithms are less affected by the overfitting problem.

- The following three algorithms have gained massive popularity in data science competitions.

  - AdaBoost (Adaptive Boosting)
  - Gradient Tree Boosting (GBM)
  - XGBoost
  
- We will discuss AdaBoost in this kernel and GBM and XGBoost in future kernels.

- Please refer to my previous kernel - [Bagging vs Boosting](https://www.kaggle.com/prashant111/bagging-vs-boosting?scriptVersionId=24194759)  for a more detailed discussion on on **Bagging** and **Boosting**.

### **1.3 Stacking** <a class="anchor" id="1.3"></a>

- **Stacking** (or stacked generalization) is an ensemble learning technique that combines multiple base classification models predictions into a new data set.

- This new data are treated as the input data for another classifier.

- This classifier employed to solve this problem. Stacking is often referred to as blending.

# **2. How are base-learners classified** <a class="anchor" id="2"></a>

[Back to Notebook Contents](#0.1)


- Base-learners are classified into two types.


- On the basis of the arrangement of base learners, ensemble methods can be divided into two groups.

  - In parallel ensemble methods, base learners are generated in parallel for example - Random Forest.
  
  - In sequential ensemble methods, base learners are generated sequentially for example AdaBoost.
  

- On the basis of the type of base learners, ensemble methods can be divided into two groups.

  - homogenous ensemble method uses the same type of base learner in each iteration.
  
  - heterogeneous ensemble method uses the different type of base learner in each iteration.

# **3. AdaBoost Classifier** <a class="anchor" id="3"></a>

[Back to Notebook Contents](#0.1)


- **AdaBoost or Adaptive Boosting** is one of the ensemble boosting classifier proposed by Yoav Freund and Robert Schapire in 1996.

- It combines multiple weak classifiers to increase the accuracy of classifiers.

- AdaBoost is an iterative ensemble method. AdaBoost classifier builds a strong classifier by combining multiple poorly performing classifiers so that you will get high accuracy strong classifier.

- The basic concept behind Adaboost is to set the weights of classifiers and training the data sample in each iteration such that it ensures the accurate predictions of unusual observations.

- Any machine learning algorithm can be used as base classifier if it accepts weights on the training set.

- **AdaBoost** should meet two conditions:

   1. The classifier should be trained interactively on various weighed training examples.
  
   2. In each iteration, it tries to provide an excellent fit for these examples by minimizing training error.

- To build a AdaBoost classifier, imagine that as a first base classifier we train a Decision Tree algorithm to make predictions on our training data.

- Now, following the methodology of AdaBoost, the weight of the misclassified training instances is increased.

- The second classifier is trained and acknowledges the updated weights and it repeats the procedure over and over again.

- At the end of every model prediction we end up boosting the weights of the misclassified instances so that the next model does a better job on them, and so on.

- AdaBoost adds predictors to the ensemble gradually making it better. The great disadvantage of this algorithm is that the model cannot be parallelized since each predictor can only be trained after the previous one has been trained and evaluated.

- Below are the steps for performing the AdaBoost algorithm:

  1. Initially, all observations are given equal weights.
  
  2. A model is built on a subset of data.
  
  3. Using this model, predictions are made on the whole dataset.
  
  4. Errors are calculated by comparing the predictions and actual values.
  
  5. While creating the next model, higher weights are given to the data points which were predicted incorrectly.
  
  6. Weights can be determined using the error value. For instance,the higher the error the more is the weight assigned to the observation.
  
  7. This process is repeated until the error function does not change, or the maximum limit of the number of estimators is reached.

# **4. AdaBoost algorithm intuition** <a class="anchor" id="4"></a>

[Back to Notebook Contents](#0.1)


- It works in the following steps:

   1. Initially, Adaboost selects a training subset randomly.
  
   2. It iteratively trains the AdaBoost machine learning model by selecting the training set based on the accurate prediction of the last training.
  
   3. It assigns the higher weight to wrong classified observations so that in the next iteration these observations will get the high probability for classification.
  
   4. Also, It assigns the weight to the trained classifier in each iteration according to the accuracy of the classifier. The more accurate classifier will get high weight.
  
   5. This process iterate until the complete training data fits without any error or until reached to the specified maximum number of estimators.
  
   6. To classify, perform a "vote" across all of the learning algorithms you built.
  
  
- The intuition can be depicted with the following diagram:

![AdaBoost Classifier](https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1542651255/image_3_nwa5zf.png)

# **5. Difference between AdaBoost and Gradient Boosting** <a class="anchor" id="5"></a>


[Back to Notebook Contents](#0.1)


- **AdaBoost** stands for **Adaptive Boosting**. It works on sequential ensemble machine learning technique. The general idea of boosting algorithms is to try predictors sequentially, where each subsequent model attempts to fix the errors of its predecessor.


- **GBM or Gradient Boosting** also works on sequential model. Gradient boosting calculates the gradient (derivative) of the Loss Function with respect to the prediction (instead of the features). Gradient boosting increases the accuracy by minimizing the Loss Function (error which is difference of actual and predicted value) and having this loss as target for the next iteration.


- Gradient boosting algorithm builds first weak learner and calculates the Loss Function. It then builds a second learner to predict the loss after the first step. The step continues for third learner and then for fourth learner and so on until a certain threshold is reached.


- So, the question arises in mind that how AdaBoost is different than Gradient Boosting algorithm since both of them works on Boosting technique.


- Both AdaBoost and Gradient Boosting build weak learners in a sequential fashion. Originally, AdaBoost was designed in such a way that at every step the sample distribution was adapted to put more weight on misclassified samples and less weight on correctly classified samples. The final prediction is a weighted average of all the weak learners, where more weight is placed on stronger learners.


- Later, it was discovered that AdaBoost can also be expressed as in terms of the more general framework of additive models with a particular loss function (the exponential loss).


- So, the main differences between AdaBoost and GBM are as follows:-


  1. The main difference therefore is that Gradient Boosting is a generic algorithm to find approximate solutions to the additive modeling problem, while AdaBoost can be seen as a special case with a particular loss function (Exponential loss function). Hence, gradient boosting is much more flexible.


  2. AdaBoost can be interepted from a much more intuitive perspective and can be implemented without the reference to gradients by reweighting the training samples based on classifications from previous learners.


  3. In Adaboost, shortcomings are identified by high-weight data points while in Gradient Boosting, shortcomings of existing weak learners are identified by gradients.


  4. Adaboost is more about ‘voting weights’ and Gradient boosting is more about ‘adding gradient optimization’.


  5. Adaboost increases the accuracy by giving more weightage to the target which is misclassified by the model. At each iteration, Adaptive boosting algorithm changes the sample distribution by modifying the weights attached to each of the instances. It increases the weights of the wrongly predicted instances and decreases the ones of the correctly predicted instances.

# **6. AdaBoost implementation in Python** <a class="anchor" id="6"></a>

[Back to Notebook Contents](#0.1)


- Now, we come to the implementation part of AdaBoost algorithm in Python.

- The first step is to load the required libraries.

### 6.1 Import libraries <a class="anchor" id="6.1"></a>

In [9]:
import time
from datetime import datetime


### 6.2 Load dataset <a class="anchor" id="6.2"></a>

In [2]:
import torchvision

# Load Data
train_dataset = torchvision.datasets.MNIST(root='../../data/',
                                           train=True,
                                           download=True)

test_dataset = torchvision.datasets.MNIST(root='../../data/',
                                          train=False)



### 6.4 Split dataset into training set and test set <a class="anchor" id="6.4"></a>

In [4]:
training_data = train_dataset.train_data.numpy()[:5000].reshape(5000, -1)
# (5000, 28, 28) -> (5000, 784)
training_label = train_dataset.train_labels[:5000].numpy()

test_data = test_dataset.test_data.numpy()[:5000].reshape(5000, -1)
test_label = test_dataset.test_labels[:5000].numpy()

In [5]:
print('Training data size: ', training_data.shape)
print('Training data label size:', training_label.shape)
print('Training data size: ', test_data.shape)
print('Training data label size:', test_label.shape)

Training data size:  (5000, 784)
Training data label size: (5000,)
Training data size:  (5000, 784)
Training data label size: (5000,)


### 6.5 Build the AdaBoost model <a class="anchor" id="6.5"></a>

In [6]:
# Import the AdaBoost classifier
from sklearn.ensemble import AdaBoostClassifier


# Create adaboost classifer object
clf = AdaBoostClassifier()

# Train Adaboost Classifer
model1 = clf.fit(training_data, training_label)


#Predict the response for test dataset
y_pred = model1.predict(test_data)

### Create Adaboost Classifier

- The most important parameters are `base_estimator`, `n_estimators` and `learning_rate`.

- **estimator** is the learning algorithm to use to train the weak models. This will almost always not needed to be changed because by far the most common learner to use with AdaBoost is a decision tree – this parameter’s default argument.

- **n_estimators** is the number of models to iteratively train.

- **learning_rate** is the contribution of each model to the weights and defaults to 1. Reducing the learning rate will mean the weights will be increased or decreased to a small degree, forcing the model train slower (but sometimes resulting in better performance scores).

- **loss** is exclusive to AdaBoostRegressor and sets the loss function to use when updating weights. This defaults to a linear loss function however can be changed to square or exponential.



```
# This is formatted as code
```

### 6.6 Evaluate Model <a class="anchor" id="6.6"></a>

Let's estimate, how accurately the classifier or model can predict the type of cultivars.

In [7]:
#import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# calculate and print model accuracy
print("Accuracy without best param:", metrics.accuracy_score(y_true=test_label, y_pred=y_pred), "\n")

Accuracy without best param: 0.4606 



### 6.7 The effect of estimator
Let's see the effect of estimator within the same model

In [10]:
print("start")
StartTime = time.time()

for i in range(10,200,10):
    clf = AdaBoostClassifier(n_estimators=i)

    # Train Adaboost Classifer
    model1 = clf.fit(training_data, training_label)


    #Predict the response for test dataset
    y_pred = model1.predict(test_data)

    acc_rf = metrics.accuracy_score(y_true=test_label, y_pred=y_pred)
    print("n_estimators = %d, accuracy:%f" % (i, acc_rf))

EndTime = time.time()
print('Total time %.2f s' % (EndTime - StartTime))


start
n_estimators = 10, accuracy:0.558400
n_estimators = 20, accuracy:0.545800
n_estimators = 30, accuracy:0.525000
n_estimators = 40, accuracy:0.508400
n_estimators = 50, accuracy:0.460600
n_estimators = 60, accuracy:0.453200
n_estimators = 70, accuracy:0.441000
n_estimators = 80, accuracy:0.447000
n_estimators = 90, accuracy:0.438800
n_estimators = 100, accuracy:0.447200
n_estimators = 110, accuracy:0.438400
n_estimators = 120, accuracy:0.446000
n_estimators = 130, accuracy:0.438200
n_estimators = 140, accuracy:0.446000
n_estimators = 150, accuracy:0.438200
n_estimators = 160, accuracy:0.446000
n_estimators = 170, accuracy:0.438400
n_estimators = 180, accuracy:0.446000
n_estimators = 190, accuracy:0.438600
Total time 256.14 s


- In this case, we got an accuracy of 55.84%, when consider the number of estimator as 10.

[link text](https://)### 6.7 Further evaluation with SVC base estimator <a class="anchor" id="6.7"></a>


- For further evaluation, we will use SVC as a base estimator as follows:

In [11]:
# load required classifer
from sklearn.ensemble import AdaBoostClassifier


# import Support Vector Classifier
from sklearn.svm import SVC


# import scikit-learn metrics module for accuracy calculation
from sklearn.metrics import accuracy_score
svc=SVC(probability=True, kernel='linear')


# create adaboost classifer object
clf2 =AdaBoostClassifier(estimator=svc)


# train adaboost classifer
model2 = clf2.fit(training_data, training_label)


# predict the response for test dataset
y_pred2 = model2.predict(test_data)


# calculate and print model accuracy
print("Model Accuracy with SVC Base Estimator:",accuracy_score(test_label, y_pred2))


Model Accuracy with SVC Base Estimator: 0.8874




```
# This is formatted as code
```
### 6.8 Further evaluation with SVC base estimator + n_estimator <a class="anchor" id="6.7"></a>


In [15]:

# create adaboost classifer object
clf3 =AdaBoostClassifier(n_estimators=10, base_estimator=svc)


# train adaboost classifer
model3 = clf3.fit(training_data, training_label)


# predict the response for test dataset
y_pred3 = model3.predict(test_data)


# calculate and print model accuracy
print("Model Accuracy with SVC Base Estimator + n_estimator:",accuracy_score(test_label, y_pred3))




Model Accuracy with SVC Base Estimator + n_estimator: 0.8884


### 6.9 Further evaluation with Decision Tree base estimator  <a class="anchor" id="6.7"></a>

In [14]:
# load required classifer
from sklearn.ensemble import AdaBoostClassifier


# import Support Vector Classifier
from sklearn.tree import DecisionTreeClassifier


# import scikit-learn metrics module for accuracy calculation
from sklearn.metrics import accuracy_score
DT=DecisionTreeClassifier()


# create adaboost classifer object
clf4 =AdaBoostClassifier(estimator=DT)


# train adaboost classifer
model4 = clf4.fit(training_data, training_label)


# predict the response for test dataset
y_pred4 = model4.predict(test_data)


# calculate and print model accuracy
print("Model Accuracy with Decision Tree Estimator:",accuracy_score(test_label, y_pred4))


Model Accuracy with SVC Base Estimator: 0.747


:- In this case, we have got a classification rate:
 - 46.06%, which is with all in default.
 - 55.84%, which is with best number of estimators.
 - 88.84%, which introduces svc as estimator.
 - 88.60%, which introduces svc as estimator + best number of estimators.
 - 74.40%, which introduces decision tree as estimator + best number of estimators.



- In this case, SVC Base Estimator is getting better accuracy then Non-base Estimator.
- In this case, SVC Base Estimator is getting better accuracy then Decision tree Base Estimator.
- In this case, SVC Base Estimator + best number of estimator is not getting better accuracy then sole SVC Base Estimator.



- The disadvantages are as follows:

   1. AdaBoost is sensitive to noise data.
  
   2. It is highly affected by outliers because it tries to fit each point perfectly.
  
   3. AdaBoost is slower compared to XGBoost.

# **7. Advantages and disadvantages of AdaBoost** <a class="anchor" id="7"></a>

[Back to Notebook Contents](#0.1)


- The advantages are as follows:

   1. AdaBoost is easy to implement.
  
   2. It iteratively corrects the mistakes of the weak classifier and improves accuracy by combining weak learners.
  
   3. We can use many base classifiers with AdaBoost.
  
   4. AdaBoost is not prone to overfitting.

# **8. Results and Conclusion** <a class="anchor" id="8"></a>

[Back to Notebook Contents](#0.1)


- In this kernel, we have discussed AdaBoost classifier.

- We have discussed how the base-learners are classified.

- Then, we move on to discuss the intuition behind AdaBoost classifier.

- We have also discuss the differences between AdaBoost classifier and GBM.

- Then, we present the implementation of AdaBoost classifier using iris dataset.

- Lastly, we have discussed the advantages and disadvantages of AdaBoost classifier.

[Go to Top](#0)