### AdaBoost Classifier
---
**Elo notes**

AdaBoost, short for "Adaptive Boosting", is a machine learning meta-algorithm. It can be used in conjunction with many other types of learning algorithms to improve their performance. The output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. In some problems it can be less susceptible to the overfitting problem than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing (e.g., their error rate is smaller than 0.5 for binary  classification), the final model can be proven to converge to a strong learner.

While every learning algorithm will tend to suit some problem types better than others, and will typically have many different parameters and configurations to be adjusted before achieving optimal performance on a dataset, AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier. When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree growing algorithm such that later trees tend to focus on harder-to-classify examples.

Problems in machine learning often suffer from the curse of dimensionality — each sample may consist of a huge number of potential features (for instance, there can be 162,336 Haar features, as used by the Viola–Jones object detection framework, in a 24×24 pixel image window), and evaluating every feature can reduce not only the speed of classifier training and execution, but in fact reduce predictive power, per the Hughes Effect. Unlike neural networks and SVMs, the AdaBoost training process selects only those features known to improve the predictive power of the model, reducing dimensionality and potentially improving execution time as irrelevant features do not need to be computed.

**Training**

AdaBoost refers to a particular method of training a boosted classifier. A boost classifier is a classifier in the form

${\displaystyle F_{T}(x)=\sum _{t=1}^{T}f_{t}(x)\,\!}$

**where each ${\displaystyle f_{t}}$ is a weak learner (weak classifier) that takes an object ${\displaystyle x}$ as input and returns a value indicating the class of the object** . For example in the two class problem, the sign of the weak learner output identifies the predicted object class and the absolute value gives the confidence in that classification. Similarly, the ${\displaystyle T}$th classifier will be positive if the sample is believed to be in the positive class and negative otherwise.

**Each weak learner produces an output, hypothesis ${\displaystyle h(x_{i})}$(classifier), for each sample in the training set**. At each iteration $ {\displaystyle t}$, a weak learner is selected and assigned a **coefficient** ${\displaystyle \alpha _{t}}$ such that the sum training error $ {\displaystyle E_{t}}$ of the resulting ${\displaystyle t}$-stage boost classifier is minimized.

${\displaystyle E_{t}=\sum _{i}E[F_{t-1}(x_{i})+\alpha _{t}h(x_{i})]}$

Here **${\displaystyle F_{t-1}(x)}$ is the boosted classifier** that has been built up to the previous stage of training, ${\displaystyle E(F)}$ is some error function and ${\displaystyle f_{t}(x)=\alpha _{t}h(x)} $ is the weak learner that is being considered for addition to the final classifier.

**Weighting**

At each iteration of the training process, a weight ${\displaystyle w_{t}}$ is assigned to each sample in the training set equal to the current error ${\displaystyle E(F_{t-1}(x_{i}))}= \frac{1}{N}$ on that sample. These weights can be used to inform the training of the weak learner, for instance, decision trees can be grown that favor splitting sets of samples with high weights.

$w_{t} = w_{i} = {\displaystyle E(F_{t-1}(x_{i}))} = \frac{1}{N} $





---
Suppose we have a data set $ {\displaystyle \{(x_{1},y_{1}),\ldots ,(x_{N},y_{N})\}} \{(x_{1},y_{1}),\ldots ,(x_{N},y_{N})\}$ where each item ${\displaystyle x_{i}}$ has an associated class ${\displaystyle y_{i}\in \{-1,1\}}$, and a set of weak classifiers ${\displaystyle \{k_{1},\ldots ,k_{L}\}} \{k_{1},\ldots ,k_{L}\}$ each of which outputs a classification ${\displaystyle k_{j}(x_{i})\in \{-1,1\}} k_{j}(x_{i})\in \{-1,1\}$ for each item. After the m ${\displaystyle m-1^{th}}$ iteration our boosted classifier is a linear combination of the weak classifiers of the form:

${\displaystyle C_{(m-1)}(x_{i})=\alpha _{1}k_{1}(x_{i})+\cdots +\alpha _{m-1}k_{m-1}(x_{i})}$

At the ${\displaystyle m^{th}}$ iteration we want to extend this to a better boosted classifier by adding a multiple of one of the weak classifiers:

${\displaystyle C_{m}(x_{i})=C_{(m-1)}(x_{i})+\alpha _{m}k_{m}(x_{i})} C_{{m}}(x_{i})$


$ {\displaystyle \epsilon _{m}={\frac {\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{{i}= 1}^{N}w_{i}^{(m)}}}} = {\displaystyle {\frac {\sum _{{i}= 1}^{N}w_{i}^{(m)}[I(y_{i}\neq k_{m}(x_{i}))]}{\sum _{{i}= 1}^{N}w_{i}^{(m)}}}} $
    
Setting this to zero and solving for ${\displaystyle \alpha _{m}} $ yields:

${\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}}\right)} = \log \frac{(1 - \displaystyle \epsilon _{m})}{\displaystyle \epsilon _{m}}$

so it follows that:

${\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {1-\epsilon _{m}}{\epsilon _{m}}}\right)} $

which is the negative logit function multiplied by 0.5.
We calculate the weighted error rate of the weak classifier to be

Update weights for all $i$:

${\displaystyle {w_{i,t+1}=w_{i,t}e^{-y_{i}\alpha _{t}h_{t}(x_{i})}}} = {\displaystyle  w_{i,t}e^{\alpha _{t}*I(y_{i}\neq k_{m}(x_{i}))}}$

In [90]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble.partial_dependence import plot_partial_dependence, partial_dependence

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score, mean_squared_error, r2_score
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV

import numpy as np
import pandas as pd

from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

%matplotlib inline

In [97]:
feature_names = pd.read_csv('feature_names.txt', header=None)[0].values
feature_names = [feature_name.replace("'", "") for feature_name in feature_names]

In [100]:
df = pd.read_csv('../data/spam_sample.csv', header=None, names=feature_names)

In [101]:
df[:2]

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1


In [102]:
y = df.pop('spam').values
feature_names = df.columns
X = df.values

In [103]:
X.shape

(4601, 57)

In [104]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

In [120]:
from adaboost_26 import Adaboostelo

In [121]:
aboost = Adaboostelo(n_classifiers=50)
aboost.fit(X_train, y_train)
aboost.score(X_test, y_test)
accuracy_score(y_test, aboost.predict(X_test))

0.93485342019543971