### Adaboost Algorithm

Given: $(x_1, y_1),...,(x_m,y_m)$ where $x_i \in X, y_i \in T = \{-1, +1\}$<br>
Initialize $D_1(i) = 1/m$.<br>
For t = 1,...,T:

* Train weak learner using distribution $D_t$
* Get weak Hypothesis h_t: X \rightarrow \{-1, +1\} with error
$$e_t = Pr_{i\sim D_t}[h_t(x_i)\neq y_i]$$
* Choose $\alpha_t = \frac{1}{2}ln\left(\frac{1-e_t}{e_t}\right)$
* Update:
$$\begin{align}D_{t+1}(i) &= \frac{D_t(i)}{Z_t} \times
\begin{cases}
e^{-\alpha_t}\, if h_t(x_i) = y_i\\
e^{\alpha_t}\, if h_t(x_i) \neq y_i
\end{cases}\\
&= \frac{D_t(i)exp(-\alpha_t y_i h_t(x_i))}{Z_t}
\end{align}$$
where $Z_t$ is a normalization factor (chosen so that $D_{t+1}$ will be a distribution).

Output of the final hypothesis:
$$
H(x) = sign\left(\sum^T_{t=1}\alpha_t h_t(x)\right)
$$

In [11]:
import pandas as pd
import matplotlib.pyplot as plt
import random
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np

%matplotlib inline

In [5]:
df = pd.read_csv('../Data/iris.data', names = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species'])
df['Species'] = df['Species'].apply(lambda x: x[5:])
df.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Replace each labels as $\{-1, +1\}$

In [6]:
df = df[(df['Species']=='versicolor')|(df['Species']=='virginica')]
df['Label'] = df['Species'].replace(to_replace = ['versicolor','virginica'], value=[1,-1])
df = df.drop('Species', axis = 1)
df.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Label
50,7.0,3.2,4.7,1.4,1
51,6.4,3.2,4.5,1.5,1
52,6.9,3.1,4.9,1.5,1
53,5.5,2.3,4.0,1.3,1
54,6.5,2.8,4.6,1.5,1


Initialize the weight for the first weak learner

In [8]:
df['D_1'] = 1/len(df)
df.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Label,D_1
50,7.0,3.2,4.7,1.4,1,0.01
51,6.4,3.2,4.5,1.5,1,0.01
52,6.9,3.1,4.9,1.5,1,0.01
53,5.5,2.3,4.0,1.3,1,0.01
54,6.5,2.8,4.6,1.5,1,0.01


Loop through given $T$ or number of learners/ estimators. A DecisionTree Classifier will be used as a default learner.

In [55]:
T = 4
alpha = []
for t in range(1, T+1):
    _ = df.sample(df.shape[0], replace=True, weights = df['D_'+str(t)])
    X_train = _.iloc[0:len(df), 0:4]
    y_train = _.iloc[0:len(df), 4]
    
    clf_gini = DecisionTreeClassifier(criterion = 'gini', 
                                  random_state=100, 
                                  max_depth = 1)
    clf = clf_gini.fit(X_train, y_train)
    
    y_pred = clf_gini.predict(df.iloc[0:len(df), 0:4])
    df['pred_'+str(t)] = y_pred
    df.loc[df.Label != df['pred_'+str(t)], 'e_'+str(t)] = 1
    df.loc[df.Label == df['pred_'+str(t)], 'e_'+str(t)] = 0
    
    e = (df['e_'+str(t)]*df['D_'+str(t)]).sum()
    
    a = 0.5*np.log((1-e)/e)
    alpha.append(a)
    
    update = df['D_'+str(t)]*np.exp(-1*a*df['Label']*df['pred_'+str(t)])
    z = sum(update)
    df['D_'+str(t+1)]=update/z

In [56]:
df

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Label,D_1,pred_1,e_1,D_2,pred_2,e_2,D_3,pred_3,e_3,D_4,pred_4,e_4,D_5
50,7.0,3.2,4.7,1.4,1,0.01,1,0.0,0.005376,1,0.0,0.002981,1,0.0,0.001865,1,0.0,0.001219
51,6.4,3.2,4.5,1.5,1,0.01,1,0.0,0.005376,1,0.0,0.002981,1,0.0,0.001865,1,0.0,0.001219
52,6.9,3.1,4.9,1.5,1,0.01,-1,1.0,0.071429,1,0.0,0.039608,1,0.0,0.024776,1,0.0,0.016198
53,5.5,2.3,4.0,1.3,1,0.01,1,0.0,0.005376,1,0.0,0.002981,1,0.0,0.001865,1,0.0,0.001219
54,6.5,2.8,4.6,1.5,1,0.01,1,0.0,0.005376,1,0.0,0.002981,1,0.0,0.001865,1,0.0,0.001219
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,-1,0.01,-1,0.0,0.005376,-1,0.0,0.002981,-1,0.0,0.001865,-1,0.0,0.001219
146,6.3,2.5,5.0,1.9,-1,0.01,-1,0.0,0.005376,-1,0.0,0.002981,1,1.0,0.007428,-1,0.0,0.004856
147,6.5,3.0,5.2,2.0,-1,0.01,-1,0.0,0.005376,-1,0.0,0.002981,-1,0.0,0.001865,-1,0.0,0.001219
148,6.2,3.4,5.4,2.3,-1,0.01,-1,0.0,0.005376,-1,0.0,0.002981,-1,0.0,0.001865,-1,0.0,0.001219


In [57]:
alpha = np.array(alpha)

In [58]:
alpha

array([1.29334467, 1.10807087, 0.69101627, 0.5895109 ])

In [64]:
predictions = df[['pred_'+str(i) for i in range(1, T+1)]].values

In [60]:
H = np.sign(np.sum(alpha*predictions, axis=1))

In [61]:
c = confusion_matrix(df['Label'], H)
c

array([[46,  4],
       [ 2, 48]], dtype=int64)

In [63]:
print(classification_report(df['Label'], H))

              precision    recall  f1-score   support

          -1       0.96      0.92      0.94        50
           1       0.92      0.96      0.94        50

    accuracy                           0.94       100
   macro avg       0.94      0.94      0.94       100
weighted avg       0.94      0.94      0.94       100

