# AdaBoost

Let us implement the AdaBoost algorithm, to build a powerful emsemble classifier from a set of weaker classifiers. Our base classifier will be a decision stump.

The training algorithm we will implement is as follows. We have $N$ training datapoints and are creating an ensemble of $k$ classifiers.

- Initialize the weights for all datapoints ($w_j = 1/N$ for $j=1,2,...N$)
- For $i = 1$ to $k$
    - Form training set $D_i$ by sampling $N$ tuples (with replacement) from the full training dataset. The sampling probability for a tuple $(x_j,y_j)$ should be given by its corresponding weight $w_j$.
    - Use dataset $D_i$ to fit a decision stump $M_i$. You can use sklearn's DecisionTreeClassifier with max_depth=1 to fit a decision stump.
    - Calculate the error rate for $M_i$ using the sum of the weights of the misclassified points.
    $$err(M_i) = \sum_{j=1}^N w_j * \mathbb{1}\{y_j \ne M_i(x_j)\}$$
    - The weight of classifier $M_i$'s vote is computed as $\alpha_i = 0.5*\log(\frac{1-err(M_i)}{err(M_i)})$
    - Increase the weight of the misclassified training points, and decrease the weight of the correctly classified training points.
    $$w_j \leftarrow w_j * \exp\{- \alpha_i * y_j * M_i(x_j)\}$$
    - Remember to normalize the weights so that they sum to 1.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Import the libraries / functions that you use in your solution
from sklearn.tree import DecisionTreeClassifier

def train_AdaBoost(X, y, k):
    
    classifiers = []
    alphas = []
    
    ### BEGIN SOLUTION
    
    N = X.shape[0]
    D = np.concatenate((X, y.reshape(-1, 1)), axis=1)
    
    # Initialize weights
    W = np.ones(N) / N
    
    for i in range(k):
        
        # Sample from the dataset according to weights
        sample = np.random.choice(N, size=N, replace=True, p=W)
        X_sample = X[sample]
        y_sample = y[sample]
        
        # Fit a decision stump
        classifier = DecisionTreeClassifier(max_depth=1)
        classifier.fit(X_sample, y_sample)
        
        # Calculate the error rate
        y_pred = classifier.predict(X_sample)
        error = np.sum(W[sample] * (y_sample != y_pred))
        
        # Calculate the weight of classifier's vote
        alpha = 0.5 * np.log((1 - error) / error)
        
        # Increase the weight of misclassified points
        W[sample] = W[sample] * np.exp(-alpha * y_sample * y_pred)

        # Normalise Weights to sum to 1
        W = W / np.sum(W)
        
        # Append your classifier to the list classifiers
        classifiers.append(classifier)
        
        # Append your alpha to the list alphas
        alphas.append(alpha)
        
    ### END SOLUTION
    
    # classifiers and alphas need of be of type <class 'list'>
    return classifiers, alphas

To obtain predictions, the vote of each classifier $M_i$ is weighted by its corresponding coefficient $\alpha_i$.

$$y_i = \text{sign}\{\sum_{i=1}^k \alpha_i*M_i(x_i)\}$$

In [None]:
def predict_AdaBoost(X,classifiers, alphas):
    
    ### BEGIN SOLUTION

    N = X.shape[0]
    y_pred = np.zeros((N, 1))

    for i in range(len(classifiers)):
        y_pred += alphas[i] * classifiers[i].predict(X)

    y_pred = np.sign(y_pred)
    y_pred[y_pred == 0] = -1
    
    ### END SOLUTION
    
    # y_pred needs to be of type <class 'numpy.ndarray'>
    return y_pred

The below function will help you plot the decision surface given by the algorithm.

In [None]:
def plot_AdaBoost(X, y, classifiers, alphas):
    
    # Get limits of x and y for plotting the decision surface
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    
    # Get points at a distance of h between the above limits 
    h = .02    
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    temp = np.c_[xx.ravel(), yy.ravel()]
    
    # Classify the all the points
    P = predict_AdaBoost(temp, classifiers, alphas).reshape(yy.shape)
    
    # Plot the decision boundary and margin
    plt.pcolormesh(xx, yy, P, cmap=plt.cm.coolwarm, shading='auto')
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm,edgecolor='k')
    plt.show()

Load the given datasets.

In [None]:
# Imports
import pandas as pd
import numpy as np

# Dataset Functions
def ReadDataset_NoHeaders(path):
    dataset = pd.read_csv(path, header=None)
    return dataset

def SaveDataset(dataset, path):
    dataset.to_csv(path, index=False)

In [None]:
# Paths
DATASET_PATH = DRIVE_PATH + "MyDrive/PRML Assignments/Assignment 2/Datasets/Q2/"
# DATASET_PATH = DRIVE_PATH + "PRML/IITM/Question1"
train_path_X = DATASET_PATH + "X_train.csv"
train_path_Y = DATASET_PATH + "y_train.csv"
test_path_X = DATASET_PATH + "X_test.csv"
test_path_Y = DATASET_PATH + "y_test.csv"

In [None]:
# Load
Dataset_train_X = ReadDataset_NoHeaders(train_path_X)
Dataset_train_Y = ReadDataset_NoHeaders(train_path_Y)
Dataset_test_X = ReadDataset_NoHeaders(test_path_X)
Dataset_test_Y = ReadDataset_NoHeaders(test_path_Y)

In [None]:
print("Train Dataset X:", Dataset_train_X.shape)
print("Train Dataset Y:", Dataset_train_Y.shape)
print("Test Dataset X:", Dataset_test_X.shape)
print("Test Dataset Y:", Dataset_test_Y.shape)

Plot the training data as a scatter plot.

In [None]:
# Imports
import matplotlib.pyplot as plt

# Plot Functions
def PlotDataset(dataset, title=''):
    X = dataset.iloc[:, 0]
    Y = dataset.iloc[:, 1]
    plt.scatter(X, Y)
    plt.title(title)
    plt.show()

def PlotLabelledDataset(dataset, title=''):
    X = dataset[:, 0]
    Y = dataset[:, 1]
    C = dataset[:, 2]
    classes_unique = np.unique(C)
    for c in classes_unique:
        x = X[C == c]
        y = Y[C == c]
        plt.scatter(x, y, label=c)
    plt.title(title)
    plt.legend()
    plt.show()

In [None]:
# Only Points Scatter Plot
print("Train Datapoints")
PlotDataset(Dataset_train_X, "Train Datapoints")

# Scatter Plot with classes
Dataset_train = np.zeros((Dataset_train_X.shape[0], 3))
Dataset_train[:, :2] = Dataset_train_X
Dataset_train[:, 2] = Dataset_train_Y.iloc[:, 0]
print("Train Datapoints with Classes")
PlotLabelledDataset(Dataset_train, "Train Datapoints with Classes")

Use the train_AdaBoost function to train an AdaBoost model with k=5.

In [None]:
classifiers, alphas = train_AdaBoost(Dataset_train_X, Dataset_train_Y, k=5)

Use the predict_AdaBoost function to make predictions on X_test.

In [None]:
y_pred = predict_AdaBoost(Dataset_train_X, classifiers, alphas)

Use the plot_AdaBoost function to plot the learnt decision surface.

In [None]:
plot_AdaBoost(Dataset_train_X, y_pred, classifiers, alphas)

Compute the accuracy of the predictions on the test set.

In [None]:
accuracy = np.sum(y_pred == Dataset_train_Y.iloc[:, 0]) / Dataset_train_X.shape[0]

Use the train_AdaBoost function to train an AdaBoost model with k=100.

Use the predict_AdaBoost function to make predictions on X_test.

Use the plot_AdaBoost function to plot the learnt decision surface.

Compute the accuracy of the predictions on the test set.