# Machine Learning SoSe21 Practice Class

Dr. Timo Baumann, Dr. Özge Alaçam, Björn Sygo <br>
Email: baumann@informatik.uni-hamburg.de, alacam@informatik.uni-hamburg.de, 6sygo@informatik.uni-hamburg.de


## Exercise 5
**Description:** Implement adaboost <br>
**Deadline:** Saturday, 29. Mai 2021, 23:59 <br>
**Working together:** You can work in pairs or triples but no larger teams are allowed. <br>
&emsp;&emsp;&emsp; &emsp; &emsp; &emsp; &emsp; Please adhere to the honor code discussed in class. <br>
&emsp;&emsp;&emsp; &emsp; &emsp; &emsp; &emsp; All members of the team must get involved in understanding and coding the solution.

## Submission: 
**Christoph Brauer, Linus Geewe, Moritz Lahann**

*Also put high-level comments that should be read before looking at your code and results.*

## Goal
The goal of this exercise is to implement Boosting based on a very simple base classifier.

You implementation should be sufficiently generic to <strong>handle an arbitrary number of dimensions</strong>.

### Choose your data

**Task 1** (10%): Choose and load your data.

For this exercise, you can choose between multiple datasets.
Choose either `dataCircle.txt`, which contains 2-dimensional points with the corresponding class label. The first 40 rows contain positive examples with the label 1 and the other 62 rows have negative examples with the label -1.
Alternatively, you may find it more interesting to use the features you extracted in the previous tasks from faces (you may use your own features, or use the feature computations from the sample solution). 
It will be more interesting to use the full set of faces, not just the small subset. 
(However, you may want to limit visualizations to 2 dimensions.)

Your implementation of the classifier shall not be limited to a fixed number of feature dimensions (e.g., 2 as in `dataCircle.txt`, or 6 as in your previous work on face detection) but should work with any number of features. However, there's no need to implement the data loading / feature computation for both data sets.

Set aside some randomly selected data as evaluation set. Alternatively you may want to re-use k-fold crossvalidation as implemented before (or from a sample solution).

In [49]:
import numpy as np
import math
import random

data = np.loadtxt('dataCircle.txt')
print(data.shape)

(102, 3)


#### Implement a simple classifier, implement weighted sample evaluation, and ensure that classifiers are (at least) "weak"

**Task 2** (30%):

Start by implementing a function that returns a simple <strong>decision stump</strong> classifier for training data and given sample weights (many of these weak classifiers will later be combined).
Standard AdaBoost searches for the best classifier in each step, by evaluating all possible classifiers in their performance on the weighted data and then choosing the one with the lowest weighted
error. The classifiers that are considered are the boundaries between two points that change classes.

In this task, you may also use a simple <strong>random decision boundary classifier </strong> that choses a random decision boundary in a randomly selected dimension of your data. (Full credit only if you implement the full AdaBoost approach to classifier selection.)

Implement an evaluation function that tests the quality of a classifier on a set of data using <strong>weighted accuracy</strong>.

A classifier with <50% accuracy is not a <strong>weak</strong> classifier. Remind yourself how you can build a weak binary classifier based on one with accuracy below 50%.

In [50]:
class DecisionStump():
    def __init__(self, data, weights):
        self.train(data, weights)
        self.threshold = 0
        self.feature_index = 0

    def train(self, data, weights):
        # choose a random feature from our data (data is in the shape of [Feature Feature ... Class])
        self.feature_index = random.randint(0, data.shape[1] - 2)

        # set a random decision boundary (within our data's possible range)
        print(np.min(data))
        print(np.max(data))
        self.threshold = random.uniform(np.min(data), np.max(data))
        print(self.threshold)
        

    def predict(self, sample):
        return 1 if sample[self.feature_index] > self.threshold else -1

def eval_classifier(data, weights, classifier):
    true_classes = data[:, -1]
    predicted_classes = [classifier.predict(sample) for sample in data]
    accuracy = 0
    for index in range(len(weights)):
        if true_classes[index] == predicted_classes[index]:
            accuracy += weights[index]

    # maybe:
    # np.sum(np.equal(true_classes, predicted_classes) * weights)
    return accuracy




#### AdaBoost

**Task 3** (30%): Use the previous function that creates weak classifiers to implement AdaBoost: 

Initialize weights, select weak classifier, compute alpha, reweigh samples, iterate.

If you use random decision boundaries for your weak classifiers, the classifier added in each iteration isn't optimal and you may need a high number of iterations until your algorithm performs well.

In [51]:
class AdaBoost():
    def __init__(self):
        self.classifiers = []
        self.classifier_weights = []
        self.weights = []


    def train(self, data, epochs):
        self.weights = [1 / len(data)] * len(data)
        for ep in range(epochs):
            classifier = DecisionStump(data, self.weights)
            acc = eval_classifier(data, self.weights, classifier)
            print(acc)

            error = self.error_candidate(data, classifier)
            alpha = self.alpha(error)

            self.classifier_weights.append(alpha)
            self.classifiers.append(classifier)

            best_error, best_classifier = self.select_classifier(data)
            alpha = self.alpha(best_error)

            self.weights = self.new_weights(alpha)


    def select_classifier(self, data):
        errors = [self.error_candidate(data, classifier) for classifier in self.classifiers]
        min_error = min(errors)
        min_classifier_index = errors.index(min_error)
        return min_error, self.classifiers[min_classifier_index]
            
    
    def error_candidate(self, data, classifier):
        sum = 0
        for sample, weight in zip(data, self.weights):
            predicted = classifier.predict(sample)
            if predicted != sample[2]:
                sum += weight
        return sum


    def alpha(self, error):
        return 0.5 * math.log((1 - error) / error)


    def new_weights(self, alpha):
        new_weights = np.array([weight * math.exp(alpha) for weight in self.weights])
        return new_weights / np.sum(new_weights)


    def predict(self, data):
        y_pred = []
        for sample in data:
            predictions = np.array([classifier.predict(sample) for classifier in self.classifiers])
            y_pred.append(math.copysign(1.0, sum(predictions * np.array(self.classifier_weights))))
        return y_pred


In [52]:
def accuracy(true, pred):
    sum = 0
    for index in range(len(true)):
        sum += true[index] == pred[index]
    return sum / len(true)

data = np.loadtxt('dataCircle.txt')
np.random.shuffle(data)
split_index = round(data.shape[0] * 0.8)
train = data[:split_index]
test = data[split_index:]

ensemble = AdaBoost()
ensemble.train(train, 20)
pred = ensemble.predict(test)
acc = accuracy(test[:, -1], pred)
print(acc)



-9.97164
9.99208
5.37714107980991
0.5000000000000002
-9.97164
9.99208
1.876606500703824
0.5000000000000002
-9.97164
9.99208
-7.245778852575333
0.5000000000000002
-9.97164
9.99208
7.292822839808874
0.5000000000000002
-9.97164
9.99208
5.30836563636184
0.5000000000000002
-9.97164
9.99208
9.007733089476023
0.5000000000000002
-9.97164
9.99208
5.470052319303763
0.5000000000000002
-9.97164
9.99208
9.295216874404446
0.5000000000000002
-9.97164
9.99208
4.6157431917861675
0.5000000000000002
-9.97164
9.99208
6.8679004695911985
0.5000000000000002
-9.97164
9.99208
-3.0240634714595176
0.5000000000000002
-9.97164
9.99208
4.27380850084201
0.5000000000000002
-9.97164
9.99208
2.083130159077209
0.5000000000000002
-9.97164
9.99208
7.108685160516984
0.5000000000000002
-9.97164
9.99208
0.07890278000513362
0.5000000000000002
-9.97164
9.99208
0.03991474204770995
0.5000000000000002
-9.97164
9.99208
-1.465585681431115
0.5000000000000002
-9.97164
9.99208
-4.512479791701816
0.5000000000000002
-9.97164
9.99208
4.3

### Evaluate your classifier and plot its inner workings

**Task 4** (15%):

Evaluate your training set error, as well as evaluation set errors over the iterations. For this it may be convenient if your final weighted ensemble classifier that is trained via AdaBoost can be restricted to use only the first _m_ classifiers (and alphas) afterwards.

#### Plot decision boundary/areas

**Task 5** (15%):

(If you use face data: pick two relevant feature dimensions only for this subtask and use only a small subset of the training data if runtime becomes an issue.)

Plot the decisions taken by your classifier in one of the following ways (or both): 
 * plot the first, second and third decision boundaries chosen by AdaBoost in a succession of plots. Also, plot the training samples in a size that is proportional to their weight after the first, second and third decision those same plots. Explain how the weight changes influence the next iteration's behaviour.
 * decide on the resolution of your image matrix (e.g., use a resolution of 100 samples over the $x_1$ and the $x_2$ range of your data), record the decisions of your classifier for all $x_1$/$x_2$ coordinates and color the image's pixels according to the decision. Plot the image. Add the training data as colored points to the plot as well. You may consult the corresponding code in the sample solution for Softmax classification.

### Report Submission

Prepare a report of your solution as a commented Jupyter notebook (using markdown for your results and comments); include figures and results.
If you must, you can also upload a PDF document with the report annexed with your Python code.

Upload your report file to the Machine Learning Moodle Course page. Please make sure that your submission team corresponds to the team's Moodle group that you're in.