# Homework 8
For this homework, the library _libsvm_ is needed. The easiest way to install this on _Mac OS X_ is using [_Homebrew_](http://brew.sh/) with the following command: `brew install libsvm`.

The following questions from this homework will be answered and its solutions will be explained: questions 7, 8, 9 and 10.

## Polynomial kernel
In the first 2 questions, we need to experiment with 10-fold cross-validation using a polynomial kernel. We will first explain the workings of support vector machines, after which we will look at the use of a polynomial kernel. Afterwards, we see why cross-validation is useful and how it works.

The purpose of support vector machines is to find dichotomies with border margins that are as large as possible.
As we saw in the lectures, we first need the distance from the separation to the nearest point to the hyperplane $w^Tx=0$, in which $w$ are weights and $x$ are data points. This was found to be $\frac{1}{2}w^T w$.
This is the value that needs to be maximized in order to get a margin as big as possible. This should also be subject to $|w^T x_n + b| \geq 1$ (in which $b$ is the bias) for $n=1,2,\dots,N$, because the distance from the separating hyperplane to its nearest point should be normalized to be 1.
Using quadratic programming, we eventually become $\alpha=\alpha_1, \alpha_2, \dots, \alpha_N$, which maximizes this value. When $\alpha_n > 0$, we say that the data point $x_n$ is a support vector. The hyperplane that separates the points is completely determined by those support vectors and must lie between the support vectors of different target values.

For data which is not linearly separable, a nonlinear transform is needed, in which a transformation (using the function $\phi$) of the data is done from the $\mathcal{X}$ to the $\mathcal{Z}$ space, in which the data in the latter space is linearly separable.


In a polynomial kernel, this transformation function $\phi: \mathcal{X} \rightarrow \mathcal{Z}$ is a polynomial of a certain order $Q$.
There are however cases where the data is still not separable after a transformation. As a result, some points violate the margin ($|w^T x_n + b| \geq 1$ is violated). The total violation was found to be $\sum_{n=1}^N \xi_n$. An order of this value (the value $C$) is added to the value that needs to be minimized using quadratic programming in order to allow some errors. The higher $C$, the more errors are allowed in separating the data. Intuitively, we see that increasing $C$ makes the decision surface more smooth and simple. A margin that uses this $C$ is called a soft margin.


For the first 2 exercises, we use cross validation. In the slides, we first saw that using small partition of the training set to validate the learned hypothesis leads to a bad estimate because the points taken out of the training set ($K$ points out of $N$) for validation may not be representative for estimating the out-of-sample error, $E_{out}$. When, we take a large partition however, we get again an accurate validation error ($E_{val}$), but because the model is learned from a small number of data points, we have a bigger chance of getting a bad model for the data. Thus, we need to balance $K$.

The optimal situation however would be to have $K$ both small and large, thus getting a good model and a good estimation of $E_{out}$. To achieve this, we use points one time for validating, and other times for training a model.
We separate the $N$ datapoints into a number of folds, $F$. These number of folds could be as large as $N$ itself, in which $N$ iterations will take place and when only one point is used for validation.
In cross-validation, $K=\frac{N}{F}$, but these K points differ each time as a different fold is used for validation.
The total cross-validation error will then be $E_{cv}=\frac{1}{F}\sum_{n=1}^F e_n$, in which $e_n=E_{val}(g_n^-)$ and $g_n^-$ is the model trained on $N-K$ datapoints.

In all the exercises, we only take data with as target value (digit) 1 or 5 and replace this target values by respectively -1 and 1.
After listing all the possible values on $C$ that need to be used with the support vector machine, we do a number of runs. In each run, we separate the training data randomly into 10 folds. Then, for each value of $C$, we iterate 10 times. Each time take a different fold on which the classifier will be tested and the other 9 folds to train the classifier. This classifier is made using a support vector machine which is given the current value of $C$. After training, the classifier is tested and the errors on the test fold are saved. Afterwards, we average all errors for specific $C$ values for each fold over all the runs. We then calculate which $C$ yielded the lowest error rate in each run. By then calculating which $C$ was the most number of times selected as the one with the lowest error rate, we know which $C$ value is the best.

In [1]:
from svmutil import *
import pandas as pd
from sklearn.cross_validation import KFold
from collections import Counter
import numpy as np
pd.options.mode.chained_assignment = None

In [2]:
train = pd.read_table("features.train", sep=" +", header=None, engine='python')
train.columns = ["digit", "intensity", "symmetry"]
test = pd.read_table("features.test", sep=" +", header=None, engine='python')
test.columns = ["digit", "intensity", "symmetry"]

In [3]:
train.head()

Unnamed: 0,digit,intensity,symmetry
0,6,0.341092,-4.528937
1,5,0.444131,-5.496812
2,4,0.231002,-2.88675
3,7,0.200275,-3.534375
4,3,0.291936,-4.352062


In [4]:
filtered = train[(train.digit == 1) | (train.digit == 5)]
filtered.loc[filtered.digit == 1, "digit"] = -1
filtered.loc[filtered.digit == 5, "digit"] = 1
filtered = filtered.reset_index(drop=True)

In [5]:
filtered.head()

Unnamed: 0,digit,intensity,symmetry
0,1,0.444131,-5.496812
1,-1,0.123043,-0.707875
2,-1,0.113859,-0.931375
3,-1,0.115371,-0.386
4,-1,0.102281,-0.378812


In [6]:
possible_C = np.array([0.0001, 0.001, 0.01, 0.1, 1])
runs = 100

best_run_C = np.zeros(runs)
C_val_errors = {c: np.zeros(runs) for c in possible_C}

for run in range(runs):
    kf = KFold(len(filtered), n_folds=10, shuffle=True)
    C_errors = np.zeros(len(possible_C))
    for i,C in enumerate(possible_C):
        fold_errors = np.zeros(len(kf))
        for j, index_pair in enumerate(kf):
            train_index, val_index = index_pair
            
            train_fold = filtered.ix[train_index]
            train_x = train_fold[["intensity", "symmetry"]].values.tolist()
            train_y = train_fold["digit"].values.tolist()
            
            val_fold = filtered.ix[val_index]
            val_x = val_fold[["intensity", "symmetry"]].values.tolist()
            val_y = val_fold["digit"].values.tolist()
            
            m = svm_train(train_y, train_x, '-q -t 1 -d 2 -c {} -r 1 -g 1'.format(C))
            p_label, p_acc, p_val = svm_predict(val_y, val_x, m, "-q")
            fold_errors[j] = (100-p_acc[0])/100.
            
        mean_fold_error = np.mean(fold_errors)
        C_val_errors[C][run] = mean_fold_error
        C_errors[i] = mean_fold_error
    best_run_C[run] = possible_C[np.argmin(C_errors)]

### Question 7

In [7]:
best_c = Counter(best_run_C).most_common(1)[0][0]
print("Best C: ", best_c)

Best C:  0.001


The lowest cross validation error is achieved when $C=0.001$. As a result, our answer to this question is B.

### Question 8
In this question, we need to report the error of the results using the best C value. As already saved this in order to calculate which C yielded the best results, reporting this value is easy.

In [8]:
print("Cross validation error when using C=", best_c, ": ", np.mean(C_val_errors[best_c]))

Cross validation error when using C= 0.001 :  0.0046828352115


The achieved cross validation error is closest to $0.005$, which means that our answer is C.

## RBF kernel
For the next questions, a radial basis function (RBF) is used. This function is of the form:
$$
h(x) = \sum_{n=1}^N w_n e^{-\gamma \| x - x_n \|^2}
$$
The learning algorithm involves finding $w_n$. We find this values by minimizing $(h(x) - y)^2$, where $y$ is the target value. $\gamma$ is a parameter passed to the algorithm and influences the variance of the obtained function. A small $\gamma$ means that a single data point has a lot of influence and leads to high variance, while a high $\gamma$ leads to less influence and a higher bias.

In [9]:
filtered_test = test[(test.digit == 1) | (test.digit == 5)]
filtered_test["digit"].replace(1,float(-1),inplace=True)
filtered_test["digit"].replace(5,float(1),inplace=True)
filtered_test = filtered_test.reset_index(drop=True)

In [10]:
possible_C = [0.01, 1, 100, 10**4, 10**6]
E_ins = np.zeros(len(possible_C))
E_outs = np.zeros(len(possible_C))

for i,C in enumerate(possible_C):
    train_x = filtered[["intensity", "symmetry"]].values.tolist()
    train_y = filtered["digit"].values.tolist()
    
    test_x = filtered_test[["intensity", "symmetry"]].values.tolist()
    test_y = filtered_test["digit"].values.tolist()
    
    m = svm_train(train_y, train_x, '-q -t 2 -d 2 -c {} -g 1'.format(C))
    train_label, train_acc, train_val = svm_predict(train_y, train_x, m, "-q")
    test_label, test_acc, test_val = svm_predict(test_y, test_x, m, "-q")
    
    E_ins[i] = (100-train_acc[0])/100.
    E_outs[i] = (100-test_acc[0])/100.

### Question 9
For this question, we had to report the C value which yielded the lowest in-sample error. To do this, we train for each value of C the SVM with that particular C on all data (of digits 1 and 5) and use this classifier on the training and test data (of digits 1 and 5). The errors on the training data and test data are then saved to separate lists.
We then report the value of C with the lowest corresponding error in the list of in-sample errors.

In [11]:
possible_C[np.argmin(E_ins)]

1000000

We see that we get the lowest in-sample error when $C=1000000$. This means that the answer to this question is E.

### Question 10
Here, we need to report the C value that gave us the lowest out-of-sample error. As already mentioned, the errors on the test data were already saved to a list. Thus, we now need to report the value of C with the lowest corresponding error in this list of out-of-sample errors

In [12]:
possible_C[np.argmin(E_outs)]

100

The lowest out-of-sample error is achieved when $C=100$. This is equal to the value of answer C.