# Exercise Session 7 -  Support Vector Machine (SVM)

Welcome to the 7th excersie session of CS233 - Introduction to Machine Learning.  

We will use Scikit-learn, a Python package of machine learning methods, in this exercise. We are going to start with a toy binary classification example to understand Linear SVM, then to address more difficult problem. 


In [None]:
# Useful starting lines
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%load_ext autoreload
%autoreload 2

# 1 Support Vector Machine (SVM)
SVM tries to solve linear classification problem of the **primal form**:  
    \begin{align}
        \underset{\tilde{\mathbf{w}},w_0}{\operatorname{min}}  \ \ & \frac{1}{2}\|\tilde{\mathbf{w}}\|^2 + C \sum^N_{i=1}\zeta_i \\
        \operatorname{subject \  to} \ \ &  y_i(\tilde{\mathbf{w}}^T\mathbf{x_i}+w_0) \geq 1-\zeta_i , \forall \  i \\
                          &  \zeta_i \geq 0 , \forall \  i
    \end{align}
where, $\tilde{\mathbf{w}}$,$w_0$ are weight and bias. C is penalty term, $\zeta_n$ is error in terms of how far data point is beyond correct margin and $y_i \in\{-1,1\}$ for binary classification. $\|\tilde{\mathbf{w}}\|$ is inversely related to margin width, so minimizing it means maximizing the margin, hence we minimize $\|\tilde{\mathbf{w}}\|$. As our data may not be linearly separable, hence maximizing margin will lead to some misclassifications. $\zeta_i$ is greater than zero when a data point is beyond margin and how many such data points are allowed is controlled by C. We choose the right value for C, given the data, through validation set. Hence with this objective function we get a maximum margin with certain amount of misclassification.

The corresponding **dual problem** is given by:
\begin{align}
    \underset{\{\alpha_i\}}{\operatorname{max}} \ \ 
    & \sum_{i=1}^N \alpha_i - \frac 1 2 \sum_{i=1}^N\sum_{j=1}^N \alpha_i\alpha_jy_iy_j\mathbf{x}_i^T\mathbf{x}_j  \\   
    \operatorname{subject \ to} & \ \ \sum_{i=1}^N \alpha_iy_i = 0 \\
                 & \ \ 0 \leq \alpha_i \leq C, \forall i \ \ 
\end{align}
**Question**
   * How can you write $\tilde{\mathbf{w}}$ using $\alpha_i$s? This relates primal and dual coefficents.
   * How is $y(\mathbf{x})$ represented using $\alpha_i$s?
 
**Answer**
   * $\tilde{\mathbf{w}} = \sum_{i=1}^N \alpha_iy_i\mathbf{x_i} $
   * We plugging the $\tilde{\mathbf{w}}$ as, 
     \begin{align}
       \hat{y}(\mathbf{x}) &= \tilde{\mathbf{w}}^T\mathbf{x} + w_0 \\
                           &= \sum_{i=1}^N \alpha_iy_i\mathbf{x}_i^T\mathbf{x} + w_0
     \end{align}
   * The sum can be computed on the support vectors ($\delta$) only, 
       \begin{align}
       \hat{y}(\mathbf{x}) & = \sum_{i \in \delta} \alpha_iy_i\mathbf{x}_i^T\mathbf{x} + w_0
     \end{align}


# 2 Scikit-Learn

Training a SVM classifer is not a easy task, so in this session, we are going to use Scikit-Learn, which is a machine learning library written in python. Most of the machine learning algorithms and tools are already implemented. In this exercise, we'll use this package to train and understand SVM. If you are interested in how to optimize a SVM, you can refer to [this](https://xavierbourretsicotte.github.io/SVM_implementation.html).

Please install scikit-learn in your conda environment by following instructions at this link:https://scikit-learn.org/stable/install.html if you don't have it.

Scikit-Learn has modules implemented broadly for 
- Data Transformations: https://scikit-learn.org/stable/data_transforms.html
- Model Selection and Training: https://scikit-learn.org/stable/model_selection.html
- Supervised Techniques: https://scikit-learn.org/stable/supervised_learning.html
- Unsupervised Techniques: https://scikit-learn.org/stable/unsupervised_learning.html

All the magic happens under the hood, but gives you freedom to try out more complicated stuff.  
Different methods to be noted here are
- `fit`: Train a model with the data
- `predict`: Use the model to predict on test data
- `score`: Return mean accuracy on the given test data

Have a look at [this](https://scikit-learn.org/stable/tutorial/basic/tutorial.html#learning-and-predicting) for simple example.

We will explore linear SVM in this session: [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) with linear kernel. 

# 3 Binary Classification

Let's begin with a simple **binary** classification using Linear SVM.
The data is simplely **linear** separable.

In [None]:
# Simple data
from plots import plot_simple_data
x = np.array([[2,4],[1,4],[2,3],[6,-1],[7,-1],[5,-3]] )
y = np.array([-1,-1, -1, 1, 1 , 1 ])
plot_simple_data()

### 3.1 Linear SVM
In this part, you are asked to build a SVM classifier using SVC and to understand the outputs from the fitted model.

In [None]:
# Let use SVC with linear kernel
from sklearn.svm import SVC
from plots import plot


# 1. Declare a SVC with C=1.0 and kernel='linear'
## CODE HERE
clf = ...

# 2. use x and y to fit the model
## CODE HERE
 

# 3. show the fitted model
plot(x, y, clf)

print('w = ',clf.coef_)
print('w0 = ',clf.intercept_)
print('Number of support vectors for each class = ', clf.n_support_)
print('Support vectors = ', clf.support_vectors_)
print('Indices of support vectors = ', clf.support_)
print('a (Coefficients of the support vector in the decision function) = ', clf.dual_coef_)

In [None]:
# Use the weights (w) from the fitted model to predict the labels of input data points

def raw_predict(x, w, w0):
    '''
    given input data, w and w0, output the prediction result
    
    input:
    x: data, np.array of shape (N, D) where N is the number of datapoints and D is the dimension of features.
    w: weights, np.array of shape (N,)
    w0: bias, np.array of shape (1,)
    
    output:
    out: predictions, np.array of shape (N, ). tip: .astype(int) 
    '''
    ## CODE HERE
    
    
    return out

x_test = np.array([
    [4, 2],
    [ 6, -3]])

#output the predictions on x_test
## CODE HERE
raw_pred = ...

print("Prediction from the model: ", clf.predict(x_test))
print("Prediction from your implementation: ", raw_pred)
assert(raw_pred.all() == clf.predict(x_test).all())



In [None]:
# Find out the indices of support vectors by using w and w0

desicion_function_from_model = clf.decision_function(x) # this function outputs the results of wx+w0

## we can also calculate the decision function manually
## CODE HERE
decision_function = ...

assert(desicion_function_from_model.all() == decision_function.all())

## according to the condition that support vectors should satisfy
## CODE HERE tips: use np.where to put the condition in.
support_vector_indices = ...

print('I find the indices of support vectors = ', support_vector_indices)
assert(support_vector_indices.all() == clf.support_.all())


### 3.2 Dual Coefficients VS Primal Coefficients

By using `dual_coef_` attribute of the model, we can get the dual coefficients $\alpha_i$ of the support vectors.  
**Question** Scikit returns dual coeff in slightly different form, can you identify the difference?

**Answer** Dual coefficients $a_n$ must satisfy constraint $0\leq \alpha_i \leq C$.  
Support vector which lies on margin has $\alpha_i<C$ and ones between margins have $a_i=C$

Scikit return coefficients with label of the class {-1,1}, i.e. it returns $a_iy_i$, where $y_i \in \{1,-1\}$. Also, the coefficients are only of support vectors.

Given support vectors ($\delta$) and their dual_coefficients, the weights can be computed by:
\begin{align}
\tilde{\mathbf{w}} & = \sum_{i \in \delta}^N \alpha_iy_i\mathbf{x_i} 
\end{align}


In [None]:
# Compute primal coefficients given dual coefficients and support vectors

def compute_w(dual_coef, support_vectors):
    '''
    given dual coefficients and support_vectors, compute the primal coefficients
    
    input:
    dual_coef: dual coefficients, np.array of shape (1, n) where n is the number of support vectors.
    support_vectors: np.array of shape (n, D) where n is the number of support vectors and D is the dimension of features.
    w0: bias, np.array of shape (1,)
    
    output:
    w: primal coefficients, np.array of shape (D, )
    '''
    ## CODE HERE
    w = 
    
    return w


w = compute_w(clf.dual_coef_, clf.support_vectors_)

print("Primal coefficients from the model: ", clf.coef_[0])
print("Primal coefficients from your implementation: ", w)

assert(w.all() == clf.coef_[0].all())



### 3.3 Different C values
Let's try different values of C. In the code, vary the C value from 0.001 to 100 and notice the changes on a bigger dataset.  
**Question**: How does the margin vary with C? **Hint**: have a look at the optimization formulation above.

In [None]:
from sklearn.svm import SVC
from helpers import get_simple_dataset
from plots import plot

# Get the simple dataset
X, Y = get_simple_dataset()
plot(X,Y,None,dataOnly=True)

#Declare a SVM model with linear kernel and C=1.0
clf = SVC(kernel='linear', C=1.0)

#call the fit method
clf.fit(X, Y)

#plot the decision boundary
plot(X, Y, clf)


The above plot shows the decision boundary and margins of the learnt model. Encircled points are the support vectors.  
WARNING: if the margins go beyound the limits of axis, they are not shown or shown close to decision plane. 

In [None]:
# Vary C and plot the boundaries
# Use np.logspace to generate 6 c values from (10e-3, 10e2) 
## CODE HERE 

C = ...

for ...


**Answer**: Lower C allows more misclassification and hence larger margin, while bigger C reduces misclassfication and hence smaller margin.

### Additional Reading (if interested)
- Multiclass SVM (Bishop- Multiclass SVMs 7.1.3)
- Can we have probabilistic interpretation of SVM? (Bishop- Relevance Support Machine 7.2)