

* What is the good threshold (decision boundary) to separate observations? 
    + Threshold halfway between observations on the edge of cluster
    + Assumption: "in general the larger the margin the lower the generalization error of the classifier"

* Goal: maximize the margin via flat, affine hyperplane in high dimensional space
    + i.e., "the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin)"
    + maximum  margin classifier
    
    <center><img src="pics/svm1.png" width="500"></center>


* Compared to Logistic Regression:
    + only one decision boundary
    + the decision boundary is only determined by the support vectors


* Definition
    + **How to represent a hyperplane in an equation?** 
    + How to define the margin?
    + Constraints: a separation of distance for all the training examples
    <center><img src="pics/svm3.png" width="500"></center>
    

    + Final form of objective can be defined as a quadratic programming problem:
    
         $$\min \frac{1}{2}\|\mathbf{w}\|^2 \text { subject to } y_i\left(\mathbf{w}^T \mathbf{x}_i+b\right) \geq 1 \quad \forall i$$

* Drawback: sensitive to outliers
    + Soft margin allows misclassification (bias-variance tradeoff)
    + Hyperparameter: how many misclassifications are allowed inside of the soft margin or on the edge (Support vectors)?
    + Slack variable
    
* Soft margin classifier or Support Vector Machine


In [1]:
from sklearn import svm
X = [[0, 0], [1, 1]]
y = [0, 1]
clf = svm.SVC()
clf.fit(X, y)


In [3]:
# get support vectors
clf.support_vectors_


array([[0., 0.],
       [1., 1.]])

In [4]:
# get indices of support vectors
clf.support_


array([0, 1], dtype=int32)

In [5]:
# get number of support vectors for each class
clf.n_support_

array([1, 1], dtype=int32)

## Kernel tricks
* Kernel functions to systematically find SVC in higher dimension
    - linear: $\left\langle x, x^{\prime}\right\rangle$.
    - polynomial: $\left(\gamma\left\langle x, x^{\prime}\right\rangle+r\right)^d$, where $d$ is specified by parameter degree, $r$ by coefo.
    - radial basis function: $\exp \left(-\gamma\left\|x-x^{\prime}\right\|^2\right)$, where $\gamma$ is specified by parameter gamma, must be greater than 0 .
    - sigmoid $\tanh \left(\gamma\left\langle x, x^{\prime}\right\rangle+r\right)$, where $r$ is specified by coefo.
* They do not do transformation into higher dimension

<center><img src="pics/svm4.png" width="700"></center>

In [None]:
linear_svc = svm.SVC(kernel='linear')
linear_svc.kernel

rbf_svc = svm.SVC(kernel='rbf')
rbf_svc.kernel


## Multiclass Classification
* one-versus-one: n_classes * (n_classes - 1) / 2 classifiers according to the combination formula
* one-vs-rest: n_classes classifiers

In [13]:
X = [[0], [1], [2], [3]]
Y = [0, 1, 2, 3]
clf = svm.SVC(decision_function_shape='ovo')
clf.fit(X, Y)

dec = clf.decision_function([[1]])
dec.shape[1] # 4 classes: 4*3/2 = 6

6

In [19]:
clf.dual_coef_.shape  # (n_classes-1, n_SV)
clf.intercept_.shape # (n_classes * (n_classes - 1) / 2)

# each row now corresponding to a binary classifier. The order for classes 0 to n is 
# “0 vs 1”, “0 vs 2” , … “0 vs n”, “1 vs 2”, “1 vs 3”, “1 vs n”, . . . “n-1 vs n”

(6,)

In [7]:
clf.decision_function_shape = "ovr"
dec = clf.decision_function([[1]])
dec.shape[1] # 4 classes

4

 LinearSVC implements “one-vs-the-rest” multi-class strategy, thus training n_classes models

In [10]:
lin_clf = svm.LinearSVC()
lin_clf.fit(X, Y)

dec = lin_clf.decision_function([[1]])
dec.shape[1]


4

In [11]:

lin_clf.coef_.shape  # (n_classes, n_features) 
lin_clf.intercept_.shape # (n_classes,)

# Each row of the coefficients corresponds to one of the n_classes “one-vs-rest” classifiers

(4, 1)

## Pass Activity

In [None]:
# Q1: Load "digits" datasets from SKlearn and print the dimension of the dataset.  
# Apply PCA on the dataset and select first three components.
# Print the dimension of modified dataset and visualise the data using appropriate plotting tool/s. 

# Q2: Classify the digit classes available in the dataset (use the modified dataset) using SVM with RBF kernel. 
# Select appropriate data splitting approach and performance metrics. 
# Report the performances and the used model hyper-parameters.  

In [None]:
# 3. Load "diabetes" datasets from SKlearn and print the dimension of the dataset. 
# Apply tSNE method to reduce dimension and select first three components. 
# Plot the selected components using appropriate visualisation technique. 

# 5. Create a model for detecting diabetes using SVM with a poly kernel.  
# Select appropriate data splitting approach and performance metrics. 
# Report the performances and the used model hyper-parameters. 

# 6. Based on the model hyper-parameters used in task-2 and task-5, share your understanding of hyper-parameters tuning in ML model development.