# 7. Support Vector Machines

A powerful algorithm widely used (_at least at the time of the video_) in both industry and academia are Support Vector Machines (SVM). 

We start to understand SVM starting from the sigmoid activation function of logistic regression. 

$h_{\theta}(x) = \frac{1}{1+e^{-\theta^T x}} $

For a single example, our cost function will be: 

$Cost(h_{\theta}(x), y) = -ylog(h_{\theta}(x)) - (1-y)log(1-h_{\theta}(x))$


### Graphical intuition

**Part 1**： $y = 1$

$Cost(h_{\theta}(x), y) = -log(h_{\theta}(x)) = -log\frac{1}{1+e^{-\theta^T x}} $

Now, we will modify this function and _split_ it in two linear parts (flat for $z > 1$ and straight line otherwise). 

$Cost_1(z)$

![SVM Graphical Intuition 1](Figures/SVM_Graph_1.jpg)

**Part 2**: $y = 0$

$Cost(h_{\theta}(x), y) = -log(1-h_{\theta}(x)) = -log(1 - \frac{1}{1+e^{-\theta^T x}})$

We will now _split_ it in a similar way, this time with a theshold placed at $z = -1$:

$Cost_0(z)$

![SVM Graphical Intuition 2](Figures/SVM_Graph_2.jpg)

### Cost Function

$ min_{\theta} [\sum_{i=1}^m y^{(i)} cost_1{\theta}x^{(i)} + (1-y^{(i)}) cost_0(1- h_{\theta}x^{(i)})] + \frac{1}{2} \sum_{j=1}^n \theta^2_j$

**Notation change**: for log reg our cost function structure would look something like this:

$ A + \lambda B$ (lambda trying to regularize / _balance_ B)

By convention, with SVM we will change to:

$ C A + B$ 

In the end, it's just a different way to control the trade-off. 

### Large Margin Intuition

For SVM:

If $y = 1$ we want $\theta^T x \ge 1$ (not just $\ge 0$)  
If $y = 0$ we want $\theta^T x \le -1$ (not just $\le 0$)

This means that we have an extra **margin** compared to logistic regression, since our function it will require more extreme cases to classify a value as 1 or 0. 

### Kernels 

We know that for higher order polynomials computation can quickly become an issue, so how can we simplify our features?

One useful idea would be to compute new features based on predetermined points (in our example, _landmarks_).

Example of similarity function (**kernel**):  

$f_i = similarity(x, l^{(i)}) = exp(-\frac{||x - l^{(i)}||^2}{2\sigma^2})$ (**gaussian kernel**)

Now, we have two extremes where we could end up, which determine the range of possible values for our kernel functions:

1. If $x \approx l^{(i)}$: $f_i \approx 1$
2. If $x$ far from $l^{(i)}$: $f_i \approx 0$

But how do we choose and use kernels? 

1. We are going to choose our first landmarks $l^{(i)}$ exactly as the first training example    
2. We can now create a _feature vector_ $f$ which maps x to $l^{(i)}$.
3. Now, we predict $y = 1$ if $\theta^T f \ge 0$  

SVM parameters 

C = $\frac{1}{\lambda}$

* Large $C$ = lower bias - high variance (small $\lambda$ - may lead to overfitting)
* Small $C$ = higher bias - low variance (large $\lambda$ - may lead to underfitting) 

$\sigma^2$

* Large $\sigma^2$: features $f_i$ vary more smoothly. Higher bias, lower variance. 
* Small $\sigma^2$: features $f_i$ vary less smoothly. Lower bias, higher variance.

### Implementation

Although it is not recommended to code the minimization function (as there are libraries which can do it more efficiently), we still have to specify:

* Parameter C
* Kernel:
    - **No** kernel > linear kernel (basically returns a linear classifier)
    - Gaussian kernel > need to choose $\sigma^2$
    - Other kernel (which needs to satisfy the "Mercer's Theorem" to make SVM packages' optimizations work properly)
    
**Note**: Perform feature scaling before using the Gaussian kernel.

### Logistic Regression vs. SVM

Let $n$ = number of features and $m$ = number of training examples, it is reasonable to:

1. If $n > m$ (e.g. text classification problem) > **logistic regression**
2. If $n$ is small, $m$ is intermediate > **SVM** with Gaussian kernel 
3. If $n < m$  create/add features > use logistic regression or linear kernel