## Exercise: 

We will use the weight-based function to solve a small problem with an SVM. The function has been incorporated into a widget that appears below.

You will need to find values for $\mathbf{w}$ and $b$ that give the best classifier.

$$\underbrace{\mathbf{w}^\top\mathbf{x}_\text{test} + b}_\text{Weight-based function}
    = b + \sum^{m}_{i=1}\alpha_i\mathbf{x}_{\text{test}}^\top \mathbf{x}_{\text{train}}[i,:]$$

## Solution:

One solution is $\mathbf{w} = [1,1]$, $b=-2$.

## Exercise:

Now, we'll solve the same problem with an SVM using the equivalent example-based function. The function has been incorporated into a widget that appears below.

You will need to find values for $\mathbf{a}$ and $b$ that give the best classifier.

$$\mathbf{w}^\top\mathbf{x}_\text{test} + b
    = \underbrace{
        b + \sum^{m}_{i=1}\alpha_i\mathbf{x}_{\text{test}}^\top \mathbf{x}_{\text{train}}[i,:]
       }_\text{Example-based function}
$$


# Solution

One solution that works here is $\mathbf{a} = [0,0,0,1,-1,0]$, $b = -4$.

# Discussion

- Go back to the example-based widget.
- Compare two solutions that both get 100% accuracy. Do you see any advantages to either solution?  
    - Solution 1: $a=[1, -1, 1, 1, -1, -1]$
    - Solution 2: $a=[0, 0, 0, 1, -1, 0]$  
    
    
- If we keep $b=0$ our model is perfectly accurate. We could actually use several different values for $b$ and get the same score. How should decide between them?

# Discussion Answers

- Recall from the regression lesson that we can use Lasso to do feature selection. Lasso regression pushes the coefficient for many features to zero. This is a clue that our model might be better off if we omitted those zero features.
- SVMs use a similar process for examples. Instead of saving all of the training examples, SVMs only need to save the ones with a non-zero $a_i$.
    - The saved examples are called <u>**support vectors**</u>.

![](https://www.dropbox.com/s/lsbse60wi31lhl0/2018-12-06_11-59-43.png?raw=1)
- Let's think back to discussions of cross validation. While the widget above is displaying training accuracy, what we really care about is how accurate a model will be on new data. Errors on new data are often called **generalization error**.
- SVMs chose a $b$ that puts the greatest distance between examples and the decision boundary.
    - This strategy is what is called a <u>**maximum margin**</u> classifier.

## Discussion:

Adjust the $C$ parameter. Describe how the model and its decision boundary are changing.

## Discussion Answers

- Higher $C$: tends to care more about increasing accuracy on the training set.
- Lower $C$: tends to care more about reducing generalization error.
- Very low $C$: results in all points being support vectors.

# Discussion

Test out the different kernels and datasets. Get a feel for what each parameter does. 

Describe in your own words
- What sorts of functions do the poly and rbf kernels tend to learn?
- What does the `gamma` parameter do?

# Discussion Answers

- What sorts of functions do the poly and rbf kernels tend to learn?


The **poly kernel** tends to learn continuous, curved boundaries.The function implemented by the poly kernel is: 
    
$$K(\mathbf{x},\mathbf{x}') = \left(\gamma\mathbf{x}^\top\mathbf{x} + r \right)^d $$

- What sorts of functions do the poly and rbf kernels tend to learn?


The **RBF kernel** tends to learn continuous, curved boundaries. The function implemented by the RBF kernel is: 
    
$$K(\mathbf{x},\mathbf{x}') = e^{-\gamma\|\mathbf{x}-\mathbf{x}' \|^2} $$


What's happening under the hood is that the RBF kernel is comparing test points to each of the support vectors. When the test point is close to a support vector, that support vector's coefficient is weighted highly. When a test point is farther away, the coefficient is weighted lower. 

Note: you might recognize that this function is similar to the PDF of the Normal distribution, 
    
$$f(x \mid \mu, \sigma^2) =\frac{1}{ \sqrt{2\pi\sigma^2}} e^{\frac{ (x-x')^2}{2\sigma^2}}$$


- What does the `gamma` parameter do?


Note that in the RBF kernel, $\gamma \sim \frac{1}{\sigma^2}$, so if we think of the RBF as a Normal distribution, increasing `gamma` is analogous to decreasing the standard deviation. 

In more qualitative terms:
- decreasing `gamma` leads to smoother boundaries
- increasing `gamma` leads to boundaries that follow the training data more tightly.

# Exercise

Confirm that SVMs with the with SGD and kernel approximation are similar to kernel SVMs. 

- What differences do you notice?

# Solution

- What differences do you notice?
    - There are no support vectors. `SGDClassifier` learns a weight matrix according to the weight-based linear function: $f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + B$.
    - The hyperparameter `C` doesn't affect the SGD SVM. This is because `SGDClassifier` uses a separate parameter `alpha` which, depending on the kernel, is roughly $\alpha \sim \frac{1}{c}$.
    - Accuracy tends to increase as the number of components increases. However, be careful because the time to train also increases with the number of components. 