# HW3 Solution
## Student Name: Jason Miller

---

**NOTE:**
Change the notebook filename in this way:
```
hw2_solution_lastname_firstname.ipynb
```

---

As you can see, you can write an inline equation in this way: $P(\theta)$.

Or you can write a block equation in this way
$$
\mathcal{N}(\mu,\sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}
$$

This is a [Markdown overview](https://colab.research.google.com/notebooks/markdown_guide.ipynb) if you are not familiar with this text editing formalism.

---


In [None]:
# You can only import these libraries
import matplotlib
import matplotlib.pyplot as plt
import numpy as np


## Problem 1

By considering the determinant of a 2×2 Gram matrix, 
show that a positive definite kernel function k(x, x′) 
satisfies the Cauchy-Schwartz inequality
k(x1,x2)2 ≤k(x1,x1)k(x2,x2).

### Solution 1
Our textbook (Bishop, page 293) defines Gram matrix K 
whose elements are the dot products  
$K_{ij} = $  
$ = k(x_i,x_j)$  
$ = \langle \phi(x_i),\phi(x_j) \rangle$  

Our textbook (Murphy, page 481) 
defines positive definite kernel k as having the property
that K is a positive definite matrix. 
K positive definite implies K preserves vector orientation under multiplication,
which implies the determinant of K is positive. 

So far we have...

$K = \left( \begin{smallmatrix} k(x_1,x_1)&k(x_1,x_2) \\ 
k(x_2,x_1)&k(x_2,x_2) \end{smallmatrix} \right)$

$det(K) = ad - bc = k(x_1,x_1)k(x_2,x_2)-k(x_1,x_2)k(x_2,x_1)$  

$det(K)>0$  
implies  
$k(x_1,x_1)k(x_2,x_2) > k(x_1,x_2)k(x_2,x_1)$  

The possibility that  
$x_1=x_2$  
implies the possibility that  
$det(K)=0$  
so,  
$k(x_1,x_1)k(x_2,x_2) \geq k(x_1,x_2)k(x_2,x_1)$  

Our slides 
(deck 16, "SVM", page 33, "Computational Efficiency")
define a kernal function as a symmetric function.   
$k(x_1,x_2) = k(x_2,x_1)$    

So  
$k(x_1,x_1)k(x_2,x_2) \geq [k(x_1,x_2)]^2$

## Problem 2 Solution
### 1 = C  
In fig C, the decision boundary is linear, 
matching the identity kernel in eq 1. 
In blue, 3 of 4 SVs are inside the margin.
In red, at least 2 of 4 SVs are inside the margin. 
This indicates high tolerance for slack,
possibly matching the C=0.1 parameter in eq 1.

### 2 = B  
In fig B, the decision boundary is linear,
matching the identity kernel in eq 1 and 2.
There is only one (blue) SV in the margin.
This indicates less tolerance for slack,
possibly matching the C=1 parameter in eq 2.

### 3 = F
Eq 3 uses a 2nd order polynomial kernel.
We expect to see a quadratic decision boundary, as in fig D.
The Fig F decision boundary looks linear but
but that could be the solution to the polynomial, 
especially since the only SVs are one red and one blue.
Fig F has no points in the margin,
consistent with the lack of slack variables in eq 3.

### 4 = A
Eq 4 and 5 use a Gaussian kernel.
This kernel is capable of the high-degree decision boundary 
in fig A and E.
For eq 4, fig A seems the better choice by the argument below.

### 5 = E
Eq 5 is similar to eq 4 and fig E is similar to fig A.
Compared to fig A, fig E shows more signs of overfitting:
the decision boundary is more tailored to the red points
and it uses nearly all the blue points as SVs. 
Where eq 4 has a 2 in the denominator, eq 5 has a 1.
The denominator is proportional to the variance of the Gaussian kernel.
Since eq 5 uses a smaller variance and a narrower Gaussian,
eq 5 is predisposed to overfit more than eq 4.

### None = D
The fig D decision boundary is consistent with a 2nd order polynomial kernel, but we assigned fig F to the only polynomial kernel.
Furthermore, fig D has problems:
the blue SV at the top right is an outlier;
the boundary should curve a bit more to incorporate 
one to two more red SVs.

## Problem 3

### 3 Background.
The parameter C is a regularizer (Bishop, page 332)
that affects the cost of including support vectors inside the margin.

The slides presented in class 
show four simulations of increasing C (deck 16, "SVM", slides 15-29).
In no example does the decision boundary move as C increases
(except one move that is so slight it could be a drawing artifact).

However, it seems that the decision boundary should move sometimes.
As C increases, the margin decreases, 
and that could disqualify some instances as support vectors.
As the set of support vectors decreases,
it would seem that the decision bounary could get rotated and translated.

Indeed, this is shown in one of our textbooks 
(James, Intro to Statistical Learning).
Unfortunately, the C in James has inverse meaning.
In class, C is a coefficient of slack, so large C discourages slack.
In James, C is a cap on slack, so large C encourages more slack.
Nevertheless, James shows an example 
where each change in C leads to a narrower magin 
and different support vectors, including dramatic rotations of the decision boundary
(Figure 9.7 on page 348).

### 3 Problems and Solutions
Linear SVM with slack penalties (eq 0.5):  
$min \frac{1}{2} w \cdot w + C \sum_i^n{\xi_i}$  
such that:  
$\xi_i \geq 0$  
and:  
$(w \cdot x_i + b) y_i - (1-\xi_i) \geq 0$
for all i.  

#### 3.1 As C increases, b will not increase.
POSSIBLE

Consider the case where the SMV already separated the data with no slack.
Then $\xi_i=0$ for all i and the second term of eq 0.5 is zero regardless of C. 
So, increased C will not increase b (or anything else) in this case.

Consider the case where the SVM did incorporate slack.
Then $\xi_i>0$ for some i.
Increased C encourages less total slack and narrower margin.
The decreased margin could induce different support vectors.
The different support vectors could induce a different a decision boundary.
The new boundary might have different w and b.
Thus, increased C could increase b in this case.

#### 3.2 As C increases, more points will be misclassified. 
POSSIBLE

Consider the SVM that leaves several points within the margin. 
Increased C increases the cost of the points within the margin.
By eq 0.5, the new SVM must have less total slack and a narrower margin. 
But the new SVM could have fewer support vectors, 
and it could increase the slack of one point,
and that point could even be on the wrong side of the new decision bounary.
Thus, the misclassified count could increase, decrease, 
or remain the same.

#### 3.3 As C increases, the margin will not increase.
TRUE

Consider the SVM that separated the data with no slack.
Increasing C will have no effect on b or w.
The margin will not increase (or decrease) in this case.

Alternately, suppose $\xi_i>0$ for some i.
Thus, the SVM already minimized the margin using slack.
Increasing C could push the SVM to a new minimum,
in which total slack goes down but w is greater.
Greater w means smaller margin (margin=2/||w||). 
Thus, the margin will not increase in this case either.

Thus, the margin will not increase in all cases. 

## Problem 4

Consider the kernel:  
$k(u,v) = u · v + 4(u · v)^2$  
where the vectors u and v are 2-dimensional. 
This kernel is equal to an inner product φ(u) · φ(v)
for some definition of φ. What is the function φ?

### 4a Toward the Solution
Expand the given equation:  
$k(u,v) $  
$= u_1 v_1 + u_2 v_2 + 4(u_1 v_1)^2 + 8 u_1 v_1 u_2 v_2 + 4(u_2 v_2)^2$  

Our slides (deck 16, "SVM", page 31, "Quadratic Features"),
give an example of mapping 2D vectors to 3D space:  
$\phi(x) = (x_{1}^{2}, \sqrt{2} x_1 x_2, x_{2}^{2})$  
which has the convenient property that   
$k(u,v) = \langle \phi{(u)},\phi{(v)} \rangle = \langle u,v \rangle^2$  

Either formulation works out to:  
$k(u,v) $  
$ = (u_1 v_1)^2 + 2 u_1 v_1 u_2 v_2 + (u_2 v_2)^2$  
which is close to the desired result.

The desired result is achieved using a coefficient of 2 
and two extra dimenstions. Thus, the feature extraction function 
phi maps 2D to 5D, as shown below.

### 4b Solution
$\phi(x) $  
$ = \phi(x_1,x_2)$  
$ = (2 x_{1}^{2}, 2 \sqrt{2} x_1 x_2, 2 x_{2}^{2}, x_1, x_2)$  

### 4c Check
$k(u,v) $  
$ = \langle \phi{(u)},\phi{(v)} \rangle $  
$ = [2 u_1^2, 2 \sqrt{2} u_1 u_2, 2 u_2^2, u^1, u^2]  · $
$   [2 v_1^2, 2 \sqrt{2} v_1 v_2, 2 v_2^2, v^1, v^2]$  
$ = 4 (u_1 v_1)^2 + 8 u_1 u_2 v_1 v_2 + 4 (u_2, v_2)^2 + u_1 v_1 + u_2 v_2$  
$ = u · v + 4(u · v)^2$  
Correct!