## SVM Hypothesis
$min_{\theta}\ h_{\theta}(x) = C*\sum_{i=1}^m[y^{(i)}*cost_1(\theta^T x^{(i)})+(1-y^{(i)})*cost_0(\theta^T x^{(i)})]+\frac{1}{2}\sum^n_{i=1}\theta^2_j$
- $Cost$ function in the equation above: a piecewise function that mimic the shape of the sigmoid function
    - If y = 1, we want $Z = \theta^T x^{(i)} \gg 0$
    - 0: otherwise
    - In the image below: the intercept on the z-axis is not necessary at $z=1$
![W7-SVM-Cost](Plots/W7-SVM-Cost.png)
- $C$: the coefficient that replaces $\lambda$; when $C$ is large, the clussifier will be very sensitive to the outliers, which means the presence of an outlier can impact the position of the original hyperplane. 
    - $C$ can be thought as $\frac{1}{\lambda}$ in the regulated Logistic Regression model
    - When $C$ is large, or $lambda$ is small, the model tends to have high variance (overfitting)
    - Wehn $C$ is small, or $lambda$ is large, the model tends to have high bias (underfitting)
    
### Decision Boundary - Large Margin Classifier
SVM tries to **maximize the margin** (minimum distance) between the *hyperplane* vs the *positive and negative examples*

#### Optimization Function
![W7-SVM-OPTIM](Plots/W7-SVM-OPTIM.png)
- Come from the hypothesis function
- We assume the costs (in the first half of the hypothesis function) will become zero that at the optimal boundary
    - Thus omitted from the objective function
    - Thus we have the two constraints
- Our goals become to minimize the square product of $\theta$s (through the objective function) while maintaining the classification functionality (through constraints)

#### Basic Math - Vector Inner Product
![W7-SVM-VECTORPRODUCT](Plots/W7-SVM-VECTORPRODUCT.png)
- $u$ and $v$ are two vertical vectors
- $||u||$ = length of the vector $u$: for example, $||u|| = \sqrt{u^2_1+u^2_2} \in R$
- $p$ = length of the **projection** of $v$ onto $u$: $p = ||v||*cos(\alpha)$
- $u^T*v = ||u||*||v||*cos(\alpha) = p*||u|| = \sum_{i=1}^n u_i*v_i$

#### Transform the Optimization Function w.r.t. $||\theta||$
![W7-SVM-OPTIM2](Plots/W7-SVM-OPTIM2.png)
In the objective function: 
- $\sum_{j=1}^n \theta^2_j = (\sqrt{\sum_{j=1}^n \theta^2_j})^2 = (\sqrt{x_1^2+x_2^2+...})^2=||\theta||^2$

Constraints:
- $\theta^T x^{(i)} = p^{(i)}*||\theta||$
- Refer to the Vector Inner Product above
- $p^{(i)}$ is the projection of $x^{(i)}$ onto the vector $\theta$

Meaning:
- **We need to find a large $p^{(i)}$ in order to minimize $||\theta||^2$**

#### Find the Optimal Decision Boundary with Math
![W7-SVM-BOUNDARY](Plots/W7-SVM-BOUNDARY.png)
*Left is non-optimal, and Right is Optimal*

The optimization tries to maximize the margin (minimum distance between the Boundary vs. Positive & Negative Data)
- $\theta$s are perpendicular to the decision boundary
- $p^{(i)}$ is the projection of $x^{(i)}$ onto the vector $\theta$
- The objective function tries to find **as large $p$ as possible** in order to minimize $||\theta||$
    - Ruled by constraint, $p^{(i)}*||\theta|| \geq 1$ if $y^{(i)} = 1$ (or the constraint on the Negative data)
    
## SVM with Kernels
**Enable SVM to work on non-linear boundaries**

### Similarity Function 
Kernels are similarity functions describing the **proximity/similarity between a data point $x$ and one landmark $l^{(i)}$**

### Gaussian Kernel
$f_i = similarity(x,l^{(i)}) = exp(-\frac{||x-l^{(i)}||^2}{2*\sigma^2})$
- Numerator: the euclidean distance between $x$ and a landmark $l^{(i)}$
- Denominator: variance coefficient

#### Kernels and Classification
- If $x$ is closed to $l^{(i)}$ or $x\approx l^{(i)}$, $-\frac{||x-l^{(i)}||^2}{2*\sigma^2}\approx 0$, so $f_i\approx 1$
- If $x$ is far from $l^{(i)}$, $-\frac{||x-l^{(i)}||^2}{2*\sigma^2}\approx -\infty$, so $f_i\approx 0$
- Therefore, we can get a value between 0 and 1 to  indicate **whether the data point $x$ is closed to certain landmarks**. 
- With multiple landmarks and the Kernel functions, we can **plot a non-linear boundary**

![W7-KERNEL-BOUNDARY](Plots/W7-KERNEL-BOUNDARY.png)
*Based on the inequation on the right, we can classify each data point and generate a non-linear boundary*

### SVM with Kernels
We replace the $x$ in the original SVM with kernels $f^{(i)}$
$min_{\theta}\ h_{\theta}(x) = C*\sum_{i=1}^m[y^{(i)}*cost_1(\theta^T f^{(i)})+(1-y^{(i)})*cost_0(\theta^T f^{(i)})]+\frac{1}{2}\sum^n_{i=1}\theta^2_j$

#### Choose landmark $l^{(i)}$
We use each $x$ in the data points in the training set as our landmarks
- With $m$ training examples, we have $m$ landmarks
- Implementation tips: $\theta^2 = \theta^T*\theta$

#### Choose Variance Coefficient $\sigma$
##### Visualize
![W7-SIGMA-VIZ](Plots/W7-SIGMA-VIZ.png)

## Implementation Notes
#### Use existing SVM software packages (liblinear, libsvm, etc.) to solve for parameters $\theta$

#### Choice of kernels
- Linear Kernels: approx. linear classifier
    - Useful to prevent overfitting when you have large number of features ($n$) but small number of data points ($m$)
    - Very similar to Logistic Regression in terms of the performance
- Gaussian Kernels
    - Useful when you have small number of features but large number of data, so you can fit a more complex decision boundaries
    - Need to speicify the variance coeff $\sigma^2$
- Others: polynomial kernel, String Kernel, Chi-Square Kernel, histogram intersection kernel, etc.
- Not all similarity functions are always valid
    - Need to satisfy *Mercer's Therem* to ensure the optimality
- Choose the kernel that performs the best on the cross-validation data

#### Feature Scaling on all features before using the Gaussian kernal
- The calculation of kernel includes the euclidean distance between $x$ and the landmark
- Large feature may dominate the size of the distance

#### Multi-class Classification
- Many SVM packages already have built-in multi-class functionality
- Otherwise, use one-vs-all methods

#### Logistic Regression vs. SVM
- If number of features $n$ is large (relative to the size of data), use logistic regression, or linear Kernel
- If $n$ is small (1-1K) and $m$ is intermediate (10-10K), use SVM with Gaussian Kernel
- If $n$ is small (1-1K) and $m$ is large (50K+), add more features, then use logistic regression, or linear Kernel
- Neural Network is likely to work well for most of the settings, but may be slower to train