Support Vector Machines (SVM)

1) **Maximum Margin Classifier** (Old SVM) - a hyperplane which separates two data of points and is at equal distance from the two. The margin between the hyperplane and the data point is maximal.
* Creating a different type of decision boundary
* **Hyperplane** - linear separation of data points with a $(p-1)$ dimensional classifier
![hyperplane](https://upload.wikimedia.org/wikipedia/commons/b/b5/Svm_separating_hyperplanes_%28SVG%29.svg)
    * In general, an n-dimensional space can be separated by a $(n-1)$-dimensional hyperplane
        * example: split a plane (2D) with a line (1D)
            * $\beta_0+\beta_1x_{i1}+\beta_2x_{i2}>0$ when $y_i=+1$
            * $\beta_0+\beta_1x_{i1}+\beta_2x_{i2}<0$ when $y_i=-1$
        * example: split a space (3D) with a plane (2D)
    * In an $n$-dimensional space any hyperplane can be defined by $w \in \mathbb{R}^n$ and $b \in \mathbb{R}$. The hyperplane includes all $x \in \mathbb{R}^n$ where:
    * $w_0x_0+w_1x_1+\cdots+w_{n-1}x_{n-1}-b=w \cdot x-b=0$
        * $w$ and $b$ define a hyperplane
        * $\frac{w}{\Vert w \Vert}$ is the hyperplane's normal vector
        * $\frac{b}{\Vert w \Vert}$ is the hyperplane's distance from the origin
* Defining Margin - the distance from the hyperplane to the nearest training data point
    * Maximize Margin - larger margin means better generalization (lower variance)
    * Goal of Maximum Margin Classifier - calculate $w$ and $b$ of the hyperplane such that the classes are split correctly and the margin is maximized
        * equation: $\big|w\cdot x^{(i)}-b\big|=1$ where $x^{(i)}$ is the closest point to the hyperplane
        * What happens to the hyperplane when we scale $w$ and $b$ by some factor $c$?
    * If $x^{(i)}$ is the closest point to the hyperplane, then the distance from $x^{(i)}$ to the hyperplane is our margin. What is that distance?
        * margin  $\rightarrow \begin{align} d 
            & = \big|\frac{w}{\Vert w \Vert} \cdot (x^{(i)}-x)\big| \\
            & = \frac{w\cdot x^{(i)}-w\cdot x}{\Vert w \Vert} \\
            & = \frac{w\cdot x^{(i)}-b-w\cdot x+b}{\Vert w \Vert} \\
            & = \frac{1}{\Vert w \Vert} \\
            \end{align}$
    * Initial idea: Maximize $\frac{1}{\Vert w \Vert}$
        * However, this optimization problem is not solvable
    * Reformulated: Minimize $\frac{1}{2}\Vert w \Vert^2$
* **Support Vectors** - the maximum margin hyperplane is defined by the points that touch the margin
![max_margin_hyperplane](https://upload.wikimedia.org/wikipedia/commons/2/2a/Svm_max_sep_hyperplane_with_margin.png)

2) Soft Margins and the "Kernel Trick" (Modern SVM)
* **Soft Margin (Support Vector Classifier)** - an extension to Maximum Margin Classifier with $C$ constant giving misclassification error penalty
    * Useful for data that is not linearly separable, noisy, has outliers
    * **Large $C$ (Harder margins)** - values classification accuracy over a large margin (generally: high bias, low variance)
    * **Small $C$ (Softer margins)** - values a large margin over classification accuracy (generally: low bias, high variance)
        * Even though it values a large margin over accuracy, it can account for datasets where there is inseparable data
        ![softer_margin_in_sep](inseparable_data_softer_margin.png)
        * Softer margins creates a better margin in cases of outliers that separates the data in generalized way
        ![softer_margin_in_out](soft_margins_outlier.png)
* **Kernel Trick (Support Vector Machine)** - increasing the dimensional space to allow for data that is inseparable in lower dimensional space without ever computing the coordinates of the data in the space, but rather by simply computing the inner products between the images of all pairs of data in the feature space
![kernel_trick_dimension](kernel_trick_dimension.png)
    * The resulting algorithm is formally similar, except that every dot product is replaced by a nonlinear kernel function: $K(x^{(i)},x^{(i)}=\phi(x^{(i)})\cdot\phi(x^{(j)})\in \mathbb{R}$
        * saves some computation since we never need to compute $\phi$
        * opens new possibilites since kernel can operate in infinite dimensions
* Kernel Functions:
    1. **Polynomial Kernel** - represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables / expand feature space by simply creating new features
![poly_kernel](poly_kernel.png)
        * equation: $(X_1,X_2) \rightarrow (X_1,X_2,X_1^2,X_2^2,X_1X_2)$
        * requires an extra hyperparameter, $d$, for "degree"
    2. **(Gaussian) Radial Basis Function (RBF) Kernel** - equivalent to the dot product in the Hilbert space of infinite dimensions
![rbf_kernel](rbf_kernel.png)
        * equation: $K(x^{(i)},x^{(j)})=e^{(-\gamma||x^{(i)}-x^{(j)}||^2)}$
        * requires an extra hyperparameter, $\gamma$, for "gamma"
* Unbalanced Classes with SVM - adjust weights inversely proportional to class frequencies
![unbalanced_classes_svm](unbalanced_classes_svm.png)

3) Bias-Variance Tradeoff / SVM vs Logistic Regression
* **Bias** (For SVM)
    * example: a linear SVM looks for dividing hyperplanes in the input space *only*
    * for complex data, high-bias models often *underfit the data*
* **Variance** (For SVM)
    * example: a RBF SVM looks for dividing hyperplanes in a infinite-dimensional space
    * for simple data, high-variance models often *overfit* the data
* SVM vs Logistic Regression:
    * SVM maximizes the **margin** (whereas Logistic Reg maximizes Binomial Log Likelihood function)
    * (+) When classes are nearly separable, SVMs tends to do better than Logistic Regression
        * If not, Logistic Regression with Ridge and SVMs are similar
    * (-) When estimating probabilities, Logistic Regression is a better choice
    * (+) With kernels, SVMs work well, however, with Logistic Regression with kernels can get too computationally expensive

4) Grid Search CV (Hyperparameter Tuning)
![grid_search_C_gamma](grid_search_C_gamma.png)
* Find $C$ and $\gamma$ by searching through values we expect might work well
* Use cross-validation accuracy to determine which values are best

5) SVM intuition
* Components of SVM:
    1. Hyperplane that separates data as well as possible
    2. Allowing some room for error (soft margin, $C$)
    3. Using kernels to accomodate non-linear class boundaries