# Thursday: Support Vector Machines (SVM)

Find a (hyper)plane that seperates a classes in feature space.

**Terminology:**

* hyperplane in (p dimension): a flat affine subspace of dimension p-1
* in p-dimensions the mathematical equation of hyperplane is:
$$\Large \beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\cdots + \beta_{p}X_{p}=0$$
* if $\beta_0$ = 0 the plane goes through the origin
* the normal vector $\beta = \beta_1, \beta_2..., \beta_p$ is the vector that is orthogonal (makes a 90 degree angle) with the plane. This is used to calculate the distance from the hyperplane. 
*  So $f(X) < 0$ is above the plane and $f(X) > 0$ is below the plane.

![image.png](attachment:60ddbe99-3e4a-471b-aaff-eb6add0f4da2.png)

The most optimal hyperplane has the biggest margin between two classes. So we constrain the sum of the betas to be one, this allows for the distance from the hyperplane to be determined (i.e. sum normal vector is one). 
$$\Large \sum_{j=1}^{p}\beta_{j}^{2}=1$$

we then maximize the distance $M$ from the hyperplane.
$$\Large y_{i}(\beta_{0}+\beta_{1}x_{i1}+\ldots+\beta_{p}x_{i p})\geq M$$

One can see in the image and from the formula that the hyperplane fitted to the data is very dependent on only a few data points clossest to the margin, hence the name support vectors (which these points are). The algorithm is thus very prone to overfitting, or completely breaks down when classes overlap. 

**What if data is non-seperable**

Non-seperable data doesn't allow to construct a solid margine, so we use a soft-margin. By involving an error term $\large \epsilon$ which indicates how far a point may be within in the margin:

$$\Large y_{\hat{\imath}}(\beta_{0}+\beta_{1}x_{i1}+\beta_{2}x_{i2}+\cdot\cdot\cdot+\beta_{m}x_{i m})\gt \nonumber M(1-\epsilon_{i})$$

We then set a budget $C$ for the total amount of "slack" or $\large \epsilon$ this is denoted by: 
$$\Large {{\epsilon_{i}}\geq0,}\ \sum_{i=1}^{n}\epsilon_{i}\leq C$$

We then optimizes the margin to be again as big as possible, however now point may lie within the boundry, while the error terms are being watched by the budget. The budget $C$ thus becomes a tuning parameter, which will tune the bias/variance trade-off as seen below. 

*IMPORTANT NOTE: the values need to be standardized as they are all counted as equals*
![image.png](attachment:20f04f25-bbf9-44c8-8a9b-e77f07665845.png)


**non-linear decision boundries**

A simple way of improving the fit of the plane is to expand the features thereby making a polynomials. however these polynomials get complex very fast i.e. the sheer number of terms to estimate increase exponentially with every dimension and number of degree. Hence the use of kernels. 

*What are inner products*

The cross product of two vectors 
$$\Large \langle x_{i},x_{i^{\prime}}\rangle=\sum_{j=1}^{p}x_{i j}x_{i^{\prime}j}$$

This is used to create the linear vector classifier, where $n$ is the number of parameters $\alpha_i$(one per training observation), to estimate the the parameter $\hat{\alpha}$ we need all the pairwise innerproducts between all the $n$ points in the dataset: 

$$\Large f(x)=\beta_{0}+\sum_{i=1}^{n}\alpha_{i}\langle x,x_{i}\rangle$$

Most of these alphas are zero however the ones that are not are the set $S$ which is called the support set. 
The inner products are often calculated by using **Kernels** which are special functions. They calculate the inner product of $p$-pairs of points needed for $d$-dimensional polynomials.:
$$\Large K(x_{i},x_{i^{\prime}})=(1+\sum_{j=1}^{p}x_{i j}x_{i^{\prime}j})^{d}$$

These kernels calculate the basis functions ($1, x, x^2, x^3, ...$) which make up the polynomial. These polynomials are used to classify the data in n-dimensional space. There are many types kernels e.g. linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid. The kernels have the propensity to squash dimensions to zero i.e. remove as we only pick terms $\alpha_i$ that are non-zero. 

$$\Large f(x)=\beta_{0}+\sum_{i\in{\cal S}}\alpha_{i}{\cal K}(x,x_{i})$$

![image.png](attachment:6a52311a-f62a-41ea-938b-b145f408836d.png)

**More than 2 classes**

* OVA: one-versus all, use a fit a kernel for one class against all other classes, so K functions. Classify the observation to the class for which $\hat{f}_k(x^*)$ is the largest i.e. distance from the margin is largest.
* OVO: one versus-one,  pairwise kernels, leads to $\large K\choose{2}$ (this denotes the all the pairwise possibilities).


When to use SVM or Logistic Regression: 
* SVM is better when classes are nearly seperable
* LR(using ridge or lasso) is better when they are not seperable.
* probabilities are only available with logistic regression.
