# Maximal Margin Classifier
- The maximal margin classifier is suitable for the condition where the classes are well separated from each other
- In such case the position for the hyperplane is determined on the basis of maximum margin between the support vectors
$$ \max\limits_{\beta_0, \beta_1, \dots, \beta_p, M} M$$
$$ Subject\ to\ \sum_{j=1}^{p} \beta_j^2 = 1 $$
$$y_i (\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip} )\geq M \qquad \forall \quad i = 1, \dots, n $$

- Here
- $M$ is the width of the margin
- $y_i$ is the class label with values $+1\ and\ -1$
- $(\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip} ) = 0$ defines the hyperplane
- $(\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip} ) > 0$ defines positive class
- $(\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip} ) < 0$ defiens negative class

# Support Vector Classifier
- Support Vector Classifier is suitable when the classes are not separable by a single hyperplane
- SVC allows soft margins meaning some data points are allowed for missclassification introducing error terms $\epsilon$ and tuining parameter $C$

$$ \max\limits_{\beta_0, \beta_1, \dots, \beta_p, \epsilon_1, \dots \epsilon_n, M} M$$
$$ Subject\ to\ \sum_{j=1}^{p} \beta_j^2 = 1 $$
$$y_i (\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip} )\geq M (1-\epsilon_i),\qquad \epsilon \geq 0,\qquad \sum_{i=1}^n \epsilon_i \leq C $$

- Here
- $\epsilon$ is the errors occured by missclassification
- $C$ is non-negative tuining parameter, it sets the limit on how much error we are willing to accept
- Sum of errors is less or equal to $C$

# SVM (Support Vector Machine)
- Support Vector machine is suitable for the case when both MMC and SVC are not able to set the single hyperplane to separate calsses.
- SVM allows to the projection of linear dataset into higher dimension dataset using kernel
- Then an appropriate hyperplane can be generated at that dimension with SVC

### Non-Linear Decision Boundary classification
- For the sake of simplicity, lets do it with 2nd degree polynomial with no interaction terms
$$ \max\limits_{\beta_0, \beta_{11}, \beta_{12}, \dots, \beta_{p1}, \beta_{p2}, \epsilon_1, \dots \epsilon_n, M} M$$
$$Subject\ to \qquad \sum_{j=1}^p \sum_{k=1}^2 \beta_{jk}^2 = 1$$
$$\qquad y_i \left(\beta_0 + \sum_{j=1}^p \beta_{j1}x_{ij} + \sum_{j=1}^p \beta_{j2}x_{ij}^2 \right) \geq M (1-\epsilon_i), \qquad  \epsilon \geq 0,\qquad \sum_{i=1}^n \epsilon_i \leq C$$

- Here,
- $\beta_{j1}$ is the coefficient for the original linear term
- $\beta_{j2}$ is the coefficient for the transformed polynomial term
- $x_{ij}$ is the original linear term
- $x_{ij}^2$ is the transformed polynomial term

# Kernels
- In case of linarly unseparable classes, kernels are used to transform the linear dataset into higher dimensional data set
- Most common kernels are **Linear Kernel**, **Polynomial Kernels**, **Radial Basis Function (RBF) or Gaussian Kernel**, **Sigmoid Kernel**
- Mapping the data set directly with these kernel function without explicitly computing the data set in that dimension is known as **_Kernel Trick_**

## Use of kernel in training set
- Before applying SVM model to the training set, if the dataset is not linearly separable, a RBF or polynomial kernel can be used to directly transform linear data to the higher dimension

## Use of Kernel in prediction
- Since, the SVM uses only support vectors for the prediction the kernel function are applied to all the support vectors.
- Based on the result follwing things are considered for the prediction,
    - if the dot product is 0, the new point is on the decision boundary itself (in such case other things are considered such as always assign to a particualr class or apply some particular bias to decide class)
    - if the dot product is positive, the new point is on the class with which support vector it results in positive dot product (the intensity is also considered, the greater the value of dot product the similar the points are)
    - if the dot product results in negative, the data point lies in the opposite class.

### <u>Decision Function in SVM (i.e for prediction)</u>
$$f(x) =\beta_0 + \sum_{i=1}^n \alpha_i \langle x, x_i \rangle$$

- Here, $\alpha_i$ is the Lagrange multiplier
- $\alpha_i \neq 0$ only for the support vectors
- $x$ is a test point and $x_i$ are the training points

Since, $\alpha_i$ is zero for non support vector data points, $x_i$ can be considered as support vectors and the formula can be rewritten as:
$$ f(x) = \beta_0 + \sum_{i \in S} \alpha_i \langle x, x_i \rangle$$

- Any kernel function applied to the above function can replace the term $\langle x, x_i \rangle$ with the kernel term

## Linear Kernel
- Linear kernel is suitable for only the prediction. Because the data set is already a linear set, no transformation is required
$$K(x_1, x_2) = x_1 \cdot x_2 $$

- Here, $x_1$ is the test point and $x_2$ is some support vector

## Polynomial Kernel
$$k(x_i,x'_i) = ( x_i \cdot x'_i + c)^d$$
- Sometimes the dot product may be represented as inner product $\langle x_i, x'_i \rangle$

## Radial Basis Function
$$k(x_i, x'_i) = exp \left( -\frac{||x_i - x'_i||^2}{2 \sigma^2} \right) $$
$$ Or,$$
$$k(x_i, x'_i) = exp(-\gamma ||x_i - x'_i||^2)$$

- Here,
- $\gamma = \frac{1}{2\sigma^2}$
- The $\gamma$ parameter defines the width of bell-shaped curve, here, larger the value of gamma the narrower the bell
- $\sigma$ is the scale parameter (standard deviation) of the gaussian function, $\sigma^2$ is variance