# Least-Squares SVM

## SVM Review

### Introduction

* SVM is an approach to classification

* SVMs are based on 3 big ideas

    1. maximizing the margin

    2. duality

    3. kernels

Kernels allow a set of features to be mapped into a higher dimension and therefore more expansive feature space without incurring the full computational cost one might expect

### Maximizing the margin

In binary classification problems, we consider an input space $X$ which is a subset of $\mathcal{R}^n$ with $n \geq 1$. The output space is the set ${-1,1}$, representing our two-class system. Given a training set $S$ of $m$ examples, $S=\{(x_1, y_1), (x_2, y_2), ..., (x_m, y_m)\}$, which are drawn from $X$ iid by an unknown distribution $D$. We want to select a hypothesis $h \in H$ that best predicts the classification of other points which are also drawn by $D$ from $X$. One of the simplest classification rules are the class of linear classifiers or hyperplanes. A hyperplane $(w,b)$ separates a sample $S$ if for every $(x,y) \in S$ we have 

$$sign(w^T x + b) = y$$

The "margin" intuitively means the distance of the decision boundary to the closest points. We look for $(w,b)$ that maximizes the margin, which is defined as:

$$\min \frac{1}{2} ||w||^2$$

$$\hbox{s.t.}\quad y_i(w^T x_i + b) \geq 1, \quad i=1,2,...,m$$

### Non-linearly separable data

The above discussion assumed the existence of a linear classifier that can correctly classify all examples in a given training sample $S$. But, what happens if the data is not fully separable? We relax the constraints $y_i (w^T x_i + b) \geq 1, \quad \forall_i$ slightly to allow for misclassified points. This is done by introducing slack variables $\xi_i, i=1, ..., m$. Our objective function becomes:

$$\min \frac{1}{2} ||w||^2 + C \sum_{i=1}^m \xi_i$$

$$\hbox{s.t.}\quad y_i (w^T x_i + b) - 1 + \xi_i \geq 0 \quad \forall_i = 1, ..., m \quad \hbox{and} \quad \xi_i \geq 0, i=1, 2, ..., m$$

where the parameter $C$ controls the trade-off between the slack variable penalty and the sign of the margin.

### Non-linear SVM

We map $X_i$ into a higher dimensional space where it will be linearly separable, using a map $\phi$, we want to solve:

$$\min \frac{1}{2} ||w||^2 + C \sum_{i=1}^m \xi_i$$

$$\hbox{s.t.}\quad y_i (w^T \phi(x_i) + b) - 1 + \xi_i \geq 0 \quad \forall_i = 1, ..., m \quad \hbox{and} \quad \xi_i \geq 0, \forall_i$$

The lagrangian is 

$$L_p = \frac{1}{2} ||w||^2 + C \sum_{i=1}^m \xi_i - \sum_{i=1}^m \alpha_i (y_i (w^T \phi(x_i) + b ) - 1 \xi_i ) - \sum_{i=1}^m \beta_i \xi_i$$

where $\alpha_i \geq 0$ and $\beta_i$ are the lagrange multipliers. After differentiation with primal variables, 

$$\frac{\partial L_p}{\partial w} = 0 \quad \Rightarrow{} \quad w = \sum_{i=1}^m \alpha_i y_i \phi (x_i) $$

$$\frac{\partial L_p}{\partial b} = 0 \quad \Rightarrow{} \quad \sum_{i=1}^m \alpha_i y_i = 0$$

$$\frac{\partial L_p}{\partial \xi_i} = 0 \quad \Rightarrow{} \quad C = \alpha_i + \beta_i$$

Putting these results back into $L_p$ yields

$$\max_{\alpha} \sum \alpha_i - \frac{1}{2} \sum_{i,j} y_i y_j \alpha_i \alpha_j \phi (x_i)^T \phi(x_i)$$

$$s.t. 0 \leq \alpha_i \leq C \quad \forall_i \quad \hbox{and} \quad \sum \alpha_i y_i = 0$$

The constraint $0 \leq \alpha_i \leq C$ is called the "box constraint".

**Definition**: A kernel function is defined as a function $k$ s.t. $k(x,z) = \phi(x)^T \phi(z)$ .

Using kernel function, the dual problem becomes:

$$\max_\alpha \sum \alpha_i - \frac{1}{2} \sum_{i,j} y_i y_j \alpha_i \alpha_j k(x_i, x_j)$$

$$\hbox{s.t.}\quad 0 \leq \alpha_i \leq C, \quad \forall_i \quad \hbox{and} \quad \sum \alpha_i y_i = 0$$

Necessary and sufficient conditions for $k$ to be a valid kernel is to satisfy **Mercer's Theorem**.

***

Since the objective function is convex and all the constraints are linear, this problem can be solved efficiently using standard quadratic programming software. After the dual problem is solved, the slackness conditions imply 3 scenarios for the training data points $x_i$ and the lagrange multipliers $\alpha_i$ associated with their classification constraints.

* $\alpha_i = 0$ and $\xi_i = 0 \quad \Rightarrow{} \quad$  the data point $x_i$ has been correctly classified

* $0 \leq \alpha_i \leq C$ and $\xi_i = 0 \quad \Rightarrow{} \quad X_i$ is a support vector. *__Note:__ that the SVs that classify $0 \leq \alpha_i \leq C$ are the unbounded support vectors*

* $\alpha_i = C$ and $\xi_i \geq 0 \quad \Rightarrow{} \quad x_i$ is a support vector. *__Note:__ that the SVs with $\alpha_i = C$ are bounded support vectors, that is they lie inside the margin*

## Least Squares SVM

### Introduction

The standard SVM are solved using quadratic programming methods. However, these methods are often time-consuming and are difficult to implement adaptively. Research has been undertaken to use a quadratic error criterion instead of the L1-Norm used for SVM.  <u>Suykens et al. (1999)</u> formulated a modified SVM, leat squares SVM (LS-SVM) based on using a quadratic error criterion with equality constraints. 

### LS-SVM formulation

Suppose a training set of $m$ data points $\{x_i, y_i\}_{i=1}^m$, where $x_i \in \mathcal{R}^n$ is the $i$th input vector and $y_i \in \{-1,1\}$.

We employ the idea to transform the input patterns into the reproducing kernel Hilbert space (RKHS) by a set of mapping function $\phi(x)$. We define the predocing kernel in RKHS as $k(x,z) = \phi(x)^T \phi(z)$. In the RKHS, the discriminant function takes the form $f(x) = w^T \phi(x) + b$. $w$ is the weight vector, $b \in \mathcal{R}$ is the bias term. The discriminant function of LS-SVM is constructed by minimizing the following problem: 

$$\min_{w,b,\xi} J(w,b,\xi) = \frac{1}{2} ||w||^2 + \frac{C}{2} \sum_{i=1}^m \xi_i^2$$

$$\hbox{s.t.} \quad y_i(w^T \phi(x_i) + b) = 1 - \xi_i, \quad i=1,...,m$$

$C$ is the regularization parameter and determines the trade-off between the fitting error minimization and smoothness. $\xi_i$ is the error.

This problem is easily solved by using the Lagrange multipliers. The Lagrangian is

$$L(w,b,\xi, \alpha) = \frac{1}{2} ||w||^2 + \frac{C}{2} \sum_i \xi_i^2 - \sum_{i=1}^m \alpha_i (y_i (w^T \phi(x_i) + b  ) - 1 + \xi_i)$$

The conditions for optimality are obtained as follows:

$$\frac{\partial L}{\partial w} = 0 \quad \Rightarrow{} \quad w = \sum_{i=1}^m \alpha_i y_i \phi(x_i)$$

$$\frac{\partial L}{\partial b} = 0 \quad \Rightarrow{} \quad \sum \alpha_i y_i = 0$$

$$\frac{\partial L}{\partial \xi_i} = 0 \quad \Rightarrow{} \quad \alpha_i = C \xi_i \quad \hbox{therefore $\alpha_i$ can be + or -}$$

$$\frac{\partial \alpha}{\partial \alpha} = 0 \quad \Rightarrow{} \quad y_i ( w^T \phi(x_i) + b) - 1 + \xi_i = 0$$

The Lagrange multipliers, $\alpha_i$, can be either + or - due to the dquality constraints. Elimination of $w$ and $\xi$ from the $\frac{\partial L}{\partial \alpha} = 0$ term results in $\xi_i = \frac{\alpha_i}{C}$. Substitute $w$ from $\frac{\partial L}{\partial w} = 0$ in $\frac{\partial \alpha}{\partial \alpha} = 0$ gives 

$$y_j [ \sum_{i=1}^m \alpha_i y_i \phi(x_i)^T \phi(x_i) + b ] - 1 + \frac{\alpha_j}{C} = 0 \Longleftrightarrow \sum_{i=1} \alpha_i y_i y_j \phi(x_i)^T \phi(x_j) + b y_j - 1 \frac{\alpha_j}{C} = 0$$

Because $\alpha_i$ and $\alpha_j$ are separate, this is a linear equation. Using this equation as well as $\frac{\partial L}{\partial b}$, and the fact that the matrix form of $\sum \alpha_i y_i \Leftrightarrow \underset{\sim}{y}^T \underset{\sim}{\alpha} = 0$, where $\underset{\sim}{y} = (y_1, y_2, ..., y_m)^T$ and $\underset{\sim}{\alpha} = (\alpha_i, \alpha_2, ..., \alpha_m)^T$, we get:

$$\sum_{i=1} \alpha_i y_i y_j \phi(x_i)^T \phi(x_j) + b y_j - 1 \frac{\alpha_j}{C} = 0 \Longleftrightarrow b \underset{\sim}{y} + (z^T z + \frac{1}{C} I) \underset{\sim}{\alpha} = \underset{\sim}{1}$$

where $\underset{\sim}{z} = (z_1, z_2, ..., z_m)^T$ with $z_i = y_i \phi (x_i)$. $I$ is identity and $\underset{\sim}{1} = (1, 1, ..., 1)^T$. We define $\Omega = z^t z$ with $\Omega_{ij} = y_i y_j k(x_i, x_j)$, then we get

$$\begin{bmatrix}
\underset{\sim}{0} & \underset{\sim}{y}^T \\
\underset{\sim}{y} & \Omega + \frac{1}{C} I \\
\end{bmatrix} \begin{bmatrix}
b\\
\underset{\sim}{b}\\
\end{bmatrix} = \begin{bmatrix}
0\\
\underset{\sim}{y}\\
\end{bmatrix}$$

where we can apply $Ax = b \Rightarrow x = A^{-1} b$

***

The support values $\alpha_k$ are proportional to the errors of the data. In general, the $\alpha_k$ values are not zero. The model prediction is the same as SVM, given by the sign of $y(x)$: 

$$y(x) = \sum_{i=1}^m \alpha_i y_i k(x, x_i) + b$$

### Example 

$$\begin{matrix}
&x & y\\
&(-1,-1) & -1\\
\textcolor{green}{\hbox{sometimes these are}\quad \Rightarrow}&(-1,1) & 1 \\
\textcolor{green}{\hbox{the same}\quad \Rightarrow}&(-1,1) & 1\\
&(1,1) & -1\\
\end{matrix}$$

Let $f(x,z) = (1 + x^T z)^2$. Matrix $\Omega$ is 

$$\Omega = \begin{bmatrix}
9 & -1 & -1 & 1\\
-1 & 9 & 1 & -1\\
-1 & 1 & 9 & -1\\
1 & -1 & -1 & 9\\
\end{bmatrix}$$

Suppose we set $C=100$, we get that

$$\begin{bmatrix}
b\\
\underset{\sim}{\alpha}
\end{bmatrix} = \begin{bmatrix}
0 & \underset{\sim}{y}^T\\
\underset{\sim}{y} & \Omega + \frac{1}{C} I\\
\end{bmatrix}^{-1} \begin{bmatrix}
0\\
\underset{\sim}{1}\\
\end{bmatrix}
$$

Therefore, $b=0, \quad \underset{\sim}{\alpha} = (0.1248, 0.1248, 0.1248, 0.1248)^T$.

### Notes

For large values of $m$, an iterative solution method for solving the system of a linear equation is needed. One method is based on the Hastenes-Stietel conjugate gradient for solving $Ax = b$ with $A \in \mathcal{R}^{m \times n}$, symmetric positive definite and $B \in \mathcal{R}^n$.