# SVM Adaptations

First, recall the SVM problem.

$$\min_{w, \xi} ||w||^2 + C \sum_i \xi_i$$

$$\hbox{s.t.} \quad y(w^T X_i + b) \geq 1 - \xi_i, \quad \forall_i = 1, ..., n \quad \hbox{and} \quad \xi_i \geq 0, \quad \forall_i = 1, ..., n$$

Every constraint can be satisfied if $\xi_i$ is sufficiently large. $C$ is a regularization parameter. A small $C$ allows constraints to be easily ignored, causing a large margin. A large $C$ makes constraints hard to ignore, causing a narrow margin. If $C$ is $\infty$, it enforces all constraints, causing a hard margin. This is a quadratic optimization problem and there is a unique minimum.

### Optimization

Learning an SVM algorithm has been formulated as a constrained optimization problem wrt $w$ and $\xi$. The constraint $y_i ( w^T X_i + b ) \geq 1 - \xi_i$ can be written more concisely as 

$$y_i f(X_i) \geq 1 - \xi_i \quad \hbox{with} \quad f(X_i) = w^T X_i + b$$

This means that $\xi_i \geq 1 - y_i f(X_i)$. With $\xi_i \geq 0$, we have 

$$\xi_i = \max (0, 1 - y_i f(X_i))$$

Hence, the learning problem is equivalent to the unconstrained optimization problem over $w$. 

$$X = \underbrace{\min_w ||w||^2}_{regularization} + \underbrace{C \sum_{i=1}^N \max (0, 1 - y_i f(X_i)) }_{\hbox{loss function}}$$

Points are in 3 categories:

1. $y_i f(X_i) > 1$: point is outside the margin, no contribution to loss

2. $y_i f(X_i) = 1$: point is on the margin, no contribution to loss

3. $y_i f(X_i) < 1$: point violates margin constraint, so contributes to loss

This loss function, $\max (0, 1 - y_i f(X_i))$ is called the "hinge" loss. It penalizes weight vectors that make mistakes. This is just one example.

**Q1: Does the SVM cost function have a unique solution?**

**Q2: Does the solution depend on the starting point of an iterative optimization algorithm?**

If the cost function is convex, then a locally optimal point is globally optimal (provided the optimization is over a convex set).

The SVM cost function is convex in $w, b$.

### Stochastic Gradient Descent algorithm for SVM

The objective function of SVM is a quad optimization problem because the objective is quadratic. We can use older method like QP (very slow). As we don't have constraints in the new unconstrained problem, we use gradient descent (still very slow). We are interested in finding $w$ which minimizes the following objective 

$$\min_w \sum_{i=1}^N f_i(w)$$

where $f_i$ are functions from vectors $w$ to scalars. The idea of SGD is to iterate over the function $f_i$ and update the vector $w$ after each iteration. The update at each iteration is 

$$w^{(t+1)} = w^{(t)} - \eta \nabla w f_i (w^{(t)})$$

where $\nabla w f_i (w)$ is the gradient of $f_i (w)$. We apply the result to SVM. The function $f_i$ are the form $f_i(w) = \max (0, 1 - y_i f(x_i))$. The hinge loss is not differentiable. So, the gradrient of these functions is not defined. Instead, the subgradient is defined as 

$$\nabla_m \Bigl[ \max \bigl\{ 0, 1 - y f(X) \bigr\} \Bigr] = \begin{Bmatrix}
-y X \quad \hbox{if} \quad y f(X) < 1\\
0 \quad O.W.\\
\end{Bmatrix}$$

Hence, the subgradient of the SVM objective function is 

$$\begin{Bmatrix}
w - C y_i X_i \quad \hbox{if} \quad y_i f(X_i) < 1\\
w \quad O.W.\\
\end{Bmatrix}$$

So, the stochastic subgradient descent algorithm is:

### Algorithm

Input: training set $S = \bigl\{ (\underset{\sim}{X_1}, y_1), ..., (\underset{\sim}{X_N}, y_N) \bigr\}, \quad C, \eta$

1. Initialize $w = 0$

2. For $t=1, ...$

    i. Choose example $(\underset{\sim}{X_i}), y_i)$ uniformly at random.

    ii. Set $\eta_t = \frac{n}{\sqrt{t}}$. 
    
    If $1 - y_i f(X_i) \geq 0$, $w^{(t+1)} = w^{(t)} - \eta_t (w^{(t)} - C y_i \underset{\sim}{X_i}) = (1 - \eta_t) w^{(t)} + \eta_t C y_i \underset{\sim}{X_i}$
    
    Else, $$w^{(t+1)} = w^{(t)} - \eta_t w^{(t)} = w^{(t)} ( 1 - \eta_t )$$

    iii. return $w$

## One Class Classification

SVM is designed to perform binary classification. In some applications, we do not have 2-class data (e.g. fraud detection, surveillance outbreak detecetion, anomaly detection, outlier detection) but only one-class data in the design phase. The SVM method cannot be directly applied. This is called the one-class classification problem. <u>Schokopf et al</u> proposed one-class SVM. OCSVM basically separates all the data from its origin, and maximizes the distance from the hyperplane to the origin 0. <u>Tax and Duin (1999)</u> presented the support vector data description (SVDD).

### One-class SVM (OCSVM)

The OCSVM consists of solving the following primal problem. Assume a training set $\bigl\{ X_i \bigr\}_{i=1}^n, \quad X_i \in \mathcal{R}^p$. The primal problem is

$$\min_{w, \rho, \xi} \frac{1}{2} ||w||^2 - \rho + C \sum_{i=1}^N \xi_i$$

$$\hbox{s.t.} \quad w^T \phi(X_i) \geq \rho - \xi_i, i=1,2,...,N \quad \hbox{and} \quad \xi_i \geq 0, i=1,2,...,N$$

where $\xi_i$ is the slack variable, the parameter $C>0$ is introduced to control the influence of the slack variables, and $\phi$ is a function mapping data to a higher dimensional space (Hilbert). This problem can be solved through the Lagrange dual problem, which is usually easier to solve than the primal. The dual problem is:

$$\min_\alpha \frac{1}{2} \sum_{i,j=1}^N \alpha_i \alpha_j \phi(X_i)^T \phi(X_j)$$

$$\hbox{s.t.} \quad 0 \leq \alpha_i \leq C, \forall_i \quad \hbox{and} \quad \sum_{i=1}^N \alpha_i = 1$$

where $\alpha_i, i=1,2,...,N$ are the Lagrange multipliers. Note that the lagrange multipliers are <u>positive</u>. If we let $k(x,y)$ represent the inner product $\phi(X)^T \phi(X)$ in a higher dimension space, the dual problem becomes

$$\min_\alpha \frac{1}{2} \sum_{i,j=1}^N \alpha_i \alpha_j k(X_i, X_j)$$

$$\hbox{s.t.} \quad 0 \leq \alpha_i \leq C, \forall_i \quad \hbox{and} \quad \sum_{i=1}^N \alpha_i = 1$$

To solve, we follow these steps:

1. Get the Lagrangian

2. Take the derivative wrt the primal variable $w, \rho, \xi_i$

3. Substitute these in the lagrangian $L$ This should give the dual problem

The quadratic problem is a quadratic problem (QP) and can be solved easily using any quad prog software to obtain $\alpha^\star$. The hyperplane corresponding to this classification rule is $f(\underset{\sim}{X}) = \underset{\sim}{w}^T \phi(X_i) - \rho$ where $\underset{\sim}{w}$ is a normal vector and $\rho$ is a bias term. The next step is to evaluate $\rho$, which can be found by using a nonzero SV $X$s. Alternatively, this can be achieved by using the set of all non-zero SV and find the average over all SV

$$\rho = \frac{1}{N_s} \sum_{s \in S} \Bigl( \sum_{j \in S} \alpha_i^\star k(X_s, X_i) \Bigr)$$

The hyperplane becomes now $f(\underset{\sim}{X}) = \sum_i \alpha_i k(\underset{\sim}{X}, \underset{\sim}{X_i}) - \rho$. Each new observation $\underset{\sim}{u}$ is classified by evaluating $g(\underset{\sim}{u}) = sign f(\underset{\sim}{u}) \tilde{=} sign \Bigl( \sum_i \alpha_i k( \underset{\sim}{u}, X_i) - \rho \Bigr) $. If $g(\underset{\sim}{u}) = 1$, then $\underset{\sim}{u}$ is classified as target (or normal). Otherwise, $\underset{\sim}{u}$ is classified as outlier.

### SVDD

Assuming a training set $\Bigl\{ X_i \Bigr\}_{i=1}^n, X_i \in \mathbf{R}^p$. The goal is to find a sphere characterized by a center "$\underset{\sim}{a}$" and a radius $r$ with a minimum volume that can envelope all of the data points in training set. Mathematically, 

$$\min r^2 + C \sum_{i=1}^N \xi_i$$

$$\hbox{s.t.} || \phi(X_i) - a ||^2 \leq r^2 + \xi_i, \forall_i \quad \hbox{and} \quad \xi_i \geq 0, \forall_i$$

where $\xi_i, i=1,..., N$ are slack variables, $C>0$ is a given trade-off between the volume of the sphere and the number of target objects rejected and $\phi$ is a function mapping data to a higher dimensional space.

#### Solution of SVDD Problem

1. The lagrangian is

$$L = r^2 + C \sum_i \xi_i - \sum_{i=1}^N \alpha_i \Bigl( r^2 + \xi_i - || \phi(X_i) - a ||^2 \Bigr) - \sum_i \beta_i \xi_i$$

where $\alpha_i$ and $\beta_i$ are the lagrangian multipliers. 

2. Take the derivative wrt the primal variables

$$\frac{\partial L}{\partial r} = 2r - 2r \sum \alpha_i = 0 \Rightarrow \sum \alpha_i = 1$$

$$\frac{\partial L}{\partial a} = a - \sum \alpha_i \phi(X_i) = 0 \Rightarrow a = \sum \alpha_i \phi(X_i)$$

$$\frac{\partial L}{\partial \xi_i} = C - \alpha_i - \beta_i = 0, \forall_i $$

3. Substituting these derivations back into the Lagrangian, yields the dual problem written initially.

$$\max_\alpha \sum \alpha_i k(X_i, X_j) - \sum_{i,j} \alpha_i \alpha_j k(X_i, X_j)$$

$$\hbox{s.t.} \sum \alpha_i = 1 \quad \hbox{and} \quad 0 \leq \alpha_i \leq C, \forall_i$$

where $k(X,y) = \phi(X)^T \phi(y)$ is a kernel function. This is a quadratic programming problem and can be solved by any QP software. 

#### KKT conditions

<u>Stationarity</u>

$1 - \sum \alpha_i = 0$

$\alpha - \sum \alpha_i \phi(X_i) = 0$

$ C - \alpha_i - \beta_i = 0$

<u>Primal admissibility</u>

$\phi(X_i) - a||^2 \leq r^2 + \xi_i$

$\xi_i \geq 0$

<u>Dual admissibility</u>

$\alpha_i \geq 0$

$\beta_i \geq 0$

<u>Complementary slackness</u>

$\alpha_i \Bigl( ||\phi(X_i) - a||^2 - r^2 - \xi_i \Bigr) = 0$

$\beta_i \xi_i = 0$

From the first condition of the complementary slackness condition, 

$$\alpha_i \Bigl( ||\phi(X_i) - a||^2 - r^2 - \xi_i \Bigr) = 0 \Longleftrightarrow \alpha_i = 0 \quad \hbox{or} \quad ||\phi(X_i) - a||^2 - r^2 - \xi_i = 0$$

When an object $X_i$ satisfies the inequality $||\phi(X_i) - a||^2 < r^2 + \xi_i$, the constraint is satisfied and the corresponding Lagrange multiplier will be 0 $\alpha_i = 0$. For object $X_i$ satisfying $||\phi(X_i) - a||^2 = r^2 + \xi_i$, the constraint has to be enforced and the corresponding Lagrange multiplier $\alpha_i > 0$. This means

i. $\alpha_i = 0 \quad \xrightarrow{} \quad ||\phi(X_i) - a||^2 < r^2$

ii. $ 0 < \alpha_i < C \quad \xrightarrow{} \quad ||\phi(X_i) - a||^2 = r^2$

iii. $\alpha_i = C \quad \xrightarrow{} \quad ||\phi(X_i) - a||^2 > r^2$

Coloqually,

i. means that the data points whose $\alpha_i=0$ are inside the sphere

ii. means that the data points whose $0 < \alpha_i < C$ are on the boundary

iii. means the data points whose $\alpha_i = C$ fall outside the sphere and have non-zero $\xi_i$

Only objects $X_i$ with $\alpha_i > 0$ are needed in the data description. The objects are called the support vectors. The center is given by $a = \sum \alpha_i \phi(X_i)$. The objects $X_i$ that correspond to $0 < \alpha_i < C$ are called the bounded SV. The objects that correspond to $\alpha_i = C$ are called unbounded SV. The radius $r$ is determined by taking any $X$s on the sphere boundary and calculating its distance to the center. This means that 

$$r^2 = ||\phi(X_i) - a ||^2 = k(X_s, X_s) - 2 \sum_{X_i \in SV} \alpha_k k(X_i, X_s) + \sum_{X_{ij} \in SV} \alpha_i \alpha_j k(X_i, X_j)$$

To test an unseen data $Z$, the distance $||\phi(Z) - a||^2$ is calculated. An unseen data is accepted as target if the distance $||\phi(Z) - a||^2$ is smaller or equal than the radius

$$\begin{align*}
||\phi(Z) - a||^2 &= (\phi(Z) - a)^T (\phi(Z) - a)\\
&= \phi(Z)^T \phi(Z) - 2 \alpha \phi(Z) + a^T a \\
&= k(z,z) - 2 \sum \alpha_i k(z, X_i) + \sum_{i,j} \alpha_i \alpha_j k(X_i, X_j)\\
\end{align*}$$

This means that 

$$d(z,a) = k(z,z) - 2 \sum_i \alpha_i k(z, X_i) + \sum_{i,j} \alpha_i \alpha_j k(X_i, X_j) \leq r^2$$

Otherwise, $Z$ is an outlier.

### Least Squares SVDD (LS-SVDD)

To derive least-squares version of SVDD, we reformulate SVDD by using a quadratic error criterion and equality constraints. Consider a training set $X_j, j=1,...,N$. We want to find a model that gives a closed boundary around the data. This closed boundary is a sphere defined by its center "c" and radius "R". This means

$$\min R^2 + \frac{C}{2} \sum_{j=1}^N \xi_j^2$$

$$\hbox{s.t.} \quad ||\phi(X_j) - \underset{\sim}{C}||^2 = R^2 + \xi_j, \quad j=1,...,N$$

Here, the constraints for the slack variables $\xi_j \geq 0$ in the SVDD are no longer needed. Instead, one can think of the variable $\xi_j$ as an error realized by a training vector $X_j$ wrt the hypersphere.

#### Solution

The typical approach to solve this optimization problem is to use the Lagrange multiplier. The multiplier is

$$L = R^2 + \frac{C}{2} \sum_j \xi_j^2 - \sum_j \alpha_j \Bigl( R^2 + \xi_j - || \phi(X_j) - \underset{\sim}{C} ||^2 \Bigr)$$

where $\alpha_j, j=1, ..., N$ are the Lagrangian multipliers, which can be either positive or negative due to the equality constraints. Differentiating $L$ wrt the primal variables and setting the derivatives to 0 gives

$$\frac{\partial L}{\partial R} = \sum \alpha_j - 1 = 0 \Rightarrow \sum \alpha_j = 1$$

$$\frac{\partial L}{\partial \xi_j} = \alpha_j - C \xi_j = 0 \Rightarrow \alpha_j = C \xi_j$$

$$\begin{align*}
\frac{\partial L}{\partial C} = 0 &\\
&\Rightarrow \underset{\sim}{C} = \sum_j \alpha_j \phi(X_j)\\
&\hbox{\textcolor{green}{using $\sum \alpha_j = 1$}}\\
&\Rightarrow \underset{\sim}{C} = \sum_j \alpha_j \phi(X_j) / \sum \alpha_j \\
\end{align*}$$

From $\alpha_j = C \xi_j$, it follows that the support vectors $\alpha_j$ are proportional to the errors $\xi_j$ at the data points in the LS-SVDD case, while in SVDD, many support values are typically 0.

Inserting these derivations into the Lagrangian yields the dual problem

$$\max_\alpha \sum_j \alpha_j k(X_j, X_j) - \sum_{i,j} \alpha_i \alpha_j \Bigl( k(X_i, X_j) + \frac{1}{2C} \delta_{ij} \Bigr)$$

$$\hbox{s.t.} \sum_j \alpha_j = 1$$

$$\hbox{where}\quad \delta_{ij} = \begin{cases}
1 & \hbox{if } i=j\\
0 & o.w.\\
\end{cases}$$

$\delta_{ij}$ is called the Kronecker $\delta$. <u>Guo eta al. (2016)</u> suggested solving the dual problem by using QP. The dual problem involves only a single equality constraint, unlike SVDD where there are multiple inequality constraints. The dual is no longer a quadratic programming problem, as in the SVDD case, but a quadratic problem and has an analytic solution.

Let $\underset{\sim}{K}$ denote the Gram matrix with entries $K_{ij} = k(X_i, X_j)$. $\underset{\sim}{I_N}$ denotes the identity matrix. 

$$\underset{\sim}{H} = \underset{\sim}{K} + \frac{1}{2} \underset{\sim}{I_N}, \quad \underset{\sim}{e} = (1,1,...,1)^T, \quad \underset{\sim}{\alpha} = \Bigl( \alpha_1, \alpha_2, ..., \alpha_N \Bigr)^T$$

and $\underset{\sim}{k}$ denotes a vector with entries $k_j = k(X_i, X_j), j=1,2,...,N$. It follows that $\underset{\sim}{\alpha} = \frac{1}{2} H^{-1} \biggl( \underset{\sim}{k} + \frac{2 - \underset{\sim}{e} \underset{\sim}{H}^{-1} \underset{\sim}{k} }{\underset{\sim}{e}^T \underset{\sim}{H}^T \underset{\sim}{e} } \biggr) \underset{\sim}{e} $. Also, $\underset{\sim}{\xi} = \frac{1}{2c} H^{-1} \biggl( \underset{\sim}{k} + \frac{2 - \underset{\sim}{e} \underset{\sim}{H}^{-1} \underset{\sim}{k} }{\underset{\sim}{e}^T \underset{\sim}{H}^T \underset{\sim}{e} } \biggr) \underset{\sim}{e}$. Once the analytic solution for $\underset{\sim}{\alpha}$ is obtained, the radius is $R^2 = \frac{1}{N} \sum_{s=1}^N \biggl(  k(X_s, X_s) - 2 \sum_j \alpha_j k(X_s, X_j) + \sum_{i,j} \alpha_i \alpha_j k(X_i, X_j)  \biggr)$. A test vector $Z$ is a target if 

$$k(z,z) - 2 \sum_j \alpha_j k(z, X_j) + \sum_{i,j} \alpha_i \alpha_j k(X_i, X_j) \leq R^2$$

<u>Solution of primal problem</u>

$$\min R^2 + \frac{C}{2} \sum \xi_i^2$$

$$\hbox{s.t.} \quad ||X_j - \underset{\sim}{c}||^2 = R^2 + \xi_j$$

Here, $\xi_j = ||X_j - \underset{\sim}{C}||^2 - R^2$. The constrained optimization problem can be replaced by the unconstrained optimization problem: $\min \frac{\lambda}{2} R^2 + \frac{1}{2} \sum_j \Bigl( ||X_j - \underset{\sim}{C}||^2 - R^2 \Bigr)^2$ where $\lambda = \frac{1}{C}$. Let $J_1(R, \underset{\sim}{C}) = \frac{\lambda}{2} R^2 + \frac{1}{2} \sum_j \Bigl( ||X_j - \underset{\sim}{C}||^2 - R^2 \Bigr)^2$, the $J_1$ is strongly convex. So, a locally optimal point is globally optimal. This is the gradient descent method for LS-SVDD. Find the gradients wrt the primal variables $R$ and $\underset{\sim}{C}$. 

1. Define a vector $\underset{\sim}{u} = \begin{pmatrix}\underset{\sim}{C}\\R\end{pmatrix}$ wrt $\underset{\sim}{C}$ and $R$

2. Or get the gradients wrt $\underset{\sim}{C}$ and $R$: $\frac{\partial J_1}{\partial R}, \frac{\partial J_1}{\partial \underset{\sim}{C}}$

Let $J_2 = \Bigl(||X_j - \underset{\sim}{C}||^2 - R^2 \Bigr)^2$, then

$$\frac{\partial J_2}{\partial R} = -4r ||X_j - \underset{\sim}{C}||^2 + 4R^3 = -4R \Bigl( ||X_j - \underset{\sim}{C}||^2 - R^2 \Bigr)$$

$$\frac{\partial J_2}{\partial \underset{\sim}{C}} = -4 (X_j - \underset{\sim}{C}) \Bigl( ||X_j - \underset{\sim}{C}||^2 - R^2 \Bigr)$$

So, the gradient for $J_1$ becomes 

$$\frac{\partial J_1}{\partial R} = R \biggl( \lambda - 2 \sum_{j} \Bigl( ||X_j - \underset{\sim}{C}||^2 - R^2 \Bigr) \biggr)$$

$$\frac{\partial J_1}{\partial \underset{\sim}{C}} = -2 \sum_j (X_j - \underset{\sim}{C}) \Bigl( ||X_j - \underset{\sim}{C}||^2 - R^2 \Bigr)$$

Next, the gradient descent update is $R_{t+1} = R_t - \eta$ and $\underset{\sim}{C_{t+1}} = \underset{\sim}{C_t} - \eta$.
