# 10 Generalized Additive Model and Support Vector Machine

## Generalized Additive Model

Denote $x = [x_1,\dotsc,x_n]^T\in\mathbb R^n$ is a piece of datum and if model $f$ has following form:
$$f(x) = f_1(x_1)+\dotsc+f_n(x_n),$$
i.e. sum of functions with respect to each variable, then we call $f$ is a generalized additive model (GAM).

## Support Vector Machine

If we are performing binary classification: $(x_i,y_i)\in\mathbb R^p\times \mathbb R$, we can try finding a linear $f:\ \mathbb R^p\rightarrow \mathbb R$:
$$f(x_i) = \beta_0 +\beta_1 x_{i1}+\dotsc+\beta_p x_{ip}=[1,x_i^T]\beta$$
where $x_i = [x_{i1},\dotsc,x_{ip}]^T$.

Aim: find $\beta_0,\dotsc,\beta_p$ such that 
$$f(x_i) = \left\{\begin{aligned}>0 &\quad {\rm if}\ y_i>0\\<0 &\quad {\rm if}\ y_i<0\end{aligned}\right.\quad \quad (i=1,2,\dotsc,n).$$

**If such coefficient** $\beta$ **exists**, then for a new data point $x$ if we have $f(x)>0$, then we predict $\hat y(x) >0$, otherwise $\hat y(x)<0$.

This is the support vector machine (SVM).

### Hard Margin

When such coefficient $\beta$ exists, we say that the data are (linearly) separable. And there often exists infinitely many such $\beta$. Then we must select a best one. 

We define the best by "margin" by the extremum of the following.

$$\hat\beta ={\rm \argmax}_{\beta}\left\{\gamma:\ \Vert\beta\Vert^2=1,\ \forall i,\ y_i[1,x_i^T]\beta\geqslant \gamma\right\}
={\rm \argmax}_{\Vert \beta\Vert=1}\left\{\min_{1\leqslant i\leqslant n}y_i[1,x_i^T]\beta\right\}.$$

### Equivalent Reformulation

When the following has at least an extremum $\beta$, then it is unique and $\frac{\beta}{\Vert\beta\Vert}$ is one of the extrema of the hard margin.
$$\hat\beta ={\rm \argmin}_{\beta}\left\{\frac 12\Vert \beta\Vert^2:\quad \forall i, \ y_i[1,x_i^T]\beta\geqslant 1\right\}.$$

**Proof** Easy to show the uniqueness by verifying the Slater's condition of convex optimization. Assume $\beta$ is the minimizer and we show that it is one of the extrema of the hard margin by showing contradictions otherwise. Assume $\beta'$ $(\Vert\beta\Vert=1)$ is strictly better than $\beta$ in the sense of hard margin, i.e.
$$m = \min_{1\leqslant i\leqslant n}y_i[1,x_i]^T\beta' >\min_{1\leqslant i\leqslant n}y_i[1,x_i]^T\frac{\beta}{\Vert\beta\Vert}\geqslant \frac{1}{\Vert\beta\Vert}.$$

Take $\beta'' = \frac{\beta'}{m}$, then we have 
$$y_i[1,x_i^T]\beta'' = y_i[1,x_i^T]\beta'\frac{1}{m}\geqslant 1
$$
and 
$$\Vert\beta''\Vert^2 = \frac{\Vert \beta'\Vert^2}{m^2}=\frac{1}{m^2}<\Vert\beta\Vert^2,$$
contradicting the selection of $\beta$.

### Lagrangian


### Soft Margin

But $\beta$ for the "equivalent reformulation" mentioned above may not exist! In this case, we say that the data are not **separable**.

(Even if it exists, it might be overfitting.) We can use the soft margin:

$$\beta  ={\rm argmin}_\beta \left\{\frac12\Vert \beta\Vert^2 +C\sum_{i=1}^n \epsilon_i\quad {\rm s.t.\ \ }  \forall i, \ y_i  [1,x_i]^T\beta\geqslant 1-\epsilon_i {\rm\  and\ } \epsilon_i\geqslant 0 \right\}$$
where $C$ is a hyperparameter.

It is clear that the optimal $\epsilon_i$ should be $\epsilon_i = \max\{0,\ 1-y_i  [1,x_i]^T\beta\}$. Thus, we can write in the equivalent form:
$$\beta  ={\rm argmin}_\beta \left\{\frac12\Vert \beta\Vert^2 +C\sum_{i=1}^n \max\{0,\ 1- y_i[1,x_i]^T \beta \} \right\}$$
From the first equation we can see that, it allows $y_i \neq {\rm sgn}([1,x_i]^T\beta )$ as long as $\epsilon_i> 1$. So $\beta$ exists even if the data are not linearly separable.

Also, it penalizes the occurence of this case by the term $C\sum_{i=1}^n \epsilon_i$.




### Multiclass Classification

#### One-versus-one 

One-versus-one is an approach to handle multiclass classification. Assume there are $K$ classes and we get $\binom K2$ pairs of classes. We fit $\binom K2$ SVMs for each pair of classes (one-versus-one). We claim the prediction to be the class that wins the most.

#### One-versus-all 

For each class $k$, run a SVM to judge whether the data belong to class $k$ or not. Since SVM outputs a margin by computing $[1,x^T]\beta$, we classify the data to the class with most predominant margin.



## Mercer Kernel

When the data are not separable, we can use a mapping apart from the soft margin. We can map $x_i$ to $\phi(x_i) = z_i\in \mathbb R^q$, or infinite Hilbert space $\phi(x_i) = z_i\in \mathcal H$ so that $z_i$ is separable or has better separation result in the sense of soft margin.

Define the kernel of the mapping $\phi$ to be inner product of two mapping results:
$$K(x,y)= \langle \phi(x),\phi(y)\rangle.$$



### Reproducing Kernel Hilbert Space 

#### Hilbert Space

A complete (which has Cauchy's limit theorem) inner product space is a Hilbert space.

#### Reproducing Kernel Hilbert Space

(RKHS)

#### Bounded Norm

**Theorem**

**Lemma** (Riesz Representation)

#### Representer Theorem 
