# Definition
 
$$\begin{align*}
\text{quadratic error function:} &\qquad E_{q}(y_n - t_n) = \frac{1}{2}\sum_{n=1}^N\big\{y_n - t_n\big\}^2 \tag{7.50}\\
\epsilon\text{-insensitive error function:} &\qquad E_{\epsilon}(y_n - t_n) = 
\left\{\begin{array}{ll}
0, &\text{if }|y_n - t_n|<\epsilon\\
|y_n - t_n|-\epsilon, &\text{otherwise}
\end{array}\right. \tag{7.51}
\end{align*}$$

The region $(y-\epsilon, y+\epsilon)$ is a tube with width $\epsilon$ centers on $y$. As the $\epsilon$-insensitive error function shows, the samples whose $t_n$ lie inside the tube contribute nothing to the error function for sparseness, and the errors of the samples that lie on the boundary or outside the tube are equal to the distance from the sample points to the boundary.

# Mathmatic representation

## Problem
Our goal is to minimize a regularized error function given by

$$C\sum_{n=1}^N E_{\epsilon}\big(y_n - t_n\big) + \frac{1}{2}\|\mathbf{w}\|^2 \tag{7.52}$$



## Constraints

We need to transform the $\epsilon$-insensitive error function to a computationally easier form.

$$E_{\epsilon}(y_n - t_n) = 
\left\{\begin{array}{ll}
0, &\text{if }|y_n - t_n|<\epsilon\\
|y_n - t_n|-\epsilon, &\text{otherwise}
\end{array}\right.
= \left\{\begin{array}{ll}
0\geqslant t_n - y_n -\epsilon, &\text{if }0\leqslant t_n - y_n<\epsilon\\
0\geqslant y_n - t_n -\epsilon, &\text{if }-\epsilon< t_n - y_n\leqslant 0\\
t_n - y_n -\epsilon, &\text{if }t_n - y_n\geqslant \epsilon\\
y_n - t_n -\epsilon, &\text{if }t_n - y_n\leqslant -\epsilon\\
\end{array}\right.
$$

For further processing with lagrange multiplier, we also need to eliminate the conditions of these inequal equations. Here, we introduce two variables

$$E_{\epsilon}(y_n - t_n) = 
\left\{\begin{array}{ll}
\xi_n, &\text{if } t_n - y_n \geqslant 0\\
\hat{\xi}_n, &\text{if } t_n - y_n \leqslant 0\\
\end{array}\right.
\qquad
\text{where }
\xi_n\text{ and }\hat{\xi}_n\text{ satisfy }
\left\{\begin{array}{ll}
\xi_n\geqslant t_n - y_n -\epsilon\\
\xi_n\geqslant 0\\
\hat{\xi}_n\geqslant y_n - t_n -\epsilon\\
\hat{\xi}_n\geqslant 0\\
\end{array}\right.
$$

where $\xi_n$ denotes the error of the sample that lie in the upside of the tube, and $\hat{\xi}_n$ denotes the error of the sample that lie in the downside of the tube. And because $\xi_n$ is zero if the sample is not in the upside as well as $\hat{\xi}_n$ is zero if the sample is not in the downside, error of an individual sample can be written in the form

$$E_{\epsilon}(y_n - t_n) = \xi_n + \hat{\xi}_n$$

where the conditions $\xi_n\geqslant t_n - y_n -\epsilon$ and $\hat{\xi}_n\geqslant y_n - t_n -\epsilon$ are mutually exclusive.

# Solution 

## Introduce Lagrange multiplier

According to the Lagrange multiplier theorey, we can solve the SVM regression problem by finding the solution of 

$$\underset{\mathbf{a}\geqslant 0,\mathbf{\hat{a}}\geqslant 0,\mathbf{\mu}\geqslant 0,\mathbf{\hat{\mu}}\geqslant 0}{\quad max\quad }\underset{\mathbf{w},b,\mathbf{\xi}, \mathbf{\hat{\xi}}}{\quad min\quad }L(\mathbf{w},b,\mathbf{\xi},\mathbf{\hat{\xi}},\mathbf{a},\mathbf{\hat{a}},\mathbf{\mu},\mathbf{\hat{\mu}})$$

where the Lagrangian function, or say objective function, is given by

$$
\left.\begin{array}{ll}
\text{Problem:} & \displaystyle{\underset{\mathbf{w},b,\mathbf{\xi}, \mathbf{\hat{\xi}}}{\ min\ }C\sum_{n=1}^N(\xi_n+\hat{\xi}_n) + \frac{1}{2}\|\mathbf{w}\|^2 } \\
\text{Constraint 1:} &y_n-t_n+\epsilon+\xi_n\geqslant 0 \\
\text{Constraint 2:} &-y_n+t_n+\epsilon+\hat{\xi}_n\geqslant 0 \\
\text{Constraint 3:} &\xi_n\geqslant 0 \\
\text{Constraint 4:} &\hat{\xi}_n\geqslant 0 
\end{array}\right\}
\Rightarrow
\begin{align*}
L = &C\sum_{n=1}^N(\xi_n+\hat{\xi}_n) + \frac{1}{2}\|\mathbf{w}\|^2 -\sum_{n=1}^N(\mu_n\xi_n+\hat{\mu}_n\hat{\xi}_n) \\
&- \sum_{n=1}^N a_n(y_n-t_n+\epsilon+\xi_n) - \sum_{n=1}^N \hat{a}_n(-y_n+t_n+\epsilon+\hat{\xi}_n)
\end{align*}\tag{7.56}$$

For the reason that the conditions $\xi_n\geqslant t_n - y_n -\epsilon$ and $\hat{\xi}_n\geqslant y_n - t_n -\epsilon$ are mutually exclusive, the lagrange multipliers $a_n$ and $\hat{a}_n$ satisfy

$$\begin{array}{ll}
\text{if } a_n\neq 0, &\hat{a}_n = 0\\
\text{if } \hat{a}_n\neq 0, &a_n = 0\\
\end{array}$$

Finding the solution of $\underset{\mathbf{w},b,\mathbf{\xi},\mathbf{\hat{\xi}}}{\ min\ }L$ is equivalent to finding the partial derivatives with respect to $\mathbf{w}$, $b$, $\mathbf{\xi}$, $\mathbf{\hat{\xi}}$ equal to zero.

$$\begin{align*}
\frac{\partial L}{\partial\mathbf{w}} = 0 &\quad\Rightarrow\quad \mathbf{w} = \sum_{n=1}^N (a_n - \hat{a}_n) \phi(\mathbf{x}_n) \tag{7.57}\\
\frac{\partial L}{\partial b} = 0 &\quad\Rightarrow\quad \sum_{n=1}^N (a_n - \hat{a}_n) = 0 \tag{7.58}\\
\frac{\partial L}{\partial \xi_n} = 0 &\quad\Rightarrow\quad a_n = C-\mu_n \tag{7.59}\\
\frac{\partial L}{\partial \hat{\xi}_n} = 0 &\quad\Rightarrow\quad \hat{a}_n = C-\hat{\mu}_n \tag{7.60}
\end{align*}$$

where 
- (7.57) indicates that the weight vector $\mathbf{w}$ changes along the difference between the multiplier $\mathbf{a}$ and $\mathbf{\hat{a}}$.
- (7.58) is the linear equality constraint that the multipliers shall satisfy.
- (7.59) limits the range of the multipliers $\mathbf{a}$. The Lagrange multiplier theorey requires $\mu_n\geqslant 0$ such that each multiplier in $\mathbf{a}$ should satisfy $a_n\leqslant C$.
- (7.60) limits the range of the multipliers $\mathbf{\hat{a}}$. The Lagrange multiplier theorey requires $\hat{\mu}_n\geqslant 0$ such that each multiplier in $\mathbf{\hat{a}}$ should satisfy $\hat{a}_n\leqslant C$.

Substitute these conditions into the Lagragian function, we obtain

$$\bbox[#ffe0f0]{L(\mathbf{a},\mathbf{\hat{a}}) = -\frac{1}{2}\sum_{n=1}^N\sum_{m=1}^N (a_n - \hat{a}_n)(a_m - \hat{a}_m) k(\mathbf{x}_n,\mathbf{x}_m) - \epsilon\sum_{n=1}^N (a_n+\hat{a}_n) + \sum_{n=1}^N(a_n-\hat{a}_n)t_n} \tag{7.61}$$

which is an equation that is only related to the multipliers $\mathbf{a}$ and $\mathbf{\hat{a}}$. As a result, our goal turns out to be solving the quadratic problem with respect to $\mathbf{a}$ and $\mathbf{\hat{a}}$ subject to the linear equality constraint as well as the constraints on $\mathbf{a}$ and $\mathbf{\hat{a}}$, which is denoted by.

$$\bbox[#e0f0ff]{\underset{0\leqslant\mathbf{a}\leqslant C, 0\leqslant\mathbf{\hat{a}}\leqslant C}{\quad max\quad } L \quad s.t.\ \sum_{n=1}^N (a_n - \hat{a}_n) = 0 \quad\text{and}\quad 
\left\{\begin{array}{ll}
\text{if } a_n\neq 0, &\hat{a}_n = 0\\
\text{if } \hat{a}_n\neq 0, &a_n = 0\\
\end{array}\right.}$$

which can be solved using the SMO algorithom.



## Solution of $\mathbf{w}$

As we said before, the weight vector $\mathbf{w}$ changes along the difference between the multiplier $\mathbf{a}$ and $\mathbf{\hat{a}}$, and we just got the difference with the algorithm SMO, thus we can obtain the solution of $\mathbf{w}$ using the equation

$$\mathbf{w}^\star = \sum_{n=1}^N (a_n - \hat{a}_n) \phi(\mathbf{x}_n)$$


## Solution of $b$

<font color='grey'>*SMO also compute the value of $b$.*</font>

<font color='#aaaaaa'>

The Lagrange multiplier theorey suggests that the solution of $\mathbf{a}$ satisfies the KKT condition that takes the form

$$\left.\begin{array}{ll}
a_n\geqslant 0 \\
y_n-t_n+\epsilon+\xi_n\geqslant 0 \\
a_n (y_n-t_n+\epsilon+\xi_n) = 0 \\
\hat{a}_n\geqslant 0 \\
-y_n+t_n+\epsilon+\hat{\xi}_n\geqslant 0 \\
\hat{a}_n (-y_n+t_n+\epsilon+\hat{\xi}_n) = 0 \\
\mu_n \geqslant 0 \\
\xi_n \geqslant 0 \\
\mu_n\xi_n = 0 \\
\hat{\mu}_n \geqslant 0 \\
\hat{\xi}_n \geqslant 0 \\
\hat{\mu}_n\hat{\xi}_n = 0 \\
------------\\
a_n = C-\mu_n \\
\hat{a}_n = C-\hat{\mu}_n \\
\text{if } a_n\neq 0, \hat{a}_n = 0\\
\text{if } \hat{a}_n\neq 0, a_n = 0
\end{array}\right\}
\Rightarrow
\left\{\begin{array}{ll}
\text{if }a_n = 0\text{ and }\hat{a}_n=0, & |y_n - t_n|<\epsilon &(\mathbf{x}_n \text{ inside the tube})\\
\text{if }0< (a_n \text{ or } \hat{a}_n)< C, & |y_n - t_n|=\epsilon &(\mathbf{x}_n \text{ on the boundary of the tube})\\
\text{if }(a_n \text{ or } \hat{a}_n) = C, & |y_n - t_n|>\epsilon &(\mathbf{x}_n \text{ outside the tube})\\
\end{array}\right.$$

where we have defined $y_n = y(\mathbf{x}_n) = \mathbf{w}^T\phi(\mathbf{x}_n) + b$. <font color='green'>The data points that lie on the boundary of the tube are callded *support vectors*. </font>


Hence, for any $\mathbf{x}_n$ that lies on the boundary of the tube, the following equation holds

$$\begin{array}{ll}
\text{if }0<a_n < C: & t_n - y_n = \epsilon\\
\text{if }0<\hat{a}_n < C: & y_n - t_n = \epsilon\\
\end{array}$$

Thus we can solve $b$ by the following equations

$$
\left\{\begin{array}{ll}
\text{if }0<a_n < C: & \displaystyle{b = t_n - \epsilon - \sum_{m=1}^N(a_m-\hat{a}_m)k(\mathbf{x}_n, \mathbf{x}_m)}\\
\text{if }0<\hat{a}_n < C: & \displaystyle{b = t_n + \epsilon - \sum_{m=1}^N(a_m-\hat{a}_m)k(\mathbf{x}_n, \mathbf{x}_m)}
\end{array}\right.
$$

In practice, it is better to average over all such estimates of $b$.
</font>