The general linear regression model takes the form

$$y(\mathbf{x}) = \mathbf{w}^T\phi(\mathbf{x})$$

With the training set, we then need to solve the following equation.

$$\underbrace{\begin{bmatrix}
\phi_0(\mathbf{x}_1) &\phi_1(\mathbf{x}_1) &\cdots &\phi_{M-1}(\mathbf{x}_1)\\
\phi_0(\mathbf{x}_2) &\phi_1(\mathbf{x}_2) &\cdots &\phi_{M-1}(\mathbf{x}_2)\\
\vdots &\vdots &\ddots &\vdots\\
\phi_0(\mathbf{x}_N) &\phi_1(\mathbf{x}_N) &\cdots &\phi_{M-1}(\mathbf{x}_N)\\
\end{bmatrix}}_{\Phi}
\underbrace{\begin{bmatrix}
w_0\\ w_1\\ \vdots\\ w_{M-1}\\
\end{bmatrix}}_{\mathbf{w}}
=
\underbrace{\begin{bmatrix}
t_1\\ t_2\\ \vdots\\ t_{N}\\
\end{bmatrix}}_{\mathbf{t}}
$$

Although in most cases, the number of training data set denoted by $N$ is larger than the dimension of the input features denoted by $M$. Thus we can not find the exact solution of these equations. However, before looking into how to use the kernel method to solve these equations, there is a concept that we need to be clear.

The vector $\mathbf{w}$ can be partitioned into two vectors that lie in two complemented spaces.

$$\mathbf{w} = \hat{\mathbf{w}} + \mathbf{z}\qquad \text{where } \hat{\mathbf{w}}\in \Phi_{row},\ \mathbf{z}\in N(\Phi)$$

where $\hat{\mathbf{w}}$ is in the row space of $\Phi$ and $\mathbf{z}$ is in the null space of $\Phi$ Then the equation becomes

$$\Phi\mathbf{w} = \Phi(\hat{\mathbf{w}} +\mathbf{z}) = \Phi\hat{\mathbf{w}} + 0 = \Phi\hat{\mathbf{w}}$$

Thus the best solution of $\mathbf{w}$ is in the row space of $\Phi$.

$$\mathbf{w} = \hat{\mathbf{w}}$$

---------------------

# Dual Representations

Because the vector $\mathbf{w}$ is in the row space of $\Phi$, we can write it in the following form

$$\mathbf{w} = \Phi^T \mathbf{a} \tag{6.3}$$

By substituting $\mathbf{w} = \Phi^T \mathbf{a}$ into the common error function, we can obtain a new error function with respect to $\mathbf{a}$. Our goal then turns out to be finding the value of $\mathbf{a}$ that minimize the error function.

$$\begin{align*}
&\text{Original Error Function:} &J(\mathbf{w}) &=\frac{1}{2}\sum_{n=1}^N\{\mathbf{w}^T\phi(\mathbf{x}_n)-t_n\}^2 + \frac{\lambda}{2}\mathbf{w}^T\mathbf{w} \tag{6.2}\\
&\text{New Error Function:} &J(\mathbf{a}) &=\frac{1}{2}\mathbf{a}^T\Phi\Phi^T\Phi\Phi^T\mathbf{a}-\mathbf{a}^T\Phi\Phi^T\mathbf{t}+\frac{1}{2}\mathbf{t}^T\mathbf{t}+\frac{1}{2}\mathbf{a}^T\Phi\Phi^T\mathbf{a} \tag{6.5}
\end{align*}$$

We now define the Gram matrix $\mathbf{K} = \Phi\Phi^T$, which is an $N\times N$ symmetric matrix with elements

$$\bbox[#ffe0f0]{K_{nm} = \phi(\mathbf{x}_n)^T\phi(\mathbf{x}_m) = k(\mathbf{x}_n,\mathbf{x}_m)} \tag{6.6}$$

which is called <font color='red'>*kernel function*</font>. Then the error function can be simplified to be

$$J(\mathbf{a}) = \frac{1}{2}\mathbf{a}^T\mathbf{K}\mathbf{K}\mathbf{a}-\mathbf{a}^T\mathbf{K}\mathbf{t}+\frac{1}{2}\mathbf{t}^T\mathbf{t}+\frac{\lambda}{2}\mathbf{a}^T\mathbf{K}\mathbf{a} \tag{6.7}$$

Setting the gradient of $J(\mathbf{a})$ with respect to $\mathbf{a}$ to zero, we obtain the following solution

$$\mathbf{a} = (\mathbf{K}+\lambda\mathbf{I}_N)^{-1}\mathbf{t} \tag{6.8}$$

If we substitute this back into the linear regresion model, we obtain the following prediction for a new input $\mathbf{x}$

$$\bbox[#ffe0f0]{y(\mathbf{x}) = \mathbf{k}(\mathbf{x})^T(\mathbf{K}+\lambda\mathbf{I}_N)^{-1}\mathbf{t}} \tag{6.9}$$

where we have defined the vector $\mathbf{k}(\mathbf{x})$ with elements $k_n(\mathbf{x}) = k(\mathbf{x}_n,\mathbf{x})$.

-----------------------

# Advantages

1. Both in the modeling method and the kernel method, the computational effort are mainly cost in matrix inverting. Modeling method need to invert a $M\times M$ matrix whereas kernel method is $N\times N$. Thus, if $N$ is less than $M$, using the kernel method will achive more computational effciency.
2. Because the dual formulation is expressed entirely in terms of the kernel function $k(\mathbf{x},\mathbf{x}')$, we can therefore work directly in terms of kernels and avoid the explicit introduction of the feature vector $\phi(\mathbf{x})$, which allows us implicitly to use feature spaces of high, even infinite, dimensionality.


--------------------

# About the remaining sections

1. The kernel function is not only from the inner product of basis functions, we can also use construction the kernel directly. In the Section 6.2, we will talk about the techniques of constructing the kernel functions.
2. In Section 6.3, we shall introduce a specific kernel.
3. In Section 6.4, we shall look into the kernel method from the perspective of Gaussian process, which derives the expression of (6.9) from the probability theorem.

