# Linear regression revisited

From this section, we will learn how we go into the Gaussian processes from the linear regression model.

1. <div style='background-color:#e0f0ff'>Here is a general linear regression model.
$$y(\mathbf{x}) = \mathbf{w}^T\phi(\mathbf{x}) \tag{6.49}$$
where $\mathbf{w}$ is a $M$-dimensional weight vector and $\phi(\mathbf{x})$ is also a $M$-dimensional basis vector.</div>
2. <div style='background-color:#f0e0ff'>Now we consider a prior distribution over $\mathbf{w}$ that is given by an isotropic Gaussian of the form
$$p(\mathbf{w}) = \mathcal{N}(\mathbf{w}|0, \alpha^{-1}I) \tag{6.50}$$
For specific $\mathbf{x}$, $y(\mathbf{x})$ is also a Gaussian distributed variable because it is the linear combination of Gaussian variables.</div>
3. <div style='background-color:#fff0e0'>In practice, we are given a training data set at the training data points $\mathbf{X} = (\mathbf{x}_1,\cdots,\mathbf{x}_N)^T$. We are therefore interested in the joint distribution of the function values $\mathbf{y} = (y_1,\cdots, y_N)^T$, where $y_n = y(\mathbf{x}_n)$. This vector is given by
$$\mathbf{y} = \Phi \mathbf{w}\quad \text{where}\quad \Phi = \big(\phi(\mathbf{x}_1),\cdots, \phi(\mathbf{x})_N\big)^T\tag{6.51}$$
Here, $\mathbf{y}$ is a vector whose elements are random variables at different points. This vector is derived by a joint probability density function denoted by $p\big(y_1, \cdots, y_N\big)$. This joint distribution is still Gaussian because its marginal distributions $p(y_n)$ are all Gaussian.</div>
4. <div style='background-color:#f0ffe0'>High dimensional Gaussian is determined by the mean and covariance.  
$$\color{red}{\begin{align*}
\mathbb{E}[\mathbf{y}] &= \Phi\mathbb{E}[\mathbf{w}] = (0,\cdots,0)^T \tag{6.52}\\
\text{cov}[\mathbf{y}] &= \mathbb{E}\big[(\mathbf{y}-\mathbb{E}[\mathbf{y}])(\mathbf{y}-\mathbb{E}[\mathbf{y}])^T\big]=\mathbb{E}[\mathbf{y}\mathbf{y}^T] = \Phi\mathbb{E}[\mathbf{w}\mathbf{w}^T]\Phi^T = \frac{1}{\alpha}\Phi\Phi^T = \mathbf{K} \tag{6.53}
\end{align*}}$$
where $\mathbf{K}$ is the Gram matrix with elements
$$\color{red}{K_{nm} = k(\mathbf{x}_n, \mathbf{x}_m) = \frac{1}{\alpha}\phi(\mathbf{x}_n)^T\phi(\mathbf{x}_m) \tag{6.54}}$$
and $k(\mathbf{x},\mathbf{x}')$ is the kernel function.</div>
5. <div style='background-color:#ffe0f0'>In this case, $y(\mathbf{x})$ is said to be a Gaussian process, because $y(\mathbf{x})$ is a Gaussian distributed random variable at arbitrary point of $\mathbf{x}$.</div>

In general, a Gaussian process is defined as a probability distribution over functions $y(\mathbf{x})$ such that the set of values of $y(\mathbf{x})$ evaluated at an arbitrary set of points $\mathbf{x}_1,\cdots,\mathbf{x}_N$ jointly have a Gaussian distribution. 

In most applications, we will not have any prior knowledge about the mean of $y(\mathbf{x})$ and so by symmetry we take it to zero. This is equivalent to choosing the mean of the prior over weight values $p(\mathbf{w}|\alpha)$ to be zero in the basis function viewpoint.

For evaluating $y(\mathbf{x})$, we would rather wish to know the relationship among different points, which are given by $cov[\mathbf{y}]$. And this relationship will determine the form of $y(\mathbf{x})$ eventually.

------------

# Add the noise

The vector $\mathbf{y}$ we discussed above is just about $y(\mathbf{x})$ itself. Here we need to take account of the noise on the observed target values, which are given by

$$t_n = y_n + \epsilon_n \tag{6.57}$$

Here we shall consider noise at each point have a Gaussian distribution, so that

$$p(t_n|y_n) = \mathcal{N}(t_n|y_n, \beta^{-1}) \tag{6.58}$$

where $\beta$ is a hyperparameter representing the precision of the noise. And because the noise is independent for each data point, the noise process is therefore given by an isotropic Gaussian

$$\left.\begin{array}{ll}
p(\mathbf{t}|\mathbf{y}) = \mathcal{N}(\mathbf{t}|\mathbf{y}, \beta^{-1}I_n) & (6.59)\\
p(\mathbf{y}) = \mathcal{N}(\mathbf{y}|0, \mathbf{K}) &(6.60)\\
p(\mathbf{t}) = \int p(\mathbf{t}|\mathbf{y})p(\mathbf{y})d\mathbf{y} &(6.61)
\end{array}\right\}
\overset{(2.115)}{\Rightarrow} p(\mathbf{t}) = \mathcal{N}(\mathbf{t}|0, \mathbf{C})\quad \text{where}\quad C(\mathbf{x}_n,\mathbf{x}_m)=k(\mathbf{x}_n,\mathbf{x}_m)+\beta^{-1}\delta_{nm} \tag{6.62}$$

where $\delta_{nm}=\left\{\begin{array}{ll}1 &n=m\\ 0 &\text{otherwise}\end{array}\right.$. The covariances are simply added due to independence.

---------------

# Widely used kernel function
We mentioned in Section 6.2 that we can construct kernel with various of techniques. However, here is a common used kernel function.

$$k(\mathbf{x}_n, \mathbf{x}_m) = \underbrace{\theta_0 exp\left\{-\frac{\theta_1}{2}\|\mathbf{x}_n-\mathbf{x}_m\|^2\right\}}_{non-linear} + \underbrace{\theta_2}_{bias} + \underbrace{\theta_3\mathbf{x}_n^T\mathbf{x}_m}_{linear} \tag{6.63}$$
where $\mathbf{\theta} = (\theta_0,\theta_1,\theta_2,\theta_3)^T$ are the hyperparameters.

----------------

# Predictive Distribution




-----------------

# Learning the hyperparameters







---------------

# Automatic relevance determination