# Curve fitting with likelihood
### Curve fitting function
Back to our curve fitting problem in Introdution 1.1. The training data comprise $N$ input values and their corresponding target values  
$\begin{align*}\mathbb{x}&=(x_1,x_2,\cdots,x_N)^T\\ \mathbb{t}&=(t_1,t_2,\cdots,t_N)^T\end{align*}$  
Suppose these data is generated from a polynomial curve  
$$y(x,\mathbf{w})=w_0+w_1x+w_2x^2+w_3x^3+\cdots$$

For a fixed point $x=x_0$, we shall get the corresponding value $y(x_0,\mathbf{w})$. Now think of a Gaussian distribution on the line $x=x_0$ with $\mu=y(x_0,\mathbf{w})$ and $\sigma^2=\beta^{-1}$, where $\beta$ is a precision parameter of this distribution. Thus we have  
$$p(t|x,\mathbf{w},\beta)=\mathcal{N}(t|y(x,\mathbf{w}), \beta^{-1})$$
which expreses the probability of $t$ given the curve $y(x,\mathbf{w})$ and parameter $\beta$. We have the training data $\mathbb{x}$, $\mathbb{t}$ and assume the precision parameter $\beta$ to be fixed. The likelihood function is  
$$p(\mathbb{t}|\mathbb{x},\mathbf{w},\beta)=\prod_{n=1}^N\mathcal{N}(t_n|y(x_n,\mathbf{w}),\beta^{-1})$$

For achiving the maximum likelihood, we take the logarithm to both sides of this equation  
$$\ln p(\mathbb{t}|\mathbb{x},\mathbf{w},\beta)=-\frac{\beta}{2}\sum_{n=1}^N\{y(x_n,\mathbf{w})-t_n\}^2+\frac{N}{2}\ln\beta-\frac{N}{2}\ln(2\pi)$$

The only relative term about $\mathbf{w}$ for likelihood maximum is  
$$\sum_{n=1}^{N}\{y(x_n,\mathbf{w})-t_n\}^2$$
which is a sum-of-squares error function, is absolutely the same as what we used in 1.1 before.The best parameters $\mathbf{w}_{ML}$ can be evaluated from this function.  
With the parameters $\mathbf{w}_{ML}$, we built the curve fitting function  
$$y(x)=w_{ML0}+w_{ML1}x+w_{ML2}x^2+\cdots$$

### Maximum likelihood
For evaluating the maximum likelihood, we also have to determine the precision parameter $\beta$ of the Gaussian conditional distribution. Maximizing likelihood function with respect to $\beta$ gives  
$$\frac{1}{\beta_{ML}}=\frac{1}{N}\sum_{n=1}^N\{y(x_n,\mathbf{w}_{ML})-t_n\}^2$$
Then the maximum likilihood is  
$$p(t|x,\mathbf{w}_{ML},\beta_{ML})=\mathcal{N}(t|y(x,\mathbf{w}),\beta^{-1}_{ML})$$

So far, besides evaluating the Maximum likelihood, the result is no different from what we achieved in 1.1. Next step we will take a step towards a more Bayesian approach.  

--------------------------

# Bayesian curve fitting

### Curve fitting function
The maximum likelihood is expressed as  
<font color='blue'>$$p(\mathbb{t}|\mathbb{x},\mathbf{w},\beta)$$</font>
**<font color='blue'>which can be considered to be the probability distribution of observing the training data.</font>**  
Now, we introduce a **<font color='orange'>prior distribution</font>** over the polynomial coefficients $\mathbf{w}$. For simplicity, let us consider a Gaussian distribution of the form  
<font color='orange'>$$p(\mathbf{w}|\alpha)=\mathcal{N}(\mathbf{w}|\mathbf{0}, \alpha^{-1}\mathbf{I})=\left(\frac{\alpha}{2\pi}\right)^{(M+1)/2}exp\left\{-\frac{\alpha}{2}\mathbf{w}^T\mathbf{w}\right\}$$</font>
where $\alpha$ is the precision of the distribution, and $M+1$ is the total number of elements in the vector $\mathbf{w}$ for and $M^{th}$ order polynomial.  
The expectation of this distribution is $\mathbf{0}$, means we want the $\mathbf{w}$ to be more likely around $\mathbf{0}$.  
The variance of this distribution is $\alpha^{-1}\mathbf{I}$, which is control by the hand-made precision parameter $\alpha$  

Using Bayes's theorem, the **<font color='red'>posterior distribution</font>** for $\mathbf{w}$ is proportional to the product of the prior distribution and the likelihood function  
$$\color{Red}{p(\mathbf{w}|\mathbb{x},\mathbb{t},\alpha,\beta)}\propto \color{Blue}{p(\mathbb{t}|\mathbb{x},\mathbf{w},\beta)}\color{Orange}{p(\mathbf{w}|\alpha)}$$
We can now determine $\mathbf{w}$ by finding the most probable value of $\mathbf{w}$ given the data, in other words by maximizing the posterior distribution (using logarithm). This technique is called maximum posterior (**MAP**).  
The maximum of the posterior is given by the minimum of  
$$\frac{\beta}{2}\sum_{n=1}^N\{y(x_n,\mathbf{w})-t_n\}^2+\frac{\alpha}{2}\mathbf{w}^T\mathbf{w}$$
where the first term is the sum-of-squares error function, the second term is the **regularization**.

Derivation of the proportation:<a href="https://math.stackexchange.com/questions/171226/stuck-with-handling-of-conditional-probability-in-bishops-pattern-recognition">Stuck with handling of conditional probability in Bishop's “Pattern Recognition and Machine Learning” (1.66)</a>

### Predictive Distribution
In the curve fitting problem, we are given the training data $\mathbb{x}$ and $\mathbb{t}$, along with a new test point $x$, and our goal is to predict the value of $t$. We therefore wish to evaluate the predictive distribution $p(t|x,\mathbb{x},\mathbb{t})$. Here we shall assume that the parameters $\alpha$ and $\beta$ are fixed and known in advance.  
A Bayesian treatment simply corresponds to a consistent application of the sum and product rules of probability, which allow the predictive distribution to be writen in the form  
$$p(t|x,\mathbb{x},\mathbb{t})=\int p(t|x,\mathbf{w})p(\mathbf{w}|\mathbb{x},\mathbb{t})d\mathbf{w}$$
where $\alpha$ and $\beta$ take fixed value.  
$p(t|x,\mathbf{w})$ is the polynomial curve distribution $p(t|x,\mathbf{w},\beta)$ omits $\beta$  
$p(\mathbf{w}|\mathbb{x},\mathbb{t})$ is the posterior distribution $p(\mathbf{w}|\mathbb{x},\mathbb{t},\alpha,\beta)$ omits $\alpha$ and $\beta$, where $p(\mathbf{w}|\mathbb{x},\mathbb{t},\alpha,\beta)$ can be drawn from normalizing $p(\mathbb{t}|\mathbb{x},\mathbf{w},\beta)p(\mathbf{w}|\alpha)$.  

As a result, the predictive distribution is given by a Gaussian of the form  
$$p(t|x,\mathbb{x},\mathbb{t})=\mathcal{N}(t|m(x),s^2(x))$$
where the mean and variance are given by  
$$\begin{align*}
m(x) &=\beta\phi(x)^TS\sum_{n=1}^N\phi(x_n)t_n\\
s^2(x) &=\beta^{-1}+\phi(x)^TS\phi(x)
\end{align*}$$
Here the matrix $S$ is given by  
$$S^{-1}=\alpha \mathbf{I}+\beta\sum_{n=1}^N\phi(x_n)\phi(x_n)^T$$
where $\mathbf{I}$ is the unit matrix and we have defined the vector $\phi(x)$ with elements $\phi_i(x)=x^i$ for $i=0,\cdots,M$ (different order).  

If we look at the variance in a analytical view, we will find that the first term of $s^2(x)$ represents the uncertainty in the predicted value of $t$ due to the noise on the target variables. However, the second term arises from the uncertainty in the parameters $\mathbf{w}$ and is a consequence of the Bayesian treatment.

Derivation:(3.53, 3.54 in 3.3 Bayesian Linear Regression)