# Bayesian Approach to Curve Fitting

The goal in the curve fitting problem is to be able to make predictions for the target variable $ t $ given some new value of the input variable $ x $, on the basis of a set of training data comprising $ N $ input values $ \mathbf{x} = (x_1, \ldots, x_N)^T $ and their corresponding target values $\mathbf{t} = (t_1, \ldots, t_N)^T $.

We can express our uncertainty over the value of the target variable using a probability distribution. For this purpose, we shall assume that, given the value of $ x $, the corresponding value of $ t $ has a Gaussian distribution with a mean equal to the value $ y(x, w)$ of the polynomial curve. Thus, we have:

$$
p(t | x, w, \beta) = \mathcal{N}(t | y(x, w), \beta^{-1}) \tag{1.1}
$$

We now use the training data $ \{\mathbf{x}, \mathbf{t}\} $ to determine the values of the unknown parameters $w$ and $ \beta $ by maximum likelihood.

If the data are assumed to be drawn independently from the distribution \( (1.1) \), then the likelihood function is given by:

$$
p(\mathbf{t} | \mathbf{x}, w, \beta) = \prod_{n=1}^N \mathcal{N}(t_n | y(x_n, w), \beta^{-1}) \tag{1.2}
$$

As we did in the case of the simple Gaussian distribution earlier, it is convenient to maximize the logarithm of the likelihood function.


![image 3](image3.png)


image taken from bishops book

# Derivation of the Log-Likelihood for a Normally Distributed Random Variable  

For a normally distributed random variable $t_n \sim \mathcal{N}(y(x_n, w), \beta^{-1})$, the probability density function is given by:  

$$  
p(t_n \mid y(x_n, w), \beta^{-1}) = \left( \frac{\beta}{2\pi} \right)^{\frac{1}{2}} \exp\left\{ -\frac{\beta}{2} \left( t_n - y(x_n, w) \right)^2 \right\}.  
$$  

This is the Gaussian density function with mean $y(x_n, w)$ and precision $\beta$ (the precision is the inverse of the variance).  

---

## Step 1: Express the Likelihood Function Explicitly  

Using the Gaussian form for each $p(t_n \mid y(x_n, w), \beta^{-1})$ and substituting it into the likelihood function $p(t \mid x, w, \beta)$, we get:  

$$  
p(t \mid x, w, \beta) = \prod_{n=1}^N \left( \frac{\beta}{2\pi} \right)^{\frac{1}{2}} \exp\left\{ -\frac{\beta}{2} \left( t_n - y(x_n, w) \right)^2 \right\}.  
$$  

This is the expanded form of the likelihood.  

---

## Step 2: Take the Logarithm of the Likelihood Function  

Since it is generally easier to maximize the log-likelihood (logarithm of the likelihood function), we take the logarithm on both sides:  

$$  
\ln p(t \mid x, w, \beta) = \sum_{n=1}^N \left[ \frac{1}{2} \ln \left( \frac{\beta}{2\pi} \right) - \frac{\beta}{2} \left( t_n - y(x_n, w) \right)^2 \right].  
$$  

Now, split the terms inside the summation:  

$$  
\ln p(t \mid x, w, \beta) = \frac{N}{2} \ln \left( \frac{\beta}{2\pi} \right) - \frac{\beta}{2} \sum_{n=1}^N \left( t_n - y(x_n, w) \right)^2.  
$$  

---

## Step 3: Simplify the Terms  

Expand the first term:  

$$  
\frac{N}{2} \ln \left( \frac{\beta}{2\pi} \right) = \frac{N}{2} \ln \beta - \frac{N}{2} \ln (2\pi).  
$$  

The second term remains as it is:  

$$  
-\frac{\beta}{2} \sum_{n=1}^N \left( t_n - y(x_n, w) \right)^2.  
$$  

So, we can write the final expression for the log-likelihood as:  

$$  
\ln p(t \mid x, w, \beta) = -\frac{\beta}{2} \sum_{n=1}^N \left( t_n - y(x_n, w) \right)^2 + \frac{N}{2} \ln \beta - \frac{N}{2} \ln (2\pi).  
$$  


# Maximizing the Log-Likelihood with Respect to $w$

To find the Maximum Likelihood Estimator (MLE) for $w$, we focus on the term involving $y(x_n, w)$, which depends on $w$. Let’s differentiate the log-likelihood with respect to $w$ and set the derivative equal to zero.

---

## Focus on the $w$-Dependent Part  

The $w$-dependent part of the log-likelihood is:  

$$  
-\frac{\beta}{2} \sum_{n=1}^N \left( y(x_n, w) - t_n \right)^2.  
$$  

This is equivalent to minimizing the sum of squared errors (SSE), which is a common objective in regression.  

---

## Minimizing the SSE  

To maximize the log-likelihood, we minimize the SSE:  

$$  
\min_w \sum_{n=1}^N \left( y(x_n, w) - t_n \right)^2.  
$$  

This is a classic least squares problem. Solving this will give the optimal weights $w^*$.




We want to minimize the **sum of squared errors (SSE)**:

$$
\min_{w} \sum_{n=1}^{N} \left( w^T \phi(x_n) - t_n \right)^2
$$

### Where:
- $w $ is the **weight vector** (parameters to be optimized),  
- $\phi(x_n) $ is the **feature vector** corresponding to the input $ x_n $,  
- $ t_n $ is the **target (output)** corresponding to \( x_n \),  
- $ w^T \phi(x_n) $ represents the **predicted output** for input $ x_n $.  

---

## Understanding the Design Matrix $ \Phi $

The matrix $ \Phi $ is called the **design matrix**, and each row of $ \Phi $ corresponds to a feature vector $ \phi(x_n) $, derived from the input data $ x_n $.

### Step 1: What is $ \phi(x_n) $ ?
The vector $ \phi(x_n) $ is the **feature vector** associated with the input $ x_n $ .

Suppose we have a set of $ N $ data points $ { x_1, x_2, \ldots, x_N } $ , and each $ x_n $ is mapped to a feature vector $ \phi(x_n) $, where:

$$
\phi(x_n) = 
\begin{bmatrix}
\phi_1(x_n) \\
\phi_2(x_n) \\
\vdots \\
\phi_M(x_n)
\end{bmatrix}
$$

with $ M $ components (features). This means that $ \phi(x_n) $ is a vector of length $ M $.

---

### Step 2: Building the Design Matrix  $ \Phi $
Now, we stack these $ N $ feature vectors (each of length $ M $) on top of each other to form the matrix $ \Phi $:

$$
\Phi = 
\begin{bmatrix}
\phi(x_1)^T \\
\phi(x_2)^T \\
\vdots \\
\phi(x_N)^T 
\end{bmatrix}
=
\begin{bmatrix}
\phi_1(x_1) & \phi_2(x_1) & \ldots & \phi_M(x_1) \\
\phi_1(x_2) & \phi_2(x_2) & \ldots & \phi_M(x_2) \\
\vdots & \vdots & \ddots & \vdots \\
\phi_1(x_N) & \phi_2(x_N) & \ldots & \phi_M(x_N) 
\end{bmatrix}
$$

---

### Step 3: Dimensions of $ \Phi $
From this construction, it is clear that:
- There are  N  rows in  $ \Phi $, each corresponding to one input $ x_n $,
- There are M  columns in $ \Phi $, each corresponding to a feature $ \phi_j(x_n) $ in the feature vector.

Thus, the matrix $ \Phi $  has dimensions $ N \times M $, where:
- N : Number of data points (inputs),  
- M : Number of features in each feature vector.  

---

### Rewrite in Matrix Form  
To solve this more compactly, we can write the objective function in **matrix form**. Define:
- $ Phi $(the design matrix) as a matrix where each row is a feature vector $ phi(x_n)^T $. Thus:

  $$
  \Phi = 
  \begin{bmatrix}
  \phi(x_1)^T \\
  \phi(x_2)^T \\
  \vdots \\
  \phi(x_N)^T
  \end{bmatrix}
  $$

  If $ phi(x_n) $ has $ M $ components, then $ \Phi $ is an $ N \times M $ matrix.

- $ t $ as the **target vector**:

  $$
  t = 
  \begin{bmatrix}
  t_1 \\
  t_2 \\
  \vdots \\
  t_N
  \end{bmatrix}
  $$

- $ y $ as the vector of predictions $ y = \Phi w $, where:

  $$
  y = 
  \begin{bmatrix}
  w^T \phi(x_1) \\
  w^T \phi(x_2) \\
  \vdots \\
  w^T \phi(x_N)
  \end{bmatrix}
  $$

Now, the objective function becomes:

$$
\min_{w} \| \Phi w - t \|^2
$$

where $ |\cdot \|^2 $ denotes the squared Euclidean norm.

---

### Step 4: Expand the Squared Term
Expand the squared norm:

$$
\| \Phi w - t \|^2 = (\Phi w - t)^T (\Phi w - t)
$$

Expanding this expression:

$$
(\Phi w - t)^T (\Phi w - t) = w^T \Phi^T \Phi w - 2 t^T \Phi w + t^T t
$$

This is a quadratic function of w .

---

### Step 5: Differentiate with Respect to \( w \)
To find the value of  w  that minimizes the objective, differentiate the above expression with respect to  w , and set the derivative to zero.

Differentiate term by term:

- $ \frac{\partial}{\partial w} \left( w^T \Phi^T \Phi w \right) = 2 \Phi^T \Phi w $
- $ \frac{\partial}{\partial w} \left( - 2 t^T \Phi w \right) = - 2 \Phi^T t $

The term $ t^T t $ is constant and vanishes after differentiation.

Thus, the gradient is:

$$
\frac{\partial}{\partial w} \left( \| \Phi w - t \|^2 \right) = 2 \Phi^T \Phi w - 2 \Phi^T t
$$

Set the derivative to zero:

$$
2 \Phi^T \Phi w - 2 \Phi^T t = 0
$$

Simplify:

$$
\Phi^T \Phi w = \Phi^T t
$$

---

### Step 6: Solve for $ w^*$
Assuming that $ \Phi^T \Phi $ is invertible, we can solve for the optimal weight vector $ w^* $:

$$
w^* = (\Phi^T \Phi)^{-1} \Phi^T t
$$

---

### Step 7: Interpretation
This is the **closed-form solution** for the least squares problem. It shows that the optimal weights  $w^*$ depend on:
- $ \Phi^T \Phi $, which captures the relationships between the features,
- $ \Phi^T t $, which captures the relationship between the features and the targets $ t_n $.


# Maximizing with Respect to $\beta $

Now, we maximize the **log-likelihood** with respect to \( \beta \).

### Focus on the \( \beta \)-Dependent Part:
The \( \beta \)-dependent terms in the log-likelihood are:

$$
\ln p(t \mid x, w, \beta) = -\frac{\beta}{2} \sum_{n=1}^{N} \left( y(x_n, w^*) - t_n \right)^2 + \frac{N}{2} \ln \beta - \frac{N}{2} \ln (2 \pi).
$$

To find the value of \( \beta \) that maximizes this expression, we will differentiate it with respect to \( \beta \) and set the derivative to zero:

$$
\frac{\partial}{\partial \beta} \ln p(t \mid x, w^*, \beta) = 0.
$$

---

## Step 1: Differentiate Term by Term  

### 1. Differentiate the First Term:
Let’s differentiate:

$$
\frac{\partial}{\partial \beta} \left[ - \frac{\beta}{2} \sum_{n=1}^{N} \left( y(x_n, w^*) - t_n \right)^2 \right].
$$

Since only $ \beta $ varies, and the summation inside is independent of $ \beta $, we get:

$$
\frac{\partial}{\partial \beta} \left[ - \frac{\beta}{2} \sum_{n=1}^{N} \left( y(x_n, w^*) - t_n \right)^2 \right] = - \frac{1}{2} \sum_{n=1}^{N} \left( y(x_n, w^*) - t_n \right)^2.
$$

---

### 2. Differentiate the Second Term:
Now, differentiate:

$$
\frac{\partial}{\partial \beta} \left( \frac{N}{2} \ln \beta \right).
$$

Using the derivative of \( \ln \beta \), we get:

$$
\frac{\partial}{\partial \beta} \left( \frac{N}{2} \ln \beta \right) = \frac{N}{2 \beta}.
$$

---

### 3. Differentiate the Third Term:
The third term, $ - \frac{N}{2} \ln (2 \pi) $, is constant with respect to $ \beta $, so its derivative is 0.

---

## Step 2: Set the Derivative to Zero
Now, we set the derivative of the entire expression to 0:

$$
\frac{1}{2} \sum_{n=1}^{N} \left( y(x_n, w^*) - t_n \right)^2 + \frac{N}{2 \beta} = 0.
$$

---

## Step 3: Simplify and Solve for $ \beta $
Rearranging terms:

$$
\frac{N}{2 \beta} = \frac{1}{2} \sum_{n=1}^{N} \left( y(x_n, w^*) - t_n \right)^2.
$$

Multiplying both sides by \( 2 \beta \):

$$
N = \beta \sum_{n=1}^{N} \left( y(x_n, w^*) - t_n \right)^2.
$$

Finally, solve for $ \beta $:

$$
\beta = \frac{N}{\sum_{n=1}^{N} \left( y(x_n, w^*) - t_n \right)^2}.
$$

---

## Interpretation:
This is the **Maximum Likelihood Estimate (MLE)** for \( \beta \), and it represents the **precision** (the inverse of variance) in terms of the sum of squared residuals.


we now take a step toward a **Bayesian approach** by introducing a **prior distribution** over the polynomial coefficients w . This allows us to incorporate prior knowledge into the model, which helps in regularizing the solution and reducing overfitting. 

---

## Prior Distribution

Let’s define the prior distribution over  w . For simplicity, we assume a **Gaussian prior**, which has the following form:

$
p(w \mid \alpha) = \mathcal{N}(w \mid 0, \alpha^{-1} I),
$
where:
- $( \mathcal{N}(w \mid 0, \alpha^{-1} I) )$ represents a **multivariate Gaussian distribution** with:
  - Mean vector $ 0 $,
  - Covariance matrix $ \alpha^{-1} I $, where $ I $ is the identity matrix.
-  $\alpha $ is a positive parameter called the **precision** (the inverse of variance).
- $ w $ is a vector of polynomial coefficients, and for an $ M $-th order polynomial, $ w $ has $ M+1$  elements.

This Gaussian distribution can be expanded as follows:

$$
p(w \mid \alpha) = \left( \frac{\alpha}{2 \pi} \right)^{(M+1)/2} \exp \left( - \frac{\alpha}{2} w^\top w \right).
$$

### Interpretation:
- $ \alpha $ is called a **hyperparameter**, as it controls the spread (or precision) of the prior distribution over the polynomial coefficients $w $.
- The term $ w^\top w $ represents the **L2 norm** (squared Euclidean distance) of the vector $ w $.

---

## Posterior Distribution

Using **Bayes’ Theorem**, we can now update our prior beliefs about $ w $ after observing data $ \{x, t\} $. Bayes’ Theorem is given by:

$$
p(w \mid x, t, \alpha, \beta) \propto p(t \mid x, w, \beta) p(w \mid \alpha),
$$

where:
- $ p(w \mid x, t, \alpha, \beta) $ is the **posterior distribution**, representing our updated belief about  w  after observing the data.
- $ p(t \mid x, w, \beta) $ is the **likelihood function**, representing the probability of observing the data given the parameters w .
- $ p(w \mid \alpha) $ is the **prior distribution**, as defined above.
- $ \beta $ is another precision parameter that controls the noise in the likelihood function.

---

## Maximum A Posteriori (MAP) Estimation

To find the optimal value of  w , we want to maximize the posterior distribution  $ p(w \mid x, t, \alpha, \beta) $. This approach is called **Maximum A Posteriori (MAP)** estimation.

Rather than maximizing the posterior directly, we can simplify the problem by taking the **negative logarithm** of the posterior (since logarithms are monotonically increasing, this will preserve the optimal value). This turns the product in Bayes' Theorem into a sum:

$$
\ln p(w \mid x, t, \alpha, \beta) = \ln p(t \mid x, w, \beta) + \ln p(w \mid \alpha) + \text{constant}.
$$

Since maximizing the log-posterior is equivalent to minimizing the negative log-posterior, we get:

$$
\text{Minimize:} \quad - \ln p(w \mid x, t, \alpha, \beta).
$$

Using the expressions for the likelihood $ p(t \mid x, w, \beta) $ and the prior $ p(w \mid \alpha) $, we can write the negative log-posterior as:

$$
-\ln p(w \mid x, t, \alpha, \beta) = \frac{\beta}{2} \sum_{n=1}^{N} \left( y(x_n, w) - t_n \right)^2 + \frac{\alpha}{2} w^\top w + \text{constant}.
$$

---

## Interpretation and Connection to Regularization

Let’s examine this result in detail:

1. The first term:

   $
   \frac{\beta}{2} \sum_{n=1}^{N} \left( y(x_n, w) - t_n \right)^2,
   $
   corresponds to the **sum-of-squares error** we encountered earlier. This term measures how well the polynomial $ y(x, w)$ fits the observed data $ \{t_n\} $.

2. The second term:

   $
   \frac{\alpha}{2} w^\top w,
   $
   represents a **regularization term**, which penalizes large values of the polynomial coefficients  w . This prevents overfitting by encouraging smaller, smoother coefficients.

Thus, maximizing the posterior distribution is equivalent to minimizing the following **regularized sum-of-squares error function**:

$$
E(w) = \frac{\beta}{2} \sum_{n=1}^{N} \left( y(x_n, w) - t_n \right)^2 + \frac{\alpha}{2} w^\top w.
$$

---

## Regularization Parameter $ \lambda $

Notice that this objective function is identical to the regularized sum-of-squares error function we encountered earlier:

$$
E(w) = \sum_{n=1}^{N} \left( y(x_n, w) - t_n \right)^2 + \lambda w^\top w,
$$
where $ \lambda = \frac{\alpha}{\beta} $ is the **regularization parameter**.

---

