

## The Predictive Distribution

The predictive distribution we aim to understand is:

$
p(t \mid x, x_{\text{train}}, t_{\text{train}}) = \int p(t \mid x, w) \, p(w \mid x_{\text{train}}, t_{\text{train}}) \, dw,
$
where:
-  x  is the input for a **new test point** (the data point where we want to predict the target \( t \)).
- $ x_{\text{train}} $ and $ t_{\text{train}} $ are the inputs and outputs from the **training data** we’ve already observed.
- w  is the vector of **model parameters** (for example, the polynomial coefficients in polynomial regression).
- $ p(t \mid x, w) $ is the **likelihood**, which tells us how likely a target value  t is, given the input x  and a particular set of parameters $ w $.
- $ p(w \mid x_{\text{train}}, t_{\text{train}}) $ is the **posterior distribution** over $ w $, which reflects our updated belief about the parameters $ w $, after observing the training data.

### Understanding the Integral

The integral sums over all possible values of  w , and each value of  w  is weighted by its posterior probability $ p(w \mid x_{\text{train}}, t_{\text{train}}) $ . This process of **averaging over the parameters** is called **marginalization**.

---

## Why Are We Integrating Over $ w $ ?

In Bayesian reasoning, we don’t just want a single "best guess" for the parameters  w . Instead, we recognize that there may be many plausible values of $ w $, each of which might give a slightly different prediction for the target value t . 

To make a robust prediction, we average over all possible values of  w , with each value weighted by how likely it is, according to the posterior distribution $ p(w \mid x_{\text{train}}, t_{\text{train}}) $. 

This gives us a **probabilistic prediction** that incorporates both the likelihood and the uncertainty about  w. 

---

## Breaking Down the Components

Let’s break down each term in the predictive distribution:

1. **Likelihood: $ p(t \mid x, w) $**  
   This tells us the probability of observing a target value \( t \) for a given input \( x \), assuming a specific set of parameters \( w \). For example, in linear regression, this could be a Gaussian distribution centered around the predicted output \( y(x, w) \).

2. **Posterior: $ p(w \mid x_{\text{train}}, t_{\text{train}}) $**  
   This reflects our belief about the parameters \( w \), after observing the training data. It combines the prior distribution over \( w \) and the likelihood of the observed training data.

3. **Integral (Marginalization):**  
   The integral:

   $$
   \int p(t \mid x, w) \, p(w \mid x_{\text{train}}, t_{\text{train}}) \, dw,
   $$
   averages the predictions from all possible parameter values, weighted by their posterior probabilities. This accounts for our uncertainty about

   $ w $ and produces a more robust and probabilistic prediction.

---

## An Intuitive Analogy: Predicting Tomorrow’s Weather

Imagine you’re trying to predict the temperature tomorrow based on historical data. Different weather models (analogous to different values of$ w $) might give slightly different predictions, and each model might have a different level of confidence. 

For example:
- One model predicts 25°C with 60% confidence.
- Another model predicts 22°C with 30% confidence.
- A third model predicts 20°C with 10% confidence.

Instead of picking just one model, you could **average their predictions**, weighted by how confident you are in each one. This might give you a final prediction like:

$
\text{Expected temperature: 24°C}, \quad \text{with some uncertainty (because the models don’t all agree)}.
$

This is exactly what the integral is doing in Bayesian inference! It averages the predictions \( p(t \mid x, w) \), weighted by the posterior probabilities $ p(w \mid x_{\text{train}}, t_{\text{train}})$.

---

## Why Is This Important?

This Bayesian approach has several key advantages:
- **Uncertainty Quantification:** The predictive distribution not only gives us a predicted value for \( t \), but also provides a measure of uncertainty, which can be useful in decision-making.
- **Robust Predictions:** By averaging over many possible values of \( w \), we avoid overfitting to a single set of parameters.
- **Flexibility:** The Bayesian framework can be extended to more complex models and priors, allowing us to incorporate domain-specific knowledge.


## Bayesian Linear Regression: Posterior Distribution

### Posterior Probability
The posterior distribution of the weights $w$ given the training data $(x_{\text{train}}, t_{\text{train}})$ is given by:  
$ p(w \mid x_{\text{train}}, t_{\text{train}}) \propto p(t_{\text{train}} \mid x_{\text{train}}, w) p(w) $  

where:
- $ p(t_{\text{train}} \mid x_{\text{train}}, w) $: Likelihood of the training outputs $t_{\text{train}}$ given the inputs $x_{\text{train}}$ and the weights $w$.
- $ p(w) $: Prior distribution over the weights $w$.

### Likelihood Function
The likelihood is given by:  
$$ p(t_{\text{train}} \mid x_{\text{train}}, w) = \prod_{n=1}^{N} p(t_n \mid x_n, w) = \prod_{n=1}^{N} \mathcal{N}(t_n \mid w^T \phi(x_n), \beta^{-1}) $$ 

In exponential form:  
$$ p(t_{\text{train}} \mid x_{\text{train}}, w) \propto \exp\left(-\frac{\beta}{2} \| t - \Phi w \|^2 \right) $$

### Prior Distribution
We assume a Gaussian prior on $w$:  
$ p(w) = \mathcal{N}(w \mid 0, \alpha^{-1} I) $  

This prior has mean $0$ and covariance matrix $\alpha^{-1} I$, where $\alpha$ is the precision of the prior. In explicit form:  
$$p(w) \propto \exp\left(-\frac{\alpha}{2} w^T w\right) $$

### Posterior Derivation
Using Bayes' theorem, we combine the likelihood and prior. Since both are Gaussian distributions, the posterior is also Gaussian:  
$ p(w \mid x_{\text{train}}, t_{\text{train}}) = \mathcal{N}(w \mid m_N, S_N) $  

where we need to determine the posterior mean $m_N$ and posterior covariance $S_N$.

### Completing the Square
To derive $m_N$ and $S_N$, we start with:  
$ p(w \mid x_{\text{train}}, t_{\text{train}}) \propto \exp\left(-\frac{1}{2} \left[\beta (t - \Phi w)^T (t - \Phi w) + \alpha w^T w \right] \right) $  

Expand the quadratic term:  
$ \beta (t - \Phi w)^T (t - \Phi w) = \beta \left[t^T t - 2 w^T \Phi^T t + w^T \Phi^T \Phi w \right] $  

Add the prior term:  
$ \beta w^T \Phi^T \Phi w + \alpha w^T w - 2 \beta w^T \Phi^T t $  

Group the quadratic and linear terms:  
$ w^T (\beta \Phi^T \Phi + \alpha I) w - 2 \beta w^T \Phi^T t $

### Completing the Square
Reorganize the expression:  
$$ -\frac{1}{2} \left[ w^T A w - 2 w^T b \right] $$ 

where:
- $ A = \beta \Phi^T \Phi + \alpha I $  
- $ b = \beta \Phi^T t $  

Add and subtract $ b^T A^{-1} b $:  
$$ -\frac{1}{2} \left[ w^T A w - 2 w^T b + b^T A^{-1} b - b^T A^{-1} b \right] $$

This can be split into two terms:  
$$ -\frac{1}{2} \left[ (w^T A w - 2 w^T b + b^T A^{-1} b) - b^T A^{-1} b \right] $$

Rewrite the perfect square term:  
$$  w^T A w - 2 w^T b + b^T A^{-1} b = (w - A^{-1} b)^T A (w - A^{-1} b) $$

So, the expression becomes:  
$$ -\frac{1}{2} \left[ (w - A^{-1} b)^T A (w - A^{-1} b) - b^T A^{-1} b \right] $$

### Simplifying
Let $ A = \beta \Phi^T \Phi + \alpha I $ and $ b = \beta \Phi^T t $. Then:  
$ A^{-1} b = (\beta \Phi^T \Phi + \alpha I)^{-1} (\beta \Phi^T t) \equiv m_N $  
$ A^{-1} = S_N = (\beta \Phi^T \Phi + \alpha I)^{-1} $  

Thus, the posterior mean and covariance are:  
$ m_N = S_N (\beta \Phi^T t) $  
$ S_N = (\beta \Phi^T \Phi + \alpha I)^{-1} $  


# Integrating the Product of Two Gaussians

Now, we want to integrate the product $p(t \mid x, w) \cdot p(w \mid x_{\text{train}}, t_{\text{train}})$, where both distributions are Gaussians.

This is a standard result in Gaussian integration. When you multiply two Gaussians and integrate out $w$, the result is another Gaussian in terms of $t$, and its mean and variance can be derived analytically.

## Predictive Distribution

The predictive distribution $p(t \mid x, x_{\text{train}}, t_{\text{train}})$ is also a Gaussian:
$$
p(t \mid x, x_{\text{train}}, t_{\text{train}}) = \mathcal{N}(t \mid m(x), s^2(x))
$$

We will now derive the predictive mean $m(x)$ and the predictive variance $s^2(x)$.

---

## Step 1: Deriving the Predictive Mean $m(x)$

The mean of the resulting Gaussian after marginalizing over $w$ is:
$$
m(x) = \mathbb{E}_{p(w \mid x_{\text{train}}, t_{\text{train}})} \left[ \phi(x)^T w \right]
$$

Using the linearity of expectation:
$$
m(x) = \phi(x)^T \, \mathbb{E}[w]
$$

Since the posterior mean of $w$ is $m_N$, we have:
$$
m(x) = \phi(x)^T m_N
$$

Substitute $m_N = \beta S_N \Phi^T t_{\text{train}}$:
$$
m(x) = \phi(x)^T (\beta S_N \Phi^T t_{\text{train}})
$$

Simplifying:
$$
m(x) = \beta \, \phi(x)^T S_N \Phi^T t_{\text{train}}
$$

Thus, the predictive mean depends on the basis functions $\phi(x)$, the posterior covariance $S_N$, and the training targets $t_{\text{train}}$.

---

## Step 2: Deriving the Predictive Variance $s^2(x)$

The predictive variance $s^2(x)$ has two contributions:

1. The noise variance $\beta^{-1}$, which reflects the inherent noise in the observations.
2. The uncertainty due to $w$, which comes from marginalizing over $w$ using its posterior distribution.

The formula for the predictive variance is:
$$
s^2(x) = \beta^{-1} + \text{Var}_{p(w \mid x_{\text{train}}, t_{\text{train}})} \left[ \phi(x)^T w \right]
$$

Using the variance formula for a linear transformation of a Gaussian random variable:
$$
s^2(x) = \beta^{-1} + \phi(x)^T S_N \phi(x)
$$

Here, $\phi(x)^T S_N \phi(x)$ is the contribution to the variance due to the uncertainty in the parameters $w$.

---

## Final Result: Predictive Mean and Variance

Thus, we have derived the mean and variance of the predictive distribution $p(t \mid x, x_{\text{train}}, t_{\text{train}})$:

**Predictive Mean:**
$$
m(x) = \beta \, \phi(x)^T S_N \Phi^T t_{\text{train}}
$$

**Predictive Variance:**
$$
s^2(x) = \beta^{-1} + \phi(x)^T S_N \phi(x)
$$


![image4](image4.png)