In [1]:
from collections import *
import math
import numpy as np
import matplotlib.pyplot as plt
from typing import *
from scipy.stats import *
from scipy.special import *

from tqdm import trange

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data

<br>

# Bayesian Regression
---

Where we step away from simple point estimates on the parameters of our models.

<br>

### Full Bayesian Regression

In a regression task, we try to estimate the distribution $p(t|x,\mathcal{D})$ where $\mathcal{D}$ is the training data set. If we use a parametric method, we have to optimize our model with respect to the parameters $w$. These parameters themselves depend on a prior distribution with parameters $\alpha$. We also have to take into account the noise on the training data with parameter $\beta$.

In typical MAP regression, we try different combinations of $\alpha$ and $\beta$ and maximize the probability $p(w|\mathcal{D},\alpha,\beta)$ with respect to $w$, to produce a point estimate of $w$ that we will use to do future predictions. To do a full Bayesian regression, we would instead need to marginalize over $w$ as well as the hyperparameters $\alpha$ and $\beta$ to do predictions:

&emsp; $\displaystyle p(t|x,\mathcal{D}) = \iiint p(t|w,\beta) \; p(w|\mathcal{D},\alpha,\beta) \; dw \; d\alpha \; d\beta$
&emsp; where
&emsp; $p(w|\mathcal{D},\alpha,\beta) \propto p(\mathcal{D}|w,\beta) \; p(w|\alpha)$

To better see these kind of dependencies and see the marginalization, it is recommended to rely on **graphical models** such as **Bayesian networks** in that specific case.

<br>

### Evidence approximation

Because this probability distribution is intractable, we usually resort to doing point estimates for $\alpha$ and $\beta$ and doing a restricted Bayesian inference where we only marginalize on $w$:

&emsp; $\displaystyle p(t|x,\mathcal{D},\alpha,\beta) = \int p(t|w,\beta) \; p(w|\mathcal{D},\alpha,\beta) \; dw$

Similarly to what we do for $w$ with ML, we will estimate $\alpha$ and $\beta$ by maximizing the likelihood of the data with respect to them. Because it follows a maximum likelihood approach at a higher level, this is called **type 2 maximum likelihood** or more often **evidence approximation**:

&emsp; $\alpha^*, \beta^* = \underset{\alpha, \beta}{\text{argmax}} p(\mathcal{D}|\alpha,\beta)$
&emsp; where
&emsp; $\displaystyle p(\mathcal{D}|\alpha,\beta) = \int p(\mathcal{D},w|\alpha,\beta) \; dw = \int p(\mathcal{D}|w,\beta) \; p(w|\alpha) \; dw$

We can see here that $w$ behaves as **latent variables**. We can therefore use the EM algorithm to find point estimates for $\alpha$ and $\beta$.

<br>

# Ex: Linear regression (exact inference)
---

In [None]:
# TODO - use EM algorithm

<br>

# Ex: Linear regression (approximate inference)
---

In [None]:
# Use MCMC
# Use variational methods