# Bayesian Statistics


Another way of approaching statistical inference

## Getting Started: An Example

Let's regress income on years of schooling, from generated data:

$$ income = 100 + 20 * school.years + \epsilon$$

<img src="figures/example_actuals.png" width="300">

We run this (very realistic) data through:
1. OLS - ala frequentist method (what we all know as OLS)
2. OLS - ala Bayesian method with flat prior
3. OLS - ala Bayesian method with (wrongly) informed prior

## Estimation: frequentist

<img src="figures/example_frequentist.png" width="450">


## Estimation: Bayesian with flat prior

The prior of both estimators are $N(0,\frac{1}{10^{-6}})$

<img src="figures/example_bayesian.png" width="450">

## Estimation: Bayesian with (wrongly) informed prior

The prior of the constant term is set to 50

<img src="figures/example_all_methods.png" width="450">


## Bayesian Statistics: A Brief History

<img src="figures/thomas_bayes.png" width="250">

- Thomas Bayes (1702 - 1761): a pastor, also trained as a statistician and logician. 
- A special case of (what we now know as) the Bayes Theorem was formulated by him in unpublished work
 - Laplace generalized the formulation afterwards, but kept Bayes' name attached to it
   - Laplace had already attached his name to a lot of things by then

The idea behind the theorem was to construct conditional probabilities using:
- One's perceptions, before the evidence is analyzed
- Analyzing the evidence and then updating these perceptions

More history and intuitive explanations can be found [here!](https://www.lesswrong.com/posts/RTt59BtFLqQbsSiqd/a-history-of-bayes-theorem)

# Bayesian vs Frequentist Statistics

### Frequentist Approach

Ingredients:
- model with parameter vector $\theta$
- data $y$ (randomly sampled from an infinite population)

Approach
- There is a true $\theta_0$ that generated the data
- Estimators of $\theta_0$ are constructed by maximizing the likelihood we observe this particular draw of $y$ 
 - Resulting test statistics (e.g. confidence intervals, standard errors) are constructed to explain the population where $y$ was sampled from
 - This puts a lot of pressure on the draw of $y$ we observe (e.g. randomness, sufficient size etc.) 
- Main Inference: a point estimator, $\hat{\theta_0}$. 
 - **Important**: inference is contingent on model assumptions. Can you think of one assumption we've seen so far today?

### Bayesian Approach

Ingredients:
- model with parameter vector $\theta$
- data $y$ (no assumptions)

Approach
- Parameters are treated as random variables
 - Our belief about the distribution of $\theta_0$, before seeing the data, is the "prior"
- This prior belief is updated when the data is observed
 - The result of this updating is the posterior distribution of $\theta_0$
- Main Inference: the posterior distribution of $\hat{\theta_0}$. From this, one can get:
 - Mean, median, mode $\hat{\theta_0}$
 - Standard deviation, credibility interval (CI) of $\hat{\theta_0}$
 - ... and anything else you can think of extracting from a series of numbers!
- **Important**: inference is not contingent on $y$. But it is heavily influenced by the prior beliefs
 

## Bayesian Statistics: Formula

Ingredients
- A statistical model, with a likelihood function $\rightarrow$ $p(y|\theta)$
- Prior belief (including not having a clue) about the parameter $\rightarrow$ $p(\theta)$

Aim is to get the posterior distribution, $P(\theta|y)$. This is done via integration, with respect to $\theta$, of the Bayes theorem:

$$ p(\theta|y) = \frac{p(y|\theta)p(\theta)}{p(y)} $$

The denominator, $p(y)$ is not part of our ingredients but is in the theorem! 
 - After integration, $p(y)$ comes out to be a constant independent of $\theta$ $\Rightarrow p(\theta|y) \propto p(y|\theta)p(\theta) $

We use the above formulation to calculate marginal posterior densities for all elements in $\theta$

## Example: Bayesian OLS

Model: $ y = X\beta + \epsilon $, with $ \epsilon \sim N(0, \sigma^2) $

Ingredients:
- Model: $p(y|\beta,\sigma) = \left(\frac{1}{\sigma\sqrt(2\pi)}\right)^N exp\left(-\frac{1}{2\sigma^2}(y - X\beta)'(y - X\beta) \right)$ 
- Diffuse priors: $p(\beta) \propto 1$ and $p(ln(\sigma^2)) \propto 1$

Joint-density of posterior:

$$p(\beta, \sigma | y) = \left(\frac{1}{\sigma}\right)^{N + 2}exp\left(-\frac{1}{2\sigma^2}(y - X\beta)'(y - X\beta)\right) $$

Marginal densities of $\beta$ and $\sigma$:
$$p(\beta| y) \propto \int_0^{\infty} \left(\frac{1}{\sigma}\right)^{N + 2}exp\left(-\frac{1}{2\sigma^2}(y - X\beta)'(y - X\beta)\right) d\sigma $$

$$p(\sigma| y) \propto \int_{-\infty}^{\infty} \left(\frac{1}{\sigma}\right)^{N + 2}exp\left(-\frac{1}{2\sigma^2}(y - X\beta)'(y - X\beta)\right) d\beta $$


Now we are going to solve these integrals by hand!

### Integral by Hand

<img src="figures/solve_integral.jpg" width="450">


### Work Around The Integral -  Sampling

The marginal posterior densities are not always straightforward, or even possible, to calculate by hand. Sampling algorithms help us by sampling from marginal posterior densities

- Each parameter ($\beta$ and $\sigma$) is sampled, conditional on the other parameters
- Recall: the space is defined by the prior. So when we set the prior, we end up setting the interval of possible values for $\beta$ and $\sigma$  

Multiple techniques exist, but Monte Carlo Markov Chain methods (MCMC, see [here](http://faculty.econ.ucdavis.edu/faculty/jorda/bayes/mcmc.pdf) to dig deeper) are most used. Three popular MCMC techniques are:
- Gibbs Sampler
- Metropolis-Hastings Sampler
- Data Augmentation

The output of sampling is a Markov chain. Some things to keep in mind:
- Due to sampling procedures, the values in the chain are correlated to each other, so thinning (selecting every $k^{th}$ observation) is often used
- Check for convergence, i.e. that the resulting Markov chain is indeed from the marginal posterior density


## Convergence: An Art

In principle, it is not possible to directly test if convergence is achieved or not. Instead, here is what is done:
1. Run multiple (2, maybe 3) samplings in parallel. These are known as *chains*
2. At the end of your sampling, compare these chains:
 * Plot them, see that they overlap sufficiently
 * Run some hypothesis tests to see if they are different or not
3. If you think that convergence has not been achieved, then re-run the simulations:
 * Add more chains
 * Increase burn-in sample
 * Change sampling algorithm
4. Repeat steps 2 and 3, until you are convinced convergence has been achieved



## Bayesian Estimator

Congratulations! You have a Bayesian estimator! This is a series of numbers, that you can:
* Calculate the mean of
* Calculate the standard deviation of
* See if the middle 95% of the numbers include 0
* See how much overlap there are between different parameters

It also holds information about your prior:
* If your prior was flat (i.e. you didn't know much about your data) $\rightarrow$ the Bayesian estimator is relatively wide
* If your prior was informative (i.e. you have some specialist knowledge) $\rightarrow$ the Bayesian estimator could be relatively narrow

So we are able to systematically include what we know and don't know

## (Dis)Advantages of the Bayesian Method


### Advantages

* Intuitive result
* Factors in prior knowledge in a transparent way
* Makes no assumptions on the dataset
* The posterior distribution

### Disadvantages

* Convergence
* Not as well-known as frequentist methods
* Requires much more computing power
* There is no rule of thumb of what prior you should use

## Take-Away

This is an alternative way of approaching statistical inference

* It can be useful for some cases
* It can be unnecessary for other cases

But here is how I like to imagine some benefits, within ING

* Setting priors $\rightarrow$ letting the business side put their specialist knowledge to good use
 * "Well I know that mortgage lending is negatively influenced by interest rates" (i.e. prior parameter is in the negative range)
* Bayesian estimator $\rightarrow$ giving probabilities of different outcomes to the business side
 * "60% of our forecasts are showing a profit, but 40% show that we will make a loss" 

# Thank You!

<img src="figures/xkcd_bayes.png" width="350">