# Bayesian Modeling and Markov Chain Monte Carlo

### Data Science 350

## Overview

This notebook introduces you to a general and flexible form of Bayesian modeling using the **Makov chain Monte Carlo** methods. 

![](img/Flips.png)


## Review of Bayes Theorem

Bayes theorem allows us 

$$P(A|B) = \frac{P(A)P(B|A)}{P(B)}$$

This is a bit of a mess. But fortunately, we don't always need the denominator. We can rewrite Bayes Theorem as:

$$𝑃(𝐴│𝐵)=𝑘∙𝑃(𝐵|𝐴)𝑃(𝐴)$$

Ignoring the normalizaton constant $k$, we get:

$$𝑃(𝐴│𝐵) \propto 𝑃(𝐵|𝐴)𝑃(𝐴)$$

### Bayesian parameter estimation

How to we interpret the relationships shown above? We do this as follows:

$$Posterior\ Distribution \propto Likelihood \bullet Prior\ Distribution \\
Or\\
𝑃(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠│𝑑𝑎𝑡𝑎) \propto 𝑃(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)𝑃(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠) $$

These relationships apply to the observed data distributions, or to parameters in a model (partial slopes, intercept, error distributions, lasso constant,…). 

### Frequentist by Bayesian models

Let's summarize the differences between the Baysian and Frequentist views. 

- Bayesian methods use priors to quantify what we know about parameters.
- Frequentists do not quantify anything about the parameters, using p-values and confidence intervals to express the unknowns about parameters.

Recalling that both views are useful, we can contrast these methods with a chart.

![](img/FrequentistBayes.jpg)

In [1]:
comp.like.2 = function(x, mu, sigma){
    l = matrix(0, nrow = length(mu), ncol = length(sigma))
    sigmaSqr = sd(x)^2
    xBar = mean(x)
    cat(' Mean =', xBar, 'Standard deviation =', sqrt(sigmaSqr), '\n')
    for(i in 1:length(sigma)){
        sigmaSqr = sigma[i]^2
        l[, i] = sapply(mu, 
                        function(u) exp(- n* (xBar - u)^2 / (2 * sigmaSqr)))
    }
    l / sum(l) # Normalize and return
}
                        
                        

## Grid Sampling and Scalability

Real-world Bayes models have large numbers of parameters, even into the millions. As a naive approach to Bayesian analysis would be to simply grid sample across the dimensions of the parameter space. However, grid sampling will not scale. To underestand the scaling problem, do the following thought experiment, where each dimension is sampled 100 times:

- For a 1-parameter model: $100$ samples.
- For a 2-parameter model: $100^2 = 10000$ samples.
- For a 3-parameter model: $100^3 = 10^6$ samples.
- For a 100-parameter model: $100^100 = 100^{200}$ samples. 

As you can see, the compuational complexity of grid sampling has **exponential scaling** with dimensionality. Clearly, we need a better approach. 

## Introduction to Markov Chain Monte Carlo

Large-scale Bayesian models use a family of efficient sampling methods known as **Markov chain Monte Carlo sampling**. MCMC methods are compuationally efficient, but requires some effort to understand how it works and  what to do when things go wrong. 

### What is a Markov process?

As you might guess, a MCMC sampling uses a chain of **Markov sampling processes**. The chain is built from a sequence of individul Markov processes. A Markov process is any process that a makes a transition from one state other states with probability $\Pi$ with **no dependency on past states**. In summary, a Markov process has the  following properties:
- $\Pi$  only depends on the current state
- Transition to one or more other states
- Can ‘transition’ to current state
- A matrix $\Pi$ of dim N X N for N possible states
- A Markov procecss is a random walk since any possible transition can occur from each state.

A Markov chain is a sequence of Markov transition processes:

$$P(X_{t + 1}| X_t = x_t, \ldots, x_0 = x_t) = p(X_{t + 1}| x_t)$$

We say the the Markov process is **memoryless**. The transition probability only depends on the current state, not any previous state. 

For a system with $N$ possible states we can write the transition matrix $\Pi$ for the probaility of transition from one state to another:

$$\Pi = 
\begin{bmatrix}
\pi_{1,1} & \pi_{1,2} & \cdots & \pi_{1, N}\\
\pi_{2,1} & \pi_{2,2} & \cdots & \pi_{2,N}\\
\cdots & \cdots & \cdots & \cdots \\
\pi_{N,i} & \pi_{N,2} & \cdots & \pi_{N,N}
\end{bmatrix}\\
where\\
\pi_{i,j} = probability\ of\ transition\ from\ state\ i\ to state\ j\\
and\\
\pi_{i,i} = probability\ of\ staying\ in\ state\ i\\
further\\
\pi_{i,j} \ne \pi_{j,i}\ in\ general
$$

Notice that none of these probabilities depend on the previous state history.

### MCMC and the Metropolis Algorithm

The first MCMC sampling algorithm is the **Metropolis Hastings algorithm** (Metropolis et al. (1953), Hastings (1970)). This algorithm is often referred to as the Metropolis algorithm. The Metropolis algorithm has the following steps to estimate the density of the likelihood of the parameters:
1. Pick a starting point in your parameter space and evaluate the posterior according to your model. In other words, take an initial sample of the likelihood $p(data|parameters)$.
2. Choose a nearby point in parameter space randomly and evaluate the likelihood at this point.
  - If the $p(data | parameters)$ of the new point is greater than your current point, accept new point and move there.
  - If the $p(data | parameters)$ of the new point is less than your current point, only accept with probability according to the ratio:  
$$Acceptance\ probability\ = \frac{p(data | new\ parameters)}{p(data | previous\ parameters)}$$.
3. Repeat step 2 many times.


Now that we have outlined the basic Metropolois MCMC algorithm, let's 

M-H algorithm eventually converges to the underlying distribution.
We only have to visit N points, not 1 Trillion points.
There is high serial correlation in M-H chain, which slows convergence
Need to ‘tune’ the state selection probability distribution used to find the next point
E.g. if we use Normal distribution need to pick s.
If s is too small chain will only search the space slowly. 
If s is too big, get large jumps and slow convergence


$$p(x; \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(x)} x^{-\alpha - 1} \exp\bigg(-\frac{\beta}{x}\bigg)$$

In [None]:
dinvgamma = function(alpha, beta, x){
    
}