# Bayesian Modeling and Markov Chain Monte Carlo

### Data Science 350

## Overview

This notebook introduces you to a general and flexible form of Bayesian modeling using the **Makov chain Monte Carlo** methods. 

![](img/Flips.png)


## Review of Bayes Theorem

Recall Bayes theorem:

$$P(A|B) = \frac{P(A)P(B|A)}{P(B)}$$

Computing the normalization $P(B)$ is a bit of a mess. But fortunately, we don't always need the denominator. We can rewrite Bayes Theorem as:

$$𝑃(𝐴│𝐵)=𝑘∙𝑃(𝐵|𝐴)𝑃(𝐴)$$

Ignoring the normalizaton constant $k$, we get:

$$𝑃(𝐴│𝐵) \propto 𝑃(𝐵|𝐴)𝑃(𝐴)$$

### Bayesian parameter estimation

How to we interpret the relationships shown above? We do this as follows:

$$Posterior\ Distribution \propto Likelihood \bullet Prior\ Distribution \\
Or\\
𝑃(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠│𝑑𝑎𝑡𝑎) \propto 𝑃(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)𝑃(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠) $$

These relationships apply to the observed data distributions, or to parameters in a model (partial slopes, intercept, error distributions, lasso constant,…). 

### Frequentist by Bayesian models

Let's summarize the differences between the Baysian and Frequentist views. 

- Bayesian methods use priors to quantify what we know about parameters.
- Frequentists do not quantify anything about the parameters, using p-values and confidence intervals to express the unknowns about parameters.

Accepting that both views are useful, we can contrast these methods with a chart.

![](img/FrequentistBayes.jpg)

## Grid Sampling and Scalability

Real-world Bayes models have large numbers of parameters, even into the millions. As a naive approach to Bayesian analysis would be to simply grid sample across the dimensions of the parameter space. However, grid sampling will not scale. To underestand the scaling problem, do the following thought experiment, where each dimension is sampled 100 times:

- For a 1-parameter model: $100$ samples.
- For a 2-parameter model: $100^2 = 10000$ samples.
- For a 3-parameter model: $100^3 = 10^5$ samples.
- For a 100-parameter model: $100^{100} = 100^{102}$ samples. 

As you can see, the compuational complexity of grid sampling has **exponential scaling** with dimensionality. Clearly, we need a better approach. 

## Introduction to Markov Chain Monte Carlo

Large-scale Bayesian models use a family of efficient sampling methods known as **Markov chain Monte Carlo sampling**. MCMC methods are compuationally efficient, but requires some effort to understand how it works and  what to do when things go wrong. 

### What is a Markov process?

As you might guess, a MCMC sampling uses a chain of **Markov sampling processes**. The chain is built from a sequence of individul Markov processes. A Markov process is any process that a makes a transition from one state other states with probability $\Pi$ with **no dependency on past states**. In summary, a Markov process has the  following properties:
- $\Pi$  only depends on the current state
- Transition from current state to one or more other states
- Can ‘transition’ to current state
- A matrix $\Pi$ of dim N X N for N possible state transistions
- A Markov procecss is a random walk since any possible transition to a new state, $j$, can occur from each state, $i$, if $p_{ij} \gt 0$.

Since a Markov chain is a **memoryless** sequence of Markov transition processes, we can write:

$$P(X_{t + 1}| X_t = x_t, \ldots, x_0 = x_t) = p(X_{t + 1}| x_t)$$

Since the Markov process is memoryless, the transition probability only depends on the current state, not any previous states. 

For a system with $N$ possible states we can write the transition matrix $\Pi$ for the probaility of transition from one state to another:

$$\Pi = 
\begin{bmatrix}
\pi_{1,1} & \pi_{1,2} & \cdots & \pi_{1, N}\\
\pi_{2,1} & \pi_{2,2} & \cdots & \pi_{2,N}\\
\cdots & \cdots & \cdots & \cdots \\
\pi_{N,i} & \pi_{N,2} & \cdots & \pi_{N,N}
\end{bmatrix}\\
where\\
\pi_{i,j} = probability\ of\ transition\ from\ state\ i\ to\ state\ j\\
and\\
\pi_{i,i} = probability\ of\ staying\ in\ state\ i\\
further\\
\pi_{i,j} \ne \pi_{j,i}\ in\ general
$$

Notice that none of these probabilities depend on the previous state history.

### MCMC and the Metropolis Algorithm

The first MCMC sampling algorithm developed is the **Metropolis Hastings algorithm** (Metropolis et al. (1953), Hastings (1970)). This algorithm is often referred to as the Metropolis algorithm. The Metropolis algorithm has the following steps to estimate the density of the likelihood of the parameters:
1. Pick a starting point in your parameter space and evaluate the posterior according to your model. In other words, take an initial sample of the likelihood $p(data|parameters)$.
2. Choose a nearby point in parameter space randomly and evaluate the likelihood at this point. A probability distribution is used to make this random selection. The Normal distribution is a common choice.
  - If the $p(data | parameters)$ of the new point is greater than your current point, accept new point and move there.
  - If the $p(data | parameters)$ of the new point is less than your current point, only accept with probability according to the ratio:  
$$Acceptance\ probability\ = \frac{p(data | new\ parameters)}{p(data | previous\ parameters)}$$.
3. Repeat step 2 many times.


Now that we have outlined the basic Metropolois MCMC algorithm, let's examine some of its properties.

- Since the M-H algorithm samples the parameter space we only need to visit a limited number of points, rather than sample an entire grid. 
- The M-H algorithm is guaranteed to **eventually converge** to the underlying distribution.
- If there is high serial correlation from one sample to the next in M-H chain converges slowly. 
- To ensure efficient convergence we Need to ‘tune’ the state selection probability distribution used to find the next point. For example if we use Normal distribution we must pick $\sigma$. If $\sigma$ is too small, the chain will only search the space slowly, with small jumps. If $\sigma$ is too big, there are large jumps which slow convergence

### M-H algorithm example

Let's make these concepts concrete, by trying a simple example.

As a first step, lets plot a set of points with density determined by the a bi-variate Normal distribution. Execute the code below and examine the resulting plot.

In [None]:
library(MASS)
random_points = mvrnorm(10000, mu=c(0.5,0.5), Sigma=matrix(c(1,0.6,0.6,1), nrow=2))
plot(random_points[,1], random_points[,2], xlim=c(-4,4), ylim=c(-4,4), col=rgb(0,0,0,0.25),
     main = 'Draws from a bivariate Normal distribution', 
    xlab = 'X', ylab = 'Y')

This plot looks as expected. The density of the dots is proportional to the probabilities. You can see the effect of the covariance structure in these data.

As a next step, let's look at the density of the marginal probabilites of the $X$ and $Y$ variables. The code in the cell below plots histogram and density plots of the marginals. Execute this code and examine the result. 

In [None]:
par(mfrow = c(2,1))
hist(random_points[,1], freq = FALSE, breaks = 40,
     main = 'Marginal X distribution', 
     xlab = 'X')
lines(density(random_points[,1]))
hist(random_points[,2], freq = FALSE, breaks = 40,
     main = 'Marginal Y distribution', 
     xlab = 'Y') 
lines(density(random_points[,2]))
par(mfrow = c(1,1))

You can see that these distributions are approimately Normal, but with a right skew. 

Now, we are ready to sample these data using the M-H MCMC algorithm. The code in the cell below performs the following operations:

1. Compute the likelihood of the bi-variate Normal distribution. 
2. Initialize the chain.
3. Initialize some performance statistics.
4. Sample the likelihood of the data using the M-H algorithm.
5. Plot the result.

Execute this code and examine the result. 

In [None]:
# Given a point, our value at that point(x,y) will be the 
# value of the distribution at x,y:
likelihood = function(x,y){
  sigma = matrix(c(1,0.6,0.6,1), nrow=2)
  mu = c(0.5,0.5)
  dist = c(x,y) - mu
  value = (1/sqrt(4*pi^2**det(sigma))) * exp((-1/2) * t(dist) %*% ginv(sigma) %*% t(t(dist)) )
  return(value)
}

# Where to start:
x_chain = 4
y_chain = -4
# Chain length:
chain_length = 10000

#Evaluate current position:
current_val = likelihood(x_chain,y_chain)
current_val

# Standard deviation of how far out to propose:
proposal_sd = .1

# Keep track of things:
accept_count = 0
reject_count = 0


for (n in 1:(chain_length-1)){ # chain length minus 1 because we already have a point (the starting point)
  proposed_x = x_chain[n] + rnorm(1, mean=0, sd=proposal_sd)
  proposed_y = y_chain[n] + rnorm(1, mean=0, sd=proposal_sd)
  proposed_val = likelihood(proposed_x, proposed_y)
  
  # Accept according to probability:
  if (runif(1) < (proposed_val/current_val)){
    x_chain = c(x_chain, proposed_x)
    y_chain = c(y_chain, proposed_y)
    current_val = proposed_val
    accept_count = accept_count + 1
  }else{
    x_chain = c(x_chain, x_chain[n])
    y_chain = c(y_chain, y_chain[n])
    reject_count = reject_count + 1
  } 
}

plot(x_chain, y_chain, col=rgb(0,0,0,0.25), xlim=c(-4,4), ylim=c(-4,4),
     main="MCMC values for a Bivariate Normal", xlab="x", ylab="y")


Notice the long 'tail' on the sampled distribution. This behavior arrises from the initial wandering of the Markov chain as it finds the high probability regions of the distribution. This period in which the Markov chain wanders is known as the **burn-in period**.

The code in the cell below, plots the same Markov chain, but with the first 1000 values removed. The remaining samples are from the post burn-in chain. 

In [None]:
# Burn in problem.  Solution?  Throw away first part of chain.
num_burnin = round(0.1*chain_length)
num_burnin

plot(x_chain[num_burnin:chain_length], y_chain[num_burnin:chain_length],
     col=rgb(0,0,0,0.25), xlim=c(-4,4), ylim=c(-4,4),
     main="MCMC values for a Bivariate Normal with burn-in", xlab="x", ylab="y")

In the plot above you can see that there is no 'tail' in the sampled distribution. As expected, the tail was sampled during the burn-in period and was not significant in sampling the distribution.

Let's plot the density of the marginal distribution of these samples. We can then compaire these densities to those of the orginal data we generated. The code in the cell below plots a histogram of the sampled marginal distributions along with the density of the original samples. Execute this code and compaire the results.  

In [None]:
par(mfrow = c(2,1))
hist(x_chain[num_burnin:chain_length], freq = FALSE, breaks = 40,
     main = 'Marginal X distribution', 
     xlab = 'X')
lines(density(random_points[,1]))
hist(y_chain[num_burnin:chain_length], freq = FALSE, breaks = 40,
     main = 'Marginal y distribution', 
     xlab = 'X') 
lines(density(random_points[,2]))
par(mfrow = c(1,1))

Notice that the histograms of the sampled marginal distributions are close to the margianl density of the original data. There is some skew as a result of sampling error.

Next, lets compare the **Maximum a posteriori or MAP** point of the sampled marginal distributions to the original means for $x$ and $y$. The code in the cell below approximates the MAP using the `mean` function. Execute this code and compaire the results to the original data with $x = 0.5$ and $y = 0.5$. 

In [None]:
# Estimate bivariate MAP from chain:
mcmc_map = c(mean(x_chain), mean(y_chain))
mcmc_map

The MAP values of the sampled marginal distributions are close to the values for the original data. However, the mean approximation for the MAP seems to be biased by the skew in the sampled distribution.

Let's turn our attention to the convergence properties of the M-H MCMC sampler. The **acceptance rate** and **rejection rate** are key convergance statistics for the M-H alorithm. A low acceptance rate and high rejection rate are signs of poor convergane. Execute the code in the cell below which computes and displays these statistics and examine the results.  

In [None]:
# Acceptance/Reject rate:
cat('Acceptance rate =', accept_count/chain_length, '\n')
cat('Rejection rate =', reject_count/chain_length, '\n')

These statistics indicate good convergance with a fairly low rejection rates.

Another way to evaluate the convergance of MCMC algorithms is to look at the **trace** of the samples. The trace is a plot of the sample value with sample number. The code in the cell below plots the trace for both the $x$ and $y$ samples, including the burn-in period. Execute this code and examine the results. 

In [None]:
# Always look at the chain, we would like random noise centered around means
par(mfrow = c(2,1))
plot(x_chain, type="l", main = 'X chain', ylab = 'Value')
plot(y_chain, type="l", main = 'Y chain', ylab = 'Value')
par(mfrow = c(1,1))

Examine these sample traces. Notice that there is a significant excersion during the inital burn-in period. After the inital burn-in you can see that the sampling wanders around the mode of the distribution, as it should. 

Let's look at a close-up view the portion of these traces just after the burn-in period. The code in the cell below plots samples 1000 to 2000. Execute this coded and examine the results. 

In [None]:
## Look at a shorter segment of the chain
# Always look at the chain, we would like random noise centered around means
par(mfrow = c(2,1))
plot(x_chain[1000:2000], type="l", main = 'X chain', ylab = 'Value')
plot(y_chain[1000:2000], type="l", main = 'Y chain', ylab = 'Value')
par(mfrow = c(1,1))

Notice that, for the most part, the samples are centered on the MAP for $x$ and $y$. This is the ideal behavior of the M-H algorithm. 

## Gibbs Sampling and Hierarchical Models

With some experience with the Metropolious-Hastings MCMC algotithm, let's try a Bayes hierarchical model example using Gibbs sampled MCMC. Heirarchical models can be quite complex and provide a great deal of flexibility. The Gibbs sampler can provide a significant improvement in efficiency over the Metropolous-Hastings algorthm. 

### Gibbs sampling

The Metropolis Hastings algorithm is a useful tool. However, this algorithm can suffer from slow convergence for several reasons:

- Samples from the M-H algorithm generally have a fairly high serial correlation. This problem results from taking steps in random directions.
- As already mentioned, we need to ‘tune’ the state selection probability distribution used to find the next point. For example if we use Normal distribution we must pick $\sigma$. If $\sigma$ is too small, the chain will only search the space slowly, with small jumps. If $\sigma$ is too big, there are large jumps which slow convergence.

The Gibbs sampler (Geman and Geman, 1984) is an improved MCMC sampler which speeds convergance. The basic Gibbs sampler algorithm has the following steps:

1. For an N dimensional parameter space, $\{ \theta_1, \theta_2, \ldots, \theta_N \}$, find a random starting point. 
2. Starting with dimension $1$, cycle through each dimension in order, $\{1, 2, 3, \ldots, N\}$:  
  - Sample the marginal distribution of the parameter based on the probability distribution of the parameter given the data and other parameter values:
  $$p(\theta_1|D, \theta_2, \theta_3, \ldots, \theta_N)\\ 
  where\\
  D\ is\ the\ data$$
  - Repeat this sampling proceedure for each remaining dimension in order, $\{2, 3, \ldots, N\}$.
4. Repeat step 2 until convergance.    

From this simplifed description of the Gibbs sampling algorithm you can infer:

- When compared to the Metropolis-Hastings algorithm, the Gibbs sampler reduces serial correlation owing to the reduced round-robin nature of the sampling.   
- There are no tuning parameters since sampling is based on the marginals of the likelihood.

### Hierarchical modeling example

In this case, we will under take a univariate regression problem using synthetic data. The regression model has two parameters a slope and an intercept. The variance of the data is an additionl 'nuisance' parameter. To compute these parameters, accounting for their dependency, we will use a hierarchical Bayes model. 

Heirarchical Bayes models depend on the **chain rule** for Bayes theorem. The chain rule allows us to expand Bayes theorm to accommodate multi-parameter models. We can write the basic chain rule for Bayes theorem like this:

$$p(\theta, \sigma | D) \propto p(D| \theta, \sigma) p(\theta, \sigma)\\
\propto p(D | \theta) p(\theta | \sigma) p(\sigma)\\
\propto\ Likelihood\ *\ Prior\ of\ \theta\ given\ \sigma\ *\ Prior\ of\ \sigma$$

As you can see, a complex multi-parameter Bayesian model is transformed to a hierarchy. The hierarchy is a chain of prior distributions (unconditional and conditional) and a likelihood dependent only on one parameter.  

As a first step the code in the cell below generates bi-variate data with Normally distributed errors and plots the result. Execute this code to compute the data.     

In [None]:
## Set up the data set as a regression problem
require('rjags')
N <- 1000
x <- 1:N
epsilon <- rnorm(N, 0, 100)
y <- x + epsilon

plot(x, y, main = 'Synthetic data for Bayes regression problem')

The regression model has two parameters, a slope and an intercept, which we will call $a$ and $b$. We will use a **hierarchical Bayes model**. The model is considered hierarchical since the quantity we really want to know, the posterior distribution of the label, which we will refer to as $\hat{y}$, depends on the distribution of other model parameters. In this case, the posterior distribution of $\hat{y}$ depends on both the regression coeficients and an error term. We can visualize the hierarchical relationships in this model in the diagram below.

![](img/HierarchicalModel.jpg)
<center> **Hierarchical model for the posterior distribution of y** </center>

In mathematical terms we can define the hierarchical model as follows:
 
1. The prior of the dispursion, $\sigma$, of the Power distribution is defined as the Uniform distribution:
$$U(0, 100)$$
2. The variance (dispersion) of the label values is modeled as an Power distribution:
$$\tau = a x^\sigma = -2 x^\sigma$$
3. The prior distributions of the regression model, $a$ and $b$, are modeled as Normal distributions:
$$N(0, 0.01)$$
4. The regression model for estimating $\hat{y_i}$ is defined by:
$$\hat{y_i} = a + b x_i$$
5. The posterior distribution of the label values is modeled as a Normal distribution:
$$N(\hat{y_i}, \tau)$$

### Computing the model with JAGS

To compute the MCMC samples we will use the JAGS (Just Another Gibbs Sampler) package. JAGS is a multi-platform derivative of the BUGS language (Bayesian inference Using Gibbs Sampling), and is based on the BUGS language. 

The JAGS model is defined in the `example.bugs` file. The model model definition in this file is shown here:

```
model
{
	for (i in 1:N)
	{
		y[i] ~ dnorm(y.hat[i], tau)
		y.hat[i] <- a + b * x[i]
	}
	a ~ dnorm(0, .01)
	b ~ dnorm(0, .01)
	tau <- pow(sigma, -2)
	sigma ~ dunif(0, 100)
}
```

Notice that the model definition in the BUGS lanuage works from the bottom of the heirarchy up. 

With the model defined we need to execute it using `rjags` package. This package orchestrates the execution of JAGS models from R. The cell below contains the code to compile our JAGS model. The `jags.model` function requires the following arguments:

1. The path to the .bugs file.
2. A list with the data, x, y, and the number of cases.
3. The number of chains to use for the MCMC sampling.
4. The number 'burn-in' samples.

Execute this code to compile and run the model.

In [None]:
## Run the jags model
path = 'C:\\Users\\StevePC2\\Documents\\Git\\DataScience350\\Lecture9' # SET YOUR PATH HERE!!
full.path = file.path(path, 'example.bug')
jags.mod.reg <- jags.model(full.path,
                   data = list('x' = x,
                               'y' = y,
                               'N' = N),
                   n.chains = 4,
                   n.adapt = 1000)

***
**Your turn.** Use a reduced size data set of 50 samples to compute another Bayesian regression model. Ensure you do the following:
1. Use the R `sample` function to create an index for the samples of `x` and `y`.
2. You may wish to plot your sampled data to ensure your sampling worked as desired.
3. Use another name for your model.
4. Set `N` to 50 in the data list.
***

### Evaluating the model

With the model compiled and the posterior sampled, we can now extract the samples. The `coda.samples` function extracts the samples from Markov chain in a form usable by the coda package. Execute this code to extract the samples. 

In [None]:
## Compute some samples
samples <- coda.samples(jags.mod.reg,
                        c('a', 'b'),
                        1000)

***
**Your turn.** Extract samples for the model you created with 50 data points. Make sure you give another name to these samples. 
***

We can now examine the convergance properties of the Gibbs sampled MCMC. As a first step, the code below plots the traces of chains (4) and the marginal density of the slope and intercept parameters, $a$ and $b$. Execute this code and examine the results. 

In [None]:
library(coda)
plot(samples) # Plot the result

Examine these plots, noting the following:

1. The trace plots show the path of the 4 MCMC chains for the $a$ and $b$ model parameters in dotted lines.
2. The solid lines show the rolling value of the MAP.
3. The density plot for the $a$ and $b$ model parameters are shown on the left. The MAP value of the intercept is close to the actual value of  0.0, and the MAP value of the slope is close to the actual value of 1.0.
4. The rug plot at the base of the density plots shows the density of the samples. 

***
**Your turn.** Plot the samples you extracted from the model computed with 50 data points. How do these results compare the model computed with 1000 data points.
***

Coda samples have a summary method. Execute the code in the cell below to print the summary of the Gibbs MCMC sampling. 

In [None]:
summary(samples) # Summary statistics

The Coda summary shows a lot of useful informaiton, including:

1. The first block shows the properties of the Markov chain.
2. The second block contains the values of the coeficients along with error metrics which include: 
  - The standard deviation (SD) of the coeficient values. In this case you can see that the interecpt, $a$, is close to 0, where as the slope, $b$ is significant.
  - The sampling error (SE) is the error arrising from the MCMC sampling.
  - The time series SE is the sampling error adjusted for serial correlation in the Markov chain.
3. The third block is a table of quantiles for the model parameters. In this case, the conclusions are similar to the ones possible from the SD. 

***
**Your turn.** Display the summary of the model you created with 50 data points. Compare the results to the model computed using 1000 data points.
***

Let's compaire the results from Gibbs MCMC samples with a conventional linear model. The code in the cell below computes a linear model and prints the summary. Execute this code and compare the results to the Gibbs MCMC sample results. 

In [None]:
lm.mod = lm(y ~ x, data = data.frame(x = x, y = y))
summary(lm.mod)

The values and error estimates for the intercept and  slope parameters from the conventional linear model are close to those obtained with Gibbs MCMC sampling. However, the linear model provides no information on the  posterior distribution beyond these simple metrics. For example, there are no quantiles for the coefficients. 

***
**Your turn.** Compute and print the summary of a linear model for the 50 data points you used for your Bayesian model. Compare the results to the Bayesian model and the linear model computed using 1000 data points.
***

Another way to examine the convergence of a Markov is to plot the cumulative values of the coefficients and their standard deviation vs. the sample number. The `cumuplot` function creates just such a plot for a coda Markov chain. Execute the code in the cell below and examine the results for each chain. 

In [None]:
cumuplot(samples) # Cumulative mean for each chain

The plots show that each chain converges to similar values and with similar standard deviations. This indicates that the Gibbs sampled Markov chains are converging properly.

A plot of the **Gelman-Rudin statistic** (Gelman and Rudin, 1992) measures the ratio of the **variance shrinkage between chains** to the **variance shrinkage within chains**. The Gelman-Rudin statistic should coverge to 1.0. The code in the cell below uses the `gelman.plot` function to produce a plot of the Gelman-Rudin statistic and its 97.5% credible interval vs. sample number. Execute this code and examine the result. 

In [None]:
gelman.plot(samples) # Gelman convergence plot

The plots show good convergance of the Markov chains for both model parameters. The Gelman-Rundin statistics converge to 1.0 as does the credible interval. 

As we already discussed, the convergance of MCMC algorithms is slowed by **autocorrelation** between the samples. The Coda package contains two functions for examining the autocorrelation in Markov chains. The `autocorr.diag` provides a table of average autocorrelation values by model parameter, The `autocorr.plot` function creates autocorrelation function plots for each parameter and chain combination. To examine the results for the Gibbs sampled Markov chain execute the code in the cell below. 

In [None]:
## Look at the autocorrelation of the chain
autocorr.diag(samples)
autocorr.plot(samples)

You can see that there are a several significant lag values of the ACF. This means that the convergence of the Markov chains was impeded by this autocorrelation.

Given the significant autocorrelation in the samples, we can compute an **effective sample size or ESS**. If there is significant autocorrelation the ESS will be significantly less than the raw sample size. We can compute the ESS as follows:

$$ESS = \frac{N}{1 + 2 \sum_k ACF(k)}$$

The code in the cell below computes the ESS and rejection rate for the Gibbs sampled Markov chain. Execute this code and examine the results. 

In [None]:
## What is the effective size of the sample?
effectiveSize(samples) 
rejectionRate(samples)

You can see that the effective sample size is much lower than the raw sample size. Still, the effective sample sizes appears to be sufficient to provide good estimates of the posterior distributions of the parameters. 

## Summary

In this notebook you have done the following:

- Reviewed the basic properties of a Markov process.
- Perform a simple Markov chain Monte Carlo using the Metropolious-Hastings algorithm.
- Created and computed a hierarchical Bayes model using Gibbs sampling.
- Evaluated the convergance of the model. 

#### Copyright 2017, Stephen F Elston. All rights reserved. 