# MCMC Diagnostics

Goals:
* Learn how to determine whether a Markov chain can reliably be used for inference

## References

todo

## Diagnostics

To be useful to us, a chain must
1. have converged to the posterior distribution
2. provide enough effectively independent samples to characterize it

What would make us confident of convergence?
* Is the chain stationary?
* Do independent chains started from overdispersed positions find the same solution?

How do we guess the number of independent samples?
* Check how well the chain appears to exploring the distribution.
* Compare the autocorrelation length scale with the chain length.

There are numerical estimates that can help with this, but **they are not a substitute for human visual inspection**.

### Common misuses of/misconceptions about convergence

Convergence does **not** mean
* that parameters are "well constrained" by the data
* that the autocorrelation length is small
* that there are not occasional excursions beyond a locus in parameter space

### Convergence tests
* Inspection! There is no substitute.
* Gelman-Rubin statistic

### Convergence tests - inspection

<table>
    <tr>
        <td><img src="graphics/mc1_sandbox_ab.png" width=100%></td>
    </tr>
</table>

### Convergence tests - inspection
* How stationary does each sequence appear? 
* Are all chains sampling the same PDF?
<table>
    <tr>
        <td><img src="graphics/mc1_sandbox_a.png" width=100%></td>
    </tr>
</table>

### Convergence tests - inspection

<table>
    <tr>
        <td><img src="graphics/mc1_sandbox_b.png" width=100%></td>
    </tr>
</table>

Conservatively, we might remove the first $\sim2000$ steps based on this.

### Convergence tests - Gelman-Rubin statistic

This approach tests the similarlity of independent chains intended to sample the same PDF. To be meaningful, they should start from different locations and burn-in should be removed.

For a given parameter, $\theta$, the $R$ statistic compares the variance across chains with the variance within a chain. Intuitively, if the chains are random-walking in very different places, i.e. not sampling the same distribution, $R$ will be large.

We'd like to see $R\approx 1$ (e.g. $R<1.1$ is often used).

### Convergence tests - Gelman-Rubin statistic
In detail, given chains $J=1,\ldots,m$, each of length $n$,

* Let $B=\frac{n}{m-1} \sum_j \left(\bar{\theta}_j - \bar{\theta}\right)^2$, where $\bar{\theta_j}$ is the average $\theta$ for chain $j$ and $\bar{\theta}$ is the global average. This is proportional to the variance of the individual-chain averages for $\theta$.

* Let $W=\frac{1}{m}\sum_j s_j^2$, where $s_j^2$ is the estimated variance of $\theta$ within chain $j$. This is the average of the individual-chain variances for $\theta$.

* Let $V=\frac{n-1}{n}W + \frac{1}{n}B$. This is an estimate for the overall variance of $\theta$.

### Convergence tests - Gelman-Rubin statistic

Finally, $R=\sqrt{\frac{V}{W}}$.

Note that this calculation can also be used to track convergence of combinations of parameters, or anything else derived from them.

### Correlation tests
* Inspection! Again, no substitute.
* Autocorrelation of parameters

### Correlation tests - inspection
Do subsequent samples look particularly independent?
<table>
    <tr>
        <td><img src="graphics/mc1_sandbox_a.png" width=100%></td>
    </tr>
</table>

### Correlation tests - inspection
The *autocorrelation* of a sequence, as a function of lag, $k$, is defined thusly:

$\rho_k = \frac{\sum_{i=1}^{n-k}\left(\theta_{i} - \bar{\theta}\right)\left(\theta_{i+k} - \bar{\theta}\right)}{\sum_{i=1}^{n-k}\left(\theta_{i} - \bar{\theta}\right)^2} = \frac{\mathrm{Cov}_i\left(\theta_i,\theta_{i+k}\right)}{\mathrm{Var}(\theta)}$

The larger lag one needs to get a small autocorrelation, the less informative individual samples are.

The `pandas` function `autocorrelation_plot()` may be useful for this.

### Correlation tests
<table>
    <tr>
        <td><img src="graphics/mc1_sandbox_acf-a.png" width=100%></td>
    </tr>
</table>

### Correlation tests
<table>
    <tr>
        <td><img src="graphics/mc1_sandbox_acf-b.png" width=100%></td>
    </tr>
</table>

We would be justified in thinning the chains by a factor of $\sim200$, apparently!

## Bonus numerical making-your-life-easier exercise: convergence

Write some code to perform the Gelman-Rubin convergence test. Try it out on

1. multiple chains from the sandbox notebook. Fiddle with the sampler to get chains that do/do not display nice convergence after e.g. 5000 steps.

2. multiple "chains" produced from independent sampling, e.g. from the inverse-transform or rejection examples above or one of the examples in previous chunks.

You'll be expected to test convergence from now on, so having a function to do so will be helpful.

MegaBonus: modify your code to compute the $R$ statistic for the eigenvectors of the covariance of the posterior (yikes). This can be informative when there are strong parameter degeneracies, as in the convergence example above. The eigenvectors can be estimated efficiently from a chain using singular value decomposition.