# Parameter inference and model comparison

These notes are based on Chapter 28 of David MacKay's superb book **[Information Theory, Inference and Learning Algorithms](http://www.inference.phy.cam.ac.uk/itila/book.html)**. 

## 1. The Bayesian workflow

  1. Assume that a particular model is true and **fit it to the data** i.e. use some method to estimate marginal posteriors for poorly constrained parameters. <br><br>
  
  2. Repeat step 1 for any other models that are under consideration. <br><br>
  
  3. Perform **model comparison** i.e. assign relative probabilities to the various alternative models. Note: this does not imply actually picking a single "best" model - for prediction, the Bayesian approach can incorporate output from multiple models. <br><br>
  
  4. Make predictions by summing over the outputs of all available models, weighted by their probabilities.

These steps are considered in turn below. As a reminder, remember that Bayes' equation consists of four main parts

$$posterior = \frac{likelihood \times prior}{normalising \; constant}$$

Recall also that the normalising constant is sometimes referred to as the "**probability of the data**" or the "**evidence**".

### 1.1. Parameter inference

If we assume that a particular model, $M_i$, is correct, we can write Bayes' equation as

$$P(\theta|D, M_i) = \frac{P(D|\theta, M_i)P(\theta|M_i)}{P(D|M_i)}$$

where $\theta$ is the vector of parameters for model $M_i$ and $D$ is the data (i.e. the observations) used for calibration.

If all of the model parameters, $\theta$, are continuous variables, we can re-write the denominator in this equation as

$$P(D|M_i) = \int_\theta{P(D|\theta, M_i)P(\theta|M_i) d\theta}$$

See the end of [notebook 1](http://nbviewer.ipython.org/github/JamesSample/enviro_mod_notes/blob/master/notebooks/01_Distributions.ipynb#1.4.-Marginal-and-conditional-distributions) for an explanation of where these equations come from.

To calibrate a single model to a dataset, we are primarily interested in finding two things:

  1. Some kind of **central estimate** for an appropriate parameter set (e.g. the median, mean or "best" parameter set), and <br><br>
  
  2. An indication of the **uncertainty** associated with the estimate for each parameter.
  
There are many ways of achieving these goals, depending on the complexity of the problem. 

#### 1.1.1. Gaussian approximation

For the simplest possible example of a one parameter model with an approximately Gaussian posterior distribution, we could simply report the **mean** and the **variance** to achieve the aims listed above. Most real problems are more complex, but for some multi-parameter models it may still be reasonable to assume an approximately Gaussian posterior in the region around the most promising parameter set. In this case, the two aims listed above can be achieved by:

  1. Finding the **[Maximum a posteriori (MAP)](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation)** estimate of the posterior. This is just the maximum of the posterior distribution, so finding it is an optimisation problem (see examples in notebooks 2 and 6). <br><br>
  
  2. Obtain **confidence intervals** by considering the **local curvature** of the posterior distribution in the region surrounding the MAP. This can be done by evaluating the **[Hessian](https://en.wikipedia.org/wiki/Hessian_matrix)**, which provides an estimate of the parameter **covariance matrix**.
  
  As long as the posterior can be reasonably approximated by a multi-dimensional Gaussian in the vicinity of the MAP, reporting the **location of the MAP** and the **parameter covariance matrix** is simply the multi-dimensional equivalent of reporting the mean and variance in the 1D example mentioned above.
  
Note that, as the amount of data collected increases, the Gaussian approximation can often be surprisingly effective.

#### 1.1.2. Monte Carlo sampling

If the posterior is *not* well approximated by a Gaussian and the parameter space is not too large, it may be possible to use a simple Monte Carlo approach to generate a sample from the posterior (see [notebook 3](http://nbviewer.ipython.org/github/JamesSample/enviro_mod_notes/blob/master/notebooks/03_Monte_Carlo.ipynb)). **Marginal posteriors** can then be approximated by histograms, and either reported in their entirety (the "real" Bayesian approach) or summarised using appropriate statistics to provide central estimates and confidecne intervals.

#### 1.1.3. Markov chain Monte Carlo (MCMC) sampling

If the parameter space is large, Monte Carlo methods will be inefficient. MCMC algorithms offer much more effective ways of searching large, multi-dimensional parameter spaces. See notebooks [4](http://nbviewer.ipython.org/github/JamesSample/enviro_mod_notes/blob/master/notebooks/04_MCMC.ipynb) and [6](http://nbviewer.ipython.org/github/JamesSample/enviro_mod_notes/blob/master/notebooks/06_Beyond_Metropolis.ipynb) for an overview of some possible algorithms.

Having generated a posterior sample, the procedure is exactly the same as for Monte Carlo methods: the sample is used to generate **marginal posterior histograms**, which provide information about the parameters and their uncertainty.

### 1.2. Model comparison

For a discrete set of models, $M_i$, we can write another version of Bayes' equation

$$P(M_i|D) = \frac{P(D|M_i)P(M_i)}{P(D)}$$

In this case, because the set of models is discrete, we can re-write the denominator as a sum, rather than an integral

$$P(D) = \sum_iP(D|M_i)P(M_i)$$

However, in many cases it is usual to ignore the "evidence" in this equation and just write

$$P(M_i|D) \propto P(D|M_i)P(M_i)$$

The prior probability for each model, $P(M_i)$, represents our initial belief about how likely each model is to be correct. If we have no reason to prefer some models over others, $P(M_i)$ will be the same for all models. The likelihood term, $P(D|M_i)$, represents what the data can tell us about the probability of $M_i$ being the correct model. Note from section 1.1 that this term is the **evidence** (i.e. the denominator) in the form of Bayes' equation used above for parameter inference. This is generally the case: **the "evidence" from the parameter inference stage becomes a crucial component of Bayes' equation at the model comparison stage**.

In previous notebooks, we have generally ignored the normalising constant in Bayes' equation when performing **parameter inference**. This is fine as long as you're only interested in evaluating parameter-related uncertainty for a **single model**. However, as you can see, it is necessary to include it if you're interested in comparing multiple models.

#### 1.2.1 Occam's razor

**[Occam's razor](https://en.wikipedia.org/wiki/Occam%27s_razor)** embodies the intuitive principle that, if two models provide equally good explanations of the data, the simplest model should be preferred. This principle is automatically incorporated into Bayesian model comparison in a quantitative way, which is extremely useful. This is because simple models have less flexibility and therefore make a narrower range of predictions than complex models. Because the total volume under the probabiltiy distribution $P(D|M_i)$ must equal $1$, the probabiltiy density for more complex models tends to be more "spread out" (and therefore lower) than for simple models. *If* a simple model manages to match the data, it is therefore more likely to have a higher concentration of probability density in the region of the observation than a more complex model with a wider range of output.

#### 1.2.2. Gaussian approximation

If the posterior can be well approximated by a Gaussian in the vicinity of the MAP (which, perhaps surprisingly, is often the case given enough data), then the **evidence** for model $M_i$, $P(D|M_i)$, can be estimated using the **best-fit likelihood** and **prior** and the **determinant** of the corresponding covariance matrix (which is obtained from the **Hessian**). See [equation 28.10](http://www.inference.phy.cam.ac.uk/itprnn/book.pdf) in David MacKay's book for further details.

For the case where the Gaussian approximation is valid, a full Bayesian model evaluation can therefore be perfromed in a surprisingly efficient way, without requiring any complicated and computationally intensive numerical sampling (e.g. MCMC). First, for each model:

  1. Find the MAP. <br><br>
  2. Use the Hessian evaluated at the MAP to find the parameter covariance matrix.
 
These two steps provide, for each model, an estimate of the "best" parameter set and an indication of parameter-related uncertainty. Then, for model comparison:

  1. Estimate the **evidence** for each model using the best-fit likelihood, $P(D|\theta_{MAP}, M_i)$, the best-fit prior, $P(\theta_{MAP}, M_i)$ and the **determinant** of the Hessian (from above). <br><br>
  2. Multiply the evidence by the prior for each model and use the result to rank the models according their probability. Predictions can be made by calculating a weighted sum of model predictions, with the weights based on the posetrior probabiltiy of each model.
 
Note that this workflow uses only a single **optimisation** step - no complex sampling algorithms are required. As long as the Gaussian approximation is valid, it is therefore an extremely efficient way to perform model calibration, uncertainty analysis and comparison.