This notebook is adapted from [a lesson from the 2017 KIPAC/StatisticalMethods course](https://github.com/KIPAC/StatisticalMethods/blob/2017winter/chunks/generative_models.ipynb), (c) 2017 Adam Mantz and Phil Marshall. The GPLv2 license applies.

# Evaluating Models

Goals:

* Be able to design and carry out goodness of fit and model comparison tests

* Understand and be prepared to use the Bayesian Evidence

Three related but distinct questions come under the heading of **model evaluation**.
1. Does a model describe (fit) the data well?
2. Does a model make accurate predictions about new data?
3. Is a model probable in light of the data?

Often (2) and (3) are directly related to **model comparison** or **selection**.

**A familiar example:** imagine we have a data set like this
<img src="../graphics/modelcheck-data.png" width=60%>

Specifically,
* we have precisely known $x$ values
* we have precisely known, Gaussian errors on $y$
* we're fitting a linear model, $\bar{y}(x)=b+mx$

In this case, the likelihood is $\propto e^{-\chi^2/2}$.

So is the posterior, given uniform priors on $a$ and $b$.

Assuming this model is correct, the distribution over data sets of $\hat{\chi}^2$ must follow a $\chi^2_\nu$ distribution, where
* $\hat{\chi}^2$ is the best-fit $\chi^2$ over parameters for a given data set
* the number of degrees of freedom $\nu=N_\mathrm{data}-N_\mathrm{params}$

Hence, the classical $\chi^2$ test looks at whether $\hat{\chi}^2$ is consistent with this distribution. If not, it's unlikely that our data came from the assumed model.

In this case, the value of $\hat{\chi}^2\approx104$ doesn't look good in light of the expectation.
<img src="../graphics/modelcheck-chisq.png" width=50%>
The probability $P(\chi^2\geq\hat{\chi}^2|\nu)$ ($\sim10^{-10}$ in this case) is called the **$p$-value** or **significance**.

More generally, our likelihood won't have these nice, analytic properties. We can still construct analogous tests by simulating many mock data sets realized from the posterior distribution. Then
1. Compare the real data directly with the distribution of simulated data, or
2. Compare a summary statistic of the data with the distribution of summary statistics

Either way, there is some freedom to choose what aspect of the data we care about the model fitting well.

This is called **posterior predictive** checking.

Visual comparison of draws from the posterior with the data. (NB: no simulated data here...)
<table>
    <tr>
        <td><img src="../graphics/modelcheck-linear-posterior.png " width=90%></td>
        <td></td>
        <td><img src="../graphics/modelcheck-linear.png" width=90%></td>
    </tr>
</table>

Example test statistic: Pearson Correlation $r_{12}$

A single number we might chose to summarize the data set is

$T(d) = r_{12} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\left[ \sum_i (x_i - \bar{x})^2 \sum_i (y_i - \bar{y})^2 \right]^{1/2}}$

For many posterior samples, we draw a replica dataset from the sampling distribution given yhe sample parameter vector, and compute $T(d^{\rm rep})$, building up a histogram of $T$.

${\rm Pr}(T(d^{\rm rep})>T(d)\,|\,d) = 99.43\%$

<img src="../graphics/modelcheck-linear-TS.png" width=50%>

Test statistics $T(d,\theta)$ that are functions of both the data and the parameters are called **discrepancy measures**.

The log-likelihood is a common example.

Discrepancy measure: $T = \hat{\chi}^2$

${\rm Pr}(T(d^{\rm rep},\theta)>T(d,\theta)\,|\,d) \approx 0.0$

<img src="../graphics/modelcheck-linear-discrepancy.png" width=50%>

Any way we look at it, it's unlikely that we'd conclude the linear model fits these data well. How do we choose an alternative?

One way to compare the fitness of models is to look at question (2) in model evaluation: how accurately do they predict new data?

* Implicitly, this assumes we want a fit that works well with any *potential* data set, rather than just reproducing the one we have.
* In general, this means an "Occam's Razor"-like penalty for complexity should be involved.

In our example, we might add a quadratic term to the model: $y = b + m x + q x^2$. How do we quantify the improvement?

<table><tr>
<td><img src="../graphics/modelcheck-quadratic.png" width=80%></td>
<td><img src="../graphics/modelcheck-quadratic-discrepancy.png" width=80%></td>
</tr></table>

The gold standard for testing predictive accuracy is to get more data.

Short of that, the best option is **cross-validation**: fitting a model on many random subsets of the data and seeing how well it describes the complementary "out of sample" subsets.
* This method is ubiquitous in machine learning, where accurate out-of-sample prediction is usually the goal.

Short of exhaustive cross-validation, a number of **information criteria** exist that (asymptotically) relate to predictive accuracy.

These have the advantage of being relatively quick to calculate from the results of a fit - either an MLE or a set of posterior samples - and include a penalty for models with greater freedom.

Some information criteria
* Akaike information criterion (AIC)
* Deviance information criterion (DIC)
* Watanabe-Akaike information criterion (WAIC)

The DIC has the advantage of being compatible with Bayesian analysis (unlike AIC), and not requiring the data to be cleanly separable into conditionally independent subsets (unlike WAIC).

$\mathrm{DIC} = \langle D(\theta) \rangle + p_D; \quad p_D = \langle D(\theta) \rangle - D(\langle\theta\rangle)$

where $D(\theta)=-2\log P(\mathrm{data}|\theta)$ and averages $\langle\rangle$ are over the posterior.

$p_D$ is an _effective number of free parameters_, i.e. the number of parameters primarily constrained by the data rather than by their priors.

The DIC thus doesn't necessarily count nuisance parameters used to marginalize out systematics as added "complexity".

Note that **lower** IC is preferable (larger likelihood and/or less model complexity).

A somewhat motivated scale for interpretting differences in IC exists (named for Jeffreys):

| $e^{\Delta\mathrm{IC}}$  | Strength of evidence |
|--------|:------------------------:|
|  < 1   | Negative |
|  1-3   | Barely worth mentioning |
|  3-10  | Substantial |
| 10-30  | Strong |
| 30-100 | Very strong |
|   >100 | Decisive |

# Model Efficiency


* As well as predictive accuracy, the other virtue a model can have is *efficiency*. Typically we are interested in models that both fit the data well, and are also somehow "natural" - that is, not contrived or fine-tuned. 


* Contrived models have high likelihood in only small regions of their parameter spaces - and it turns out such models are penalized automatically by the "Bayesian evidence."

# Bayesian Evidence


* The Evidence, or "fully marginalized likelihood" (FML), is the denominator in Bayes' theorem (previously referred to as "just a normalization constant"), and is an integral over all parameter space:

$\;\;\;\;\;\;\;{\rm Pr}(d\,|\,H) = \int\;{\rm Pr}(d\,|\,\theta,H)\;{\rm Pr}(\theta\,|\,H)\;d\theta$


* The evidence is able to capture both a model's accuracy and its efficiency because it summarizes *all* the information we put into our model inferences, via both the data *and* our prior beliefs. 



# Model Comparison with the Bayesian Evidence

* The evidence for model $H$, ${\rm Pr}(d\,|\,H)$, enables a form of Bayesian hypothesis testing: model comparison with the "evidence ratio" or "Bayes Factor":

$\;\;\;\;\;\;\;R = \frac{{\rm Pr}(d\,|\,H_1)}{{\rm Pr}(d\,|\,H_0)}$


* This quantity is similar to a likelihood ratio, but it's a *fully marginalized likelihood ratio* - which is to say that it *takes into account our uncertainty about values of the parameters of each model by integrating over all plausible values of them.*

# Exercise: Evidence illustration 

In a 1D inference problem you have a Gaussian likelihood $L(\theta)$ that has peak value $L_{max}$, and a uniform prior of width $\Delta \theta$.

Sketch this situation, and provide a graphical interpretation (i.e. in terms of areas on your graph) of the Evidence integral $E$.

Make some more sketches for different choices of prior location and width, and comment on the relative values of the Evidence.

In [None]:
# %load solutions/modelevaluation_exercise2.py

# Evidence illustration: notes

1) The evidence can be made arbitrarily small by increasing the prior volume: the evidence is more conservative than focusing on the goodness of fit ($L_{\rm max}$) alone.  

2) The evidence is linearly sensitive to prior volume ($f$), but exponentially sensitive to goodness of fit ($L_{\rm max} \propto e^{-\chi^2_{\rm min}/2}$). It's still a likelihood, after all.

> If you assign a prior you don't believe, then you may not get out a meaningful value for ${\rm Pr}(d\,|\,H)$; however, the ratio $R$ can still be useful to think about even with arbitrary choices of prior.


# Model probabilities

The evidence ratio can, in principle, be combined with the ratio of priors for each model to give us the relative probability for each model being true, given the data:

$\frac{{\rm Pr}(H_1|d)}{{\rm Pr}(H_0|d)} = \frac{{\rm Pr}(d|H_1)}{{\rm Pr}(d|H_0)} \; \frac{{\rm Pr}(H_1)}{{\rm Pr}(H_0)}$


Prior probabilities for models are very difficult to assign in most practical problems (notice that no theorist ever provides them). 

# Model probabilities

One practical way to interpret the evidence ratio is to note that the evidence ratio updates the prior ratio into a posterior one. This means that:

  * if you think that, having seen the data, the two models are *still equally probable,*
  
  * then the evidence ratio in favor of $H_1$ is _the odds that you would have had to have been willing to take against $H_1$, before seeing the data._

# Example Evidence

* In the example above, we can compute the evidence for the linear and quadratic models, and form the odds ratio $R$.

```
log Evidence for Straight Line Model: -157.2
log Evidence for Quadratic Model: -120.7
Evidence ratio in favour of the Quadratic Model: 7e15 to 1
```

* The 26 unit difference in log evidence between the two models translates to a _huge_ odds ratio in favour of the quadratic model.

# Notes on the Evidence

* The Bayesian evidence is *qualitatively different* from classical model assessments. While they focus primarily on *prediction accuracy,* the evidence is the way in which information from the prior PDF propagates through into our posterior beliefs about the model as a whole.


* There are no inherent mathematical limitations to its use, in contrast to various other hypothesis tests that are only valid under certain assumptions (such as the models being nested, e.g. the classical $F$ test for comparing $\chi^2$ values). Any two models can be compared and the odds ratio computed.

# Calculating the Evidence

* The FML is in general quite difficult to calculate, since it involves averaging the likelihood over the prior. MCMC methods aimed at parameter fitting usually give us samples from the posterior - and these cannot, it turns out, be reprocessed so as to estimate the evidence stably.


* A number of sampling algorithms have been developed that *do* calculate the evidence, or provide the means to do so, during the process of sampling. These include:

  * Nested Sampling, Population Monte Carlo
  * Parallel Tempering, Thermodynamic Integration
  * ...

# Exercise: Evidence Estimators


  1) Write down the Evidence integral and approximate it as a sum over prior samples. What problems do you foresee with this approach to estimating the Evidence?

  2) Consider the ratio $P(\theta|d,H) / P(d|\theta,H)$ and its integral over all parameters $\theta$, and derive a second sample-based approximation for the Evidence. What problems do you foresee with this approach?

The [straight line example notebook](../examples/StraightLine/ModelEvaluation.ipynb) shows method (1) applied to our straight line example.

# Information Criteria

A number of "information criteria" exist in the statistical literature, as *easier-to-calculate alternatives* to the Bayesian Evidence. Most have the form:

$IC(H) = D_{\max} + C; \quad D_\mathrm{max}=-2\log L_{\rm max}$

where $D=-2\log L$ is a common discrepancy (or deviance) measure.

* Here, the $L_{\rm max}$ term captures the goodness of fit of the model $H$, while the constant $C$ increases with model complexity. 

* Models with *low* $IC$ are preferred - for having high goodness of fit at low model complexity. 

# Information Criteria

Examples are
* Bayesian Information Criterion (BIC), where $C = K \log N$
* corrected Akaike Information Criterion (AICc), where $C = 2K + 2K(K+1)/(N-K-1)$

where $N$ is the number of data points and $K$ is the number of free parameters in a model.

# Information Criteria

A more Bayes-friendly option is the Deviance Information Criterion, which incorporates averages over the posterior:

$\mathrm{DIC}(H) = \langle D(\theta) \rangle + p_D; \quad p_D = \langle D(\theta) \rangle - D(\langle\theta\rangle)$

where $D(\theta)=-2\log L(\theta)$.

$p_D$ is an _effective number of free parameters_, i.e. the number of parameters primarily constrained by the data rather than by their priors.

The DIC thus doesn't count nuisance parameters used to marginalize out systematics as "complexity", an ambiguous case for the other ICs.

# Information Criteria


* BIC is supposed to approximate the log evidence (in the limit of good data, linear model, and uniform priors)


* Differences in AIC approximate the relative _information loss_ when using the model to approximate the actual data generator. AIC focuses more on model accuracy than BIC. 


* DIC aims to minimize the discrepancy between a fitted model (accounting for its posterior uncertainty) and future data.

# Information Criteria


* In general differences in IC are not easy to interpret, but tables similar to the Jeffreys scale exist. In general, testing against realistic simulated data on a case by case basis will probably provide the best guidance.

# Things to keep in mind

Here are some good things to keep in mind when carrying out, or reading about, model comparison: 


  * The evidence is only linearly sensitive to prior volume, but exponentially sensitive to goodness of fit. The best way to determine which model makes the data more probable is to *get better data.*
  
  
  * This piece of common sense is reflected by both the probability theory and the behaviour of scientists: in practice, in the low signal to noise regime most model comparison proceeds via _discussion in the literature_.

# Things to keep in mind

  * If you don't believe your priors, or indeed your models, then it may not be very meaningful to compare evidences across models: it can be a distraction from things you care about more - such as accuracy, or cost. 
  
  
  * A section of the Bayesian astronomy community holds this pragmatic view

# Things to keep in mind

  * An oft-used phrase in evidence discussions is "garbage in, garbage out." 
  
  
  * However: the evidence is the $K$-dimensional marginalization integral of the likelihood over the prior, while most of the numerical results you put in your abstracts are $(K-1)$-dimensional marginalization integrals. _The degree to which you trust your evidence values should be commensurate with the degree to which you trust your other inferences._ 
  

# Things to keep in mind
  
 
  
  * The evidence appears in *second-level inferences*, when parts of your model that you previously considered to be constant now need to be varied and inferred. The FML can be considered the likelihood for a *model* (as opposed to for specific parameter values).

In fact...

## Model Selection by MCMC

There exists an even more general form of Metropolis-Hastings sampling, called **Metropolis-Hastings-Green**. Among other things, M-H-G allows a chain to move among models with different dimensionality.

This means that, in principle, we can marginalize over uncertainty about what model is correct in exactly the same way that we marginalize over possible parameter values.

# Model Evaluation Summary

* It is necessary but not sufficient to write down our model assumptions: we must also test them.

* Model checking, for prediction accuracy (or "goodness of fit") is best first done _visually_, in _data space_.

* Posterior predictive checks, using well-designed test statistics and discrepancy measures, can then quantify the innaccuracy.

* The next step is typically model expansion: improving the model's ability to fit the data by changing its form, perhaps adding more parameters. 

* Comparing alternative models against each other can be done just with accuracy (via the check results), or with efficiency considerations as well.