# Evaluating Models

Goals:

* Be able to design and carry out a posterior predictive model check, and describe its relationship to a frequentist significance test

* Recognize the Bayesian Evidence, know how it is computed, and describe the difference between a Bayes Factor and a likelihood ratio 

# Further reading

* Mackay, Chapters 3 and 28

* Gelman et al, Chapters 6 and 7

# Model Evaluation

* You can't do inference without making assumptions.

* We must _test_ the hypotheses defined by our models.


>  Does the model provide an _accurate_ description of the data?
>  
>
>  Are there alternative models that are either more accurate, more efficient, or both?

# Model Checking

* How do we know if our model is any good? One property that "good" models have is *accuracy.*


* Accurate models generate data that is *like* the observed data. What does this mean? 


>  * A model's ability to "predict the data" can often be assessed _visually_
> 
>
>  * *Test statistics* can be designed to capture discrepancies between observed and predicted data

# Model Checking Example

We'll step through a simple [example analysis](../examples/Straightline), of some $(x, y \pm \sigma)$ data generated by a simple model:

<img src="../graphics/modelcheck-data.png" width=60%>

# Fitting a Straight Line $y = m x + b$

<img src="../graphics/modelcheck-linear-posterior.png" width=50%> This looks like a nice, precise measurement.

# Visual Check

* Does the model fit the data?

* The first thing to do is plot the model's predictions, in data space

* In this example, "realizing the model" means drawing lines of $m x + b$ through the data points

* We have posterior samples in $(m, b)$: let's realize them all, with low `alpha` to see where the probability mass is

# Visual Check

<img src="../graphics/modelcheck-linear.png" width=60%>

# Quantifying Goodness of Fit

* Clearly the model looks inadequate: can we quantify this?

* In the frequentist approach, we would investigate the distribution of peak log likelihood values over an ensemble of (hypothetical) datasets, and compute the probability of getting a fit as bad as the maximum likelihood one or worse by chance. 

* In our case (or under the assumptions discussed in the ["approaches" notebook](approaches.ipynb)) the log likelihood follows a $\chi^2$ distribution

# Chi-squared hypothesis testing

* The data can be summarized with the Maximum Likelihood estimators for $m$ and $b$, leading to a particular value for $\chi^2_{\rm min}$

* We expect $\chi^2_{\rm min}$ to be distributed as $\chi^2$ for $N_{\rm dof} = (N_{\rm data}-2)$ degrees of freedom. 

* We can ask for the probability of getting a dataset that gives our value of $\chi^2_{\rm min}$ or worse (greater), given that our dataset and all the hypothetical ones were generated from our model (the "null hypothesis"). This one-sided integral probability is the "p-value"

# Chi-squared hypothesis testing

* The "p-value" gives the probability of getting a value of $\chi^2_{\rm min}$, or lower, by chance, given that the model is true. 

* If the p-value is less than some significance level of our choosing (e.g. 0.05), then the null hypothesis (our straight line model) would be rejected at this level.

* We can compute the p-value assuming a chi-squared distribution using `scipy.stats`:
```python
import scipy.stats
chisq = scipy.stats.chi2(Ndof)
pvalue = chisq.sf(chisq_min)
```

# Chi-squared hypothesis testing

* In our case, we would "reject the null hypothesis at the $10^{-10}$ significance level"
```
chisq_min =  104.2 p-value =  1.03e-10
```

* The "reduced chi-squared", $\hat{\chi}^2 = \chi^2_{\rm min} / N_{\rm dof}$, is often used by astronomers to quantify goodness of fit - but note that you need to know the number of degrees of freedom separately to be able to interpret it.


# Fisher's Approximation to the Chi-squared Distribution

* A useful, quick way to make sense of  $\chi^2_{\rm min}$ and $N_{\rm dof}$ values is to use Fisher's Gaussian approximation to the chi-squared distribution: 

$\;\;\;\;\;\sqrt{2\chi^2_{\rm min}} \sim \mathcal{N}\left( \sqrt{2 N_{\rm dof}-1}, 1 \right)$

* The difference between $\sqrt{2\chi^2_{\rm min}}$ and $\sqrt{2 N_{\rm dof}-1}$ is the "number of sigma" ($n_{\sigma}$) we are away from a good fit.

* In our case, the MLE model is about 7-sigma away from being a good fit.

# Exercise: $\chi^2$ intuition

For a various choices of `chisq` and `Ndof`, extend and populate this table:

| **$\chi^2_{\rm min}$**  | **$N_{\rm dof}$**|  **$\hat{\chi}^2$** | **$p$ value** | **$n_{\sigma}$** |
|-------------------------|------------------|---------------------|---------------|------------------|
|          11             |        10        |          .          |      .        |        .         |
|         110             |       100        |          .          |      .        |        .         |
|        1100             |      1000        |          .          |      .        |        .         |



In [None]:
def assess_goodness_of_fit(chisq_min, Ndof):
    """
    Parameters
    ----------
    chisq_min : float
        Value of chisq following minimization
    Ndof : int
        No. of degrees of freedom (Ndata - Npars)

    Returns
    -------
    rchisq : float
        Reduced chi-squared value
    p : float
        p-value
    nsigma : float
        No. of sigma we are from an acceptable fit, using Fisher approximation
        
    Notes
    -----
    This code has bugs in it.
    """
    
    import scipy.stats, numpy as np

    # Reduced chi-squared:
    rchisq = chisq_min / Ndof
    # p-value:
    chisq = scipy.stats.chi2(Ndof)
    p = 1.0 - chisq.sf(chisq_min)
    # Nsigma from an acceptable fit:
    nsigma = np.sqrt(2.0*chisq_min) - np.sqrt(2.0*Ndof - 1.0)

    return rchisq, p, nsigma

# Populate table:
chisq_min, Ndof = 11, 10
rchisq, p, nsigma = assess_goodness_of_fit(chisq_min, Ndof)

from __future__ import print_function
print("|        {0:.1f}             |      {1:d}        |        {2:.1f}          |      {3:.3f}        |        {4:.1f}         |".format(chisq_min, Ndof, rchisq, p, nsigma))

# Bayesian hypothesis testing

The frequentist hypothesis testing treatment has several shortcomings, from the Bayesian point of view:

1. It provides no way to include the prior information in the model assessment.

2. Many recipes require the likelihood function to be such that a standard distribution (like chi-squared) is appropriate for the distribution over datasets of maximum log likelihood values.

3. It does not take into account the uncertainty on the estimators.

The Bayesian approach to hypothesis testing addresses these by focusing on the full posterior PDF for the model parameters

# Posterior predictive model checking

Logic:

* If our model is the true one, then *replica* data generated by it should "look like" the one dataset we have (as in the frequentist approach). 


* This means that *summaries* of both the real dataset, $T(d)$, and the replica datasets, $T(d^{\rm rep})$, should follow the same distribution over noise realizations _and_ model parameters (new in Bayesian approach).


* If the real dataset was not generated with our model, then its summary may be an _outlier_ from the distribution of summaries of replica datasets.

# Posterior predictive model checking

* We can account for our uncertainty in the parameters $\theta$ by marginalizing them out, which can be easily done by just making the histogram of $T(d^{\rm rep}(\theta))$ from our posterior samples, after drawing one replica dataset $d^{\rm rep}$ from the model sampling distribution ${\rm Pr}(d^{\rm rep}\,|\,\theta)$ for each one.

# Posterior predictive $p$-values (PPP)

* Then, we can ask: what is the posterior probability for the summary $T$ to be greater than the observed summary $T(d)$? If this is very small or very large, we should be suspicious of our model - because it is not predicting the data very accurately.

$\;\;\;\;\;\;\;{\rm Pr}(T(d^{\rm rep})>T(d)\,|\,d)$

$\;\;\;\;\;\;\;\;\;\;\;= \int I(T(d^{\rm rep})>T(d))\,{\rm Pr}(d^{\rm rep}\,|\,\theta)\,{\rm Pr}(\theta\,|\,d)\;d\theta\,dd^{\rm rep}$

> Here $I$ is the "indicator function" - 1 or 0 according to the condition. 

# Test statistic: Pearson Correlation $r_{12}$

* Let's take a simple linear correlation test statistic:

$\;\;\;\;\;\;\;T(d) = r_{12} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\left[ \sum_i (x_i - \bar{x})^2 \sum_i (y_i - \bar{y})^2 \right]^{1/2}}$

* Then, we draw a replica dataset from the sampling distribution given a sample parameter vector, compute $T(d^{\rm rep})$, and repeat for all posterior samples to build up a histogram of $T$. We then compute the Posterior Predictive p-value (PPP). 

# Test statistic: Pearson Correlation $r_{12}$

$\;\;\;\;\;\;\;{\rm Pr}(T(d^{\rm rep})>T(d)\,|\,d) = 99.43\%$

<img src="../graphics/modelcheck-linear-TS.png" width=50%>

# Interpreting test statistics

* If our model is true (and we're just uncertain about its parameters, given the data), we can compute the probability of getting a $T$ less than that observed.


* Note that _we did not have to look up any particular standard distribution_ - we can simply compute the posterior predictive distribution given our generative model.


* This particular test statistic is not very powerful: better choices might put more acute stress on the model to perform, by focusing on the places where the model predictions are suspect.

# Discrepancy measures

* Test statistics $T(d,\theta)$ that are functions of both the data and the parameters are known as _discrepancy measures._


* Similar in spirit to the above, we can compute the posterior probability of getting $T(d^{\rm rep},\theta) > T(d,\theta)$:

$\;\;\;\;\;\;\;{\rm Pr}(T(d^{\rm rep},\theta)>T(d,\theta)\,|\,d)$
$\;\;\;\;\;\;\;\;\;\;\;= \int I(T(d^{\rm rep},\theta)>T(d,\theta))\,{\rm Pr}(d^{\rm rep}\,|\,\theta)\,{\rm Pr}(\theta\,|\,d)\;d\theta\,dd^{\rm rep}$


* The reduced $\chi^2(d,\theta)$ is a good example of a discrepancy measure; more generally we can use $log \mathcal{L}$ 

# Discrepancy measure: $T = \hat{\chi}^2$

$\;\;\;\;\;\;\;{\rm Pr}(T(d^{\rm rep},\theta)>T(d,\theta)\,|\,d) = 0.0\%$

<img src="../graphics/modelcheck-linear-discrepancy.png" width=50%>

# $p$-values


* Bayesian posterior predictive $p$-values can be thought of as an extension of the classical hypothesis testing $p$-value


* The extension is to include prior information, and uncertainty in the parameters, via its focus on the full posterior PDF.

# Model expansion

* Once the model checks / hypothesis tests have revealed an inadequate model, we are back to the drawing board to design an expanded model

* Usually, improving the accuracy of the model means increasing its flexibility, either by relaxing hard assumptions or increasing the number of free parameters

* In our example, we might add a quadratic term to the model: $y = m x + b + q x^2$

# Quadratic model tests

<table><tr>
<td><img src="../graphics/modelcheck-quadratic.png" width=80%></td>
<td><img src="../graphics/modelcheck-quadratic-discrepancy.png" width=80%></td>
</tr></table>

# Model Efficiency


* As well as predictive accuracy, the other virtue a model can have is *efficiency*. Typically we are interested in models that both fit the data well, and are also somehow "natural" - that is, not contrived or fine-tuned. 


* Contrived models have high likelihood in only small regions of their parameter spaces - and it turns out such models are penalized automatically by the "Bayesian evidence."

# Bayesian Evidence


* The Evidence, or "fully marginalized likelihood" (FML), is the denominator in Bayes' theorem (previously referred to as "just a normalization constant"), and is an integral over all parameter space:

$\;\;\;\;\;\;\;{\rm Pr}(d\,|\,H) = \int\;{\rm Pr}(d\,|\,\theta,H)\;{\rm Pr}(\theta\,|\,H)\;d\theta$


* The evidence is able to capture both a model's accuracy and its efficiency because it summarizes *all* the information we put into our model inferences, via both the data *and* our prior beliefs. 



# Model Comparison with the Bayesian Evidence

* The evidence for model $H$, ${\rm Pr}(d\,|\,H)$, enables a form of Bayesian hypothesis testing: model comparison with the "evidence ratio" or "Bayes Factor":

$\;\;\;\;\;\;\;R = \frac{{\rm Pr}(d\,|\,H_1)}{{\rm Pr}(d\,|\,H_0)}$


* This quantity is similar to a likelihood ratio, but it's a *fully marginalized likelihood ratio* - which is to say that it *takes into account our uncertainty about values of the parameters of each model by integrating over all plausible values of them.*

# Exercise: Evidence illustration 

In a 1D inference problem you have a Gaussian likelihood $L(\theta)$ that has peak value $L_{max}$, and a uniform prior of width $\Delta \theta$.

Sketch this situation, and provide a graphical interpretation (i.e. in terms of areas on your graph) of the Evidence integral $E$.

Make some more sketches for different choices of prior location and width, and comment on the relative valus of the Evidence.

In [None]:
# %load solutions/modelevaluation_exercise2.py

# Evidence illustration: notes

1) The evidence can be made arbitrarily small by increasing the prior volume: the evidence is more conservative than focusing on the goodness of fit ($L_{\rm max}$) alone.  

2) The evidence is linearly sensitive to prior volume ($f$), but exponentially sensitive to goodness of fit ($L_{\rm max} \propto e^{-\chi^2_{\rm min}/2}$). It's still a likelihood, after all.

> If you assign a prior you don't believe, then you may not get out a meaningful value for ${\rm Pr}(d\,|\,H)$; however, the ratio $R$ can still be useful to think about even with arbitrary choices of prior.


# Model probabilities

The evidence ratio can, in principle, be combined with the ratio of priors for each model to give us the relative probability for each model being true, given the data:

$\frac{{\rm Pr}(H_1|d)}{{\rm Pr}(H_0|d)} = \frac{{\rm Pr}(d|H_1)}{{\rm Pr}(d|H_0)} \; \frac{{\rm Pr}(H_1)}{{\rm Pr}(H_0)}$


Prior probabilities for models are very difficult to assign in most practical problems (notice that no theorist ever provides them). 

# Model probabilities

One practical way to interpret the evidence ratio is to note that the evidence ratio updates the prior ratio into a posterior one. This means that:

  * if you think that, having seen the data, the two models are *still equally probable,*
  
  * then the evidence ratio in favor of $H_1$ is _the odds that you would have had to have been willing to take against $H_1$, before seeing the data._

# Example Evidence

* In the example above, we can compute the evidence for the linear and quadratic models, and form the odds ratio $R$.

```
log Evidence for Straight Line Model: -157.2
log Evidence for Quadratic Model: -120.7
Evidence ratio in favour of the Quadratic Model: 7e15 to 1
```

* The 26 unit difference in log evidence between the two models translates to a _huge_ odds ratio in favour of the quadratic model.

# Interpreting Bayes Factors

Jeffreys provided a scale to aid the interpretation of Bayes Factors:

| **R**  | **Strength of evidence**|
|--------|:------------------------|
|  < 1   | Negative |
|  1-3   | Barely worth mentioning |
|  3-10  | Substantial |
| 10-30  | Strong |
| 30-100 | Very strong |
|   >100 | Decisive

Also: recall that evidence ratios can be interpreted in very similar ways to *odds* at the Bookmakers.

# Notes on the Evidence

* The Bayesian evidence is *qualitatively different* from other model assessments. While they focus primarily on *prediction accuracy,* the evidence is the way in which information from the prior PDF propagates through into our posterior beliefs about the model as a whole.


* There are no mathematical limitations to its use[[*citation needed*]](), in contrast to various other hypothesis tests that are only valid under certain assumptions (such as the models being nested). Any two models can be compared and the odds ratio computed.

# Calculating the Evidence

* The FML is in general quite difficult to calculate, since it involves averaging the likelihood over the prior. MCMC gives us samples from the posterior - and these cannot, it turns out, be reprocessed so as to estimate the evidence stably.


* A number of sampling algorithms have been developed that *do* calculate the evidence, during the process of sampling. These include:

  * Nested Sampling (including MultiNest and DNest)
  * Parallel Tempering, Thermodynamic Integration
  * ...

# Exercise: Evidence Estimators


  1) Write down the Evidence integral and approximate it as a sum over prior samples. What problems do you foresee with this approach to estimating the Evidence?

  2) Consider the ratio $P(\theta|d,H) / P(d|\theta,H)$ and its integral over all parameters $\theta$, and derive a second sample-based approximation for the Evidence. What problems do you foresee with this approach?

The [straight line example notebook](../examples/StraightLine/ModelEvaluation.ipynb) shows method (1) applied to our straight line example.

# Information Criteria

A number of "information criteria" exist in the statistical literature, as *easier-to-calculate alternatives* to the Bayesian Evidence. Most have the form:

$XIC(H) = -2\log L_{\rm max} + C$

* Here, the $L_{\rm max}$ term captures the goodness of fit of the model $H$, while the constant $C$ increases with model complexity. 

* Models with *low* $XIC$ are preferred - for having high goodness of fit at low model complexity. 

* Popular choices are the Bayesian Information Criterion $BIC$, where $C = K \log N$, and the corrected Akaike Information Criterion $AICc$, where $C = 2K + 2K(K+1)/(N-K-1)$.

# Information Criteria

The Deviance Information Criterion is different, since its calculation requires taking expectation values over sets of posterior samples:

$DIC(H) = \langle -2\log L(\theta) \rangle + p_D$

where $p_D = \langle -2\log L(\theta) \rangle + 2 \log L(\langle\theta\rangle)$ is an _effective number of parameters_

* Again, mdels with *low* $DIC$ are preferred - for having high goodness of fit at low model complexity. 

# Information Criteria


* $BIC$ is supposed to approximate the log evidence (in the limit of good data, linear model, and uniform priors)


* Differences in $AIC$ approximate the relative _information loss_ when using the model to approximate the actual data generator. $AIC$ focuses more on model accuracy than $BIC$. 


* $DIC$ can be computed from a set of posterior samples, and so also takes into account the prior uncertainty


* In general differences in $XIC$ are not easy to interpret, but tables similar to the Jeffreys scale exist. In general, testing against realistic simulated data on a case by case basis will probably provide the best guidance.

# Things to keep in mind

Here are some good things to keep in mind when carrying out, or reading about, model comparison: 


  * The evidence is only linearly sensitive to prior volume, but exponentially sensitive to goodness of fit. The best way to determine which model makes the data more probable is to *get better data.*
  
  
  * This piece of common sense is reflected by both the probability theory and the behaviour of scientists: in practice, in the low signal to noise regime most model comparison proceeds via _discussion in the literature_.

# Things to keep in mind

  * If you don't believe your priors, or indeed your models, then it may not be very meaningful to compare evidences across models: it can be a distraction from things you care about more - such as accuracy, or cost. 
  
  
  * A section of the Bayesian astronomy community holds this pragmatic view

# Things to keep in mind

  * An oft-used phrase in evidence discussions is "garbage in, garbage out." 
  
  
  * However: the evidence is the $K$-dimensional marginalization integral of the likelihood over the prior, while most of the numerical results you put in your abstracts are $(K-1)$-dimensional marginalization integrals. _The degree to which you trust your evidence values should be commensurate with the degree to which you trust your other inferences._ 
  

# Things to keep in mind
  
  * A reliably-computed evidence integral is a good indication of an accurately characterised posterior PDF. Indeed, some high performance sampling methods (such as nested sampling) were developed specifically *in order to compute the evidence accurately*.
  
  
  * The evidence appears in *second-level inferences*, when parts of your model that you previously considered to be constant now need to be varied and inferred. The FML is only one parameter liberation away from becoming a likelihood.  

# Model Evaluation Summary

* It is necessary but not sufficient to write down our model assumptions: we must also test them.

* Model checking, for prediction accuracy (or "goodness of fit") is best first done _visually_, in _data space_.

* Posterior predictive checks, using well-designed test statistics and discrepancy measures, can then quantify the innaccuracy.

* The next step is typically model expansion: improving the model's ability to fit the data by changing its form, perhaps adding more parameters. 

* Comparing alternative models against each other can be done just with accuracy (via the check results), or with efficiency considerations as well, using the Bayesian evidence.