# Evaluating Models

Goals:

* Be able to design and carry out a posterior predictive model check, and describe its relationship to a classical significance test

* Recognize the Bayesian Evidence, know how it is computed, and describe the difference between a Bayes Factor and a likelihood ratio

* Assign priors in various different ways
 
 

# Further reading

* Mackay, Chapters 3 and 28

* Gelman et al, Chapters 6 and 7

# Model Evaluation

* You can't do inference without making assumptions.

* We need to _test_ the hypotheses defined by our models.


>  Does the model provide an _accurate_ description of the data?
>  
>
>  Are there alternative models that are either more accurate, more efficient, or both?

# Model Checking

* How do we know if our model is any good? One property that "good" models have is *accuracy.*


* Accurate models generate data that is *like* the observed data. What does this mean? First we have to define what similarity is, in this context. 


>  * *Visual impression* of a model's ability to "predict the data"
> 
>
>  * *Test statistics* that capture relevant features of the data

# Model Checking Example

We'll step through a simple [example analysis](../examples/Straightline), of some $(x, y, \sigma)$ data drawn generated by a simple linear model:

<img src="../graphics/modelcheck-data.png" width=60%>

# Fitting a Straight Line $y = m x + b$

<img src="../graphics/modelcheck-linear-posterior.png" width=50%> This looks like a nice, precise measurement.

# Visual Check

* Does the model fit the data?

* The first thing to do is plot the model's predictions, in data space

* In this example, "realizing the model" means drawing lines of $m x + b$ through the data points

* We have posterior samples in $(m, b)$: let's realize them all, with low `alpha` to see where the probability mass is

# Visual Check

<img src="../graphics/modelcheck-linear.png" width=60%>

# Quantifying Goodness of Fit

* Clearly the model looks inadequate: can we quantify this?

* In the frequentist approach, we would investigate the distribution of peak log likelihood values over (hypothetical) datasets, and compute the probability of getting a fit at least as bad as the maximum likelihood one by chance. 

* Under the assumptions discussed in the ["approaches" notebook](approaches.ipynb) the log likelihood follows a $\chi^2$ distribution, which can (often) in turn be approximated by a Gaussian

# Chi-squared hypothesis testing

* The data can be summarized with the Maximum Likelihood estimators for $m$ and $b$, leading to a particular value for $\chi^2_{\rm min}$

* We expect $\chi^2_{\rm min}$ to be distributed as $\chi^2$ for $N_{\rm dof} = (N_{\rm data}-2)$ degrees of freedom. 

* We can ask for the probability of getting a dataset that gives our value of $\chi^2_{\rm min}$ or worse (greater), given that our dataset and all the hypothetical ones were generated from our model (the "null hypothesis"). This one-sided integral probability is the "p-value"

# Chi-squared hypothesis testing

* The "p-value" gives the probability of getting a value of $\chi^2_{\rm min}$, or higher, by chance, given that the model is true. 

* If the p-value is less than some significance level (e.g. 0.05), then the null hypothesis (our straight line model) would be rejected.

* We can compute the p-value assuming a chi-squared distribution using `scipy.stats`:
```python
import scipy.stats
chisq = scipy.stats.chi2(Ndof)
pvalue = chisq.sf(chisq_min)
```

# Chi-squared hypothesis testing

* In our case, we would "reject the null hypothesis at the $10^{-10}$ significance level"
```
chisq_min =  104.2 p-value =  1.03e-10
```

* The "reduced chi-squared", $\chi^2_{\rm min} / N_{\rm dof}$, is often used by astronomers to quantify goodness of fit - but note that you need to know the number of degrees of freedom separately to be able to interpret it.


# Fisher's Approximation to the Chi-squared Distribution

* A useful, quick way to make sense of  $\chi^2_{\rm min}$ and $N_{\rm dof}$ values is to use Fisher's Gaussian approximation to the chi-squared distribution: 

$\;\;\;\;\;\sqrt{2\chi^2_{\rm min}} \sim \mathcal{N}\left( \sqrt{2 N_{\rm dof}-1}, 1 \right)$

* The difference between $\sqrt{2\chi^2_{\rm min}}$ and $\sqrt{2 N_{\rm dof}-1}$ is the "number of sigma" we are away from a good fit.

* In our case, the MLE model is about 7-sigma away from being a good fit.

# Bayesian Model Checking

* The frequentist hypothesis testing treatment has several shortcomings, from the Bayesian point of view

1. It provides no way to include the prior information in the model assessment.

2. It requires the likelihood function to be such that a standard distribution (like chi-squared) can be used to model the distribution over datasets of maximum log likelihood values.

The Bayesian approach to hypothesis testing addresses both of these by focusing on the full posterior PDF for the model parameters

# Posterior Predictive Model Checking

Logic:

* If our model is the true one, then *replica* data generated by it should "look like" the one dataset we have. 


* This means that *summaries* of both the real dataset, $T(d)$, and the replica datasets, $T(d^{\rm rep})$, should follow the same distribution over model parameters _and_ noise realizations 


* If the real dataset was not generated with our model, then its summary may be an _outlier_ from the distribution of summaries of replica datasets.

# Posterior Predictive Model Checking

* We can account for our uncertainty in the parameters $\theta$ by marginalizing them out, which can be easily done by just making the histogram of $T(d^{\rm rep}(\theta))$ from our posterior samples, after drawing one replica dataset $d^{\rm rep}$ from the model sampling distribution ${\rm Pr}(d^{\rm rep}\,|\,\theta)$ for each one.

# Posterior Predictive $p$-values

* Then, we can ask: what is the posterior probability for the summary $T$ to be greater than the observed summary $T(d)$? If this is very small or very large, we should be suspicious of our model - because it is not predicting the data very accurately.

$\;\;\;\;\;\;\;{\rm Pr}(T(d^{\rm rep})>T(d)\,|\,d) = \int I(T(d^{\rm rep})>T(d))\,{\rm Pr}(d^{\rm rep}\,|\,\theta)\,{\rm Pr}(\theta\,|\,d)\;d\theta\,dd^{\rm rep}$

> Here $I$ is the "indicator function" - 1 or 0 according to the condition. 

# Test statistic: Pearson Correlation $r_{12}$

* Let's take a simple linear correlation test statistic:

$\;\;\;\;\;\;\;T(d) = r_{12} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\left[ \sum_i (x_i - \bar{x})^2 \sum_i (y_i - \bar{y})^2 \right]^{1/2}}$

* Then, we draw a replica dataset from the sampling distribution given a sample parameter vector and compute $T(d^{\rm rep}$, and repeat for all posterior samples to build up a histogram of $T$ and then compute the PPP value. 

# Test statistic: Pearson Correlation $r_{12}$

$\;\;\;\;\;\;\;{\rm Pr}(T(d^{\rm rep})>T(d)\,|\,d) = 99.43\%$

<img src="../graphics/modelcheck-linear-TS.png" width=60%>

# Interpreting Test Statistics

* If our model is true (and we're just uncertain about its parameters, given the data), we can compute the probability of getting a $T$ less than that observed.


* Note that _we did not have to look up any particular standard distribution_ - we can simply compute the posterior predictive distribution given our generative model.


* This particular test statistic is not very powerful: better choices might put more acute stress on the model to perform, by focusing on the places where the model predictions are suspect.

# Discrepancy Measures

* Test statistics $T(d,\theta)$ that are functions of both the data and the parameters are known as _discrepancy measures._


* Similar in spirit to the above, we can compute the posterior probability of getting $T(d^{\rm rep},\theta) > T(d,\theta)$:

$\;\;\;\;\;\;\;{\rm Pr}(T(d^{\rm rep},\theta)>T(d,\theta)\,|\,d) = \int I(T(d^{\rm rep},\theta)>T(d,\theta))\,{\rm Pr}(d^{\rm rep}\,|\,\theta)\,{\rm Pr}(\theta\,|\,d)\;d\theta\,dd^{\rm rep}$


* Reduced $\chi^2(d,\theta)$ is a good example of a discrepancy measure! 

# Discrepancy Measure: $\chi^2$

$\;\;\;\;\;\;\;{\rm Pr}(T(d^{\rm rep},\theta)>T(d,\theta)\,|\,d) = 0.0\%$

<img src="../graphics/modelcheck-linear-discrepancy.png" width=60%>

# $p$-values


* Bayesian posterior predictive $p$-values can be thought of as an extension of the classical hypothesis testing $p$-value


* The extension is to include prior information and to cope with arbitrary sampling distributions, via its focus on the full posterior PDF.

# Model Comparison

* If our model is inadequate, we'll need an alternative. How should we compare it to our first one? Accuracy is not the only consideration: [efficiency is also important](../notes/Evidence.ipynb).


* Returning to the straight line problem, let's [compare two models via the Bayesian evidence](../examples/StraightLine/ModelEvaluation.ipynb#Model-Expansion).

# Model Comparison with the Bayesian Evidence

* The evidence for model $H$, ${\rm Pr}(d\,|\,H)$, enables a form of Bayesian hypothesis testing: model comparison with the "evidence ratio" or "Bayes Factor":

$\;\;\;\;\;\;\;R = \frac{{\rm Pr}(d\,|\,H_1)}{{\rm Pr}(d\,|\,H_0)}$


* This quantity is similar to a likelihood ratio, but it's a *fully marginalized likelihood ratio* - which is to say that it *takes into account our uncertainty about values of the parameters of each model by integrating over them all.*

* As well as predictive accuracy, the other virtue a model can have is *efficiency*. Typically we are interested in models that both fit the data well, and are also somehow "natural" - that is, not contrived or fine-tuned. 


* Contrived models have high likelihood in only small regions of their parameter spaces - and it turns out such models are penalized automatically by the Bayesian evidence.


* The evidence is able to capture both a model's accuracy and its efficiency because it summarizes *all* the information we put into our model inferences, via both the data *and* our prior beliefs. 


* You can see this by inspection of the evidence, or fully marginalized likelihood (#FML), integral:

$\;\;\;\;\;\;\;{\rm Pr}(d\,|\,H) = \int\;{\rm Pr}(d\,|\,\theta,H)\;{\rm Pr}(\theta\,|\,H)\;d\theta$

# Evidence Illustration 

* The following figure might help illustrate how the evidence depends on both goodness of fit (through the likelihood) and the complexity of the model (via the prior). 

<img src="../graphics/evidence.png">

* In this 1D case, a Gaussian likelihood (red) is integrated over a uniform prior (blue): the evidence can be shown to be given by $E = f \times L_{\rm max}$, where $L_{\rm max}$ is the maximum possible likelihood, and $f$ is the fraction of the blue dashed area that is shaded red. $f$ is 0.31, 0.98, and 0.07 in each case.

The illustration above shows us a few things:

1) The evidence can be made arbitrarily small by increasing the prior volume: the evidence is more conservative than focusing on the goodness of fit ($L_{\rm max}$) alone.  Of course if you assign a prior you don't believe, then you should not expect to get out a meaningful answer for ${\rm Pr}(d\,|\,H)$.

2) The evidence is linearly sensitive to prior volume ($f$), but exponentially sensitive to goodness of fit ($L_{\rm max}$). It's still a likelihood, after all.

The evidence ratio can, in principle, be combined with the ratio of priors for each model to give us the relative probability for each model being true, given the data:

$\frac{{\rm Pr}(H_1|d)}{{\rm Pr}(H_0|d)} = \frac{{\rm Pr}(d|H_1)}{{\rm Pr}(d|H_0)} \; \frac{{\rm Pr}(H_1)}{{\rm Pr}(H_0)}$


Prior probabilities for models are very difficult to assign in most practical problems (notice that no theorist ever provides them). So, one way to interpret the evidence ratio is to note that:

  * If you think that having seen the data, the two models are *still equally probable,*
  
  * then the evidence ratio in favor of $H_1$ is you the odds that you would have had to have been willing to take against $H_1$, before seeing the data.

  * That is: the evidence ratio updates the prior ratio into a posterior one - as usual.

# Model Evaluation

* It is necessary but not sufficient to write down our model assumptions: we must also test them.

* Model checking, for prediction accuracy (or "goodness of fit") is best first done _visually_, in _data space_.

* Posterior predictive checks, using well-designed test statistics and discrepancy measures, can then quantify the innaccuracy.

* The next step is typically model expansion: improving the model's ability to fit the data by changing its form, perhaps adding more parameters. 

* Comparing alternative models against each other can be done just with accuracy (via the check results), or with efficiency considerations as well, using the Bayesian evidence.