# Notes: How to Avoiding Fooling Ourselves
## Detection, Fishing, Experimenter Bias, and Consistency

Goals:

* See how discoveries can be assessed in both the Bayesian and Frequentist frameworks

* Understand the dangers of fishing expeditions

* Understand unconscious experimenter bias, and how blinding can mitigate it

* See how to quantify dataset consistency, when bringing multiple observations together

_The first principle [of science] is that you must not fool yourself — and you are the easiest person to fool.”_ - Richard Feynman

## Detection

Source detection is different in character from parameter estimation. Initially, we are less interested in the properties of a new source than we are in its existence. Detection is therefore a model comparison or hypothesis testing problem:

  * $H_0$: the "Null Hypothesis," that there is no source present
  
  * $H_1$: the "Alternative Hypothesis," that there is a source present, e.g. with flux $f$ and position $(x,y)$

### Bayesian Detection

Within Bayesian analysis, we calculate and compare the evidence for each model, $P(d|H_0)$ and $P(d|H_1)$. Their ratio gives the relative probability of getting the data under the alternative and null hypotheses (and is equal to the relative probabilities of the hypotheses being true, up to a model prior ratio).

Calculating the evidence ratio involves marginalizing over the alternative hypothesis' model parameters, given the prior PDFs that we assigned when defining $H_1$. Weakening the prior on the source position and flux (by, for example, expanding their ranges) makes any given point in parameter space less probable a priori. So, as we've seen, weaker priors decrease the evidence for $H_1$, making the detection less significant (but only linearly in the prior volume, remember).

### Frequentist Detection

Instead of working towards the relative probabilities of two hypotheses, the frequentist approach is to attempt to reject the null hypothesis by showing that it would be too improbable for the data to have been generated by it. It turns out that the most powerful statistic to use in this hypothesis test is the likelihood ratio. ("Power" is defined to be the probability that the null hypothesis test is rejected when the alternative hypothesis is true.)

To find the likelihood ratio, one would maximize the likelihood for the parameters of each hypothesis, and form the test statistic

$T_d = 2 \log \frac{L(\hat{\theta},H_1)}{L(H_0)}$,

where $\theta$ are the parameters describing the source. We then inspect the distribution of test statistics $T$ over an ensemble of hypothetical datasets generated from a model with no source (the null hypothesis), and compute the $p$-value $P(T_d > T)$. If $p < \alpha$ we then say that "we can reject the null hypothesis at the $100\,\alpha$ percent confidence level." We could find the distribution of $T$ by simulation or by approximating it as $\chi^2$, as we did in when looking at [model evaluation](modelevaluation.ipynb).

#### Fishing expeditions

The test statistic depends only on the estimated values of the source parameters, and reports the likelihood ratio between two _discrete_ hypotheses. We need to account for the fact that we "went fishing" (ie, we searched for the source throughout parameter space).

One way to do this in general is the classical _Bonferroni correction_. If we carry out $m$ independent "trials", and are aiming to detect at the $\alpha$ confidence level, we would expect to get a positive result by chance in a fraction $\alpha' = 1 - (1-\alpha)^m \lesssim m \alpha$ of cases. Even if the trials are not independent, this last inequality holds: the Bonferroni correction involves comparing $p$-values to a threshold $\alpha / m$, in order to test (and report) at the $\alpha$ confidence level. This $1/m$ is sometimes referred to as the "trials factor," and the issue described here is known in the statistics literature as the "multiple comparisons" problem, and in (astro)particle physics as the "look elsewhere effect." The correction is analogous to the decrease in evidence that we suffer from using wide priors in the Bayesian approach.

## Post-Hoc Analysis

The issues inherent in fishing are not limited to detection problems. Consider the following scenario:

> You take some data with the aim of fitting a linear relation. However, when you look at the data, it seems like a linear model won't be a good description, so you also fit a quadratic, and the posterior for the second-order coefficient is not consistent with zero. The paper, naturally, notes that you have detected a departure from the simple linear model expected.

The issue here is that the data are used more than once: both for defining the model to be fit and also for fitting the parameters of the model. This is not necessarily forbidden. But we need to consider the possibility that the features that motivated using a higher-order model were only a random fluctuation, with the true model actually being just linear. We can't take the constraint on the second-order coefficient at face value because we only included it in the model to begin with because the data appeared to prefer it.

Note that this consideration would be moot if we had planned to test linear vs quadratic models from the beginning, before ever looking at the data. But, more generally, we need to be cautious of cases where many additional hypotheses are introduced (even when when they are not motivated by examining the data), simply because the chances of one of them randomly appearing significant increases. This, too, is known as fishing, and we would need to account for the number of tests done, as above.

Post-hoc analysis is unavoidably common in astronomy, since our science is so often driven by noticing something unexpected in new observations. This doesn't mean we shouldn't report new discoveries, just that we should recognize the possibility that we are being fooled. The most robust way to test new models is with new data.

## Experimenter Bias

A particularly insidious post-hoc analysis problem is "unconscious experimenter bias". It might look like this:


> After producing some intermediate or final result, a researcher compares their findings with those of others or with prior expectations. If there is disagreement, the researcher is more likely to investigate potential problems before publishing, whereas if everything looks in agreement with expectations they might immediately publish.

In the worst case (short of actual fudging), a researcher might continue finding more and more bugs to fix, until the results come into line with expectations. (As Bayesians, we might ask: if your prior is so strong, why bother doing the experiment?) In the alternative hypothetical, the analysis might have just as many bugs, and the results might be completely wrong, yet it isn't checked as diligently. The same dynamic can be at play when one's work is being weighed by others; papers that confirm previous work and/or readers' biases are accepted with minimal scrutiny, while those that do not are looked at with suspicion. The net result is that there is a natural tendency towards "concordance" in the literature.

A practical solution to mitigate unconscious experimenter bias is this:
* The analysis team is prevented from completing their inference and viewing the results (including any intermediate results that can be used to predict the final result) until they deem their model complete (or, at least, adequate).
* The final inference is done once.
* Post-hoc analysis is then enabled, but with the understanding that it must be clearly identified as such, since experimenter bias is now potentially in play again.

The terms "blinding" and "blinded/unblinded" are frequently used in this context. [Endnote 1]

Before "opening the box" for the final inference run, one might use a method like:

* **Hidden signal box**: the subset of data believed most likely to contain the signal is removed

* **Hidden answer**: the numerical values of the parameters being measured are hidden

* **Adding/removing data (aka "salting")**: so that the team don't know whether they have detected a real signal or not

* Training on a **"pre-scaling" subset**: as in cross-validation

You might be wondering what the point of all this is, since the overall workflow hasn't changed: set up and perform an analysis, look at results, and (possibly) do something post-hoc. It's true that none of these methods is a panacea against experimenter bias. They have value to the extent that they force us to be skeptical of our own analysis, and therefore check everything carefully, to a greater extent then we might if everything came out as we expected immediately. This involves doing some qualitatively different things from what we might be used to:
* Organizing analyses in teams, and agreeing to abide by rules
* Temporarily censoring or adjusting datasets while inferences are developed
* Thinking _really, really_ hard about what diagnostics we might be able to look at without informing ourselves as to the results of interest

The perceived finality of opening the box is also beneficial. No one wants to have to admit that they went to all this trouble only to find a major bug after the box was open, so we usually end up checking everything as thoroughly as possible beforehand.

The process may be a pain, but it does seem to increase confidence in the results, both inside and outside the team.

## Consistency of independent contraints

When the same model can be constrained by independent data sets, those constraints might be perfectly consistent with one another. Hooray! We can jointly analyze the data, and get even tighter constraints.

On the other hand, the two sets of constraints could be inconsistent, or in tension with one another. We might conclude that there is a "systematic" error in the modeling of one or both data sets, or an error in the physical model.

How can we quantify the level of consistency between two sets of constraints (posterior PDFs)?

The most common (quick and dirty) quantification of tension between two datasets is the distance between the central values of their likelihoods (or posterior PDFs), in units of "sigma". Here "sigma" is usually taken to be the sum of the two posterior widths, in quadrature, assuming everything is Gaussian (quick/dirty, remember). This would normally be done with 1D or 2D marginalized posteriors, for a subset of paramters where we particularly care about consistency.

<img src="graphics/consistency.png"></img>

A more mathematically principled approach, which works in any number of dimensions, is to use the Bayesian evidence to quantify the overlap between likelihoods. Consider the following two models:

> $H_1$: Both datasets, $d_A$ and $d_B$, were generated from the same global model, with parameters $\theta$.

The evidence for $H_1$ is 

$P(\{d_A,d_B\}|H_1) = \int P(d_A|\theta,H_1)P(d_B|\theta,H_1)p(\theta|H_1)\,d\theta$.

This would be computed during a joint fit of both data sets.

> $H_2$: Each dataset, $d_A$ and $d_B$, was generated from its own local model, with parameters $\theta_A$ and $\theta_B$.

The evidence for $H_2$ is

$P(\{d_A,d_B\}|H_2) = \int P(d_A|\theta_A,H_2)P(d_B|\theta_B,H_2)p(\theta_A|H_2)p(\theta_B|H_2)\,d\theta_A d\theta_B = P(d_A|H_2)P(d_B|H_2)$.
    
For $H_2$, the evidence is just the product of the evidences computed during the two separate fits.

The Bayes factor is therefore

$\frac{P(\{d_A,d_B\}|H_2)}{P(\{d_A,d_B\}|H_1)} = \frac{P(d_A|H_2)P(d_B|H_2)}{P(\{d_A,d_B\}|H_1)}$.

If the inferences of $\theta_A$ and $\theta_B$ under $H_2$ are very different, we would see a large Bayes factor (the combined goodness of fit under $H_2$ would be greater than under $H_1$).

#### Optional further reading

* James Berger, ["The Bayesian Approach to Discovery"](https://indico.cern.ch/event/107747/contributions/32678/attachments/24371/35060/berger.pdf) (PHYSTAT 2011)

* Kyle Cranmer, ["Practical Statistics for the LHC"](https://arxiv.org/pdf/1503.07622.pdf)

* Gross & Vitells (2010), ["Trial factors for the look elsewhere effect in high energy physics"](https://arxiv.org/pdf/1005.1891.pdf)

* MacCoun & Perlmutter (2015), ["Hide Results to Seek the Truth"](http://www.nature.com/polopoly_fs/1.18510!/menu/main/topColumns/topLeftColumn/pdf/526187a.pdf) 

* Klein & Roodman (2005) ["Blind Analysis in Nuclear and Particle Physics"](https://www.pp.rhul.ac.uk/~cowan/stat/annurev.nucl.55.090704.pdf)

* Talks given at the 2017 KIPAC Workshop, "Blind Analysis in High-Stakes Survey Science: When, Why, and How?": http://kipac.github.io/Blinding/

#### Endnotes
1. The term "blinding" is somewhat problematic (personal opinion) and one that I am trying to avoid these days. First, there are some in the disabled community who, reasonably, object to the use of "blind" in the context of _voluntarily or intentionally_ not looking at something, as opposed to being fundamentally unable to do so. (Though I'm not aware of any particular outrage about the specific usage we're discussing here.) Second, the term is terribly imprecise. Saying you've done a blind analysis doesn't tell me _how_ experimenter bias was mitigated, and the details do matter. This imprecision also makes it easier for people to misuse the term as a buzzword, presenting work as superior when it may (and I've seen this done) be the complete _opposite_ of "blinded". Having said all that, I do not have an equally succinct replacement, and "blind" is the term you'll hear in the real world.