# Reconsidering the p-value

1. Revisiting the p-value
2. The Don't's and Do's
3. ATOM

The lecture draws from Wasserstein et al. (2019). "Moving to a world beyond 'p<0.05'." The American Statisticican, 73(sup 1), 1-19.

---
# 1. Revisiting the p-value

As we have mentioned before in this class, the p-value used in inference tests (specifically null hypothesis tests) is a delicate thing. Many people jump to asking whether or not a _p-value_ is larger or smaller than a target value (e.g., 0.05). But let's take a step back and probe a bit more what we actually mean when we say that a test is "statistically significant".

As we pointed out in the discussion on permutation tests, the Fisherian p-value reflects the probability that you would get the data you observed if the null hypothesis ($H_0$) were true.

$$ p = P(X | H_0) $$

Let's investigate what this means a little deeper. Most null hypothesis significance tests (NHSTs) assume that you can infer this probability by knowing the _sampling distribution_ of a test statistic given that $H_0$ is true. 

Think about it in the context of the resampling methods we discussed earlier. Imagine that you collect a particular sample of data, with a specific number of observations (e.g., N=20), and run a given test statistic. For this example let's do something simple like a t-test. We want to know "is this significant?"

To understand this (and get a sampling distribution of our test statistic) we'll want to imagine a world where we can create a bunch of samples of the same size, but knowing that $H_0$ is true. To create a probability distribution we sample and sample and sample again, letting chance produce a distribution of t-statistics under the condition where the $H_0$ is true.

You can then calculate whether your original test statistic is drawn from this sampling distribution by counting the number of times you see a larger test statistic in the sampling distribution when $H_0$ is true.

$$ p = P( t(x^{null}|H_0) \geq t(x) ) $$

This process is illustrated in the figure below.

![Sampling Distribution](imgs/L19_SamplingDistribution.png)

Now traditional parametric NHSTs don't do what we learned in the resampling methods lecture. Instead, the sampling distribution is assumed (given that your data meets the assumptions of the test statistic) and we rely on look-up tables to estimate the p-value. This is mostly done for convenience given how difficult computation was before computers.



---

These NHSTs (i.e., tests that give you p-values), have several major problems.

<br>

(1) p-values depend on **unobserved data**.

In parametric statistics we assume a shape or family of the null distribution as in the example used above. Now this may seem like a restricted problem for parametric statistics, but in non-parametric statistics we assume the nature of chance (e.g., noise is _iid_), which may not always be correct.

<br>

(2) p-values depend on unknown and **subjective intentions.**

This makes sense if you consider the problem of "p-hacking". P-hacking is when you repeatedly test your data, as you collect your sample, until the p-value passes a specific threshold. Here the subjective intention of "monitoring" your data, leads to the case where you're more likely to interpret a finding as being significant when it is just a random configuration of $X$ that happens to be an extreme value.

Another subjective aspect of p-values is even the conept of using $\alpha = 0.05$ in the first place. This popular threshold is arbitrary and used only for historical reasons (see below).

<br>

(3) p-values do not quantify **statistical evidence**

The p-value is ameasure of _existence_ , not the magnitude, of an effect. For example, a _p=0.1_ does not indicate substantially more evidence for a null hypothesis than a _p=0.2_ or _p=0.06_. 

Consider the case of the **p postulate**. Imagine you have two experiments with two different sample sizes. Experiment 1 has 10 participants, while Experiment 2 has 100 participants. Imagine that we perform the same NHST on both data sets and get the same p-value (e.g., _p=0.01_). The traditional p-value only gives you the probability of the null being true and thus, in these two cases, it means that both experiments provide equal evidence against the null hypothesis. However, Experiment 1 has a smaller sample size, which means that the effect size is actually much larger than in Experiment 2. This means there is _more_ evidence against $H_0$ in Experiment 1 than in Experiment 2. But this fact isn't reflected in the traditional NHST.

<br>

(4) p-values are **poorly understood**

![p values](imgs/L19_pvalues.png)

You can _never_ prove the $H_0$ with standard NHSTs! You can only measure evidence against it.
So you can never actually confirm if the null is true, only that it is not likely false. 

---
# 2. The Dont's and Do's

In the Wasserstein review that you read for this lecture, the authors lay out many ways in which NHST's are incorrectly used. 

The XKCD comic above highlights many ways in which p-values are incorrectly used. The authors of the reading highlight several things you should **not** do with p-values.

**Don't**:

* Base your conclusion solely on whether an effect was "statistically significant".

* Believe that an effect exists just because it was "statistically significant".

* Believe an effect _does not_ exist just because it was not "statistically significant".

* Believe your p-value gives you the probability of chance alone (or that your hypothesis is true).

The basic idea is that p-values above or below a threshold should not be the arbitrary of truth. As the authors put it, "researchers are free to treat 'p=0.051' and 'p=0.049' as not being categorically different." Thus the magical significance of "p<0.05" disappears.

In fact, there is a strong case to be made to avoid using the term "statistically significance" altogether (which is why I kept it in quotes above). Fisher never intended statistical signficance to ipmly scientific importance. **It was simply meant to be a flag to indicate that a result warrants further scrutiny.** 

Yet, the misconception persists almost a century later. As a result it has become a value that has taken on almost magical value in science. Careers are made or broken based on what side of the 0.05 line a critical test falls. It is becoming increaslingly clear that this is causing more problems than it is helping. As the authors point out, "no p-value can reveal the plausibility, presence, truth, or importance of an association or effect."

Now this does not mean to imply that p-values themselves are meaningless. Quite the contrary. P-values tell you something very relevant for your data, it is just that reporting p-values requires being able to also interpret it correctly. The value of p-values we will explore in the next section.


---
# 3. ATOM

In order to advice on the proper way to interpret p-values, the authors composed the (somewhat awkwardly arranged) acronym: ATOM.

* [A]ccept uncertainty.

* Be [T]houghtful, [O]pen, and [M]odest.

Let us explore these ideas in order.

## Accept uncertainty

Statistics is a field premised on the issue of uncertainty (otherwise, we would just be doing plain old math). However, the introduction of p-values, and how most people interpret them, has pushed people to a false sense of certainty. _If_ a p-value falls below 0.05 (or whatever your $\alpha$ is), _then_ you are certain your effect is "real". 

The idea of accepting uncertainty is meant to get rid of this implicit comfort of relying on the p-value as a false measure of certainty. Rather, we should move to a stage where uncertainty is not only accepted, but embraced. There are many ways of quantifying uncertainty in order to inform the reader how to temper their influences (we will go over one of these approaches in the next lecture). 

The authors point out that there are two ways in which accepting and embracing uncertainty can change things. 

* 1. It pushes researchers to seek better measures and push for the highest quality of data possible.

* 2. It moves researchers away from the false certainty of "statistical significance" by default.

## Be thoughtful

Here the authors are endorsing "statistical thoughtfulness", which consists of relying on clearly defined (analytical) objectives (e.g., exploratory vs. hypothesis driven analysis), investing in data quality, and pushing for multiple analytical approaches to the same question. 

The authors promote several ideas of what "thoughtful research" does:

* It prioritizes quality of data production, experimental design, and quality of study execution

* It considers context and prior evidence in evaluating a statistical finding (i.e., a p-value).

* It looks ahead to prospective outcomes in the context of theory and previous research. (No p-value lives in a vaccum).

* It includes careful consideration of effect sizes.

* It considers mechanisms, possible confounds, internal/external validity, etc.

* It relies on a "toolbox" of methods, not just one.

* It focuses on clear, concise, and effective communication of the confidence or credibility of any result.

## Openness

Being thoughtful about your analysis leads to the concept of openness, which is the embracing of positive practices in teh development and presentation of research work.

This mostly follows along the idea of [Open Science](https://en.wikipedia.org/wiki/Open_science) as a field. If you can share your data and the steps you took to get your analysis, it not only forces you to be thoughtful, but also promotes a variety of viewpoints on the same data set.

Openness primarily comes into play with reporting, but goes beyond just sharing code or data. For example, report p-values as continuous descriptive statistics, instead of just a binary "significant"/"non-significant" value (i.e., "p<0.05"). Also, include confidence measures on the effect itself (e.g., confidence intervals on the test statistic you are also evaluating with a NHST). 

Openness has the advantage of tackling issues like publication bias, where only positive (or significant) results are reported, leaving a false impression on the false postive or false negative rate. While this can remove the so-called "[file drawer problem](https://en.wikipedia.org/wiki/Publication_bias)" in scientific publication, it also forces researchers to think that evidence goes beyond a single study. Real evidence for or against an effect requires external replication. Which means the validity of an effect isn't determined by a single lab or team, but by a collective field. Thus, providing sufficient information to allow others to execute the research in a meantinful way becomes crucial.

## Be Modest

A critical part of openness is modesty, which requires an understanding and clear communication of the limitations of your work. There is no such thing as the "perfect study". All studies have their flaws. Clearly communicating those flaws and the limitations that they could introduce tempers the "flash" of a study, but also increases the reliability of the findings by not overstating the effects.

When it comes to evaluating hypotheses, this means recognizing that each NHST measures a **specific** and **narrow** problem. For example, a t-test only measures the null hypotheses that two means, with equal underlying variances, do not come from the same distribution. This is a limited scope for the broader question "Are these two groups different?". Thus, you want to show how your effect holds across an entire family of relevant tests (e.g., t-test, Bayes Factor, permutation test). 

This realization that answering a question requires looking at it from multiple angles further encourages openness to report _everything_ within a single study, because others in the future may wish to evaluate with different tests that may not even exist yet. If your effect holds under these new conditions, it increases the percieved reliabilty of your findings. 

Being modest, also supports openness because it makes it safer to encourage others to reproduce your work. If you make bold (or outlandish) inferences from a single effect (or study), it is harder for researchers to replicate your _conclusions_ because they are inflated to begin with. 

Finally, and most imporantly, modesty means that you should take the role of neutral judge, as opposed to advocate, of your results. The results are what they are. They made add support to a theory (or not), but the only real conclusion requires extensive replication and validation work outside of your individual study or lab. So infer with caution and neutrality as much as possible.
    
---

## Some cautionary thoughts

While the points made by Wasserstein and colleagues (as well as the other articles in the special issue), are ideal, they can be tempered by practical considerations. For example, implementing the ATOM approach requires some dramatic cultural changes from how sciences is currently done. This is easier for people at more established stages of their career than for early career researchers. It is perfectly okay to be practical when striving for perfection. It doesn't mean to ignore the recommendations by Wasserstein et al., it just means to implement them with care when you can. It is a slow push to change the system and it takes a collective effort.

