Skip to content

Understanding the p values

Thomas Roder edited this page Mar 14, 2023 · 6 revisions

Statistical tests used in Scoary

Fisher's test

Fisher's test determines if there are associations between two categorical variables.

Scoary2 uses Fisher's test to remove uncorrelated genes from the analysis. This avoids having to run the computationally intensive post-hoc test for all genes.

Because many genes are tested, Fisher's test p-value is adjusted for multiple testing, resulting in a q-value.

Advantages

Simple and fast test that measures how strongly a gene and a trait correlate.

Disadvantages

Importantly, one of the assumptions of Fisher's test is violated in mGWAS: Each isolate does not have a random and independently distributed probability of exhibiting each state. (Why? See Brynildsrud, 2016, Figure 2.)

For this reason, the resulting p-value is not conservative enough and cannot be interpreted straightforwardly: Fisher's test will likely find too many significant associations. To deal with this, Scoary includes pairwise comparisons post-hoc tests, see below.

Further reading

Wikipedia, fast-fisher Python library

Best / worst pairwise comparisons

In contrast to Fisher's test, the pairwise comparisons test takes population structure into account: Instead of considering each isolate as an independent sample, this test focuses on evolutionary transitions. The goal is to find the maximum number of phylogenetically non-intersecting pairs of isolates that contrast in the state of both genotype and phenotype. See Brynildsrud, 2016.

Calculation

There are many ways of "picking" nodes in the tree as evolutionary transitions. Scoary2 implements two extreme solutions: the "best", most optimistic and the "worst", most pessimistic picking.

Let $c$ be the maximal number of non-intersecting contrasting pairs (see Brynildsrud, 2016).

  • A "best" picking is one with $c$ contrasting pairs where as many as possible ( $b$ ) support the hypothesis.
  • A "worst" picking is one with $c$ contrasting pairs where as many as possible ( $w$ ) contradict the hypothesis.

The null hypothesis $H_{0}$ is that the number of $b$ (or $w$ , respectively) does not differ significantly from $c/2$.

The p-values are calculated using the binomial test:

  • "best" p-value: $binom\_test(b,\ n=c)$
  • "worst" p-value: $binom\_test(w,\ n=c)$

Advantages

Takes population structure into account: the focus is not on mere correlation as in Fisher's test, but on evolutionary transitions.

Makes very few assumptions on the evolutionary process.

Disadvantages

It is not clear how to interpret these p-values or the range between them (Brynildsrud, 2016). To get a more readily interpretable p-value, read the section on the permutation test below.

The pairwise comparisons test is arguably too conservative because it only takes into account a fraction of the available data. (Maddison, 2014, Felsenstein, 1985, Grafen, 1996)

Note

Scoary2's parameter worst_cutoff allows one to skip traits where no gene has a "worst" p-value lower than a certain threshold, i.e. traits that very strongly correlate with the phylogeny. This can greatly speed up the analysis of datasets with very many traits and strong population structure effects (for example: multiple species).

Further reading

Read & Nee, 1995, Maddison, 2000, Brynildsrud, 2016

Post-hoc permutation test

The goal here is to calculate a p-value based on pairwise comparisons that is more readily interpretable than the "best" / "worst" p-values described above.

The idea is to make $N$ random permutations of the phenotype data and calculating the values $c$ and $b$ for each permutation. This yields a bootstrapped empirical distribution of $b/c$ under the null hypothesis, since any association between the genotype and phenotype is broken by the random sampling.

If $r$ is the number of permuted $b/c$ higher or equal to the observed $b/c$ , then the empirical p-value is $(r+1)/(N+1)$.

Advantages

Same advantages as Best / worst pairwise comparisons, but the p-value is readily interpretable:

  • Given the phylogeny and the distribution of the gene, it measures how likely one is to find a random trait that causes at least so many $b$ per $c$.
  • If the p-value is high, the trait strongly correlates with the phylogeny and its association with the gene is more likely to be spurious
  • If the p-value is low, the trait weakly correlates with the phylogeny, lending more credence to the hypothesis that the gene is causally linked to the trait.

Disadvantages

Permutation tests are computationally intensive.

The pairwise comparisons test is arguably too conservative because it only takes into account a fraction of the available data. (Maddison, 2014, Felsenstein, 1985, Grafen, 1996)

Further reading

Brynildsrud, 2016

Product of Fisher's q-value and post-hoc test (fq*ep)

This is not a p-value and cannot be interpreted as such. It is merely used as an empirically-derived score in Scoary2 to sort the genes by how promising they are.

Illustration

To illustrate what these p-values mean, let's consider these two examples:

  • blue: g+t+ (gene present, trait present)
  • green: g-t- (gene absent, trait absent)

Example 1

good evidence for causality

  • Fisher's test: 0.00016
  • $c$: 8 (eight non-intersecting contrasting pairs can be made)
  • $b$: 8 (all of them support the hypothesis: g-t- | g+t+)
  • $w$: 0 (none contradict the hypothesis: g-t+ | g+t-)
  • "best" p-value: 0.0078
  • "worst" p-value: 0.0078
  • permuted p-value: 0.083

Here, the trait and the gene strongly correlate (Fisher's test) and that the correlation does not come from the population structure (best/worst/permutet p-values).

Example 2

bad evidence for causality

  • Fisher's test: 0.00016
  • $c$: 1 (only one contrasting pair possible: at the root node)
  • $b$: 1 (this pair supports the hypothesis)
  • $w$: 0 (none contradict the hypothesis: g-t+ | g+t-)
  • "best" p-value: 1
  • "worst" p-value: 1
  • permuted p-value: 1

Here, the trait and the gene correlate as strongly as in example 1 (Fisher's test), but there is only one evolutionary transition. We conclude that the data constitutes weak evidence for causality, as illustrated by the best/worst/permutet p-values.