# Large scale inference and multiple testing

Basic approaches to statistical inference focus on the setting where a
researcher wishes to assess a single hypothesis by attempting to
falsify it. This has come to be known as "null hypothesis significance
testing", or the "Neyman-Pearson" approach to statistical inference.

A properly-conducted statistical hypothesis test controls the
"family-wise" or "type-I" error rate, also known as the "false
positive rate", which is the probability of rejecting the null
hypothesis when the null hypothesis is true.  Typically, the
family-wise error rate (FWER) is controlled at 5%, but this is an
arbitrary threshold.

In practice, it is rarely the case that an empirical research project
strictly follows the framework for which null hypothesis significance
testing (NHST) was devised.  Research often involves assessing several
hypotheses with the same data, or may involve assessing hypotheses
that are suggested through preliminary analysis of the data.  Famous
approaches by Sidak, Scheffe, Bonferroni, and Tukey have been
developed to address this issue, and are effective in many settings.

More recently, alternative frameworks for statistical inference that
are suited for a broader set of research strategies have been devised.
One important notion is that of a "false discovery rate", which we
will discuss further below.  A branch of statistics often called
"large scale inference" has arisen to address statistical inference
for complex data analysis workflows.  We will illustrate a few basic
techniques in this area that can be carried out using the Python
statsmodels library, or by direct calculation using the Numpy Python
module.

# p-values and multiple testing

In many scientific investigations, the most interesting outcome that a
researcher can hope for is to "reject" a simple null hypothesis.  This
typically allows the researcher to claim that they have discovered a
new association, predictive relationship, or mechanism.  In other
words, classical NHST analyses are usually conducted in a setting
where the researcher wants to see the null hypothesis rejected.  To
counteract this, a primary goal of traditional (NHST) inference
procedures is to control the probability of incorrectly rejecting the
null hypothesis.  By convention, this probability is often bounded at
5% (the family-wise error rate, or FWER).

Since the FWER cannot in practice be controlled at zero, we allow the
null hypothesis to be falsely rejected with a certain positive
probability (e.g. 5%).  This is the probability that a false assertion
is made based on the analysis.  While a 5% rate of false conclusions
may be acceptable, the rate of false assertions grows as the
researcher conducts additional analyses related to the same research
aim.  Essentially, this gives the researcher multiple chances to
"win".  This is the fundamental issue that arises when applying the
NHST framework to more open-ended research pipelines.

To make this more concrete, below are several specific settings where
multiple testing can easily arise:

* _Subgroup analyses_: Suppose we have a clinical trial in which the
goal is to assess the effectiveness of a treatment.  The null
hypothesis is that the treatment has no effect, and the alternative
hypothesis is that the treatment is beneficial.  At the outset of such
a study, the researcher usually aims to show that the treatment is
beneficial in the population represented by the subjects in the
clinical trial.  But it is common to ask in addition if the treatment
is beneficial in a subpopulation, e.g. only in women, only in older
people, or only in people with a certain form of the disease.  If the
p-value for at least one of these tests is less than the conventional
0.05 threshold, the true "significance" of our discovery is unclear.
For example, if the p-value is less than 0.05 in the female sample,
but not in the whole sample or in any other supopulation, then people
may be tempted to interpret this as equivalent to if we had
pre-specified that the only question of interest was whether the
treatment is effective for women.  However this is misleading.  If the
treatment is in fact totally ineffective for all people, and we assess
it in five subpopulations, then the FWER can be as high as 20%, even
if each test on its own has a false positive rate of 5%.

* _Model selection_: Suppose we are conducting an observational study
to assess the association between a primary variable of interest and
an outcome.  For example, we may be interested in assessing whether
states with a particular regulation have slower economic growth than
states without the regulation.  Since such regulations are not
assigned at random, we cannot simply compare the economic growth rates
in states with and without the regulation.  Instead, we typically
identify several potential confounding factors and use some form of
regression analysis to assess whether the regulation is associated
with economic growth when comparing states that are otherwise similar.
In general, we will not know what the confounders are, so in practice
we will identify many potential confounders, screen out those that
seem to play no role, and assess the effect of interest in a model
that includes the confounders that seem to be important.  This process
of model selection is somewhat open-ended and may lead to a form of
implicit multiple testing.  The effect of interest will be assessed
many times, adjusting for different sets of potential confounding
variables.  Each model may control the false positive rate for the
effect of interest at 5%, but over all such models, the FWER will
exceed 5%, possibly by a large margin.

To begin, we will conduct some simple simulations to show how FWER
inflation can occur.  As always, we first import the Python libraries
that we will be using.

In [None]:
%matplotlib inline

# Remove these two lines, but make sure that you have the latest
# Statsmodels master from Github.  This is only needed for the
# RegressionFDR analysis at the end of the notebook.
import sys
sys.path.insert(0, "/afs/umich.edu/user/k/s/kshedden/statsmodels_fork/statsmodels")

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import numpy as np
from scipy.stats.distributions import norm

Very simplistically, suppose we conduct five independent tests and
report a positive finding if at least one of them looks interesting.  Here, we
work with Z-scores, noting that a Z-score that is bigger than two in
magnitude is equivalent to a p-value being smaller than 0.05.

In [None]:
# The simulation sample size, any large number will do.
nrep = 10000

z = np.random.normal(size=(nrep, 5))
print((np.abs(z).max(1) > 2).mean())

We obtain a value of around 20%, meaning that we have a 20% change of
rejecting at least one null hypothesis falsely, even though the
probability of rejecting each individual null hypothesis falsely is

5%.  Note that in this simple case where the hypotheses are
independent, we did not need to use simulation to calculate this
quantity, it is exactly equal to $1 - (1 - 0.05)^5$.

Next we have a slightly more elaborate example.  We have an outcome
$y$ and a variable of interest $x$, along with five potential
confounders.  We regress $y$ on $x$ and one of the confounders,
yielding five regressions.  If $y$ and $x$ are statistically
associated in any of these five models, we claim that an association
has been found.  In the example below, the outcome is independent of
all covariates.  Thus, all rejections of the null hypothesis are false
positives.

In [None]:
# Number of simulation replications, any large value will do
nrep = 500

# Sample size
n = 100

# Correlation between each confounder and the exposure
r = 0.4

reject = 0
for i in range(nrep):

    x = np.random.normal(size=(n, 6))

    # The outcome is independent of all predictors
    y = np.random.normal(size=n)

    # Check if we reject the null of no treatment effect
    # for at least one choice of confounder
    reject1 = 0
    for j in range(1, 6):
        x[:, j] = r*x[:, 0] + np.sqrt(1 - r**2)*x[:, j]
        zscore = sm.OLS(y, x[:, [0, j]]).fit().tvalues
        if np.abs(zscore[0]) > 2:
            reject1 += 1
    reject += int(reject1 > 0)

# The false positive rate of this procedure
print(reject / nrep)

Based on the simulation above, we see that the FWER is around double
its "nominal" value of 0.05.

# Simple ways to address multiple testing

A very simple approach for addressing multiple testing is to multiply
all p-values by the number of tests that were performed.  This is the
"Bonferroni adjustment".  Equivalently, we can define a test statistic
threshold based on a per-test type-I error rate of $\alpha/m$, where
$\alpha$ (e.g. 0.05) is the desired FWER, and $m$ is the number of
tests.

The Bonferroni adjustment can be moderately to extremely conservative
if the tests being conducted are dependent with each other.  If the
tests are independent, the Bonferroni procedure is not conservative to
a meaningful degree.  By "dependent" here, we mean that two tests are
dependent if the test statistics used to decide the results of the
tests are non-independent random variables.

The performance of the Bonferroni adjustment is illustrated in the
following simulation study.  We use correlation coefficients here as
an example.  All population correlation coefficients are zero, so all
rejected null hypotheses are false positives.  Our hypothesis tests
are conducted by applying a variance-stabilizing transformation to the
Pearson correlation coefficient (the "Fisher transformation").  The
results of this simulation show that when the tests are independent (r
= 0), the Bonferroni approach is tight (around 5% of null hypotheses
are rejected).  But when the tests are dependent (r > 0), the
Bonferroni approach becomes conservative.

In [None]:
# Number of tests
m = 100

# Sample size
n = 100

# Monte Carlo replications (any large number will do)
nrep = 1000

# Generate data that is AR(r) within each column
def genar(m, n, r):
	x = np.random.normal(size=(m, n))
	for i in range(1, n):
		x[i, :] = r*x[i-1, :] + np.sqrt(1 - r**2)*x[i, :]
	return x

for r in 0, 0.99:
    reject = 0
    for k in range(nrep):

        x1 = genar(m, n, r)
        x2 = genar(m, n, r)

        c = [np.corrcoef(x1[i, :], x2[i, :])[0, 1] for i in range(n)]
        c = np.asarray(c)

        # Apply a variance-stabilizing transformation to the correlation
        # estimates
        f = 0.5 * np.log((1 + c) / (1 - c))

        # Z-scores
        z = f * np.sqrt(n - 3)

        # p-values
        pv = 2 * norm.cdf(-np.abs(z))

        # Bonferroni adjusted p-values
        bpv = m * pv

        reject += np.any(bpv < 0.05)

    print(reject / nrep)

# False Discovery Rates

The Bonferroni approach aims to control the type-I error rate, which
is the probability of making at least one false statement among all
claims made in a study.  If the cost of a false positive is high (e.g.
the consequences are great, and it is expensive or difficult to
conduct additional studies on independent data to further validate the
finding), this is arguably the proper approach to take.  But in other
research settings, it is relatively cheap to follow up all the initial
positive findings from a study with additional validations.  In this
type of work, it seems too conservative to control the probability of
even one false claim being made, especially since this reduces our
power for making discoveries.

The False Discovery Rate (FDR) is the proportion of all claimed
positives that are false positives.  Controlling the FDR is arguably
more appropriate than controlling the FWER in settings where it is
relatively easy to validate findings.  Here are two concrete examples
where people may argue that controlling the FDR is more appropriate
than controlling the FWER:

* Suppose we are screening financial transactions for possible fraud.
Each day, millions of transactions are screened, and a small fraction
of them look suspicious.  The suspicious transactions are then
manually checked.  Under FWER control, we aim to limit the probability
that even one transaction is deemed suspicious when it is in fact
legitimate.  If we think of every day as being a replicate of this
approach, and we control FWER at 5%, then on 19 out of 20 days, no
false positives will occur.  On the other hand, if we control the FDR
at 5%, this means that of all transactions flagged as suspicious, 95%
of them (on average) turn out to be fraudulent.  That is, 95% of the
effort spent assessing suspicious transactions leads to the conclusion
that the transaction is indeed problematic.

* Suppose we are screening drug candidates for a disease using an "in
vitro" assay that can be carried out rapidly and cheaply by robots.
There are millions of drug candidates.  A candidate that looks
possibly "active" will be assessed further with additional assays.  It
is expected that most of the positives from the first round will turn
out to be false positives, but if even one succeeds this is a major
victory.  If we control the FDR, even at a slack value such as 90%,
then we are operating in a way that is consistent with the project
goals.

Note that while FWER is indeed superior than FDR at limiting false
positives, it achieves this by using a stricter threshold to determine
what is a true positive.  This means that FWER will generally have
lower power than FDR.

There are a number of ways to calculate the FDR.  Here we focus on two
of them.  One, is a simplified version of the "Benjamini-Hochberg"
FDR, which uses simple empirical proportions that directly mimic the
population version of the FDR statistic.  The other is a "local"
version of the FDR devised by Efron.

## Global FDR

Suppose we have null hypotheses $H_1, H_2, \ldots, H_m$.  Define
$N_i=1$ if null hypothesis $i$ is true, and $N_i=0$ otherwise.  Define
$R_i=1$ if null hypothesis $i$ is rejected at a particular evidence
threshold $T$, and $R_i=0$ otherwise.  Then the FDR can be defined as

$$
\frac{\sum_i R_i\cdot N_i}{\sum_i R_i} =
\frac{m^{-1}\sum_i R_i\cdot N_i}{m^{-1}\sum_i R_i}.
$$

A simple way to estimate this quantity is to assume (for the sake of
calculation) that all the hypotheses are null, i.e., that $N_i = 0$
for all $i$.  This generally leads to only a small amount of bias,
since most of the null hypotheses actually are true in most practical
settings.  In this case, the numerator $m^{-1}\sum_i R_i\cdot N_i$ can
be estimated by the number of rejected tests, and the denominator
$m^{-1} \sum_i R_i$ is approximately the type-I error rate of the
individual tests when conducted at threshold $T$.

Thus, for a given testing threshold $T$, we can define the type-I
error rate of the procedure.  There are various ways to obtain the FDR
for a specific test.  One approach is to take the test statistic $T_i$
for test $i$ (so that $R_i = {\cal I}(T_i >T)$), and calculate the FDR
for $T = T_i$.  The FDR for this procedure can be used to define an
FDR for test $i$.  There are alternative procedures known as "step-up"
procedures that are often used in practice to obtain the FDR for a
specific hypothesis.  But this approach gives similar results and is
more intuitive.

In [None]:
np.random.seed(194)

# Number of tests
m = 10000

# Sample size per group
n = 50

# Number of true alternatives
q = 50

x1 = np.random.normal(size=(m, n))
x2 = np.random.normal(size=(m, n))

# The first q tests are true alternatives, the others
# are true nulls
x1[0:q, :] += 0.5

# Z-scores
se = np.sqrt(x1.var(1)/n + x2.var(1)/n)
z = (x1.mean(1) - x2.mean(1)) / se
za = np.abs(z)

# The Z-score for the first test (which is a true alternative)
print(za[0])

# The number of tests that are at least as strong as the first test
# in terms of evidence against the null
print(np.sum(za >= za[0]))

# The expected number of tests that would be as strongly against
# the null as the first test, if all null hypotheses were true
print(m * norm.cdf(-za[0]))

Based on the results above, a rough estimate of the FDR at threshold
$T=2.74$ is $30.6/86 \approx 0.36$.  This is also the FDR for the
first test.  The FDR values for the other tests can be calculated
similarly.  This approach to estimating the FDR has been called the
"Bayesian FDR".

We see that in this setting, a Z-score of 2.74 gives an FDR value of
0.36.  Below we calculate the single-test p-value, and the Bonferroni
adjusted p-value for this Z-score.  The single test p-value is very
small, but due to the effect of multiple testing, it's not clear what
this means in terms of overall evidence against the null hypothesis.
The Bonferroni-corrected p-value is greater than 1.  It would usually
be reported as 1, indicating that the FWER is close to 100%.

The Z-score of 4.566, also calculated below, is the minimum Z-score
magnitude that would achieve a Bonferroni adjusted p-value of 0.05
when $m=10,000$ tests are performed.  As we can see, $Z=2.74$ is far
below the needed value of 4.56.  Thus we see than when 10,000 tests
are performed, $Z=2.74$ is not sufficient evidence to be confident
that a null hypothesis does not hold.

In [None]:
# The p-value for an observed Z-score of 2.74.
p = 2*norm.cdf(-2.74)
print(p)

# The Bonferroni adjusted p-value
print(m*p)

# Minimum Z-score magnitude needed to achieve Bonferroni
# adjusted p-value less than 0.05.
print(-norm.ppf(0.025/m))

Statsmodels provides several different ways to calculate FDR values,
but the "Bayesian" approach discussed above is not one of them.  In
the next cell, we calculate the "Benjamini Hochberg" FDR for the same
simulated data considered above.  We see that of the 50 "true
alternatives", 6 have FDR < 0.1, and would likely be considered as
"discoveries" in practice.  None of the "true nulls" has FDR < 0.1, so
in this case, we achieve an FDR of 0, with 6 discoveries.  The
achieved FDR is random, and on average, the true FDR of a test is no
greater than its estimated FDR.

In [None]:
fdr0 = norm.cdf(-za[0]) / np.mean(za >= za[0])
print(fdr0)

pv = 2*norm.cdf(-za)
_, gfdr, _, _ = sm.stats.multipletests(pv, method="fdr_bh")
print(np.sum(gfdr[0:q] < 0.1))
print(np.sum(gfdr[q:] < 0.1))

We won't cover this further here, but in practice it is important to consider
whether the Z-score (or p-values) being considered in an FDR analysis are
statistically independent, in the sense discussed above.  The basic approaches
to FDR control discussed above are robust to a certain amount of dependence.
There are some alternative approaches to estimating FDR values that handle
stronger dependence of certain forms, but there isn't a practical way at
present to estimate FDR values in a way that handles arbitrary patterns of
dependence.

# Local FDR

An alternative approach to FDR known as "local FDR” has been advocated
by Efron. At a high level, the distinction between FDR and local FDR
is that local FDR is based on densities while FDR is based on tail
probabilities.  In global FDR, if we have an evidence threshold $T$
(i.e.  something that we compare a Z-score to), then we compare the
number of tests with $Z > T$ to the expected number of such tests.  In
local FDR, we compare the number of tests with $Z \approx T$ to the
expected number of such tests.

Local and global FDR can both be defined in terms of Z-scores. The
local FDR at a particular Z-score value is defined as the ratio of two
densities evaluated at Z, $f(Z)/g(Z)$. The numerator density $f$ is
the density of null Z-scores, and the denominator density $g$ is the
density of all Z-scores, which is presumed to be a mixture of null and
non-null Z-scores.

If the local FDR takes on a value, say 0.1 at $Z\approx 2.5$, this
means that the actual distribution of Z-scores generates values around
2.5 at 10 times the rate that the reference distribution generates
such values.

In most cases, $f$ is simply a standard normal density, since most
test statistics can be taken to follow a standard normal distribution
when the null hypothesis is true. The denominator density $g$ could be
estimated with a simple histogram method, but it is most commonly
estimated using Poisson regression, following an approach known as
"Lindley’s method". The density $g$ is modeled as

$$
g(z) = \exp(\sum_j \beta_j z^j)
$$

In Lindley's method, the parameters $\beta$ are estimated using
Poisson regression. The observed range of Z-scores is partitioned into
bins, and we count the number of Z-scores that fall into each
bin. These counts are regressed against the bin centers (and a
polynomial basis of these values) using Poisson regression. This is
essentially a way of smoothing a histogram, using Poisson regression to
do the smoothing.

The cell below illustrates how to calculate local FDR values using statsmodels.
In this case, local FDR discovered only 4 of the "true alternatives", and like
the global FDR achieved an FDR of 0.  In general, either the local or
global FDR can be more powerful, depending on the setting.

In [None]:
lfdr = sm.stats.local_fdr(z)

print(np.sum(lfdr[0:q] < 0.1))
print(np.sum(lfdr[q:] < 0.1))

# Regression FDR and the knockoff filter

The approaches to FDR discussed above are suited for “marginal
screening”. The generally means that distinct variables are considered
in each test.  A different, but related question arises when fitting
regression models with large numbers of covariates. Here, we are faced
with the question of variable selection, for a collection of $p$
covariates $x_1, \ldots, x_p$ that may predict an outcome $y$.  In
regression variable selection, the null hypotheses are not asking
about the presence or absence of marginal dependence, e.g. through
${\rm cor}(x_j, y)$. Instead, we are asking whether $x_j$ is
independent of $y$ conditioned on $\{x_k, k\ne j\}$.

There are many methods for variable selection, here we focus on a
recently-proposed approach to variable selection that utilizes the
idea of FDR. The FDR approaches discussed above would not directly
address this question, since they look at marginal not conditional
relationships. One possible resolution is the “knockoff filter”.

The basic idea of the knockoff filter is that we augment our covariate
set with a collection of "knockoffs", doubling the number of
covariates in the model. The knockoff variables are in one-to-one
correspondence with the actual variables, so we can write
$\tilde{x}_j$ as the knockoff counterpart to $x_j$. The knockoff
variables need to be constructed in a very particular way. First, two
knockoff variables must be correlated with each other in the same way
that their non-knockoff counterparts are correlated. That is, ${\rm
cov}(\tilde{x}_j, \tilde{x}_k) = {\rm cov}(x_j, x_k)$.  In addition,
the knockoff variables need to be coupled to their non-knockoff
counterparts: ${\rm cov}(x_j, \tilde{x}_k) = {\rm cov}(x_j, x_k)$,
where $j\ne k$, and ${\rm cov}(x_j, \tilde{x}_j) = 1−s_j$, where $s_j$
is a tuning parameter which we will not discuss in detail here.

The knockoff filter works by regression $y$ on all the variables (both
the actual variables and their knockoff counterparts). We then define
a statistic that measures whether a given actual variable has a
stronger role in the model than its knockoff counterpart. A basic
choice for such a statistic would be $T_j \equiv |\beta_j| −
|\beta_{j+p}|$.  Next, we order the variables based on decreasing
values of $T_j$, placing the variables with greatest $T_j$ at the
beginning of the list. Finally a "stepdown" procedure is used to
assign an FDR to each variable in this list. We will not discuss the
details of the stepdown procedure here.

The knockoff procedure can be seen to accurately control the FDR under
fairly weak conditions. Note in particular that we do not need to do
any detailed theoretical analysis of the particular modeling procedure
being used, which would be needed for most classical
approaches to inference. For this reason, the knockoff filter can
be applied to many modern modeling methods like the Lasso
which are difficult to approach in a rigorous inferential manner using
other techniques.

In [None]:
from statsmodels.stats import knockoff_regeffects as kr

n = 500
p = 20

# Generate covariates that are somewhat dependent
x = np.random.normal(size=(n, p))
r = 0.2
for j in range(1, p):
    x[:, j] = r*x[:, j-1] + np.sqrt(1-r**2)*x[:, j]

# Generate the dependent variable of the regression
b = np.zeros(p)
b[0:6] = [1, 0, -1, 0, 1, 0]
ey = np.dot(x, b)
y = ey + np.random.normal(size=n)

# This is one of several "effect testers" that we can
# choose.  This one works well for regression models
# with non-orthogonal designs.
tester = kr.OLSEffects()

kn = sm.stats.RegressionFDR(y, x, tester, "equi")
print(kn.summary())

_