/
selective_inference_toy.Rmd
79 lines (60 loc) · 2.5 KB
/
selective_inference_toy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# Selective inference for a toy example
Here we investigate "selective inference" in the toy example of [Wang et al (2018)](https://www.biorxiv.org/content/10.1101/501114v1). We show that the approach will sometimes select the wrong variables -- which is inevitable in cases where variables are perfectly correlated -- and then assign them highly significant $p$ values. This is because even though the wrong variables are selected, their coefficients within the wrong model can be estimated precisely.
```{r knitr-opts, include=FALSE}
knitr::opts_chunk$set(comment = "#",collapse = TRUE,results = "hold")
```
## Load packages
First, load the [selective inference](https://cran.r-project.org/package=selectiveInference) package.
```{r load-pkgs, warning=FALSE, message=FALSE}
library(selectiveInference)
```
## Simulate data
Now simulate some data with $x_1 = x_2$ and $x_3 = x_4$, and with
effects at variables 1 and 3. (We simulate $p = 100$ variables rather
than $p = 1000$ so that the example runs faster.)
```{r sim-data}
set.seed(15)
n <- 500
p <- 100
x <- matrix(rnorm(n*p),n,p)
x[,2] <- x[,1]
x[,4] <- x[,3]
b <- rep(0,p)
b[1] <- 1
b[4] <- 1
y <- drop(x %*% b + rnorm(n))
```
## Run selective inference
Unfortunately, the selective inference methods won't allow duplicate
columns.
```{r run-fs-1}
try(fsfit <- fs(x,y))
try(larfit <- lar(x,y))
```
So we modify `x` so that the identical columns aren't quite identical.
```{r tweak-x}
x[,2] <- x[,1] + rnorm(n,0,0.1)
x[,4] <- x[,3] + rnorm(n,0,0.1)
cor(x[,1],x[,2])
cor(x[,3],x[,4])
```
Now run the forward selection again, computing sequential *p*-values and
confidence intervals.
```{r run-fs-2}
fsfit <- fs(x,y)
out <- fsInf(fsfit)
print(out)
```
## Summary
From the above output, we see that the selective inference method
selected variables 1 and 3 with very small *p*-values. Of course, we
know that variable 3 is a false selection, so it might seem bad that
the *p*-value is small. However, you have to remember that
*p*-values do not measure significance of variable selection---they
measure the significance of the coefficient of the selected variable,
*conditional on the selection event.*
Put another way, selective inference is not trying to assess
uncertainty in which variables should be selected, and is certainly
not trying to produce inferences of the form $$(b_1 \neq 0 \text{ OR }
b_2 \neq 0) \text{ AND } (b_3 \neq 0 \text{ OR } b_4 \neq 0),$$ which
was the goal of [Wang et al (2018)](https://www.biorxiv.org/content/10.1101/501114v1).