manuscript_results/selective_inference_toy.Rmd

# Selective inference for a toy example

Here we investigate "selective inference" in the toy example of [Wang et al (2018)](https://www.biorxiv.org/content/10.1101/501114v1). We show that the approach will sometimes select the wrong variables -- which is inevitable in cases where variables are perfectly correlated -- and then assign them highly significant $p$ values. This is because even though the wrong variables are selected, their coefficients within the wrong model can be estimated precisely.

 
```{r knitr-opts, include=FALSE}
knitr::opts_chunk$set(comment = "#",collapse = TRUE,results = "hold")
```

## Load packages

First, load the [selective inference](https://cran.r-project.org/package=selectiveInference) package.

```{r load-pkgs, warning=FALSE, message=FALSE}
library(selectiveInference)
```

## Simulate data

Now simulate some data with $x_1 = x_2$ and $x_3 = x_4$, and with
effects at variables 1 and 3. (We simulate $p = 100$ variables rather
than $p = 1000$ so that the example runs faster.)

```{r sim-data}
set.seed(15)
n     <- 500
p     <- 100
x     <- matrix(rnorm(n*p),n,p)
x[,2] <- x[,1] 
x[,4] <- x[,3]
b     <- rep(0,p)
b[1]  <- 1
b[4]  <- 1
y     <- drop(x %*% b + rnorm(n))
```

## Run selective inference

Unfortunately, the selective inference methods won't allow duplicate
columns.

```{r run-fs-1}
try(fsfit <- fs(x,y))
try(larfit <- lar(x,y))
```

So we modify `x` so that the identical columns aren't quite identical.

```{r tweak-x}
x[,2] <- x[,1] + rnorm(n,0,0.1) 
x[,4] <- x[,3] + rnorm(n,0,0.1)
cor(x[,1],x[,2])
cor(x[,3],x[,4])
```

Now run the forward selection again, computing sequential *p*-values and
confidence intervals.

```{r run-fs-2}
fsfit <- fs(x,y)
out   <- fsInf(fsfit)
print(out)
```

## Summary

From the above output, we see that the selective inference method
selected variables 1 and 3 with very small *p*-values. Of course, we
know that variable 3 is a false selection, so it might seem bad that
the *p*-value is small. However, you have to remember that 
*p*-values do not measure significance of variable selection---they
measure the significance of the coefficient of the selected variable,
*conditional on the selection event.*

Put another way, selective inference is not trying to assess
uncertainty in which variables should be selected, and is certainly
not trying to produce inferences of the form $$(b_1 \neq 0 \text{ OR }
b_2 \neq 0) \text{ AND } (b_3 \neq 0 \text{ OR } b_4 \neq 0),$$ which
was the goal of [Wang et al (2018)](https://www.biorxiv.org/content/10.1101/501114v1).