Update high correlation simulation notebook

stephenslab · Feb 7, 2020 · 3ed0e7b · 3ed0e7b
1 parent 302374d
commit 3ed0e7b
Show file tree

Hide file tree

Showing 9 changed files with 260 additions and 1,367 deletions.
diff --git a/manuscript_results/motivating_example.ipynb b/manuscript_results/motivating_example.ipynb
@@ -8,7 +8,7 @@
    "source": [
     "# A demonstration of SuSiE's motivations\n",
     "\n",
-    "This document explains with toy example illustration the unique type of inference SuSiE is interested in."
+    "This document explains with toy example illustration the unique type of inference SuSiE is interested in. For a more complicated where variables are in high but not perfect correlation please see [this notebook](motivating_example_high_corr.html)."
    ]
   },
   {
@@ -849,9 +849,9 @@
      ""
     ]
    ],
-   "version": "0.17.2"
+   "version": "0.21.0"
   }
  },
  "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
diff --git a/manuscript_results/motivating_example_high_corr.ipynb b/manuscript_results/motivating_example_high_corr.ipynb
@@ -0,0 +1,257 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Experiment with variables of given high correlation structure\n",
+    "\n",
+    "This notebook is meant to address to a shared concern from two referees. The [motivating example](motivating_example.html) in the manuscript was designed to be a simple toy for illustrating the novel type of inference SuSiE offers. Here are some slightly more complicated examples, based on the motivating example, but with variables in high (rather than perfect) correlations with each other."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## $x_1$ and $x_2$ are highly correlated"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Following a reviewer's\n",
+    "suggestion, we simulated two variables, $x_1$ and $x_2$, with high but\n",
+    "not perfect correlation ($0.9$). Specifically, we simulated $n = 600$\n",
+    "samples stored as an $X_{600 \\times 2}$ matrix, in which each row was\n",
+    "drawn *i.i.d.* from a normal distribution with mean zero and\n",
+    "$\\mathrm{cor}(x_1, x_2) = 0.9$. We then simulated $y_i = x_{i1} \\beta_1\n",
+    "+ x_{i2} \\beta_2 + \\varepsilon_i$, with $\\beta_1 = 1, \\beta_2 = 0$,\n",
+    "and $\\varepsilon_i$ *i.i.d.* normal with zero mean and standard\n",
+    "deviation of 3. We performed 1,000 replicates of this simulation\n",
+    "(generated with different random number seeds)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this simulation, the correlation between $x_1$ and $x_2$ is still\n",
+    "sufficiently high (0.9) to make distinguishing between the two\n",
+    "variables somewhat possible, but not non entirely straightforward. For\n",
+    "example, when we run lasso (using `cv.glmnet` from the `glmnet`\n",
+    "R package) on these data it wrongly selected $x_2$ as having\n",
+    "non-zero coefficient in about 10\\% of the simulations (95 out of\n",
+    "1,000), and correctly selected $x_1$ in about 96\\% of simulations (956\n",
+    "out of 1,000). Note that the lasso does not assess uncertainty in\n",
+    "variable selection, so these results are not directly comparable\n",
+    "with SuSiE CSs below; however, the lasso results demonstrate that\n",
+    "distinguishing the correct variable here is possible, but not so easy\n",
+    "that the example is uninteresting."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Ideally, then, SuSiE should identify variable $x_1$ as an effect\n",
+    "variable and drop $x_2$ as often as possible. However, due to the high\n",
+    "correlation between the variables, it is inevitable that some\n",
+    "95\\% SuSiE credible sets (CS) will also contain $x_2$. Most\n",
+    "important is that we should avoid, as much as possible, reporting a CS\n",
+    "that contains *only* $x_2$, since the goal is that 95\\% of CSs\n",
+    "should contain at least one effect variable.  The SuSiE results (SuSiE version 0.9.1 on R 3.5.2) \n",
+    "are summarized below. The code used for the simulation [can be found here](https://github.com/stephenslab/susie-paper/blob/master/src/ref_3_question.R)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "| CSs   | count |\n",
+    "| :---- | ----: |\n",
+    "| (1)   |   829 |\n",
+    "| (1,2) |   169 |\n",
+    "|   **(2)**    |     2 |"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Highlighted in **bold** are CSs that do *not* contain\n",
+    "the true effect variable --- there are 2 of them out of 1,000 CSs\n",
+    "detected. In summary, SuSiE precisely identifies the effect\n",
+    "variable in a single CS in the majority (83\\%) of the simulations, and\n",
+    "provides a \"valid\" CS (*i.e.*, one containing an effect\n",
+    "variable) in almost all simulations (998 out of 1,000). Further, even\n",
+    "when SuSiE reports a CS including both variables, it consistently\n",
+    "assigns higher posterior inclusion probability (PIP) to the correct\n",
+    "variable, $x_1$: among the 169 CSs that contain both variables, the\n",
+    "median PIPs for $x_1$ and $x_2$ were 0.86 and 0.14, respectively."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## When an additional non-effect variable highly correlated with both variable groups"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Another referee suggested the following:\n",
+    "\n",
+    "> Suppose\n",
+    "we have another predictor $x_5$, which is both correlated with $(x_1,\n",
+    "x_2)$ and $(x_3, x_4)$. Say $\\mathrm{cor}(x_1, x_5) = 0.9$,\n",
+    "$\\mathrm{cor}(x_2, x_5) = 0.7$, and $\\mathrm{cor}(x_5, x_3)\n",
+    "= \\mathrm{cor}(x_5, x_4) = 0.8$. Does the current method assign $x_5$\n",
+    "to the $(x_1, x_2)$ group or the $(x_3, x_4)$ group?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Following the suggestion, we simulated $x_1, \\ldots, x_5$ from a\n",
+    "multivariate normal with zero mean and the covariance matrix\n",
+    "approximately as given in Table below. (Since this matrix is\n",
+    "not quite positive definite, in our R code we used `nearPD` from\n",
+    "the `Matrix` package to generate the nearest positive definite\n",
+    "matrix --- the entries of the resulting covariance matrix differ only\n",
+    "very slightly from those in Table below, with a maximum\n",
+    "absolute difference of 0.0025 between corresponding elements in the\n",
+    "two matrices).\n",
+    "\n",
+    "|         | $x_1$ | $x_2$ | $x_3$ | $x_4$ | $x_5$ |\n",
+    "| ------: | ------: | ------: | ------: | ------: | ------: |\n",
+    "| $x_1$ |    1.00 |    0.92 |    0.70 |    0.70 |    0.90 |\n",
+    "| $x_2$ |    0.92 |    1.00 |    0.70 |    0.70 |    0.70 |\n",
+    "| $x_3$ |    0.70 |    0.70 |    1.00 |    0.92 |    0.80 |\n",
+    "| $x_4$ |    0.70 |    0.70 |    0.92 |    1.00 |    0.80 |\n",
+    "| $x_5$ |    0.90 |    0.70 |    0.80 |    0.80 |    1.00 |\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We simulated $n = 600$ samples from this\n",
+    "multivariate normal distribution, then we simulated $n = 600$\n",
+    "responses $y_i$ from the regression model $y_i = x_{i1} \\beta_1 +\n",
+    "\\cdots x_{i5} \\beta_5 + \\varepsilon_i$, with $\\beta = (0, 1, 1, 0,\n",
+    "0)^T$, and $\\varepsilon_i$ *i.i.d.* normal with zero mean and\n",
+    "standard deviation of 3. We repeated this simulation procedure 1,000\n",
+    "times with different random seeds, and each time we fit a SuSiE\n",
+    "model to the simulated data by running the IBSS algorithm.\\footnote{To\n",
+    "simplify the example, we ran the IBSS algorithm with $L = 2$, and\n",
+    "fixed the $\\sigma_0^2 = 1$. Similar results were obtained when we used\n",
+    "larger values of $L$, and when $\\sigma_0^2$ was estimated. For more\n",
+    "details on how the data were simulated and how the SuSiE models\n",
+    "were fitted to the data sets, [see this script](https://github.com/stephenslab/susie-paper/blob/master/src/ref_4_question.R)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Like the toy motivating example given in the paper, in this simulation\n",
+    "the first two predictors are strongly correlated with each other, so\n",
+    "it may be difficult to distinguish among them, and likewise for the\n",
+    "third and fourth predictors. The fifth predictor, which has no effect\n",
+    "on $y$, potentially complicates matters because it is also strongly\n",
+    "correlated with the other predictors. Despite this complication, our\n",
+    "basic goal remains the same: the Credible Sets inferred by SuSiE\n",
+    "should capture the true effects most of the time, while also\n",
+    "minimizing \"false positive\" CSs that do not contain any true\n",
+    "effects. (Further, each CS should, ideally, be as small as possible.)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Table below summarizes the results of these simulations:\n",
+    "the left-hand column gives a unique result (a combination of CSs), and\n",
+    "the right-hand column gives the number of times this unique result\n",
+    "occurred among the 1,000 simulations. The CS combinations are ordered\n",
+    "by the frequency of their occurrence in the simulations. We highlight\n",
+    "in **bold** CSs that do not contain a true effect.\n",
+    "\n",
+    "| CSs            | count |\n",
+    "| :------------- | ----: |\n",
+    "| (2), (3)       |   551 |\n",
+    "| (2), (3,4)     |   212 |\n",
+    "| (1,2), (3)     |   176 |\n",
+    "| (1,2), (3,4)   |    38 |\n",
+    "| (2), (3,4,5)   |     9 |\n",
+    "| **(1)**, (3,4)        |     3 |\n",
+    "| (2),  **(4)**         |     3 |\n",
+    "| (1,2), (3,4,5) |     2 |\n",
+    "| **(1)**, (3)          |     1 |\n",
+    "| (1,2), **(4)**        |     1 |\n",
+    "| (2), (3,5)     |     1 |\n",
+    "| (3), (1,2,5)   |     1 |\n",
+    "| (3), (1,2,3)   |     1 |\n",
+    "| (3,4), (1,2,4) |     1 |"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the majority (551) of the simulations, SuSiE precisely identiied\n",
+    "the true effect variables, and no others. In most other cases,\n",
+    "SuSiE identified two CSs, each containing a correct effect variable, and\n",
+    "with one or more other variables included due to high correlation with\n",
+    "the true-effect variable. The referee asks specifically about how the\n",
+    "additional variable $x_5$ behaves in this example. In practice, $x_5$\n",
+    "was rarely included in a CS. In the few cases where $x_5$ *was*\n",
+    "included in a CS, the results were consistent with the simulation\n",
+    "setting; $x_5$ was included more frequently with $x_3$ and/or $x_4$\n",
+    "(12 times) rather than $x_2$ and/or $x_1$ (only once). In no\n",
+    "simulations did SuSiE form a large group that contains all five\n",
+    "predictors."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This example actually highlights the benefits of SuSiE compared to\n",
+    "alternative approaches (e.g., hierinf) that *first* cluster the\n",
+    "variables into groups based on the correlation structure, then test\n",
+    "the groups. As we pointed out in the manuscript, this alternative\n",
+    "approach (first cluster variables into groups, then test groups) would\n",
+    "work well in the toy example in the paper, but in general it requires\n",
+    "*ad hoc* decisions about how to cluster variables. In this more\n",
+    "complex example raised by the referee, it is far from clear how to\n",
+    "cluster the variables. \\SuSiE avoids this problem because there is\n",
+    "no pre-clustering of variables; instead, the SuSiE CSs are computed\n",
+    "directly from an (approximate) posterior distribution (which takes\n",
+    "into account how the variables $x$ are correlated with each other, as\n",
+    "well as their relationship with $y$)."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "R",
+   "language": "R",
+   "name": "ir"
+  },
+  "language_info": {
+   "codemirror_mode": "r",
+   "file_extension": ".r",
+   "mimetype": "text/x-r-source",
+   "name": "R",
+   "pygments_lexer": "r",
+   "version": "3.5.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}