analysis/index.Rmd

---
site: workflowr::wflow_site
title: "Truncated Adaptive Shrinkage (`truncash`)"
output:
  workflowr::wflow_html:
    toc: false
---

`truncash` (Truncated ASH) is an exploratory project with Matthew, built on [`ashr`].

* [Matthew's initial observation on null, correlated data](voom_null.html)

[Prof. Matthew Stephens](http://stephenslab.uchicago.edu/) did a quick investigation of the p values and z scores obtained for simulated null data (using just voom transform, no correction) from real RNA-seq data of [GTEx](http://www.gtexportal.org/home/).  Here is what he found.

"I found something that I hadn’t realized, although is obvious in hindsight: although you sometimes see inflation under null of $p$-values/$z$-scores, the most extreme values are not inflated compared with expectations (and tend to be deflated). That is the histograms of $p$-values that show inflation near $0$ (and deflation near $1$) actually hide something different going on in the very left hand side near $0$.  The qq-plots are clearer… showing most extreme values are deflated, or not inflated.  This is expected under positive correlation i think.  For example, if all $z$-scores were the same (complete correlation), then most extreme of n would just be $N(0,1)$. but if independent the most extreme of n would have longer tails..."

Matthew's initial observation inspired this project.  If under positive correlation, the most extreme tend to be not inflated, maybe we can use them to control the false discoveries.  Meanwhile, if the moderate are more prone to inflation due to correlation, maybe it's better to make only partial use of their information.

* [Occurrence of extreme observations](ExtremeOccurrence.html)

As [Prof. Michael Stein](https://galton.uchicago.edu/~stein/) pointed during a conversation with Matthew, if the marginal distribution is correct then the expected number exceeding any threshold should be correct.  So if the tail is "usually"" deflated, it should be that with some small probability there are many large $z$-scores (even in the tail).  Therefore, if "on average" we have the right number of large $z$-scores/small $p$-values, and "usually" we have too few, then "rarely" we should have too many.  A simulation is run to check this intuition.

* [Two FWER-controlling procedures on correlated null](StepDown.html)

In order to understand the behavior of $p$-values of top expressed, correlated genes under the global null, simulated from GTEx data, we apply two FWER-controlling multiple comparison procedures, Holm's "step-down" ([Holm 1979]) and Hochberg's "step-up." ([Hochberg 1988])

* [`truncash` Model and first simulations](truncash.html)

* [Pipeline for simulating null data](nullpipeline.html)

Using a toy model to examine and document the pipeline to simulate null summary statistics at each step, including `edgeR::calcNormFactors`, `limma::voom`, `limma::lmFit`, `limma::eBayes`.

* [FDR on Null, Part 1](FDR_Null.html)
* [FDR on Null, Part 2](FDR_null_betahat.html)

Apply two FDR-controlling procedures, BH and BY, as well as two $s$ value models, `ash` and `truncash` to the simulated, correlated null data, and compare the numbers of false discoveries (by definition, all discoveries should be false) obtained.  Part 1 uses $z$ scores only, Part 2 uses $\hat \beta$ and moderated $\hat s$.

* [$\hat\pi_0$ estimated in correlated global null](pihat0_null.html)

$\hat\pi_0$ estimated by `ash` and `truncash` with $T = 1.96$ on correlated global null data simulated from GTEx/Liver.  Ideally they should be close to $1$.

* [Ordered $p$ values vs critical values](cutoff_null.html)

For various FWER / FDR controlling procedures, and for `truncash`, what matters the most is the behavior of the most extreme observations.  Here these extreme $p$ values are plotted along with common critical values used by various procedures, in order to shed light on their behavior.

It's very exploratory.  May be related to Extreme Value Thoery and Concentration of Measure.  To be continued.

* [Single most extreme observation](SingleExtOb.html)

What will happen if we allow the threshold in `truncash` dependent on data?  Let's start from the case when we only know the single most extreme observation.


* [Handling $t$ likelihood](t-likelihood.html)

When moving to $t$ likelihood, or in other words, when taking randomness of $\hat s$ into consideration, things get trickier.  Here we review several techiniques currently used in Matthew's lab, regarding $t$ likelihood and uniform mixture priors, based on a discussion with Matthew.

* [Histogram of correlated $z$ scores, random data sets](correlated_z.html)
* [Histogram of correlated $z$ scores, `ash`-hostile data sets](correlated_z_2.html)
* [Histogram of correlated $z$ scores, `BH`-hostile data sets](correlated_z_3.html)

An implicit key question of this inquiry is: what's the behavior of $z$ scores under dependency?  We take a look at their histograms.  First for randomly sampled data sets.  Second for those most "hostile" to `ash`.  Third for those most "hostile" to `BH`.  The bottom line is we are reproducing what Efron observed in microarray data, that "the theoretical null may fail" in different ways.  Now the key questions are

1. Why the theoretical null may fail?  What does it mean by **correlation**?

2. Can `truncash` make `ash` more robust against some of the foes that make the theoretical null fail?

3. Generally, how robust is empirical Bayes? Is empirical Bayes robust or non-robust to certain kinds of correlation?

* [Fitting empirical null with Gaussian derivatives: Theory](gaussian_derivatives.html)
* [Fitting empirical null with Gaussian derivatives: Examples](gaussian_derivatives_2.html)
* [Fitting empirical null with Gaussian derivatives: Numerical issues](gaussian_derivatives_3.html)
* [Fitting empirical null with Gaussian derivatives: Large correlations](gaussian_derivatives_4.html)
* [Fitting empirical null with Gaussian derivatives: Delta function](gd_delta.html)

Inspired by [Schwartzman 2010](http://amstat.tandfonline.com/doi/abs/10.1198/jasa.2010.tm10237), we experiment a new way to tackle "empirical null."

* [Fitting empirical null with Gaussian derivatives: Weight constraints](gaussian_derivatives_5.html)

In Gaussian derivatives decomposition, weights $W_k$ and especially [normalized weights](gaussian_derivatives_5.html#numerical_issues) $W_k^s = W_k\sqrt{k!}$ contain substantial information. In order to produce a proper density, and in order to regularize the fitted density to make it describe a plausible empiricall correlated null, constraints need to be imposed on the weights.

* [True signal vs correlated null: small effects](alternative.html)
* [True signal vs correlated null: larger effects](alternative2.html)

Both true effects and correlation can distort the empirical distribution away from the standard normal $N(0, 1)$, and Gaussian derivatives are presumably able to fit both.  Therefore, is there a way to let Gaussian derivatives with [a reasonable number of reasonable weights](gaussian_derivatives_5.html#weight_constraints) automatically identify correlated null from true effects?  It works well when effects are *not too small* right now.

* [Fitting $N\left(0, \sigma^2\right)$ with Gaussian derivatives](fitting_normal.html)

Continuing from the empirical studies above, we are looking at why $N\left(0, \sigma^2\right)$ when $\sigma^2$ is small can be satisfactorily fitted by a limited number of Gaussian derivatives.

* [Diagnostic plots for `ASH`](diagnostic_plot.html)
* [Diagnostic plots for `ASH` on correlated null](diagnostic_correlated_z.html)
* [Diagnostic plots for `ASH` on correlated null: `BH` vs $\hat\pi_0$](diagnostic_correlated_z_2.html)
* [Diagnostic plots for `ASH` on correlated null: `BH` vs `lfsr`](diagnostic_correlated_z_3.html)

Under the exchangeability assumption, the goodness of fit of empirical Bayes can be measured by the behavior of $\left\{\hat F_j = \hat F_{\hat\beta_j | \hat s_j}\left(\hat\beta_j\mid\hat s_j\right)\right\}$.  Meanwhile, if `ASH` is applied to correlated null $z$ scores, estimated $\hat g$ will not only be different from $\delta_0$; moreover, with this estimated $\hat g$, $\left\{\hat F_j\right\}$ might not behave like $\text{Unif}\left[0, 1\right]$.

* [Diagnostic plots for uniform distribution](diagnostic_plot_2.html)
* [Diagnostic plots for `ASH`](diagnostic_ash.html)
* [`ashr::plot_diagnostic()` in `ashr`](plot_diagnostic.html)

Essentially we hope to tell whether $n$ random samples are uniformly distributed, and the task seems more complicated than what it sounds.  There are multiple ways to do it, and some of them have been absorbed into the `ashr` package.

* [Marginal distribution of $z$ scores: Correlated null](marginal_z.html)
* [Marginal distribution of $z$ scores: Correlated non-null](marginal_z_alternative.html)

Take a look at the null $z$ scores we've been experimenting with so far and check whether they truly are marginally $N\left(0, 1\right)$.  It's an enhanced version of the simulation on [occurrence of extreme observations](ExtremeOccurrence.html).  The non-null $z$ scores are produced for comparison.

* [Can we create correlated null from the raw data?](create_null.html)

This whole project comes from a ubiquitous issue that in the simultaneous inference problem, the deviation from what would be expected under the null can come from both distortion due to correlation and enrichment due to true effects.  Ideally, it would be perfect if we could create controls which keep the distortion by correlation unchanged, but remove the true effects.  Efron in a series of papers (for example, [Efron et al 2001](http://www.tandfonline.com/doi/abs/10.1198/016214501753382129)) suggests to create the null from the raw data by resampling or permutation.  We apply the idea to our data sets, and the results are not promising.

* [Normal means inference with heteroskedastic and correlated noise: Model](ash_gd.html)

The previous investigation is boiled down to a prototypical problem: the classic normal means inference, but with heteroskedastic and especially correlated noise.  Gaussian derivatives and `ASH` provide a way to tackle it.

* [Implementation by `Rmosek`: Correlated null](Rmosek.html)
* [Implementation by `Rmosek`: Simulated non-null](gd-ash.html)

`cvxr` is still under active development.  We are moving to the more established `Rmosek`, exploring and experimenting this convex optimization package.

* [Improvement on implementation with `Rmosek`: Primal and dual](mosek_reg.html)
* [Improvement on implementation with `Rmosek`: Normalization](mosek_reg_2.html)
* [Improvement on implementation with `Rmosek`: Regularization](mosek_reg_3.html)

Several ways are explored to make fitting Gaussian derivatives more efficient and accurate.

* [Regularized Gaussian Derivatives: Correlated null](mosek_reg_4.html)
* [Regularized Gaussian Derivatives & `ASH`: Simulated true effects with homoscedastic correlated noises](mosek_reg_5.html)
* [Regularized Gaussian Derivatives & `ASH`: Simulated true effects with heteroskedastic correlated noises](mosek_reg_6.html)

Fitting the corrrelated null with Gaussian derivatives has two constraints: non-negativity and orderly decay, and both are intractable.  We are using $L_1$ regularization to try to solve both.

* [Simulated signals added to real data: Part 1](real_data_simulation.html)
* [Simulated signals added to real data: Part 2](real_data_simulation_2.html)
* [Simulated signals added to real data: Part 3](real_data_simulation_3.html)
* [Simulated signals added to real data: Part 4](real_data_simulation_4.html)
* [Simulated signals added to real data: Part 5](real_data_simulation_5.html)

Using the `seqgendiff` pipeline developed by [David](http://dcgerard.github.io/), [Mengyin](https://github.com/mengyin), and Matthew, we are adding simulated $\pi_0\delta_0 + \left(1 - \pi0\right)N\left(0, \sigma^2\right)$ signals to GTEx/Liver RNA-seq expression matrix, and compare our method with [`BH`](http://www.jstor.org/stable/2346101), [`qvalue`](http://bioconductor.org/packages/release/bioc/html/qvalue.html), [`locfdr`](https://cran.r-project.org/web/packages/locfdr/index.html), [`ash`](https://github.com/stephens999/ashr), [`sva`](https://www.bioconductor.org/packages/release/bioc/html/sva.html), [`cate`](https://cran.r-project.org/web/packages/cate/index.html), [`mouthwash`](https://github.com/dcgerard/vicar).  The pipeline and result here are mostly exploratory and in development stage.  More mature version will follow.

* [Completely synthetic correlated null: Low rank simulation](simulated_correlated_null.html)
* [Completely synthetic correlated null: Multiple groups of closely correlated null](simulated_correlated_null_2.html)
* [Completely synthetic correlated null: More low rank simulation ](simulated_correlated_null_3.html)
* [Completely synthetic correlated null: Limit case of low rank simulation](simulated_correlated_null_4.html)

Following [Gao](http://home.uchicago.edu/gaow/)'s suggestion, we are taking a look at whether Gaussian derivatives can satisfactorily fit synthetic correlated $N\left(0, 1\right)$ $z$ scores.  Suggested by Matthew, **it might have something to do with the distribution of eigenvalues in a random matrix**, and the [Wigner semicircle distribution](https://en.wikipedia.org/wiki/Wigner_semicircle_distribution).  It's also an interesting comparison with Gaussian mixtures.

* [Simulated true effects with realistic correlated noises: Large and moderate SNR](simulation_real_se.html)
* [Simulated true effects with realistic correlated noises: Small SNR](simulation_real_se_2.html)

The large-scale simulation testing for GD-ASH with simulated effects under different sparsity, strength, and shape, and realistic correlated noises.

* [Why AUC of $p$ values and `BH` might be different?](auc_pvalue.html)

AUC-wise, $p$ values and $BH$ should perform almost the same, since `BH` doesn't change the order of $p$ values, but in practice their performance might differ.  It turns out ties should be one important reason.

* [Group meeting presentation on `GD-ASH`](group_meeting_0608.html)

Talked about using Gaussian derivatives to handle correlation on Matthew's group meeting.

* [`GD-ASH` applied to Smemo et al 2014 data](smemo.html)
* [`GD-ASH` applied to Smemo et al 2014 data: Fitting `ASH` first](smemo_2.html)

An RNA-seq data set of mouse hearts contains $23K$+ genes and only 4 samples, two for each condition.  `GD-ASH` applying to this data set raises several interesting questions.

* [`GD-ASH` with Gaussian derivative likelihood: Initial simulation](gd_lik.html)
* [`GD-ASH` with Gaussian derivative likelihood: Model, formulae, and simulation](gd_lik_2.html)

When the noise is correlated, the likelihood should be correlated.  However, in the [`GD-ASH`](ash_gd.html) model, noise is essentially treated to be independently sampled from an empirical null fitted by Gaussian derivatives, so it would make more sense to use Gaussian derivatives, rather than simple Gaussian, as likelihood.

* [Primal vs Dual in `Rmosek`](rmosek_primal_dual.html)
* [Primal vs Dual in `Rmosek`: With or without $L_1$ regularization](rmosek_primal_dual_2.html)

A more in-depth look into the [primal vs dual](mosek_reg.html) problem in `Rmosek` implementation.

* [`CASH` on Poisson thinned real data](seqgendiff.html)

Explore the `seqgendiff::poisthin` function developed by [David](https://github.com/dcgerard/seqgendiff).

* [Data analysis: Efron's BRCA data](brca.html)

A close look at Efron's classic BRCA data, used in many of his papers to illustrate the empirical Bayes idea which inspires this project.

* [`CASH` Simulation, Part 1: Overall data sets](cash_sim_1.html)
* [`CASH` Simulation, Part 2: Over-dispersed data sets](cash_sim_2.html)
* [`CASH` Simulation, Part 3: Under-dispersed data sets](cash_sim_3.html)
* [`CASH` Simulation, Part 4: Different effect size distributions](cash_sim_4.html)
* [`CASH` Simulation, Part 5: pFDP variability](cash_sim_5.html)
* [`CASH` Simulation, Part 6: Synthetic vs real data](cash_sim_6.html)
* [`CASH` Simulation, Part 7: More on synthetic data](cash_sim_7.html)

Using the same framework of [Gaussian derivative likelihood](gd_lik_2.html), extensive simulations are performed to compare `CASH` with other methods, smaller simulations are used to facilitate the exploration, different visualization methods are explored to illustrate the advantages of `CASH`.

* [`CASH` on FDR, Part 1: $g = 0.5\delta_0 + 0.5N(0, \sigma_n^2)$](cash_fdr_1.html)
* [`CASH` on FDR, Part 2: $g = 0.5\delta_0 + 0.5N(0, (2\sigma_n)^2)$](cash_fdr_2.html)
* [`CASH` on FDR, Part 3: $g = 0.9\delta_0 + 0.1N(0, \sigma_n^2)$](cash_fdr_3.html)
* [`CASH` on FDR, Part 4: $g = 0.9\delta_0 + 0.1N(0, (2\sigma_n)^2)$](cash_fdr_4.html)
* [`CASH` on FDR, Part 5: $g = 0.5\delta_0 + 0.25N(-2\sigma_n, \sigma_n^2) + 0.25N(2\sigma_n, \sigma_n^2)$](cash_fdr_5.html)
* [`CASH` on FDR, Part 6: $g = 0.6\delta_0 + 0.3N(0, \sigma_n^2) + 0.1N(0, (2\sigma_n)^2)$](cash_fdr_6.html)

An extensive study of `CASH` and other methods in different scenarios shows that 1. `CASH` produces more accurate $\hat\pi_0$; 2. `CASH` produces less variable false discovery proportions at nominal FDR cutoff; 3. `CASH` tends to overfit when the noise is under-dispersed or looks like independent; 4. `CASH` doesn't work well when *Unimodal Assumption (UA)* is broken.

* [`CASH` on Deconvolution: Accuracy of $\hat g$](cash_deconv.html)

On the deconvolution front, `CASH` generates more robust $\hat g$ than `ASH` in the correlated noise case, and doesn't show noticeable overfitting in the independent noise case.

* [Poster for O'Bayes 2017](poster_obayes17.html)

First public presentation of `CASH`.

* [Average correlation in GTEx data: 5 livers vs 5 livers](average_cor_gtex.html)
* [Average correlation in GTEx data: 50 livers vs 50 livers](average_cor_gtex_2.html)

A very important quantity for the shape of the empirical distribution of a large number of correlated null is **average pairwise correlation** $=\sqrt{\overline{\rho_{ij}^2}}$.

* [Plots for the `CASH` Paper](cash_plots.html)

* [Data analysis: Efron's Leukemia data](efron_leukemia.html)

A close look at Efron's classic BRCA data, used in many of his papers to illustrate the empirical null idea.

* [The coefficients of Gaussian derivatives for correlated null](gd_w.html)

* [The underappreciated and surprising robustness of `BH` for correlation](BH_robustness.html). See more systemic exploration at [BHq](https://github.com/LSun/BHq)

---

Next it's a series of investigation of a number of variable selection and false discovery rate control methods in linear regression.

1. Commonly used design matrix $X$ simulation schemes and its effects on **average pairwise correlation** in $X$ $=\sqrt{\overline{\rho\left(X_i, X_j\right)^2}}$ and in $\hat\beta$ $=\sqrt{\overline{\rho\left(\hat\beta_i, \hat\beta_j\right)^2}}$

* [Design matrix $X$ and its effect on the empirical distribution of the correlated null](design_matrix.html)
* [Design matrix $X$ and its effect on the correlation pattern of $X$ and $\hat\beta$](design_matrix_2.html)

2. Initial comparison

* [Linear regression variable selection: Independent columns in $X$](knockoff_4.html)
* [Linear regression variable selection: $\Sigma_X$ is Toeplitz](knockoff.html)
* [Linear regression variable selection: $X$ has high collinearity](knockoff_2.html)
* [Linear regression variable selection: Start from $\Sigma_{\hat\beta} / \sigma_e^2 = (X^TX)^{-1}$](knockoff_3.html)

3. Scenarios that might be less friendly to `knockoff`

* [`Knockoff` on small signals](knockoff_6.html)
* [`Knockoff` on design matrix correlation](knockoff_5.html)
* [Column randomization for `Knockoff`: Factor model for $\hat\beta$](knockoff_7.html)
* [Detailed simulations for Fixed-$X$ and Model-$X$ knockoffs](knockoff_8.html)
* [`Knockoff` vs Bayesian: Normal signals](knockoff_9.html)
* [`Knockoff` vs Bayesian: Large signals](knockoff_10.html)
* [The variability of FDP by `Knockoff`](knockoff_var.html)

[`ashr`]: https://github.com/stephens999/ashr