Varistran is an R package providing a Variance Stabilizing Transformation appropriate for RNA-Seq data, and a variety of diagnostic plots based on such transformation.
Varistran is most easily installed from GitHub using devtools:
By default, library size estimation is by TMM (Robinson and Oshlack, 2010), implemented in edgeR from BioConductor. You will need to install this manually if you haven't already:
Varistran also uses various CRAN packages, which will be installed automatically using the
devtools::install_github method above. Refer to the DESCRIPTION file for the full list of dependencies.
I intend to keep adding features to Varistran. Therefore, I recommend using its functions via
varistran:: rather than attaching its namespace with
library(Biobase) download.file("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bottomly_eset.RData", "bottomly_eset.RData") load("bottomly_eset.RData") counts <- exprs(bottomly.eset) strain <- phenoData(bottomly.eset)$strain experiment.number <- factor( phenoData(bottomly.eset)$experiment.number ) design <- model.matrix(~ strain + experiment.number)
Say you have a count matrix
counts and a design matrix
design. To perform a variance stabilizing transformation:
y <- varistran::vst(counts, design=design)
By default, Anscombe's (1948) variance stabilizing transformation for the negative binomial distribution is used. This behaves like log2 for large counts (log2 Counts-Per-Million if
cpm=T is given).
An appropraite dispersion is estimated with the aid of the design matrix. If omitted, this defaults to a column of ones, for blind estimation of the dispersion. This might slightly over-estimate the dispersion. A third possibility is to estimate the dispersion with edgeR.
plot_stability allows assessment of how well the variance has been stabilized. Ideally this will produce a horizontal line, but counts below 5 will always show a drop off in variance.
varistran::plot_stability(y, counts, design=design)
plot_biplot provides a two-dimensional overview of your samples and genes using Principle Components Analysis (similar concept to
plotMDS in limma):
plot_heatmap draws a heatmap.
Varistran's various diagnostic plots are also available as a Shiny app, which can be launched with:
y is not given it is calculated with
This also includes the limma bioconductor package's
plotMDS MDS plot (Ritchie et al, 2015).
Download the source code, and ensure that the Bioconductor packages in the "Suggests:" field of the DESCRIPTION file are installed. Then a suite of tests can be run with:
Outputs are placed in a directory called
Sources of data used in these tests are:
The "arab" dataset provided in the NBPSeq package (Di et al, 2011).
Simulated data following negative binomial distributions.
Dispersion estimates are compared to those calculated by the edgeR biocnoductor package's
estimateGLMCommonDisp function (McCarthy, Chen and Smyth, 2012) and by the DESeq2 bioconductor package's
DESeq function (Love, Huber and Anders, 2014).
Please email questions about using this software to the author, email@example.com.
Please file bug reports and feature requests by filing a bug report, or by contacting the author.
Pull requests gratefully considered.
To cite this R package, use:
Harrison, Paul F. 2017. "Varistran: Anscombe's variance stabilizing transformation for RNA-seq gene expression data." The Journal of Open Source Software 2 (16). doi:10.21105/joss.00257
Anscombe, Francis J. 1948. "The Transformation of Poisson, Binomial and Negative-Binomial Data." Biometrika 35 (3/4): 246–54.
Di, Yamming, Daniel W. Schafer, Jason S. Cumbie and Jeff H. Chang. 2011. "The NBP Negative Binomial Model for Assessing Differential Gene Expression from RNA-Seq" Statistical Applications in Genetics and Molecular Biology 10 (1). doi:10.2202/1544-6115.1637
Frazee, Alyssa C., Ben Langmead and Jeffrey T. Leek. 2011. "ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets." BMC Bioinformatics 12: 449. doi:10.1186/1471-2105-12-449
Love, Michael I., Wolfgang Huber and Simon Anders. 2014. "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2." Genome Biology 15 (12): 550. doi:10.1186/s13059-014-0550-8
McCarthy, Davis J., Yunshun Chen and Gordon K. Smyth. 2012. "Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation." Nucleic Acids Research 40 (10): 4288-4297. doi:10.1093/nar/gks042
Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi and Gordon K. Smyth. 2015. "limma powers differential expression analyses for RNA-sequencing and microarray studies." Nucleic Acids Research 43 (7): e47. doi:10.1093/nar/gkv007
Robinson, Mark D. and Alicia Oshlack. 2010. "A scaling normalization method for differential expression analysis of RNA-seq data." Genome Biology 11 (3): R25. doi:10.1186/gb-2010-11-3-r25