Fast estimation of Gaussian Mixture Copula Models
The GMCM package offers R functions that perform high-dimensional meta-analysis (Li et. al., 2011) and general unsupervised cluster analysis (Tewari et. al., 2011) using Gaussian Copula Mixture Models in a very fast manner.
Gaussian copula mixture models (GMCMs) are a very flexible alternative to gaussian mixture models in unsupervised cluster analysis for continuous data where non-Gaussian clusters are present. GMCMs model the ranks of the observed data and are thus invariant to monotone increasing transformations of the data, i.e. they are semi-parametric and only the ordering of the data is important. Alternatively, a special-case of the GMCMs can be used for a novel meta-analysis approach in high-dimensional settings. In this context, the model tries to cluster results which agree and don't agree on statistical evidence into a reproducible and irreproducible group.
The optimization of the complicated likelihood function is difficult, however. The GMCM package utilizes Rcpp and RcppArmadillo to evaluate the likelihood function quickly and arrive at a parameter estimate using various standard numerical optimization routines.
Additional information and documentation can be found by running
help(package = "GMCM") in R for help and examples.
The core user functions of GMCM are
The released and tested version of GMCM is available at CRAN (Comprehensive R Archive Network). It can be easily be installed from within R by running
If you wish to install the latest version of GMCM directly from the master branch here at GitHub, run
Note, that this version is in development and is different from the version at CRAN. As such, it may be unstable. Be sure that you have the package development prerequisites if you wish to install the package from the source.
When installed, run
news(package = "GMCM") to view the latest notable changes of GMCM.
- Bilgrau, Anders Ellern; Eriksen, Poul Svante; Rasmussen, Jakob Gulddahl Dybkaer, Karen; Johnsen, Hans E; Boegsted, Martin. (2015) "Unsupervised Clustering and Meta-analysis using Gaussian Mixture Copula Models." Accepted for 'Journal of Statistical Software'.
Meta Analysis example
This is a very short tutorial for using the special GMCM for meta analysis. To illustrate we load the
u133VsExon dataset within the package. The dataset contains 19,577 p-values for the null hypothesis of no differential gene expression between two cell types for each of two different experiments called
# Load and show data data(u133VsExon) head(u133VsExon, n = 3) # u133 exon #ENSG00000265096 1.756104e-01 1.072572e-01 #ENSG00000152495 1.779757e-03 6.741108e-10 #ENSG00000198040 5.370574e-03 1.505019e-03
?u133VsExon for more information.
Next, we subset the data, rank it, and visualize it:
# Subsetting data to reduce computation time u133VsExon <- u133VsExon[1:5000, ] # Ranking and scaling, remember large values should be critical to the null! uhat <- Uhat(1 - u133VsExon) par(mfrow = c(1,2)) # Visualizing P-values and the ranked P-values plot(u133VsExon, cex = 0.5, pch = 4, col = "tomato", main = "P-values", xlab = "P-value (U133)", ylab = "P-value (Exon)") plot(uhat, cex = 0.5, pch = 4, col = "tomato", main = "Ranked P-values", xlab = "rank(1-P) (U133)", ylab = "rank(1-P) (Exon)")
Here each point represent a gene. The genes in the lower left of the first panel and correspondingly in the upper right of the second panel are the seemingly reproducible genes.
They have a low p-value and thus a high rank and both experiments.
Next, we do the actual fit using the core user function
fit.meta.GMCM with a L-BFGS-like procedure and subsequently compute the IDR values:
fit <- fit.meta.GMCM(uhat, init.par = c(0.5, 1, 1, 0.5), method = "L-BFGS", pgtol = 1e-2, verbose = TRUE) idr <- get.IDR(uhat, par = fit) # Compute IDR values and classify table(idr$K) # Show results, 1 = irreproducible, 2 = reproducible # 1 2 #4136 864
Where we see that 864 of the 5000 genes are deemed reproducible by the model at the default 0.05 threshold. The clustering results and the reproducible genes can be visualized in the following manner:
par(mfrow = c(1,2)) plot(u133VsExon, cex = 0.5, pch = 4, main = "Classified genes", col = c("tomato", "steelblue")[idr$K], xlab = "P-value (U133)", ylab = "P-value (Exon)") plot(uhat, cex = 0.5, pch = 4, main = "Classified genes", col = c("tomato", "steelblue")[idr$K], xlab = "rank(1-P) (U133)", ylab = "rank(1-P) (Exon)")
The model indeed capture genes in the upper right (of the second panel) and classify them as reproducible.