SIMLR and CIMLR Multi-kernel LeaRning methods
In this repository we provide implementations in both R and Matlab of SIMLR (https://www.nature.com/articles/nmeth.4207) and CIMLR (https://www.biorxiv.org/content/early/2018/02/16/267245). These methods were originally applied to single-cell and cancer genomic data, but they are in principle capable of effectively and efficiently learning similarities in all the contexts where diverse and heterogeneous statistical characteristics of the data make the problem harder for standard approaches.
The main branch of the repository (named SIMLR) provides the code (both R and Matlab) for the two methods (namely SIMLR and CIMLR), but no data in order to make it smaller for download. Some example data together with the implementations are provided in the branch SIMLR_full (bigger to download). We recall that those data are provided purely as examples and should not be used in place of the ones provided in the respective publications.
Moreover, the tools are also available on Bioconductor (https://www.bioconductor.org/packages/release/bioc/html/SIMLR.html). The branch master of this repository refers to the stable version on Bioconductor and the development branch of this repository refers to the development version on Bioconductor (https://www.bioconductor.org/packages/devel/bioc/html/SIMLR.html).
The standard implementations of SIMLR are provided in the scripts SIMLR.R for R and SIMLR.m for Matlab. Similarly, we provide in the scripts CIMLR.R and CIMLR.m the implementations of CIMLR in both R and Matlab.
Besides the standard implementation of SIMLR, we also provide SIMLR_Large_Scale to handle large scale datasets (scripts SIMLR_Large_Scale.R and SIMLR_Large_Scale.m for R and Matlab) and SIMLR_Feature_Ranking to rank the most important features for the learned similarities (scripts SIMLR_Feature_Ranking.R and SIMLR_Feature_Ranking.m for R and Matlab). We notice that this last function can also be used to prioritize features for the results by CIMLR, with the shrewdness of modifying the input data as follow: CIMLR takes as input a list of multiple data types (multi-omics) for the same samples (rows in the input matrices), which need to be trasformed into one unique input matrix with the same rows and by appending into it all the features (columns) of the list of matrices as input. As an example, if you provide to CIMLR 4 input data of 100 samples and 1000 features each, the input to compute the feature ranking will be one matrix of 100 rows and 4000 columns.
Finally, we also provide scripts to estimate the number of clusters from the data as suggested in the original papers in the scripts SIMLR_Estimate_Number_of_Clusters.R and SIMLR_Estimate_Number_of_Clusters.m for SIMLR and CIMLR_Estimate_Number_of_Clusters.R and CIMLR_Estimate_Number_of_Clusters.m for CIMLR.
Single-cell RNA-seq technologies enable high throughput gene expression measurement of individual cells, and allow the discovery of heterogeneity within cell populations. Measurement of cell-to-cell gene expression similarity is critical for the identification, visualization and analysis of cell populations. However, single-cell data introduce challenges to conventional measures of gene expression similarity because of the high level of noise, outliers and dropouts. We develop a novel similarity-learning framework, SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), which learns an appropriate distance metric from the data for dimension reduction, clustering and visualization. SIMLR is capable of separating known subpopulations more accurately in single-cell data sets than do existing dimension reduction methods. Additionally, SIMLR demonstrates high sensitivity and accuracy on high-throughput peripheral blood mononuclear cells (PBMC) data sets generated by the GemCode single-cell technology from 10x Genomics.
SIMLR offers three main unique advantages over previous methods: (1) it learns a distance metric that best fits the structure of the data via combining multiple kernels. This is important because the diverse statistical characteristics due to large noise and dropout effect of single-cell data produced today do not easily fit specific statistical assumptions made by standard dimension reduction algorithms. The adoption of multiple kernel representations provides a better fit to the true underlying statistical distribution of the specific input scRNA-seq data set; (2) SIMLR addresses the challenge of high levels of dropout events that can significantly weaken cell-to-cell similarities even under an appropriate distance metric, by employing graph diffusion, which improves weak similarity measures that are likely to result from noise or dropout events; (3) in contrast to some previous analyses that pre-select gene subsets of known function, SIMLR is unsupervised, thus allowing de novo discovery from the data. We empirically demonstrate that SIMLR produces more reliable clusters than commonly used linear methods, such as principal component analysis (PCA), and nonlinear methods, such as t-distributed stochastic neighbor embedding (t-SNE), and we use SIMLR to provide 2-D and 3-D visualizations that assist with the interpretation of single-cell data derived from several diverse technologies and biological samples.
Furthermore, here we also provide an implementation of SIMLR (see SIMLR large scale) capable of handling large scale datasets.
Outcomes for cancer patients vary greatly even within the same tumor type, and characterization of molecular subtypes of cancer holds important promise for improving prognosis and personalized treatment. This promise has motivated recent efforts to produce large amounts of multidimensional genomic ("multi-omic") data, but current algorithms still face challenges in the integrated analysis of such data. Here we present Cancer Integration via Multikernel Learning (CIMLR), a new cancer subtyping method that integrates multi-omic data to reveal molecular subtypes of cancer. CIMLR extends the original implementation of SIMLR to take as input multiple data matrices corresponding to different types of measurements upon the same set of tumors. We applied CIMLR to multi-omic data from 32 cancer types and showed significant improvements in both computational efficiency and ability to extract biologically meaningful cancer subtypes. The discovered subtypes exhibited significant differences in patient survival for 21 of the 32 studied cancer types. Our analysis revealed integrated patterns of gene expression, methylation, point mutations and copy number changes in multiple cancers and highlights patterns specifically associated with poor patient outcomes.
The latest version of the manuscript related to SIMLR is published on Nature Methods at https://www.nature.com/articles/nmeth.4207. While, the latest draft of the manuscript of CIMLR can be found as a preprint at https://www.biorxiv.org/content/early/2018/02/16/267245. We also provide a paper describing the software that is published on PROTEOMICS and can be found at http://onlinelibrary.wiley.com/doi/10.1002/pmic.201700232/full.
When using SIMLR, please cite Wang, Bo, et al. "Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning." Nature methods 14.4 (2017): 414.
For CIMLR, please cite Ramazzotti, Daniele, et al. "Multi-omic tumor data reveal diversity of molecular mechanisms underlying survival." bioRxiv (2018): 267245.
The citation of Wang, Bo, et al. "SIMLR: A Tool for Large‐Scale Genomic Analyses by Multi‐Kernel Learning." Proteomics 18.2 (2018) is optional, although appreciated.
INSTALLING SIMLR R Bioconductor IMPLEMENTATION
As mentioned, both SIMLR and CIMLR are also hosted on Bioconductor at https://bioconductor.org/packages/release/bioc/html/SIMLR.html and can be installed as follow. To install the package directly from Bioconductor, run the following commands directly from R:
Moreover, it is also possible to install the Github version of the tool from R by using the R library devtools.
install_github("BatzoglouLabSU/SIMLR", ref = 'master')
install_github("BatzoglouLabSU/SIMLR", ref = 'development')
The "master" branch hosts the latest stable version of the code which is also available on Bioconductor on the stable repository, while the "development" branch hosts the latest version that is on the devel repository on Bioconductor.
We describe next the procedure to manually install our software in case one wishes to do so.
RUNNING THE R IMPLEMENTATION
We provide the R demo code to run SIMLR on 4 examples in the script R_main_demo_SIMLR.R. Furthermore, we provide a large scale implementation of SIMLR (see large scale implementation) with 1 example in the script R_main_demo_SIMLR_Large_Scale.R. A demo for the estimation of the number of clusters by SIMLR is also provided in the script R_main_demo_SIMLR_Estimate_Number_of_Clusters.R.
The R demo code to run CIMLR is also provided in the script R_main_demo_CIMLR.R. Besides this, a demo for the estimation of the number of clusters by CIMLR can be found in the script R_main_demo_CIMLR_Estimate_Number_of_Clusters.R.
The R libraries required to run the demos can be installed by running the script install_R_libraries.R. We now present a set of requirements to run the examples.
- Required R libraries. our tool requires 2 R packages to run, namely the Matrix package (see https://cran.r-project.org/web/packages/Matrix/index.html) to handle sparse matrices and the parallel package (see https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf) for a parallel implementation of the kernel estimation.
To run the large scale analysis, it is necessary to install 4 more packages, namely Rcpp package (see https://cran.r-project.org/web/packages/Rcpp/index.html), pracma package (see https://cran.r-project.org/web/packages/pracma/index.html), RcppAnnoy package (see https://cran.rstudio.com/web/packages/RcppAnnoy/index.html) and RSpectra package (see https://cran.r-project.org/web/packages/RSpectra/index.html).
Furthermore, to run the examples, we require the igraph package (see http://igraph.org/r/) to compute the normalized mutual informetion metric and the grDevices package (see https://stat.ethz.ch/R-manual/R-devel/library/grDevices/html/00Index.html) to color the plots.
All these packages, can be installed with the R built-in install.packages function.
- External C code. We make use of an external C program during the computations. The code is located in the R directory in the file projsplx_R.c. In order to compite the program, one needs to run on the shell the command R CMD SHLIB -c projsplx_R.c.
An OS X pre-compiled file is also provided. Note: if there are issues in compiling the .c file, try to remove the pre-compiled files (i.e., projsplx_R.o and projsplx_R.so).
- Example datasets. The 6 example datasets are provided in the directory data of the branch SIMLR_full. We recall that those data are provided purely as examples and after some pre-processing; they should not be used in place of the ones provided in the respective publications.
Specifically, the dataset of Test_1_mECS.RData refers to http://www.ncbi.nlm.nih.gov/pubmed/25599176, Test_2_Kolod.RData refers to http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4595712/, Test_3_Pollen.RData refers to http://www.ncbi.nlm.nih.gov/pubmed/25086649 and Test_4_Usoskin.RData refers to http://www.ncbi.nlm.nih.gov/pubmed/25420068.
Moreover, for the large scale example, the dataset of Test_5_Zeisel.RData refers to https://www.ncbi.nlm.nih.gov/pubmed/25700174.
Finally, we provide the dataset Test_6_gliomas_multi_omic_data.RData from https://www.ncbi.nlm.nih.gov/pubmed/26061751 to test CIMLR.
RUNNING THE MATLAB IMPLEMENTATION
We also provide the MATLAB code to run SIMLR on the 5 examples in the script Matlab_main_demo_SIMLR.m and Matlab_main_demo_SIMLR_Large_Scale.m. Furthermore, we provide two demos for CIMLR with the data from https://www.ncbi.nlm.nih.gov/pubmed/26061751.
Please refer to the directory MATLAB and the file README.txt within for further details.