Skip to content

KlugerLab/oocPCA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

oocPCA

oocPCA is a Intel MKL-based, out-of-core C++ implementation of randomized SVD for rank-k approximation of matrices that are too large to fit into memory.

Two interfaces are available, via an R wrapper called oocRPCA and by commandline. OS X and Linux users can install the pre-compiled binaries, following the processes outlined below:

R Package Installation

if(!require(devtools)) install.packages("devtools") # If not already installed
devtools::install_github("KlugerLab/oocPCA")

OR:

  1. Clone this git repository
  2. cd fastRPCA
  3. R CMD INSTALL .

Please see the documentation for usage: ?oocPCA_CSV

R Testing Suite

Test cases for this software use the popular testing package testthat:

if(!require(testthat)) install.packages('testthat')
testthat::test_dir(sprintf("%s/testthat", system.file("tests", package="oocRPCA")))

Features

  • Variety of input formats and use cases
  • All matrix algebra is done with Intel MKL (pre-compiled version already linked) making it extremely fast
  • The calculations are 'blocked' allowing it to be 'out-of-core' when necessary, so that the user to specify the maximum amount of memory to be used.
    • CSV files: when too large for the memory, read block by block from the hard drive
    • BED files: when too large for the memory, stored in a compressed 2 bit-per-SNP format, and then decompressed block by block for calculations
  • Row-centering and column-centering
  • Imputation by averaging of missing data for GWAS

References

If you use oocPCA, please cite:

George C. Linderman, Manas Rachh, Jeremy G. Hoskins, Stefan Steinerberger, Yuval Kluger. (2017). Efficient Algorithms for t-distributed Stochastic Neighborhood Embedding. arXiv preprint.

Development

Compiling from source

This implementation relies heavily on a highly optimized implementations of BLAS and LAPACK called Intel Math Kernel Library (MKL) (Free download here ). The lib folder contains a custom built shared library, but the headers cannot be distributed. As such, to compile from source, Intel MKL must be installed on your machine. To recompile the custom built shared library, follow the instructions in lib/generate_custom_mkl.sh

TODO

  • Ignore column and row headers for the CSV input
  • Windows Support

About

Out-of-Core Principal Component Analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published