Clone this wiki locally
This R package enables the R user to perform common data manipulation operations, as found in popular packages such as
reshape2, on very large data sets stored on Hadoop. Like rmr, it relies on Hadoop mapreduce to perform its tasks, but it provides a familiar plyr-like interface while hiding many of the mapreduce details.
- Hadoop-capable equivalents of well known data.frame functions:
reshape2; sampling, quantiles, counting and more.
- Simple but powerful ways of applying many functions operating on data frames to Hadoop data sets:
- Simple but powerful ways to group data:
- All of the above can be combined by normal functional composition: delayed evaluation helps mitigating any performance penalty of doing so by minimizing the number of Hadoop jobs launched to evaluate an expression.
The current version has a major release number of zero (0.x.y). As the numbering suggests, the package should be considered work in progress and the API is not cast in stone yet. We seek feedback at an early stage to drive further development. This package has a Github repo, please feel free to enter an issue there to discuss problems, existing or missing features and what not (anything that requires an answer from the developers). For general discussions head to the RHadoop forum.
Prerequisites and installation
- rmr 3.2.0 or higher.
plyrmrinstalled on each node of a Hadoop cluster together with its dependencies (see the DESCRIPTION file,
imports:lines). The package
memoiserequires special instructions. First load the package
devtools. For memoise, issue this command at the R prompt:
install_github("RevolutionAnalytics/memoise"). The reason is that its maintainer, the excellent @hadley, would not accept our pull-request for no particular reason, nor he plans to submit to CRAN in the foreseeable future. Hence we were forced into a, hopefully temporary, fork.
plyrmr see Releases.
- A Tutorial