Skip to content

user>plyrmr>Home

Antonio Piccolboni edited this page Feb 12, 2015 · 6 revisions

 

Overview

This R package enables the R user to perform common data manipulation operations, as found in popular packages such as plyr and reshape2, on very large data sets stored on Hadoop. Like rmr, it relies on Hadoop mapreduce to perform its tasks, but it provides a familiar plyr-like interface while hiding many of the mapreduce details. plyrmr provides:

  • Hadoop-capable equivalents of well known data.frame functions: transmute and bind.cols generalize over transform and summarize; select from dplyr; melt and dcast from reshape2; sampling, quantiles, counting and more.
  • Simple but powerful ways of applying many functions operating on data frames to Hadoop data sets: gapply and magic.wand.
  • Simple but powerful ways to group data: group, group.f, gather and ungroup.
  • All of the above can be combined by normal functional composition: delayed evaluation helps mitigating any performance penalty of doing so by minimizing the number of Hadoop jobs launched to evaluate an expression.

Status

The current version has a major release number of zero (0.x.y). As the numbering suggests, the package should be considered work in progress and the API is not cast in stone yet. We seek feedback at an early stage to drive further development. This package has a Github repo, please feel free to enter an issue there to discuss problems, existing or missing features and what not (anything that requires an answer from the developers). For general discussions head to the RHadoop forum.

Prerequisites and installation

  • rmr 3.2.0 or higher.
  • plyrmr installed on each node of a Hadoop cluster together with its dependencies (see the DESCRIPTION file, depends: and imports: lines). The package memoise requires special instructions. First load the package devtools. For memoise, issue this command at the R prompt: install_github("RevolutionAnalytics/memoise"). The reason is that its maintainer, the excellent @hadley, would not accept our pull-request for no particular reason, nor he plans to submit to CRAN in the foreseeable future. Hence we were forced into a, hopefully temporary, fork.

To download plyrmr see Releases.

Contents