Skip to content
j-martens edited this page Nov 24, 2015 · 129 revisions

RHadoop is a collection of five R packages that allow users to manage and analyze data with Hadoop. The packages have been tested (and always before a release) on recent releases of the Cloudera and Hortonworks Hadoop distributions and should have broad compatibility with open source Hadoop and mapR's distribution. We normally test on recent Revolution R/Microsoft R and CentOS releases, but we expect all the RHadoop packages to work on a recent release of open source R and Linux.

RHadoop consists of the following packages:

Package Name Description
rhdfs This package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS from within R. Install this package only on the node that will run the R client.
rhbase This package provides basic connectivity to the HBASE distributed database, using the Thrift server. R programmers can browse, read, write, and modify tables stored in HBASE from within R. Install this package only on the node that will run the R client.
plyrmr This package enables the R user to perform common data manipulation operations, as found in popular packages such as plyr and reshape2, on very large data sets stored on Hadoop. Like rmr, it relies on Hadoop MapReduce to perform its tasks, but it provides a familiar plyr-like interface while hiding many of the MapReduce details. Install this package only every node in the cluster.
rmr2 A package that allows R developer to perform statistical analysis in R via Hadoop MapReduce functionality on a Hadoop cluster. Install this package on every node in the cluster.
ravro A package that adds the ability to read and write avro files from local and HDFS file system and adds an avro input format for rmr2. Install this package only on the node that will run the R client.

 

  More Information

 

Problems, Suggestions, Interesting Examples

 

News/Notes

  • 02/12/2015 plyrmr 0.6.0 Spark backend now fully functional, some programming-friendly changes and updated tests. See Releases.
  • 02/12/2015 rmr 3.3.1 Bug fixes and updated tests. See Releases.
  • 12/8/2014 plyrmr 0.5.0. New function VAR helps using plyrmr in programs. More basic data frame functions in their big data version. Transparent caching of intermediate results. Fast aggregation functions for the small groups case. See the Changelog.
  • 12/8/2014 rmr 3.3.0 dfs.ls and Avro input format, different default Hadoop settings and bug fixes. See the Changelog.
  • 8/27/2014 ravro 1.0.4 give read/write access to Avro files
  • 8/15/2014 plyrmr 0.4.0 brings fast aggregations and swappable backends. See the Changelog.
  • 8/15/2014 rmr 3.2.0 released, mostly bug fixes. See the Changelog.
  • 7/9/2014 rhbase 1.2.1 released with CDH5 compatibility
  • 6/28/2014 rmr 3.1.2 released, adds windows compatibility. See the Changelog.
  • 5/19/2014 plyrmr 0.3.0 released with partial ungroup, extension packs and improved quantile.cols and count.cols, plus a raft of bug fixes. See the Changelog.
  • 5/19/2014 rmr 3.1.1 released, a bugfix release. See the Changelog.
  • 3/31/2014 plyrmr 0.2.0 released with a simplified API and lots of new features. See the Changelog.
  • 3/27/2014 rmr 3.1.0 released. More flexible tmp dir selection, hbase input filters and many bugs squashed. See the Changelog.
  • 2/10/2014 rmr 3.0.0 released. Faster, do I need to say more? See the Changelog.
  • 11/11/2013 rhdfs 1.0.8 released. Compatibility with Hadoop 3.0.0.
  • 10/14/2013 rhdfs 1.0.7 released. Update for HDP 1.3 Windows
  • 10/9/2013 plyrmr 0.1.0 is available.
  • 10/7/2013 rmr 2.3.0 is available. See the Changelog.
  • 9/27/2013 A preview for plyrmr is available.
  • 6/18/2013 rmr 2.2.2 released, two bug fixes, one very important, upgrade recommended. See the Changelog.
  • 6/21/2013 rhbase 1.2.0 released, adds character serialization, fixes raw.
  • 6/21/2013 rhdfs 1.0.6 released, adds windows compatibility.
  • 6/20/2013 rmr 2.2.1 released, adds windows compatbility, some speed improvements and bug fixes. See the Changelog.
  • 4/18/2013 rmr 2.2.0 released, with flexible I/O formats for equijoins, configurable HDFS tempdir, a more convenient rmr.str for debugging, better error messages and many bugfixes. See the Changelog.
  • 3/7/2013 rhbase 1.1.1 released, fixes an issue with CR/LF breaking the build on some platforms
  • 2/25/2013 rmr 2.1.0 released, improves speed, adds in-memory combiners and more vectorization, status and counters, hbase input and more. See the Changelog.
  • 2/5/2013 Created package-specific repos to better support development. See the announcement.
  • 12/4/2012 rmr 2.0.2 released with ligther dependencies and multiple bug fixes. See the Changelog.
  • 10/29/2012 rmr 2.0.1 released with multiple bug fixes and tested against most major Hadoop distros. See the Changelog.
  • 10/18/2012 rhbase (1.1) added 'filterstring' support for scan operations on HBase tables (HBase 0.92 or >)
  • 10/1/2012 rmr 2.0 released, simplest and fastest rmr yet, makes everything vectorized and gives first class status to structured data. See the Changelog.
  • 9/10/2012 branched rmr-2.0 to prepare for the next release. Also provided a tgz file for download. Many changes and documentation still mostly out of date. Try it if you are able to read the source code. Feedback is welcome.
  • 7/30/2012 rmr 1.3.1 tested on major Hadoop distros and Rmd docs — See the Changelog.
  • 7/17/2012 rhdfs (1.0.4) change to handle different classpaths in the init function
  • 7/13/2012 rmr 1.3 with vectorized API — See the Changelog.
  • 5/18/2012 rhdfs (1.0.3) bug fix in function hdfs.file
  • 4/14/2012 rhbase (1.0.4) and rhdfs (1.0.2) minor bug fixes and some cleanup for R CMD check
  • 3/30/2012 rmr 1.2.2 fixes from.dfs for some obscure platforms and prepares for apache 1.0.2 compatibility (more flexible w.r.t. hadoop layout)
  • 3/13/2012 New version of rhbase (1.0.3) that supports both "native" and "raw" serialization
  • 2/27/2012 rmr version 1.2 with binary formats and other goodies available — See the Changelog.
  • 2/11/2012 New version of rhbase (1.0.2) that installs with thrift 0.8 or greater.
  • 2/1/2012 Binary format now default in dev, passes all normal checks. Please test.
  • 24/1/2012 - Merged branch binary-io into dev.Please note some non-backward compatible changes in the API intended to strike a compromise between flexibility and ease of use in the IO department.
  • 12/7/2011 - Version 1.1 of the package rmr is available. See the Changelog. for details.
  • 9/29/2011 - Version 1.0.1 available - fixes some minor defects with R CMD check tests on the packages