HTTPS clone URL
Subversion checkout URL
- dev>Contribute to the RHadoop project
- dev>Release Process
- dev>rmr>API design
- dev>rmr>Design Philosophy
- dev>rmr>Documentation guidelines for rmr
- dev>Version Control Guidelines
- user>rmr>Comparison of high level languages for mapreduce k means
- user>rmr>Debugging rmr programs
- user>rmr>Efficient rmr techniques
- user>rmr>Finding Frequent Itemsets
- user>rmr>Getting data in and out
- user>rmr>Keyval types and combinations
- user>rmr>Learning resources
- user>rmr>Memory management in rmr2
- user>rmr>rmr2 settings
- user>rmr>Use Cases
- user>rmr>Writing composable mapreduce jobs
Clone this wiki locally
RHadoop is a collection of five R packages that allow users to manage and analyze data with Hadoop. The packages are regularly tested (and always before a release) on recent releases of the Cloudera and Hortonworks Hadoop distributions and should have broad compatibility with open source Hadoop and mapR's distribution. We normally test on recent Revolution R and CentOS releases, but we expect all the RHadoop packages to work on a recent release of open source R and Linux.
RHadoop consists of the following packages:
- NEW! ravro - read and write files in avro format
plyrmr - higher level plyr-like data processing for structured data, powered by
- rmr - functions providing Hadoop MapReduce functionality in R
- rhdfs - functions providing file management of the HDFS from within R
- rhbase - functions providing database management for the HBase distributed database from within R
- Problem, suggestions, interesting examples? Post a message on the RHadoop Google Group
- Overview of RHadoop, from the Revolution Analytics blog.
- Slides and Replay of 30-minute presentation about RHadoop, "Leveraging R in Hadoop Environments".
- R in a Nutshell, 2nd edition devotes a good part of the last chapter to RHadoop. "The most mature (and best integrated) project for R and Hadoop is RHadoop."
- For developers
Questions: Please participate in our discussion group.
- 02/12/2015 plyrmr 0.6.0 Spark backend now fully functional, some programming-friendly changes and updated tests. See Releases.
- 02/12/2015 rmr 3.3.1 Bug fixes and updated tests. See Releases.
- 12/8/2014 plyrmr 0.5.0. New function
VARhelps using plyrmr in programs. More basic data frame functions in their big data version. Transparent caching of intermediate results. Fast aggregation functions for the small groups case. See the Changelog.
- 12/8/2014 rmr 3.3.0
dfs.lsand Avro input format, different default Hadoop settings and bug fixes. See the Changelog.
- 8/27/2014 ravro 1.0.4 give read/write access to Avro files
- 8/15/2014 plyrmr 0.4.0 brings fast aggregations and swappable backends. See the Changelog.
- 8/15/2014 rmr 3.2.0 released, mostly bug fixes. See the Changelog.
- 7/9/2014 rhbase 1.2.1 released with CDH5 compatibility
- 6/28/2014 rmr 3.1.2 released, adds windows compatibility. See the Changelog.
- 5/19/2014 plyrmr 0.3.0 released with partial
ungroup, extension packs and improved
count.cols, plus a raft of bug fixes. See the Changelog.
- 5/19/2014 rmr 3.1.1 released, a bugfix release. See the Changelog.
- 3/31/2014 plyrmr 0.2.0 released with a simplified API and lots of new features. See the Changelog.
- 3/27/2014 rmr 3.1.0 released. More flexible tmp dir selection, hbase input filters and many bugs squashed. See the Changelog.
- 2/10/2014 rmr 3.0.0 released. Faster, do I need to say more? See the Changelog.
- 11/11/2013 rhdfs 1.0.8 released. Compatibility with Hadoop 3.0.0.
- 10/14/2013 rhdfs 1.0.7 released. Update for HDP 1.3 Windows
- 10/9/2013 plyrmr 0.1.0 is available.
- 10/7/2013 rmr 2.3.0 is available. See the Changelog.
- 9/27/2013 A preview for plyrmr is available.
- 6/18/2013 rmr 2.2.2 released, two bug fixes, one very important, upgrade recommended. See the Changelog.
- 6/21/2013 rhbase 1.2.0 released, adds
- 6/21/2013 rhdfs 1.0.6 released, adds windows compatibility.
- 6/20/2013 rmr 2.2.1 released, adds windows compatbility, some speed improvements and bug fixes. See the Changelog.
- 4/18/2013 rmr 2.2.0 released, with flexible I/O formats for equijoins, configurable HDFS tempdir, a more convenient
rmr.strfor debugging, better error messages and many bugfixes. See the Changelog.
- 3/7/2013 rhbase 1.1.1 released, fixes an issue with CR/LF breaking the build on some platforms
- 2/25/2013 rmr 2.1.0 released, improves speed, adds in-memory combiners and more vectorization, status and counters, hbase input and more. See the Changelog.
- 2/5/2013 Created package-specific repos to better support development. See the announcement.
- 12/4/2012 rmr 2.0.2 released with ligther dependencies and multiple bug fixes. See the Changelog.
- 10/29/2012 rmr 2.0.1 released with multiple bug fixes and tested against most major Hadoop distros. See the Changelog.
- 10/18/2012 rhbase (1.1) added 'filterstring' support for scan operations on HBase tables (HBase 0.92 or >)
- 10/1/2012 rmr 2.0 released, simplest and fastest rmr yet, makes everything vectorized and gives first class status to structured data. See the Changelog.
- 9/10/2012 branched rmr-2.0 to prepare for the next release. Also provided a tgz file for download. Many changes and documentation still mostly out of date. Try it if you are able to read the source code. Feedback is welcome.
- 7/30/2012 rmr 1.3.1 tested on major Hadoop distros and Rmd docs — See the Changelog.
- 7/17/2012 rhdfs (1.0.4) change to handle different classpaths in the init function
- 7/13/2012 rmr 1.3 with vectorized API — See the Changelog.
- 5/18/2012 rhdfs (1.0.3) bug fix in function hdfs.file
- 4/14/2012 rhbase (1.0.4) and rhdfs (1.0.2) minor bug fixes and some cleanup for R CMD check
- 3/30/2012 rmr 1.2.2 fixes from.dfs for some obscure platforms and prepares for apache 1.0.2 compatibility (more flexible w.r.t. hadoop layout)
- 3/13/2012 New version of rhbase (1.0.3) that supports both "native" and "raw" serialization
- 2/27/2012 rmr version 1.2 with binary formats and other goodies available — See the Changelog.
- 2/11/2012 New version of rhbase (1.0.2) that installs with thrift 0.8 or greater.
- 2/1/2012 Binary format now default in dev, passes all normal checks. Please test.
- 24/1/2012 - Merged branch binary-io into dev.Please note some non-backward compatible changes in the API intended to strike a compromise between flexibility and ease of use in the IO department.
- 12/7/2011 - Version 1.1 of the package rmr is available. See the Changelog. for details.
- 9/29/2011 - Version 1.0.1 available - fixes some minor defects with R CMD check tests on the packages