HTTPS clone URL
Subversion checkout URL
- dev>Contribute to the RHadoop project
- dev>Release Process
- dev>rmr>API design
- dev>rmr>Design Philosophy
- dev>rmr>Documentation guidelines for rmr
- dev>Version Control Guidelines
- Installing RHadoop on RHEL
- user>rmr>Comparison of high level languages for mapreduce k means
- user>rmr>Debugging rmr programs
- user>rmr>Efficient rmr techniques
- user>rmr>Finding Frequent Itemsets
- user>rmr>Getting data in and out
- user>rmr>Keyval types and combinations
- user>rmr>Learning resources
- user>rmr>Memory management in rmr2
- user>rmr>rmr2 settings
- user>rmr>Use Cases
- user>rmr>Writing composable mapreduce jobs
Clone this wiki locally
RHadoop is a collection of three R packages that allow users to manage and analyze data with Hadoop. The packages have been implemented and tested in Cloudera's distribution of Hadoop (CDH3) & (CDH4). and R 2.15.0. THe packages have also been tested with Revolution R 4.3, 5.0, and 6.0. For rmr see Compatibility.
RHadoop consists of the following packages:
- rmr - functions providing Hadoop MapReduce functionality in R
- rhdfs - functions providing file management of the HDFS from within R
- rhbase - functions providing database management for the HBase distributed database from within R
- Having problems?, post a message on the RHadoop Google Group
- Overview of RHadoop, from the Revolution Analytics blog.
- Slides and Replay of 30-minute presentation about RHadoop, "Leveraging R in Hadoop Environments".
- R in a Nutshell, 2nd edition devotes a good part of the last chapter to RHadoop. "The most mature (and best integrated) project for R and Hadoop is RHadoop."
- Learning Resources
- Contribute to the RHadoop project
- Live from the net
Questions: Please participate in our discussion group. For private questions, please use the above email address.
- 12/4/2012 rmr-2.0.2 released with ligther dependencies and multiple bug fixes. See Changelog.
- 10/29/2012 rmr-2.0.1 released with multiple bug fixes and tested against most major Hadoop distros. See Changelog.
- 10/18/2012 rhbase (1.1) added 'filterstring' support for scan operations on HBase tables (HBase 0.92 or >)
- 10/1/2012 rmr-2.0 released, simplest and fastest rmr yet, makes everything vectorized and gives first class status to structured data. See Changelog
- 9/10/2012 branched rmr-2.0 to prepare for the next release. Also provided a tgz file for download. Many changes and documentation still mostly out of date. Try it if you are able to read the source code. Feedback is welcome.
- 7/30/2012 rmr 1.3.1 tested on major Hadoop distros and Rmd docs — see Changelog.
- 7/17/2012 rhdfs (1.0.4) change to handle different classpaths in the init function
- 7/13/2012 rmr 1.3 with vectorized API — see Changelog.
- 5/18/2012 rhdfs (1.0.3) bug fix in function hdfs.file
- 4/14/2012 rhbase (1.0.4) and rhdfs (1.0.2) minor bug fixes and some cleanup for R CMD check
- 3/30/2012 rmr 1.2.2 fixes from.dfs for some obscure platforms and prepares for apache 1.0.2 compatibility (more flexible w.r.t. hadoop layout)
- 3/13/2012 New version of rhbase (1.0.3) that supports both "native" and "raw" serialization
- 2/27/2012 rmr version 1.2 with binary formats and other goodies available — see Changelog.
- 2/11/2012 New version of rhbase (1.0.2) that installs with thrift 0.8 or greater.
- 2/1/2012 Binary format now default in dev, passes all normal checks. Please test.
- 24/1/2012 - Merged branch binary-io into dev.Please note some non-backward compatible changes in the API intended to strike a compromise between flexibility and ease of use in the IO department.
- 12/7/2011 - Version 1.1 of the package rmr is available. See the Changelog for details.
- 9/29/2011 - Version 1.0.1 available - fixes some minor defects with R CMD check tests on the packages