HTTPS clone URL
Subversion checkout URL
- dev>Contribute to the RHadoop project
- dev>Release Process
- dev>rmr>API design
- dev>rmr>Design Philosophy
- dev>rmr>Documentation guidelines for rmr
- dev>Version Control Guidelines
- Installing RHadoop on RHEL
- user>rmr>Comparison of high level languages for mapreduce k means
- user>rmr>Debugging rmr programs
- user>rmr>Efficient rmr techniques
- user>rmr>Finding Frequent Itemsets
- user>rmr>Getting data in and out
- user>rmr>Keyval types and combinations
- user>rmr>Learning resources
- user>rmr>Memory management in rmr2
- user>rmr>rmr2 settings
- user>rmr>Use Cases
- user>rmr>Writing composable mapreduce jobs
Clone this wiki locally
RHadoop is a collection of five R packages that allow users to manage and analyze data with Hadoop. The packages have been tested (and always before a release) on recent releases of the Cloudera and Hortonworks Hadoop distributions and should have broad compatibility with open source Hadoop and mapR's distribution. We normally test on recent Revolution R/Microsoft R and CentOS releases, but we expect all the RHadoop packages to work on a recent release of open source R and Linux.
RHadoop consists of the following packages:
|rhdfs||This package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS from within R. Install this package only on the node that will run the R client.|
|rhbase||This package provides basic connectivity to the HBASE distributed database, using the Thrift server. R programmers can browse, read, write, and modify tables stored in HBASE from within R. Install this package only on the node that will run the R client.|
|plyrmr||This package enables the R user to perform common data manipulation operations, as found in popular packages such as
|rmr2||A package that allows R developer to perform statistical analysis in R via Hadoop MapReduce functionality on a Hadoop cluster. Install this package on every node in the cluster.|
|ravro||A package that adds the ability to read and write
- Overview of RHadoop from the Revolution Analytics blog
- Slides and Replay of 30-minute presentation about RHadoop, "Leveraging R in Hadoop Environments"
- R in a Nutshell, 2nd edition devotes a good part of the last chapter to RHadoop. "The most mature (and best integrated) project for R and Hadoop is RHadoop."
- For developers
Problems, Suggestions, Interesting Examples
- Post on the RHadoop Forum
- 02/12/2015 plyrmr 0.6.0 Spark backend now fully functional, some programming-friendly changes and updated tests. See Releases.
- 02/12/2015 rmr 3.3.1 Bug fixes and updated tests. See Releases.
- 12/8/2014 plyrmr 0.5.0. New function
VARhelps using plyrmr in programs. More basic data frame functions in their big data version. Transparent caching of intermediate results. Fast aggregation functions for the small groups case. See the Changelog.
- 12/8/2014 rmr 3.3.0
dfs.lsand Avro input format, different default Hadoop settings and bug fixes. See the Changelog.
- 8/27/2014 ravro 1.0.4 give read/write access to Avro files
- 8/15/2014 plyrmr 0.4.0 brings fast aggregations and swappable backends. See the Changelog.
- 8/15/2014 rmr 3.2.0 released, mostly bug fixes. See the Changelog.
- 7/9/2014 rhbase 1.2.1 released with CDH5 compatibility
- 6/28/2014 rmr 3.1.2 released, adds windows compatibility. See the Changelog.
- 5/19/2014 plyrmr 0.3.0 released with partial
ungroup, extension packs and improved
count.cols, plus a raft of bug fixes. See the Changelog.
- 5/19/2014 rmr 3.1.1 released, a bugfix release. See the Changelog.
- 3/31/2014 plyrmr 0.2.0 released with a simplified API and lots of new features. See the Changelog.
- 3/27/2014 rmr 3.1.0 released. More flexible tmp dir selection, hbase input filters and many bugs squashed. See the Changelog.
- 2/10/2014 rmr 3.0.0 released. Faster, do I need to say more? See the Changelog.
- 11/11/2013 rhdfs 1.0.8 released. Compatibility with Hadoop 3.0.0.
- 10/14/2013 rhdfs 1.0.7 released. Update for HDP 1.3 Windows
- 10/9/2013 plyrmr 0.1.0 is available.
- 10/7/2013 rmr 2.3.0 is available. See the Changelog.
- 9/27/2013 A preview for plyrmr is available.
- 6/18/2013 rmr 2.2.2 released, two bug fixes, one very important, upgrade recommended. See the Changelog.
- 6/21/2013 rhbase 1.2.0 released, adds
- 6/21/2013 rhdfs 1.0.6 released, adds windows compatibility.
- 6/20/2013 rmr 2.2.1 released, adds windows compatbility, some speed improvements and bug fixes. See the Changelog.
- 4/18/2013 rmr 2.2.0 released, with flexible I/O formats for equijoins, configurable HDFS tempdir, a more convenient
rmr.strfor debugging, better error messages and many bugfixes. See the Changelog.
- 3/7/2013 rhbase 1.1.1 released, fixes an issue with CR/LF breaking the build on some platforms
- 2/25/2013 rmr 2.1.0 released, improves speed, adds in-memory combiners and more vectorization, status and counters, hbase input and more. See the Changelog.
- 2/5/2013 Created package-specific repos to better support development. See the announcement.
- 12/4/2012 rmr 2.0.2 released with ligther dependencies and multiple bug fixes. See the Changelog.
- 10/29/2012 rmr 2.0.1 released with multiple bug fixes and tested against most major Hadoop distros. See the Changelog.
- 10/18/2012 rhbase (1.1) added 'filterstring' support for scan operations on HBase tables (HBase 0.92 or >)
- 10/1/2012 rmr 2.0 released, simplest and fastest rmr yet, makes everything vectorized and gives first class status to structured data. See the Changelog.
- 9/10/2012 branched rmr-2.0 to prepare for the next release. Also provided a tgz file for download. Many changes and documentation still mostly out of date. Try it if you are able to read the source code. Feedback is welcome.
- 7/30/2012 rmr 1.3.1 tested on major Hadoop distros and Rmd docs — See the Changelog.
- 7/17/2012 rhdfs (1.0.4) change to handle different classpaths in the init function
- 7/13/2012 rmr 1.3 with vectorized API — See the Changelog.
- 5/18/2012 rhdfs (1.0.3) bug fix in function hdfs.file
- 4/14/2012 rhbase (1.0.4) and rhdfs (1.0.2) minor bug fixes and some cleanup for R CMD check
- 3/30/2012 rmr 1.2.2 fixes from.dfs for some obscure platforms and prepares for apache 1.0.2 compatibility (more flexible w.r.t. hadoop layout)
- 3/13/2012 New version of rhbase (1.0.3) that supports both "native" and "raw" serialization
- 2/27/2012 rmr version 1.2 with binary formats and other goodies available — See the Changelog.
- 2/11/2012 New version of rhbase (1.0.2) that installs with thrift 0.8 or greater.
- 2/1/2012 Binary format now default in dev, passes all normal checks. Please test.
- 24/1/2012 - Merged branch binary-io into dev.Please note some non-backward compatible changes in the API intended to strike a compromise between flexibility and ease of use in the IO department.
- 12/7/2011 - Version 1.1 of the package rmr is available. See the Changelog. for details.
- 9/29/2011 - Version 1.0.1 available - fixes some minor defects with R CMD check tests on the packages