This R package allows an R programmer to perform statistical analysis via MapReduce on a Hadoop cluster.
Prerequisites and installation
- A Hadoop cluster, CDH3 and higher or Apache 1.0.2 and higher but limited to mr1, not mr2. Compatibility with mr2 from Apache 2.2.0 or HDP2. For configuration suggestions see Memory management in rmr2.
- R installed on each node of the cluster (R 2.14.1 or newer). Revolution R Community 4.3 or higher can be used, creating a symbolic link from /usr/bin/Revoscript to /usr/bin/Rscript.
- Install the required R packages on each node. Check the DESCRIPTION file,
Importslines, for the most up to date list of dependencies. The suggested
testthat(available on CRAN) and
rhdfsare needed only for testing.
ravrois needed only for testing or to use the avro input format.
rmr2itself needs to be installed on each node. Download it from the Release page and then, at the shell prompt, enter
R CMD INSTALL rmr2_<specific version>.tar.gz.
rmr2is not available on CRAN.
- Make sure that the packages are installed in a default location accessible to all users (R will run on the cluster as a different user from the one who has started the R interpreter where the mapreduce calls have been executed) on every node.
- Make sure that the environment variables
HADOOP_STREAMINGare properly set. The former should point to the main
hadoopcommand, the latter to the streaming jar, a file called something like
hadoop-streaming*.jarthat is part of most hadoop distributions. For some distributions,
HADOOP_HOMEis still sufficient for R to find everything that's needed so if that works for you you can keep it that way, but it is not recommended anymore. Optionally, you can set
rmr2can't find the
hdfsexecutable, which only results in some deprecation warnings. Its value should be the path to the
export HADOOP_CMD=/usr/bin/hadoop export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-<version>.jar
For people who use RPMs for their deployments, courtesy of jseidman, we have RPMs for rmr and its dependencies. These RPMs are available in this repo: https://github.com/jseidman/pkgs. Note that currently there's only CentOS 5.5 64bit RPMs, but the source files to create the RPMs are in the same repo, so it should be easy to build for other RH-based distros. jseidman reports using RPMs along with Puppet to deploy all packages, applications, etc. to their (Orbitz) Hadoop clusters.
For people who use EC2 (not EMR), in the source package under the tools directory there is a whirr script to fire up an EC2 rmr cluster.
If you use Globus Provision, check out this https://github.com/nbest937/gp-rhadoop (very alpha as of this edit), courtesy nbest.
MapR provides specific instructions for their distribution of Hadoop