This R package allows an R programmer to perform statistical analysis via MapReduce on a Hadoop cluster.
Importslines, for the most up to date list of dependencies. The suggested
testthat(available on CRAN) and
rhdfsare needed only for testing.
ravrois needed only for testing or to use the avro input format.
rmr2itself needs to be installed on each node. Download it from the Release page and then, at the shell prompt, enter
R CMD INSTALL rmr2_<specific version>.tar.gz.
rmr2is not available on CRAN.
HADOOP_STREAMINGare properly set. The former should point to the main
hadoopcommand, the latter to the streaming jar, a file called something like
hadoop-streaming*.jarthat is part of most hadoop distributions. For some distributions,
HADOOP_HOMEis still sufficient for R to find everything that's needed so if that works for you you can keep it that way, but it is not recommended anymore. Optionally, you can set
rmr2can't find the
hdfsexecutable, which only results in some deprecation warnings. Its value should be the path to the
export HADOOP_CMD=/usr/bin/hadoop export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-<version>.jar
For people who use RPMs for their deployments, courtesy of jseidman, we have RPMs for rmr and its dependencies. These RPMs are available in this repo: https://github.com/jseidman/pkgs. Note that currently there's only CentOS 5.5 64bit RPMs, but the source files to create the RPMs are in the same repo, so it should be easy to build for other RH-based distros. jseidman reports using RPMs along with Puppet to deploy all packages, applications, etc. to their (Orbitz) Hadoop clusters.
For people who use EC2 (not EMR), in the source package under the tools directory there is a whirr script to fire up an EC2 rmr cluster.
If you use Globus Provision, check out this https://github.com/nbest937/gp-rhadoop (very alpha as of this edit), courtesy nbest.
MapR provides specific instructions for their distribution of Hadoop