This R package allows an R programmer to perform statistical analysis via MapReduce on a Hadoop cluster.
Depends:line, for the most up to date list of dependencies. The suggested
quickcheckis needed only for testing and a link to it can be found on its repo.
R CMD INSTALL rmr2_<specific version>.tar.gz. rmr2 is not available on CRAN.
HADOOP_STREAMINGare properly set. The former should point to the main
hadoopcommand, the latter to the streaming jar, a file called something like
hadoop-streaming*.jarthat is part of most hadoop distributions. For some distributions,
HADOOP_HOMEis still sufficient for R to find everything that's needed so if that works for you you can keep it that way, but it is not recommended anymore. Optionally, you can set
HDFS_CMDif rmr can't find the
hdfsexecutable, which only results in some deprecation warnings. Its value should be the path to the
export HADOOP_CMD=/usr/bin/hadoop export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-<version>.jar
Because of the variabilty between distros, we have collected some observations about these settings in a dedicated page. For people who use RPMs for their deployments, courtesy of jseidman, we have RPMs for rmr and its dependencies. These RPMs are available in this repo: https://github.com/jseidman/pkgs. Note that currently there's only CentOS 5.5 64bit RPMs, but the source files to create the RPMs are in the same repo, so it should be easy to build for other RH-based distros. jseidman reports using RPMs along with Puppet to deploy all packages, applications, etc. to their (Orbitz) Hadoop clusters.
For people who use EC2 (not EMR), in the source package under the tools directory there is a whirr script to fire up an EC2 rmr cluster.
If you use Globus Provision, check out this https://github.com/nbest937/gp-rhadoop (very alpha as of this edit), courtesy nbest.
MapR provides specific instructions for their distribution of Hadoop
Last edited by Antonio Piccolboni,