GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
Already on GitHub? Sign in to your account
Many users have complained that they can not have random sw installed on their hadoop clusters. If they get R installed, the problem moves to the packages, which get installed and updated much more often. So after some coaching by the friendly people at Hortonworks, I think there can be a path to run rmr without installing anything on the cluster. We first need a true binary version of R and all the needed packages. It's important that the cluster be homogenous (say all RHEL 5 or Ubuntu natty or whatever) and that you can build the necessary binaries on a machine that's like one of the nodes. Then you would create a jar of it and put it at some location in hdfs. It's about 60mb so if you are doing big data that should be peanuts. Then when rmr calls streaming, you would specify that jar with the -cacheArchive option and with the cmdenv option point to the R copy in there, or maybe just specifying the path to Rscript in the right way. Once we have a proof of concept we can hide the gory detail from the user to a degree.
This is the general plan
Time to turn this into real development, will move the code under tools in the 0-install branch
I deleted the code snippets in here because there is no reason to look at them anymore. The authoritative version is in the 0-install branch under tools/0-install. It looks like we are about to pass the automated tests with this one in the next few minutes. Unfortunately it required some changes in the R code as well and the merger with dev will be complicated if we want to allow two deploy approaches (pre-install or 0-install). The idea is that you run setup-jar which will create a jar file with R and all the packages, set one env variable and you are ready to run mapreduce jobs, no sysadm involvement required. As it is, it will work only with ubuntu, but generalizations should be possible. Further dependencies on my specific environment (EC2) may exist, please report back with any problems if you give it a try.
All checks passed. I am not closing yet hoping to hear news from other platforms, or to have a chance to test other platforms myself.
just to mark we are passing all tests at this time #66
This approach is too hacky, makes too many assumptions and leaves cluster in an inconsistent state and we moved on.