0-install deployments #66

Closed
piccolbo opened this Issue Mar 28, 2012 · 4 comments

Comments

Projects
None yet
1 participant
Collaborator

piccolbo commented Mar 28, 2012

Many users have complained that they can not have random sw installed on their hadoop clusters. If they get R installed, the problem moves to the packages, which get installed and updated much more often. So after some coaching by the friendly people at Hortonworks, I think there can be a path to run rmr without installing anything on the cluster. We first need a true binary version of R and all the needed packages. It's important that the cluster be homogenous (say all RHEL 5 or Ubuntu natty or whatever) and that you can build the necessary binaries on a machine that's like one of the nodes. Then you would create a jar of it and put it at some location in hdfs. It's about 60mb so if you are doing big data that should be peanuts. Then when rmr calls streaming, you would specify that jar with the -cacheArchive option and with the cmdenv option point to the R copy in there, or maybe just specifying the path to Rscript in the right way. Once we have a proof of concept we can hide the gory detail from the user to a degree.
This is the general plan

Collaborator

piccolbo commented Apr 12, 2012

Time to turn this into real development, will move the code under tools in the 0-install branch

Collaborator

piccolbo commented Apr 14, 2012

I deleted the code snippets in here because there is no reason to look at them anymore. The authoritative version is in the 0-install branch under tools/0-install. It looks like we are about to pass the automated tests with this one in the next few minutes. Unfortunately it required some changes in the R code as well and the merger with dev will be complicated if we want to allow two deploy approaches (pre-install or 0-install). The idea is that you run setup-jar which will create a jar file with R and all the packages, set one env variable and you are ready to run mapreduce jobs, no sysadm involvement required. As it is, it will work only with ubuntu, but generalizations should be possible. Further dependencies on my specific environment (EC2) may exist, please report back with any problems if you give it a try.

Collaborator

piccolbo commented Apr 15, 2012

All checks passed. I am not closing yet hoping to hear news from other platforms, or to have a chance to test other platforms myself.

Collaborator

piccolbo commented Mar 6, 2013

This approach is too hacky, makes too many assumptions and leaves cluster in an inconsistent state and we moved on.

@piccolbo piccolbo closed this Mar 6, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment