rmr - Add ability to modify per-job config parameters #31

Closed
jseidman opened this Issue Dec 1, 2011 · 3 comments

Projects

None yet

2 participants

@jseidman
jseidman commented Dec 1, 2011

As documented on the wiki, a conscious design decision was made to not provide an interface to set per-job config parameters. However, there are scenarios where developers need to set job specific parameters on the jobconf, such as mapred.reduce.tasks. Functionality should be added to provide this ability.

@piccolbo
Collaborator
piccolbo commented Dec 1, 2011

Hi, I propose this compromise between API simplicity and ability to control hadoop behaviour (slightly different from our initial conversation, not in substantial ways):

  1. There is one catch-all option for tuning parameters, called tuning.params
  2. It is organized as a named lists of named lists: the first level has backends as names an lists as values, the nested lists are any named lists
  3. Each backend reads only the elements of the outer list that are named after it, so that hadoop will ignore options for local and viceversa
  4. The interpretation of backend specific options is, indeed, left to the backend, under the only constraint that mapreduce(x,y,z, tuning.params) computes the same results (the order is not important) no matter what the value of tuning.params. Any other use of the option would be considered incorrect. Tests may be added to enforce this
  5. For the hadoop streaming backend the nested list would be construed as a command line addition, that is if the argument is list(hadoop = list(number.of.reducers=100, number.of.mappers=200)) this would imply that when the hadoop streaming backend is used the following options would be appended to the cmd line
    --number.of.reducers 100 --number.of.mappers 200
    or whatever is the appropriate syntax for streaming. If the value of the option is TRUE, that would be used as a switch with no option value, if FALSE it would be skipped

One aspect I am still uncertain about is if it would be more useful for people to set this on a per job basis or as package options. In the specific case you suggest it seems reasonable to write

library(rmr)

rmr.tuning.options(hadoop = list(mapred.reduce.tasks = 0.9 * reduce.slots.in.your.cluster)

mapreduce ...
mapreduce ...

This has also the advantage of keeping the program logic cleaner.
Other options may be best set per-job as in

mapreduce(..., tuning.options = list(hadoop = list(some.options = some.value)))

We could do both and have the package options act as a default to the per job ones. It just makes things a bit more complicated (for me to write).

I would like to solicit comments on this ticket and proposed solution. I have to warn that we will be unable to include this in 1.1 but it will go into dev ASAP. Thanks.

@piccolbo
Collaborator
piccolbo commented Dec 8, 2011

I went for the per-job setting. It is not cast in stone. Use with care. Eventually I would like to integrate something like https://www.cs.duke.edu/~shivnath/amr.html in the spirit of taking low level details away from the user, but until then this will have to do. In dev now.

@piccolbo piccolbo closed this Dec 8, 2011
@jseidman
jseidman commented Dec 9, 2011

I think per job is an acceptable solution, and corresponds to the how it's supported in other interfaces (e.g. streaming, RHIPE). I'll go ahead and test this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment