As documented on the wiki, a conscious design decision was made to not provide an interface to set per-job config parameters. However, there are scenarios where developers need to set job specific parameters on the jobconf, such as mapred.reduce.tasks. Functionality should be added to provide this ability.
Hi, I propose this compromise between API simplicity and ability to control hadoop behaviour (slightly different from our initial conversation, not in substantial ways):
One aspect I am still uncertain about is if it would be more useful for people to set this on a per job basis or as package options. In the specific case you suggest it seems reasonable to write
rmr.tuning.options(hadoop = list(mapred.reduce.tasks = 0.9 * reduce.slots.in.your.cluster)
This has also the advantage of keeping the program logic cleaner.
Other options may be best set per-job as in
mapreduce(..., tuning.options = list(hadoop = list(some.options = some.value)))
We could do both and have the package options act as a default to the per job ones. It just makes things a bit more complicated (for me to write).
I would like to solicit comments on this ticket and proposed solution. I have to warn that we will be unable to include this in 1.1 but it will go into dev ASAP. Thanks.
I went for the per-job setting. It is not cast in stone. Use with care. Eventually I would like to integrate something like https://www.cs.duke.edu/~shivnath/amr.html in the spirit of taking low level details away from the user, but until then this will have to do. In dev now.
I think per job is an acceptable solution, and corresponds to the how it's supported in other interfaces (e.g. streaming, RHIPE). I'll go ahead and test this out.