controlling number of map and reduce tasks #164

piccolbo · 2012-12-10T19:50:45Z

I would like to reopen the issue touched upon with #143. It is clear that my original position that the number of map and reduce tasks is best set on a per-cluster basis, which I based on my own experience and Cloudera recommendations, is not compatible with the variety of workloads that users are creating. At on extreme we have the network IO heavy job, like a web crawler (not that many people write web crawlers in R, but for lack of a better example) for which the desired number of tasks per node may be in the hundreds (unless async IO is used, which I don't think is available in R); in the middle we have local IO heavy processes, like a simple filter; then we have CPU heavy jobs, say model fitting; and finally memory heavy jobs, for which only a small number, maybe even only one job can run effectively on each node. Even with the next gen of Mapreduce a job can't even make the system aware of what its resource needs will be, with the exception of memory. Therefore some per-job configuration will be needed for the foreseeable future. My preference is for a tasks per node setting more than total number of tasks. This seems more generalizable between clusters of a different size and jobs of different size. For instance, a memory intensive task where each node can only run one process will run with a total number of tasks equal to the cluster size. It's difficult to query the cluster size for a streaming application, I couldn't find a clean way of doing it albeit there is a list of slaves in the configuration directory. But if we look at tasks per node, the answer is a constant 1. There are 4 properties we can set to influence this behavior: mapred.map.tasks and mapred.map.task.maximum and the equivalent two for reduce. The "maximum" property is not job-level, it is tasktracker level, so it needs a restart to take effect and clearly won't help with a mixed workload. mapred.map.tasks influences but doesn't actually determine the number of map tasks. So if one sets mapred.map.tasks to 10, assuming the request is honored verbatim by Hadoop, they can run all 10 on one node or 1 per node on a 10 node cluster. This is not very helpful in optimizing job execution. Additional thoughts are welcome.

This discussion has additional information.

The text was updated successfully, but these errors were encountered:

piccolbo · 2013-03-05T22:08:09Z

now RevolutionAnalytics/rmr2#10

piccolbo mentioned this issue Feb 26, 2013

controlling number of map and reduce tasks RevolutionAnalytics/rmr2#10

Open

piccolbo closed this as completed Mar 5, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

controlling number of map and reduce tasks #164

controlling number of map and reduce tasks #164

piccolbo commented Dec 10, 2012

piccolbo commented Mar 5, 2013

controlling number of map and reduce tasks #164

controlling number of map and reduce tasks #164

Comments

piccolbo commented Dec 10, 2012

piccolbo commented Mar 5, 2013