Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controlling number of map and reduce tasks #164

Closed
piccolbo opened this issue Dec 10, 2012 · 1 comment
Closed

controlling number of map and reduce tasks #164

piccolbo opened this issue Dec 10, 2012 · 1 comment

Comments

@piccolbo
Copy link
Collaborator

I would like to reopen the issue touched upon with #143. It is clear that my original position that the number of map and reduce tasks is best set on a per-cluster basis, which I based on my own experience and Cloudera recommendations, is not compatible with the variety of workloads that users are creating. At on extreme we have the network IO heavy job, like a web crawler (not that many people write web crawlers in R, but for lack of a better example) for which the desired number of tasks per node may be in the hundreds (unless async IO is used, which I don't think is available in R); in the middle we have local IO heavy processes, like a simple filter; then we have CPU heavy jobs, say model fitting; and finally memory heavy jobs, for which only a small number, maybe even only one job can run effectively on each node. Even with the next gen of Mapreduce a job can't even make the system aware of what its resource needs will be, with the exception of memory. Therefore some per-job configuration will be needed for the foreseeable future. My preference is for a tasks per node setting more than total number of tasks. This seems more generalizable between clusters of a different size and jobs of different size. For instance, a memory intensive task where each node can only run one process will run with a total number of tasks equal to the cluster size. It's difficult to query the cluster size for a streaming application, I couldn't find a clean way of doing it albeit there is a list of slaves in the configuration directory. But if we look at tasks per node, the answer is a constant 1. There are 4 properties we can set to influence this behavior: mapred.map.tasks and mapred.map.task.maximum and the equivalent two for reduce. The "maximum" property is not job-level, it is tasktracker level, so it needs a restart to take effect and clearly won't help with a mixed workload. mapred.map.tasks influences but doesn't actually determine the number of map tasks. So if one sets mapred.map.tasks to 10, assuming the request is honored verbatim by Hadoop, they can run all 10 on one node or 1 per node on a 10 node cluster. This is not very helpful in optimizing job execution. Additional thoughts are welcome.

This discussion has additional information.

@piccolbo
Copy link
Collaborator Author

piccolbo commented Mar 5, 2013

now RevolutionAnalytics/rmr2#10

@piccolbo piccolbo closed this as completed Mar 5, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant