Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Optimizer or planner for mapreduce jobs #83

piccolbo opened this Issue Apr 26, 2012 · 1 comment


None yet
1 participant

piccolbo commented Apr 26, 2012

It is common practice to apply transformation to mapreduce programs that change the number and nature of jobs involved, usually to minimize I/O while preserving the same function. It is done in Hive, Pig and Cascading for example. In rmr it is a little more challenging because

  • the I/O bound assumption which is behind for instance, the Cascading optimizer (called a planner in that context), is not necessarily true for complex analytics programs.
  • the variety of programs that can be written with rmr. It is not a little crippled special language, it is allows the full power of R. So it's going to be difficult to apply general transformations while preserving semantic equivalence.
  • The unavailability of some advanced java-only features such as multiple output formats.

On the positive side are the reflection capabilities of R that allows to inspect the parse tree, for instance. A little example of what could be done is in a function optimize in the source, completely untested. The only optimization applied it to reduce a chain of mapreduce calls that have a reduce only at the end to a single mapreduce job by composing the mappers.


piccolbo commented Mar 6, 2013

@piccolbo piccolbo closed this Mar 6, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment