This is unscientific (meaning I only benchmarked one run of each method with a single data set and code), but I have found a ~16% performance penalty for rmr 1.2 (vs. streaming) -- but an easy way to eliminate it.
I'm running jseidman's sample code (https://github.com/jseidman/hadoop-R) against a full year's airline data file (2004.csv, uncompressed, 639MB, 7,129,271 lines) from http://stat-computing.org/dataexpo/2009/2004.csv.bz2
I'm running on a 1+5 m1.large cluster running CentOS 4.6 spun up from my whirr script.
rmr 1.2: 27.83333 minutes
streaming: 23.88333 minutes
penalty = 16.5%
with '--byte-compile' used for installation of rmr, RJSONIO, itertools, and digest:
rmr 1.2: 22.159939801693 (20% improvement)
streaming: 23.3833333 (1.9% improvement)
penalty = gone!
Note that refactoring out the mapper and reducer and throwing cmpfun()'s around them isn't necessary -- and doesn't help.
Thanks to Nathan McIntyre of the LA User Group and John Versotek of the Boston Predictive Analytics Meetup for prompting me to take a look at relative performance. Nathan had found a significant performance penalty using rmr 1.1 and a noop analysis.
I will add the flag to my fork if you want to test it out.
Awesome, please change dev so that I can do an automatic merge. 1.1 had the append bug, reduce was quadratic in the size of the grouping. Please reach out to Nathan or let me connect with him. We will probably devote 1.3 to speed issues, this kind of feedback is gold.
Jeffrey – can you add some details on things like number of nodes and number of reducers? Can you test against the full data set?
@piccolbo -- sorry, I am new to the whole git thing. How do I do that?
@jseidman -- this was using 5 data nodes and 1 name node per my hacked whirr script (https://github.com/jeffreybreen/RHadoop/blob/master/rmr/pkg/tools/whirr/hadoop-ec2-centos.properties):
whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,5 hadoop-datanode+hadoop-tasktracker
I have since destroyed the cluster (the meter was running... :), but I think I remember seeing "10" in the web UI -- not sure if that was # of mappers or reducers though.
I haven't downloaded the full data set yet -- I was actually only interested in one day for some comparison with some more detailed data I have.
Jeffrey, just check out dev
git checkout dev
do your changes, commit and enter a pull request as you did before. I hope, no git expert either. If you read the discussions on stackoverflow, you get a sense that nobody is a git expert. There's no agreement on anything.
too many GUIs nowadays -- let me know if you got it!