Every repository with this icon (
Every repository with this icon (
| name | age | message | |
|---|---|---|---|
| |
.gitignore | Tue Jan 06 14:07:15 -0800 2009 | |
| |
Makefile | Tue Jan 06 14:07:15 -0800 2009 | |
| |
README.markdown | Tue Jan 06 14:41:06 -0800 2009 | |
| |
data/ | Tue Jan 06 14:07:15 -0800 2009 |
About
Ted Gould's post about transcoding music files using Make got me thinking about combining Make and RecordStream into a barebones MapReduce framework. As a demonstration, I wrote a partial solution to Tim Bray's Wide Finder challenge, to calculate hit counts from web server logs.
RecordStream (also known as "recs") is a collection of Unix tools for
producing and processing files full of JSON objects. It was developed by
my friends Ben Bernard and Keith Amling. Since RecordStream files contain one
object per line, the recs tools are easy to script in combination with
traditional Unix tools like head and grep.
Aside: Unfortunately, the version of RecordStream on Google Code is very old, and in particular does not have much in the way of usage or installation instructions. Hopefully Ben and Keith can get approval to publish the many, many, many improvements they and others have made since they started working at Amazon.com.
I think of Make as a functional programming environment for shell scripting. It's simple to write a Makefile where each rule's output depends only on the specified input files. Since all dependencies between rules are declared in the Makefile, we get some nice features for free:
- The entire process can be parallelized (using
make -j) with no explicit instructions from the programmer. Runningmake -j 2on my dual-core workstation resulted in a 1.75× speed-up for this program. - By saving intermediate files, we can do incremental MapReduce updates
(like CouchDB). Just add or remove log files from the data directory
and re-run
make. The new files will be analyzed, then combined with cached results from any already-processed files. If you had a huge collection of files, you could cache multiple levels of aggregate results (e.g. weekly, monthly, daily) to make incremental updates even more efficient. - Saving intermediate files also prevents other types of duplicate work. For example, the first step in my Wide Finder program converts the logs to into RecordStream's JSON format. This is more work than necessary for the Wide Finder problem alone, but in the real world you could then reuse the results to do other calculations.
- You can kill the process at any point and later restart it near where it stopped—or even migrate it to another machine.
Here's my source code: 16 line of code, not counting comments or blank lines. It's not as fast as any of the heavily-optimized versions, but it demonstrates a practical way to do parallel data processing with very little effort.
Instructions
- Check out RecordStream. Add the commands in the
bindirectory to your$PATH, and copy the contents of the lib directory into your Perl include path (e.g./usr/local/lib/site_perl). - Clone this git repository:
git clone git://github.com/mbrubeck/wf-recs.git - Run
makeinside your cloned repository. Usemake -j 4to run four concurrent jobs. (Adjust for the number of processor cores in your machine.)
Exercise for the reader: Integrate this with a distributed Make program, to make it run across multiple computers on a network.
Notes
The sample data in the data directory are the 10,000-line
http://www.tbray.org/tmp/o10k.ap, split into ten files. I wimped out and
did not include file-splitting in the code itself.
Eric Wong's Wide Finder 2 entry also used Make, although his code is currently unavailable.







