public
Description: Infinite Monkeywrench - A frameworks for collecting, peeling, and sharing delicious bananas of data.
Homepage: http://infinitemonkeywrench.org
Clone URL: git://github.com/infochimps/imw.git
imw /
name age message
file .gitignore Sun Sep 27 21:03:06 -0700 2009 Add Manifest to git repo so github will use it? [dhruvbansal]
file CHANGELOG Tue May 13 12:10:52 -0700 2008 Initial Commit -- trashing all our BZR history [Dhruv Bansal]
file Manifest Wed Oct 07 22:51:48 -0700 2009 Couple of bugs with expanding paths were fixed. [dhruvbansal]
file README-commands Tue May 13 12:10:52 -0700 2008 Initial Commit -- trashing all our BZR history [Dhruv Bansal]
file README-license Tue May 13 12:10:52 -0700 2008 Initial Commit -- trashing all our BZR history [Dhruv Bansal]
file README-organization.txt Tue Nov 25 21:15:19 -0800 2008 Kinda know how to organize this... let's find out [mrflip]
file README-overview Tue May 13 12:10:52 -0700 2008 Initial Commit -- trashing all our BZR history [Dhruv Bansal]
file README.textile Thu Oct 15 11:21:25 -0700 2009 Fixed a bug with reading files, added a random ... [dhruvbansal]
file Rakefile Sun Sep 27 16:01:35 -0700 2009 Ready to make a gem. [dhruvbansal]
file TODO Tue May 13 12:10:52 -0700 2008 Initial Commit -- trashing all our BZR history [Dhruv Bansal]
file TODO-scraper.TODO Mon Sep 15 04:27:40 -0700 2008 delicious import working well ; Starting import... [mrflip]
directory etc/ Sun Sep 27 13:33:35 -0700 2009 Merge branch 'master' of git@github.com:infochi... [dhruvbansal]
directory lib/ Tue Nov 10 17:05:57 -0800 2009 Bug where if env var HOME wasn't set then IMW w... [dhruvbansal]
directory meta/ Sun Sep 27 13:33:35 -0700 2009 Merge branch 'master' of git@github.com:infochi... [dhruvbansal]
directory old/ Thu Jul 31 20:03:33 -0700 2008 moving rspec script out the way -- the gem shou... [mrflip]
directory spec/ Thu Oct 15 11:21:25 -0700 2009 Fixed a bug with reading files, added a random ... [dhruvbansal]
README.textile

Intro

Infinite monkeywrench is a frameworks to simplify the tasks of acquiring,
extracting, transforming and loading data.

  • It’s built, designed and tested for manipulating datasets as small as 1k and
    as large as hundreds of gigabytes.
  • Minimize programmer time even at the expense of increasing run time.
    These tasks only need to be run once. (And they deal scalably with
    incremental updates: see ‘Lazy’ below.)
  • Runtime scales with data. One MB of data for testing, will run in about
    1000’th the time to process one GB of data for real.
  • Simple parallelization. Tell imw it’s #3 out of 5 (or 500) workers and it will
    process only that fraction of the input.
  • Lazy evaluation, like ‘make’: imw lets you define dependency chains (and comes
    already knowing a few). It won’t scrape new data if there’s nothing new to
    scrape, and if you need to generate file “frobnozz” before you can process
    file “marklar”, imw generates frobnozz if and only if it doesn’t already exist.
  • Realistic. IMW is built to handle real data as she is spoke:
  • Beautiful, schematized, formatted data from the infochimps.org collection.
  • Most popular file formats: XML, YAML, CSV, JSON
  • Parsing flat files becomes an easy two-liner piece of code.
  • Messy data in some backwater format with no schema still sucks but a lot less
    than it used to.
  • Scrape a web page tree and nimbly extract the table data from each page.
  • Although obviously it’s more work to acquire a sloppy dataset than a
    well-defined one, imw degrades well — you write code for exactly and only the
    tasks that make your dataset bizarre.
  • IMW is toolset agnostic. If you have pre-existing routines to parse or
    acquire some format, imw is happy to call those tools at any step. You can
    have imw manage the acquisition and loading, but replace the entire munging
    step with a simple 'sh "perl dostuff.pl"' or even 'make'.

Setup

Since I’m not smart enough to get this bootstrapped the right way, add this to
your .profile (either that, or ensmarten me about the right way):

  1. Wield Infinite Monkeywrench (+1 to data munging, charisma, THAC0):
    export IMW_ROOT=$HOME/ics/imw
    export PATH=“$IMW_ROOT”/bin:“$PATH
    export RUBYLIB=“$IMW_ROOT”/lib:“$RUBYLIB

These directories will be created under $IMW_ROOT (more about each later):

  • pool/(cat)/(subcat)/(pool)/ — Data pool processing code
  • ripd/com.reverse.url/dirs/files.ext — scraped data from elsewhere
  • rawd/(cat)/(subcat)/(pool)/ — working copy of raw data
  • dump/(cat)/(subcat)/(pool)/ — intermediate data
  • fixd/(cat)/(subcat)/(pool)/ — completely processed data
  • pkgd/(cat)/(subcat)/(pool)/ — compressed & bundled distributable

If any directory is called for but found missing, IMW will create it (except for
dump/, which will be linked to /tmp/imw). However, if the directory is there,
IMW will leave it the heck alone. So feel free to replace each directory with a
symbolic link (putting rawd/, dump/ and fixd/ directories on a large, fast
drive, ripd/ and pkgd/ on large, slow drives for instance.).

Scaffolding

@ imw generate pool=foo/bar/myhappypool @

Actually processing your data

The Infinite Monkeywrench is built atop ActiveResource, the data model that
powers Ruby on Rails. This gives us

  • a well-known, well-tested, database-agnostic data model
  • ability to export from that datamodel to sqlite3, csv, yaml, xml and JSON with
    vernacular structure
  • active_resource for both an outgoing API and an ingoing data socket: if
    someone creates an active_resource facade to an external API we get the
    data

It also demonstrates our commitment to the “minimize programmer time, not run
time” philosophy, but it’s worked for us so far.