Every repository with this icon (
Every repository with this icon (
| name | age | message | |
|---|---|---|---|
| |
.gitignore | Sun Sep 27 21:03:06 -0700 2009 | |
| |
CHANGELOG | Tue May 13 12:10:52 -0700 2008 | |
| |
Manifest | Wed Oct 07 22:51:48 -0700 2009 | |
| |
README-commands | Tue May 13 12:10:52 -0700 2008 | |
| |
README-license | Tue May 13 12:10:52 -0700 2008 | |
| |
README-organization.txt | Tue Nov 25 21:15:19 -0800 2008 | |
| |
README-overview | Tue May 13 12:10:52 -0700 2008 | |
| |
README.textile | Thu Oct 15 11:21:25 -0700 2009 | |
| |
Rakefile | Sun Sep 27 16:01:35 -0700 2009 | |
| |
TODO | Tue May 13 12:10:52 -0700 2008 | |
| |
TODO-scraper.TODO | Mon Sep 15 04:27:40 -0700 2008 | |
| |
etc/ | Sun Sep 27 13:33:35 -0700 2009 | |
| |
lib/ | Tue Nov 10 17:05:57 -0800 2009 | |
| |
meta/ | Sun Sep 27 13:33:35 -0700 2009 | |
| |
old/ | Thu Jul 31 20:03:33 -0700 2008 | |
| |
spec/ | Thu Oct 15 11:21:25 -0700 2009 |
Intro
Infinite monkeywrench is a frameworks to simplify the tasks of acquiring,
extracting, transforming and loading data.
- It’s built, designed and tested for manipulating datasets as small as 1k and
as large as hundreds of gigabytes.
- Minimize programmer time even at the expense of increasing run time.
These tasks only need to be run once. (And they deal scalably with
incremental updates: see ‘Lazy’ below.)
- Runtime scales with data. One MB of data for testing, will run in about
1000’th the time to process one GB of data for real.
- Simple parallelization. Tell imw it’s #3 out of 5 (or 500) workers and it will
process only that fraction of the input.
- Lazy evaluation, like ‘make’: imw lets you define dependency chains (and comes
already knowing a few). It won’t scrape new data if there’s nothing new to
scrape, and if you need to generate file “frobnozz” before you can process
file “marklar”, imw generates frobnozz if and only if it doesn’t already exist.
- Realistic. IMW is built to handle real data as she is spoke:
- Beautiful, schematized, formatted data from the infochimps.org collection.
- Most popular file formats: XML, YAML, CSV, JSON
- Parsing flat files becomes an easy two-liner piece of code.
- Messy data in some backwater format with no schema still sucks but a lot less
than it used to. - Scrape a web page tree and nimbly extract the table data from each page.
- Although obviously it’s more work to acquire a sloppy dataset than a
well-defined one, imw degrades well — you write code for exactly and only the
tasks that make your dataset bizarre.
- IMW is toolset agnostic. If you have pre-existing routines to parse or
acquire some format, imw is happy to call those tools at any step. You can
have imw manage the acquisition and loading, but replace the entire munging
step with a simple'sh "perl dostuff.pl"'or even'make'.
Setup
Since I’m not smart enough to get this bootstrapped the right way, add this to
your .profile (either that, or ensmarten me about the right way):
- Wield Infinite Monkeywrench (+1 to data munging, charisma, THAC0):
export IMW_ROOT=$HOME/ics/imw
export PATH=“$IMW_ROOT”/bin:“$PATH”
export RUBYLIB=“$IMW_ROOT”/lib:“$RUBYLIB”
These directories will be created under $IMW_ROOT (more about each later):
pool/(cat)/(subcat)/(pool)/— Data pool processing coderipd/com.reverse.url/dirs/files.ext— scraped data from elsewhererawd/(cat)/(subcat)/(pool)/— working copy of raw datadump/(cat)/(subcat)/(pool)/— intermediate datafixd/(cat)/(subcat)/(pool)/— completely processed datapkgd/(cat)/(subcat)/(pool)/— compressed & bundled distributable
If any directory is called for but found missing, IMW will create it (except for
dump/, which will be linked to /tmp/imw). However, if the directory is there,
IMW will leave it the heck alone. So feel free to replace each directory with a
symbolic link (putting rawd/, dump/ and fixd/ directories on a large, fast
drive, ripd/ and pkgd/ on large, slow drives for instance.).
Scaffolding
@ imw generate pool=foo/bar/myhappypool @
Actually processing your data
The Infinite Monkeywrench is built atop ActiveResource, the data model that
powers Ruby on Rails. This gives us
- a well-known, well-tested, database-agnostic data model
- ability to export from that datamodel to sqlite3, csv, yaml, xml and JSON with
vernacular structure - active_resource for both an outgoing API and an ingoing data socket: if
someone creates an active_resource facade to an external API we get the
data
It also demonstrates our commitment to the “minimize programmer time, not run
time” philosophy, but it’s worked for us so far.







