public
Description: Our own collection of data mungers to feed infochimps.org
Homepage: http://infochimps.org/
Clone URL: git://github.com/infochimps/infochimps-data.git
name age message
file .gitignore Mon Aug 04 05:32:38 -0700 2008 fixing .gitignore [mrflip]
file README Tue May 13 12:18:32 -0700 2008 Initial Commit, trashing the BZR history. Eat ... [Dhruv Bansal]
file README-commands Tue May 13 12:18:32 -0700 2008 Initial Commit, trashing the BZR history. Eat ... [Dhruv Bansal]
file README-license Tue May 13 12:18:32 -0700 2008 Initial Commit, trashing the BZR history. Eat ... [Dhruv Bansal]
file README-overview Tue May 13 12:18:32 -0700 2008 Initial Commit, trashing the BZR history. Eat ... [Dhruv Bansal]
directory culture/ Tue Sep 09 08:19:22 -0700 2008 Getting old schema in place to load [mrflip]
directory db/ Tue Sep 09 08:19:22 -0700 2008 Getting old schema in place to load [mrflip]
directory demographics/ Tue Sep 09 08:19:22 -0700 2008 Getting old schema in place to load [mrflip]
directory engineering/ Tue Sep 09 08:19:22 -0700 2008 Getting old schema in place to load [mrflip]
directory geo/ Tue Sep 09 08:19:22 -0700 2008 Getting old schema in place to load [mrflip]
directory health/ Tue Sep 09 08:19:22 -0700 2008 Getting old schema in place to load [mrflip]
directory huge/ Wed Sep 10 12:48:56 -0700 2008 Repairing old schemata [mrflip]
directory icss/ Wed Sep 10 23:13:59 -0700 2008 Merge branch 'master' of git@github-infochimps:... [dhruvbansal]
directory join/ Tue Sep 09 08:19:22 -0700 2008 Getting old schema in place to load [mrflip]
directory language/ Mon Sep 29 12:35:29 -0700 2008 Word freq DB loading is WAY TOO SLOW [mrflip]
directory math/ Tue Sep 09 08:19:22 -0700 2008 Getting old schema in place to load [mrflip]
directory money/ Tue Sep 09 08:19:22 -0700 2008 Getting old schema in place to load [mrflip]
directory politics/ Sat Sep 27 02:16:32 -0700 2008 Wordcloud Normalization [mrflip]
directory scaffolds/ Tue Sep 16 23:52:10 -0700 2008 Deploying [mrflip]
directory science/ Tue Sep 09 08:19:22 -0700 2008 Getting old schema in place to load [mrflip]
directory social/ Mon Sep 29 13:14:06 -0700 2008 Merge branch 'master' of git@github.com:mrflip/... [mrflip]
directory sport/ Tue Sep 09 08:19:22 -0700 2008 Getting old schema in place to load [mrflip]
directory time/ Tue Sep 09 08:19:22 -0700 2008 Getting old schema in place to load [mrflip]
directory weather/ Mon Sep 29 08:59:36 -0700 2008 weather file [mrflip]
README
h1. Intro

Infinite monkeywrench is a frameworks to simplify the tasks of acquiring,
extracting, transforming and loading data.

* It's built, designed and tested for manipulating datasets as small as 1k and
  as large as hundreds of gigabytes.

* Minimize **programmer time** even at the expense of increasing run time.
  These tasks only need to be run once.  (And they deal scalably with
  incremental updates: see 'Lazy' below.)

* Runtime scales with data.  One MB of data for testing, will run in about
  1000'th the time to process one GB of data for real.
  
* Simple parallelization. Tell imw it's #3 out of 5 (or 500) workers and it will
  process only that fraction of the input.

* Lazy evaluation, like 'make': imw lets you define dependency chains (and comes
  already knowing a few).  It won't scrape new data if there's nothing new to
  scrape, and if you need to generate file "frobnozz" before you can process
  file "marklar", imw generates frobnozz if and only if it doesn't already exist.

* Realistic.  IMW is built to handle real data as she is spoke:

** Beautiful, schematized, formatted data from the infochimps.org collection.
** Most popular file formats: XML, YAML, CSV, JSON
** Parsing flat files becomes an easy two-liner piece of code.
** Messy data in some backwater format with no schema still sucks but a lot less
   than it used to.
** Scrape a web page tree and nimbly extract the table data from each page.

* Although obviously it's more work to acquire a sloppy dataset than a
  well-defined one, imw degrades well -- you write code for exactly and only the
  tasks that make your dataset bizarre.

* IMW is toolset agnostic.  If you have pre-existing routines to parse or
  acquire some format, imw is happy to call those tools at any step. You can
  have imw manage the acquisition and loading, but replace the entire munging
  step with a simple @'sh "perl dostuff.pl"'@ or even @'make'@.

h2. Setup

Since I'm not smart enough to get this bootstrapped the right way, add this to
your .profile (either that, or ensmarten me about the right way):
  ### Wield Infinite Monkeywrench (+1 to data munging, charisma, THAC0):
  export IMW_ROOT=$HOME/ics/imw
  export PATH="$IMW_ROOT"/bin:"$PATH"
  export RUBYLIB="$IMW_ROOT"/lib:"$RUBYLIB"

These directories will be created under $IMW_ROOT (more about each later):
* @pool/(cat)/(subcat)/(pool)/@    -- Data pool processing code
* @ripd/com.reverse.url/dirs/files.ext@  -- scraped data from elsewhere
* @rawd/(cat)/(subcat)/(pool)/@   -- working copy of raw data
* @dump/(cat)/(subcat)/(pool)/@   -- intermediate data
* @fixd/(cat)/(subcat)/(pool)/@   -- completely processed data
* @pkgd/(cat)/(subcat)/(pool)/@   -- compressed & bundled distributable

If any directory is called for but found missing, IMW will create it (except for
@dump/@, which will be linked to /tmp/imw).  However, if the directory is there,
IMW will leave it the heck alone. So feel free to replace each directory with a
symbolic link (putting @rawd/@, @dump/@ and @fixd/@ directories on a large, fast
drive, @ripd/@ and @pkgd/@ on large, slow drives for instance.).

h2. Scaffolding

@ imw generate pool=foo/bar/myhappypool @



h2. Actually processing your data

The Infinite Monkeywrench is built atop ActiveResource, the data model that
powers Ruby on Rails.  This gives us

* a well-known, well-tested, database-agnostic data model
* ability to export from that datamodel to sqlite3, csv, yaml, xml and JSON with
  vernacular structure
* active_resource for both an outgoing API and an ingoing data socket: if
  someone creates an active_resource facade to an external API we get the
  data

It also demonstrates our commitment to the "minimize programmer time, not run
time" philosophy, but it's worked for us so far.