github
Advanced Search
  • Home
  • Pricing and Signup
  • Explore GitHub
  • Blog
  • Login

infochimps / imw

  • Admin
  • Watch Unwatch
  • Fork
  • Your Fork
  • Pull Request
  • Download Source
    • 26
    • 4
  • Source
  • Commits
  • Network (4)
  • Issues (0)
  • Downloads (0)
  • Wiki (8)
  • Graphs
  • Branch: master

click here to add a description

click here to add a homepage

  • Branches (3)
    • mainline
    • master ✓
    • old
  • Tags (0)
Sending Request…
Enable Donations

Pledgie Donations

Once activated, we'll place the following badge in your repository's detail box:
Pledgie_example
This service is courtesy of Pledgie.

Infinite Monkeywrench - A frameworks for collecting, peeling, and sharing delicious bananas of data. — Read more

  cancel

http://infinitemonkeywrench.org

  cancel
  • Private
  • Read-Only
  • HTTP Read-Only

This URL has Read+Write access

Rewrote lots of specs and now we have Tar, Zip, and Archiver specs 
passing. 
dhruvbansal (author)
Tue Feb 09 13:16:36 -0800 2010
commit  53e978b30ecd0cf83a395924994ea64ed79c0a21
tree    08d8246a5d5e0455124c85d674c54eab35ea2931
parent  7d9b5d37b372a68a2f41c133953ebb38a93d8e9c
imw /
name age
history
message
file .gitignore Thu Feb 04 13:23:37 -0800 2010 Made URI be Addressable::URI, and to use heuris... [mrflip]
file CHANGELOG Tue May 13 12:10:52 -0700 2008 Initial Commit -- trashing all our BZR history [Dhruv Bansal]
file LICENSE Tue Dec 01 19:09:03 -0800 2009 Removed all the cruft. [dhruvbansal]
file README.rdoc Tue Feb 02 09:44:08 -0800 2010 README. [dhruvbansal]
file Rakefile Wed Dec 02 12:46:02 -0800 2009 Moved the code for the packager into IMW. [dhruvbansal]
file TODO Thu Feb 04 13:23:37 -0800 2010 Made URI be Addressable::URI, and to use heuris... [mrflip]
file VERSION Thu Jan 07 02:11:14 -0800 2010 Version bump to 0.1.1 [Dhruv Bansal]
directory bin/ Thu Jan 07 21:06:35 -0800 2010 Datasets can be added to IMW's repository and m... [dhruvbansal]
directory etc/ Thu Dec 03 18:33:01 -0800 2009 Added a packager to move to S3. [dhruvbansal]
directory lib/ Tue Feb 09 13:16:36 -0800 2010 Rewrote lots of specs and now we have Tar, Zip,... [dhruvbansal]
directory old/ Thu Jan 07 03:03:57 -0800 2010 Got rid of a lot of cruft in lib/imw/dataset. [dhruvbansal]
directory spec/ Tue Feb 09 13:16:36 -0800 2010 Rewrote lots of specs and now we have Tar, Zip,... [dhruvbansal]
README.rdoc

Overview

The Infinite Monkeywrench (IMW) is a Ruby frameworks to simplify the tasks of acquiring, extracting, transforming, loading, and packaging data. It has the following goals:

  • Minimize programmer time even at the expense of increasing run time.
  • Take data through a full transformation from raw source to packaged purity in as few lines of code as possible.
  • Treat data records as objects as much as possible.
  • Use instead of repeat better code that already exists in other libraries (FasterCSV, I’m talkin’ to you).
  • Make what’s common easy without making what’s uncommon impossible.
  • Work with messy data as well as clean data.
  • Let you incorporate your own tools wherever you choose to.

The Infinite Monkeywrench is a powerful tool but it is not always the right one to use. IMW is *not* designed for

  • Scraping vast amounts of data (use Wuclan, Monkeyshines, and Edamame.)
  • Really, really big datasets (use Wukong and Hadoop)
  • Data mining
  • Data visualization

Setup

IMW is hosted on Gemcutter so it’s easy to install.

You’ll have to set up Gemcutter if you haven’t already

  $ sudo gem install gemcutter
  $ gem tumble

and then install IMW

  $ sudo gem install imw

IMW Basics

The central goal of IMW is to make workflow involved in processing a dataset from a raw source to a finished product as simple as possible.

To help achieve this goal, IMW creates lots of convenient structures and methods. The following sections provide a tour of these.

It is assumed that you’ve installed IMW and required it in a script via

  require 'rubygems'
  require 'imw'

Paths

IMW holds a registry of paths that you can define on the fly or store in a configuration file.

  IMW.add_path(:dropbox, "/var/www/public/dropbox")
  IMW.path_to(:dropbox)  #=> "/var/www/public/dropbox"

You can combine paths together dynamically.

  IMW.add_path(:raw, "/data/raw")
  IMW.path_to(:raw, "my/dataset") #=> "/data/raw/my/dataset"
  IMW.add_path(:rejects, :raw, "rejects")
  IMW.path_to(:rejects) #=> "/data/raw/rejects"

Altering one path will update others

  IMW.add_path(:raw, "/data2/raw")
  IMW.path_to(:rejects) #=> "/data2/raw/rejects", not "/data/raw/rejects"

Files & Directories

Use IMW.open to open files. The object returned by IMW.open obeys the usual semantics of a File object but it has new methods to manipulate and parse the file.

  f1 = IMW.open("/path/to/file")
  f1.read() # does what you think

  # class methods from File are available
  f1.size
  f1.writeable?

  # use a bang or a 'w' to write
  writable_file = IMW.open!('/some/path') # similar to open('/some/path', 'w')

  # as well as methods to manipulate the file on the filesystem
  f2 = f1.cp("/new/path/to/file") # also try cp_to_dir
  f1.exist? # true
  f3 = f1.mv("/yet/another/path") # also try mv_to_dir
  f1.exist? # false

IMW also knows about directories

  d = IMW.open('/tmp')
  d.directory? # true
  d['*'] # Dir['/tmp/*']
  d.mv('/parent/dir')

Remote Files

Many operations defined for files are also defined for arbitrary URIs through the open-uri library.

Files can readily be opened, read, and downloaded from the Internet

  site = IMW.open('http://infochimps.org') #=> Recognized as an HTML document
  site.read() # does what you think
  site.cp('/some/local/path')
  site.exist? # will work in many cases

(writing to remote sources isn’t enabled yet).

Archives & Compressed Files

IMW works with a variety of archiving and compression programs (see IMW::EXTERNAL_PROGRAMS) to make packaging/unpackaging data easy.

  bz2   = IMW.open('/path/to/big_file.bz2')
  zip   = IMW.open('/path/to/archive.zip')
  targz = IMW.open('/path/to/archive.tar.gz')

  # IMW recognizes files by extension
  bz2.archive?      # false
  bz2.compressed?   # true
  zip.archive?      # true
  zip.compressed?   # false
  targz.archive?    # true
  targz.compressed? # true

  # decompress or compress files
  big_file = bz2.decompress! # skip the ! to preserve the original
  new_bz2  = big_file.compress!

  # extract and package archives
  zip.extract    # files show up in working directory
  tarbz2.extract # no need to decompress first
  new_tarbz2 = IMW.open!('/new/archive.tar').create(['/path1', '/path/2']).compress!

Data Formats

IMW encourages you to work with data as Ruby objects as much as possible by providing methods to parse common data formats directly into Ruby.

The actual parsing is always handled by a separate library appropriate for the data format so it will be fast and, if you’re familiar with the library, you can use many functions of the library directly on the object returned by IMW.open.

IMW uses classes (defined in IMW::Files) to interface with each data type. The choice of class is determined by the extension of the path supplied to IMW.open.

  IMW.open('file.csv')  #=> IMW::Files::Csv
  IMW.open('file.xml')  #=> IMW::Files::Xml
  IMW.open('file.html') #=> IMW::Files::Html

  # default choice will be a text file
  IMW.open('strange_filename.wuzz') #=> IMW::Files::Text

  # but you force a particular choice
  IMW.open('strange_filename.wuzz', :as => :csv)  #=> IMW::Files::Csv

Some formats are extremely regular (CSV’s, JSON, YAML, &c.) and can immediately be converted to simple Ruby objects. Other formats (flat files, HTML, XML, &c.) require parsing before they can be unambiguously converted to Ruby objects.

As an example, consider flat, delimited files. They are extremely regular and IMW uses FasterCSV to automatically parse them into nested arrays, the only sensible and unambiguous Ruby representation of their data:

  delimit1 = IMW.open('/path/to/csv') # IMW::Files::Csv
  delimit1.entries #=> array of arrays of entries
  delimit1.each do |row|
    # passes in parsed rows
    ...
  end

  # if there's a funny delimiter, it can be passed as an option (in
  # this case identical to what would be passed to FasterCSV under the
  # hood
  delimit2 = IMW.open('/path/to/file.csv', :col_sep => " ")

HTML files, on the other hand, are more complex and typically have to be parsed before being converted to plain Ruby objects:

  # Grab a tiny link from the bottom of Google's homepage
  doc = IMW.open('http://www.google.com') # IMW::Files::Html
  doc.parse('p a') # 'Privacy'

More complex parsers can also be built

  # Grab each row from an HTML table
  doc = IMW.open('/path/to/data.html')
  doc.parse :employees => ["tr", { :name => "td.name", :address => "td.address" } ]
  #=> [{:name => "John Chimpo", :address => "123 Fake St."}, {...}, ... ]

see IMW::Parsers::HtmlParser for details on parsing HTML (and similar) files. Examine the other parsers in IMW::Parsers for details on parsing other data formats.

The IMW Workflow

The workflow of IMW can be roughly summarized as follows:

rip::

  Data is obtained from a source.  IMW allows you to download data
  from the web, obtain it by querying databases, or use other services
  like rsync, ftp, &c. to pull it in from another computer.

extract::

  Ripped data is often compressed or otherwise archived and needs to
  be extracted.  It may also be sliced in many ways (excluding certain
  years, say) to reduce the volume to only what is required.

parse::

  Data is parsed into Ruby objects and stored.

munge::

  All the parsed data is combined, reconciled, and further processed
  into a final form.

package::

  The data is archived and compressed as necessary and moved to an
  outbox, staging server, S3 bucket, &c.

Not all datasets

Datasets

Tasks & Dependencies

Directory Structure

Records

IMW on the Command Line

Repositories

Running Tasks

Blog | Support | Training | Contact | API | Status | Twitter | Help | Security
© 2010 GitHub Inc. All rights reserved. | Terms of Service | Privacy Policy
Powered by the Dedicated Servers and
Cloud Computing of Rackspace Hosting®
Dedicated Server