public
Fork of mislav/scraper
Description: A cute HTML scraper / data extraction tool
Homepage:
Clone URL: git://github.com/drnic/scraper.git
drnic (author)
Mon Oct 26 20:33:57 -0700 2009
commit  6d94ba0f8d5f3f276a546addb526824129eea0d0
tree    f0a206d6ebd265525c1b096e5f49801e9b38b1ce
parent  de45d03fdbce6198ceba1af6a5e9885c99018e13
name age message
file README.md Loading commit data...
file Rakefile Sat Oct 24 06:56:03 -0700 2009 add delicious.com, Twitter JSON sample scripts [mislav]
directory examples/ Sat Oct 24 06:56:03 -0700 2009 add delicious.com, Twitter JSON sample scripts [mislav]
directory lib/
README.md

Scraper

Scraper is a cute HTML screen-scraping tool.

require 'scraper'
require 'open-uri'

class BlogScraper < Scraper
  element :title

  elements 'div.hentry' => :articles do
    element 'h2' => :title
    element 'a/@href' => :url
  end
end

blog = BlogScraper.parse open('http://example.com')

blog.title
#=> "My blog title"

blog.articles.first.title
#=> "First article title"

blog.articles.first.url
#=> "http://example.com/article"

There are sample scripts in the "examples/" directory; run them with:

ruby -rubygems examples/<script>.rb

See the wiki for more on how to use Scraper.

Requirements

None. Well, Nokogiri is a requirement if you pass in HTML content that needs to be parsed, like in the example above. Otherwise you can initialize the scaper with an Hpricot document or anything else that implements at(selector) and search(selector) methods.