public
Description: A Ruby library to parse the content out of web pages, such as BBC News pages. Used by the News Sniffer project.
Homepage:
Clone URL: git://github.com/johnl/web-page-parser.git
name age message
file LICENSE Sat May 09 14:25:01 -0700 2009 Added LICENSE, README.rdoc and gemspec [johnl]
file README.rdoc Sat May 09 14:25:01 -0700 2009 Added LICENSE, README.rdoc and gemspec [johnl]
file Rakefile Sat May 09 12:49:50 -0700 2009 spec: BbcNewsPageParserV2 should convert iso-88... [johnl]
directory lib/ Sat Jun 20 14:40:11 -0700 2009 Add missing fixture 7745137.stm.html. Fix CONT... [johnl]
directory spec/ Sat Jun 20 14:40:11 -0700 2009 Add missing fixture 7745137.stm.html. Fix CONT... [johnl]
file web-page-parser.gemspec Sat Jun 20 14:41:36 -0700 2009 gem release 0.10 [johnl]
README.rdoc

Web Page Parser

Web Page Parser is a Ruby library to parse the content out of web pages, such as BBC News pages. It strips all non-textual stuff out, leaving the title, publication date and an array of paragraphs. It currently only supports BBC News pages but new parsers are planned and can be added easily.

It is used by the News Sniffer project, which parses and archives news articles to keep track of how they change.

Example usage

  require 'web-page-parser'
  require 'open-uri'

  url = "http://news.bbc.co.uk/1/hi/uk/8041972.stm"
  page_data = open(url).read

  page = WebPageParser::ParserFactory.parser_for(:url => url, :page => page_data)

  puts page.title # MPs hit back over expenses claims
  puts page.date # 2009-05-09T18:58:59+00:00
  puts page.content.first # The wife of author Ken Follett and ...

More Info

Web Page Parser was written by John Leach.

The code is available on github.