This repository is private.
All pages are served over SSL and all pushing and pulling is done over SSH.
No one may fork, clone, or view it unless they are added as a member.
Every repository with this icon (
) is private.
Every repository with this icon (
This repository is public.
Anyone may fork, clone, or view it.
Every repository with this icon (
) is public.
Every repository with this icon (
| name | age | message | |
|---|---|---|---|
| |
LICENSE | Sat May 09 14:25:01 -0700 2009 | |
| |
README.rdoc | Sat May 09 14:25:01 -0700 2009 | |
| |
Rakefile | Sat May 09 12:49:50 -0700 2009 | |
| |
lib/ | Sat Jun 20 14:40:11 -0700 2009 | |
| |
spec/ | Sat Jun 20 14:40:11 -0700 2009 | |
| |
web-page-parser.gemspec | Sat Jun 20 14:41:36 -0700 2009 |
README.rdoc
Web Page Parser
Web Page Parser is a Ruby library to parse the content out of web pages, such as BBC News pages. It strips all non-textual stuff out, leaving the title, publication date and an array of paragraphs. It currently only supports BBC News pages but new parsers are planned and can be added easily.
It is used by the News Sniffer project, which parses and archives news articles to keep track of how they change.
Example usage
require 'web-page-parser' require 'open-uri' url = "http://news.bbc.co.uk/1/hi/uk/8041972.stm" page_data = open(url).read page = WebPageParser::ParserFactory.parser_for(:url => url, :page => page_data) puts page.title # MPs hit back over expenses claims puts page.date # 2009-05-09T18:58:59+00:00 puts page.content.first # The wife of author Ken Follett and ...
More Info
Web Page Parser was written by John Leach.
The code is available on github.







