Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Attempts to extract readable content and embedded links from HTML markup & web pages

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 lib
Octocat-spinner-32 test
Octocat-spinner-32 .document
Octocat-spinner-32 .gitignore
Octocat-spinner-32 LICENSE
Octocat-spinner-32 README.markdown
Octocat-spinner-32 Rakefile
Octocat-spinner-32 VERSION
Octocat-spinner-32 dragnet.gemspec
README.markdown

dragnet

This is still very experimental.

Extracting readable content from HTML markup

This was inspired by the Readability bookmarklet. The goal is to extract meaningful, readable content from HTML. This will attempt to extract content from sources such as blogs articles and publications. It will also attempt to extract links embedded within the readable content.

Approach

Given the vast nasty of HTML tag soup regurgitated by some blog engines, it's very hard to get fully clean content from a page on any kind of consistent basis. What works for one chunk of HTML, might not work for other chunks of HTML. We want to attempt to get as clean of content as we can, but remain as abstract as possible.

A basic overview of what this library does:

  1. Try to extract any hEntry microformat items from the page and return those.
  2. Collect all paragraphs (if none found, collect text nodes and/or divs as last resort)
  3. Iterate over the paragraphs, ascending up the hierarchy, scoring the parent based on some common keywords and word count.

Notable Troublesome URLS

[x] Readability doesn't parse correctly : [-] Readability parses correctly

TODO

  • Parsing multiple page articles
  • Consider searching for a 'print' link on the page and use this content instead. This content tends to be a cleaner version of the original and it also tends to bypass the multiple page article issue.

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Add tests for it. This is important so I don't break it in a future version unintentionally.
  • Send me a pull request. Bonus points for topic branches.

Copyright

Copyright (c) 2009 Justin Palmer. See LICENSE for details.

Something went wrong with that request. Please try again.