public
Description: Anemone web-spider framework
Homepage: http://anemone.rubyforge.org
Clone URL: git://github.com/chriskite/anemone.git
chriskite (author)
Sun Nov 01 15:18:59 -0800 2009
commit  ae1b43cfd624364af1065cc5b8048634da5f19dd
tree    92ae9f02e6efc8b17276922c1b0352232b335562
parent  e475a64e6bbf4e2c7a0bcc5e6407dfa7881b9ad3
anemone / README.rdoc
100644 25 lines (19 sloc) 0.852 kb

Anemone

Anemone is a web spider framework that can spider a domain and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized spider tasks quickly and easily.

See anemone.rubyforge.org for more information.

Features

  • Multi-threaded design for high performance
  • Tracks 301 HTTP redirects to understand a page’s aliases
  • Built-in BFS algorithm for determining page depth
  • Allows exclusion of URLs based on regular expressions
  • Choose the links to follow on each page with focus_crawl()
  • HTTPS support
  • Records response time for each page
  • CLI program can list all pages in a domain, calculate page depths, and more

Examples

See the scripts under the lib/anemone/cli directory for examples of several useful Anemone tasks.

Requirements

  • nokogiri
  • robots