public
Description: Anemone web-spider framework
Homepage: http://anemone.rubyforge.org
Clone URL: git://github.com/chriskite/anemone.git
chriskite (author)
Wed Dec 16 12:06:09 -0800 2009
commit  7b58fecdb142bc126967ba70888ad9c61eae193a
tree    625253b63efb93691798f84cb26c6986b91d65e4
parent  6bbe30cd489b385fdb83113e79c3af9a803e4279
name age message
file CHANGELOG.rdoc Wed Dec 16 12:06:09 -0800 2009 update README and CHANGELOG [chriskite]
file LICENSE.txt Tue Apr 14 12:14:47 -0700 2009 initial import [Chris Kite]
file README.rdoc Wed Dec 16 12:06:09 -0800 2009 update README and CHANGELOG [chriskite]
file anemone.gemspec Wed Nov 04 09:06:58 -0800 2009 added new specs to gemspec [chriskite]
directory bin/ Thu Oct 22 20:51:37 -0700 2009 change interperter line on anemoen cli bin [chriskite]
directory lib/ Wed Dec 16 12:05:50 -0800 2009 require .tch file extension on TokyoCabinet [chriskite]
directory spec/ Wed Dec 16 12:05:50 -0800 2009 require .tch file extension on TokyoCabinet [chriskite]
README.rdoc

Anemone

Anemone is a web spider framework that can spider a domain and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized spider tasks quickly and easily.

See anemone.rubyforge.org for more information.

Features

  • Multi-threaded design for high performance
  • Tracks 301 HTTP redirects to understand a page’s aliases
  • Built-in BFS algorithm for determining page depth
  • Allows exclusion of URLs based on regular expressions
  • Choose the links to follow on each page with focus_crawl()
  • HTTPS support
  • Records response time for each page
  • CLI program can list all pages in a domain, calculate page depths, and more
  • Obey robots.txt
  • In-memory or persistent storage of pages during crawl, using TokyoCabinet or PStore

Examples

See the scripts under the lib/anemone/cli directory for examples of several useful Anemone tasks.

Requirements

  • nokogiri
  • robots