jkraemer / rdig

Crawler and content extractor for building a full text index of a website's contents. Uses Ferret for indexing.

This URL has Read+Write access

rdig /
README
= RDig

RDig provides an HTTP crawler and content extraction utilities
to help building a site search for web sites or intranets. Internally,
Ferret is used for the full text indexing. After creating a config file 
for your site, the index can be built with a single call to rdig.

RDig depends on Ferret (>= 0.10.0) and, for parsing HTML, on either
Hpricot (>= 0.4) or the RubyfulSoup library (>= 1.0.4). As I know no way 
to specify such an OR dependency in a gem specification, the gem depends
on Hpricot. If this is a problem for you, install the gem with --force and 
manually do a +gem install rubyful_soup+.

== basic usage


=== Index creation
- create a config file based on the template in doc/examples
- to create an index:
    rdig -c CONFIGFILE
- to run a query against the index (just to try it out)
    rdig -c CONFIGFILE -q 'your query'
  this will dump the first 10 search results to STDOUT

=== Handle search in your application:
  require 'rdig'
  require 'rdig_config'   # load your config file here
  search_results = RDig.searcher.search(query, options={})

see RDig::Search::Searcher for more information.


== usage in rails

- add to config/environment.rb :
    require 'rdig'
    require 'rdig_config'
- place rdig_config.rb into config/ directory.
- build index:
    rdig -c config/rdig_config.rb
- in your controller that handles the search form:
    search_results = RDig.searcher.search(params[:query])
    @results = search_results[:list]
    @hitcount = search_results[:hitcount]

=== search result paging
Use the :first_doc and :num_docs options to implement 
paging through search results. 
(:num_docs is 10 by default, so without using these options only the first 10
results will be retrieved)


== sample configuration

from doc/examples/config.rb. The tag_selector properties are called 
with a BeautifulSoup instance as parameter. See the RubyfulSoup 
Site[http://www.crummy.com/software/RubyfulSoup/documentation.html] for more info about this cool lib.
You can also have a look at the +html_content_extractor+ unit test.

:include:doc/examples/config.rb