public
Fork of ulbrich/couchsphinx
Description: A full text indexing extension for MongoDB using Sphinx.
Homepage:
Clone URL: git://github.com/burke/mongosphinx.git
name age message
file .gitignore Thu Aug 27 12:34:22 -0700 2009 mostly ported [burke]
file README.rdoc Fri Sep 25 10:38:09 -0700 2009 remove references to couchsphinx in readme [burke]
file Rakefile Fri Aug 28 09:43:33 -0700 2009 ITS PEANUT BUTTER COMMIT TIME [burke]
directory lib/ Fri Aug 28 13:12:13 -0700 2009 works! [burke]
file mongosphinx.gemspec Thu Aug 27 12:34:22 -0700 2009 mostly ported [burke]
file mongosphinx.rb Fri Aug 28 09:43:33 -0700 2009 ITS PEANUT BUTTER COMMIT TIME [burke]
README.rdoc

MongoSphinx

The MongoSphinx library implements an interface between MongoDBand Sphinx supporting MongoMapper to automatically index objects in Sphinx. It tries to act as transparent as possible: Just an additional method in MongoMapper and some Sphinx configuration are needed to get going.

Prerequisites

MongoSphinx needs gems MongoMapper and Riddle as well as a running Sphinx and a MongoDB installation.

  sudo gem sources -a http://gems.github.com  # Only needed once!
  sudo gem install riddle
  sudo gem install mongomapper
  sudo gem install burke-mongosphinx

No additional configuraton is needed for interfacing with MongoDB: Setup is done when MongoMapper is able to talk to the MongoDB server.

A proper "sphinx.conf" file and a script for retrieving index data have to be provided for interfacing with Sphinx: Sorry, no UltraSphinx like magic… :-) Depending on the amount of data, more than one index may be used and indexes may be consolidated from time to time.

This is a sample configuration for a single "main" index:

  searchd {
    address = 0.0.0.0
    port = 3312

    log = ./sphinx/searchd.log
    query_log = ./sphinx/query.log
    pid_file = ./sphinx/searchd.pid
  }

  source mongoblog {
    type = xmlpipe2

    xmlpipe_command = "rake sphinx:genxml"
  }

  index mongoblog {
    source = mongoblog

    charset_type = utf-8
    path = ./sphinx/sphinx_index_main
  }

Notice the line "xmlpipe_command =". This is what the indexer runs to generate its input. You can change this to whatever works best for you, but I set it up as a rake task, with the following in `lib/tasks/sphinx.rake` .

Here, :fields is a list of fields to export. Performance tends to suffer if you export everything, so you’ll probably want to just list the fields you’re indexing.

  namespace :sphinx do
    task :genxml => :environment do
      puts MongoSphinx::Indexer::XMLDocset.new(Food.all(:fields => 'name')).to_s
    end
  end

Models

Use method fulltext_index to enable indexing of a model. The default is to index all attributes but it is recommended to provide a list of attribute keys.

A side effect of calling this method is, that MongoSphinx overrides the default of letting MongoDB create new IDs: Sphinx only allows numeric IDs and MongoSphinx forces new objects with the name of the class, a hyphen and an integer as ID (e.g. Post-38497238). Again: Only these objects are indexed due to internal restrictions of Sphinx.

Sample:

  class Post
    include MongoMapper::Document

    key :title, String
    key :body, String

    fulltext_index :title, :body
  end

Add options :server and :port to fulltext_index if the Sphinx server to query is running on a different server (defaults to "localhost" with port 3312).

Here is a full-featured sample setting additional options:

  fulltext_index :title, :body, :server => 'my.other.server', :port => 3313

Indexing

Automatically starting the reindexing of objects the moment new objects are created can be implemented by adding a filter to the model class:

 after_save :reindex
 def reindex
  `sudo indexer --all --rotate` # Configure sudo to allow this call...
  end

This or a similar callback should be added to all models needing instant indexing. If indexing is not that crucial or load is high, some additional checks for the time of the last call should be added.

Keep in mind that reindexing is not incremental, and xml is generated to pass data from mongo to sphinx. It’s not a speedy operation on large datasets.

Queries

An additional instance method by_fulltext_index is added for each fulltext indexed model. This method takes a Sphinx query like "foo @title bar", runs it within the context of the current class and returns an Array of matching MongoDB documents.

Samples:

  Post.by_fulltext_index('first')
  => [...]

  post = Post.by_fulltext_index('this is @title post').first
  post.title
  => "First Post"
  post.class
  => Post

Additional options :match_mode, :limit and :max_matches can be provided to customize the behaviour of Riddle. Option :raw can be set to true to do no lookup of the document IDs but return the raw IDs instead.

Sample:

  Post.by_fulltext_index('my post', :limit => 100)

Copyright & License

Copyright © 2009 Burke Libbey, Ryan Neufeld

CouchSphinx Copyright © 2009 Holtzbrinck Digital GmbH, Jan Ulbrich

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.