Skip to content

Indexing configuration

Florian Hanke edited this page Apr 14, 2011 · 6 revisions

Indexing

Indexing options define what Picky does with your data. E.g. where does it split the data text into words to index?

Indexing is defined in app/application.rb like all other index specific options.

Examples

Define the default indexing behaviour or an indexing behaviour for a specific index by calling indexing(options) with various options (described below, in the new Ruby 1.9 hash style):

class PickySearch < Application

  # ...

  indexing removes_characters: /[^a-zA-Z0-9\.]/,
           stopwords: /\b(and|or|in|on|is|has)\b/,
           splits_text_on: /\s/,
           removes_characters_after_splitting: /\./,
           substitutes_characters_with: CharacterSubstituters::WestEuropean.new,
           normalizes_words: [
             [/(.*)hausen/, '\1hn'],
             [/\b(\w*)str(eet)?/, '\1st']
           ]

  # ...

  index = Index::Memory.new(:some_index) do
    # ...
    indexing removes_characters: /[^a-zA-Z0-9\.]/,
             stopwords: /\b(and|or|in|on|is|has)\b/,
             splits_text_on: /\s/,
             removes_characters_after_splitting: /\./,
             substitutes_characters_with: CharacterSubstituters::WestEuropean.new,
             normalizes_words: [
               [/(.*)hausen/, '\1hn'],
               [/\b(\w*)str(eet)?/, '\1st']
             ]
    # ...
  end

end

This example does:

  • not remove the abc or numbers.
  • remove in the query text the words and, or, in, on, is, and has. (If they are not the only word to occur in the data)
  • split query text on whitespaces. “fish market” would be indexed as “fish”, “market”.
  • remove characters after splitting in the same mode as removes_characters, but for each split word.
  • substitute certain west european special characters, e.g. “ü” to “ue”, or “ø” to “o”. (if you have them indexed that way)
  • normalize “Petershausen” (and similar) to “Petershn”, and “…street”, “…str” to “…st”.

indexing options

Note: The options are almost the same as in the Searching Configuration.

The options are:

  • removes_characters(regexp)
  • stopwords(regexp)
  • splits_text_on(regexp)
  • removes_characters_after_splitting(regexp)
  • substitutes_characters_with(substituter)
  • normalizes_words(array of [regexp, replacement])
  • rejects_token_if(a_lambda)

By default, there is only one of these defined:
splits_text_on(/\s/)
So, if none of the above options is defined, Picky splits on whitespaces (\s).

The process

First, text is processed, then split into words, finally made into tokens.

The order below represents the order the filters are applied.

substitutes_characters_with(substituter)

This is the very first step. Here, characters can be replaced in the text using a character substituter.

A character substituter is an object that has a #substitute(text) method that returns a text.

Currently, there is only CharacterSubstituters::WestEuropean.new (see [CharacterSubstituters|Charactersubstituters-configuration]).

Example:
substitutes_characters_with: CharacterSubstituters::WestEuropean.new

removes_characters(regexp)

Defines what characters are removed from the indexed text.

Example:
removes_characters: /[0-9]/ if you don’t want any number to make it to the search engine.

Note that it is case sensitive, so /[a-z]/ will only remove lowercase letters.

Also note that Picky needs :, ", ~, and * to function properly, so please don’t remove these.

If you wish to define a whitelist, use [^...], e.g. /[^ïôåñëäöüa-zA-Z0-9\s\/\-\,\&\.\"\~\*\:]/i.

stopwords(regexp)

Defines what words are removed from the text, after removing specific characters.

Example:
stopwords: /\b(and|the|of|it|in|for)\b/i would remove a number of stopwords, case-insensitively.

Note that if the word occurs along, i.e. "and", it is not removed from the query.

splits_text_on(regexp)

Define how the text is split into tokens. Tokens are what Picky works with and tries to find in indexes.

So, if you define splits_text_on(/\s/), then Picky will split input text "my beautiful query" into tokens [:my, :beautiful, :query].

normalizes_words(array of [regexp, replacement])

Defines rules for words to replace after splitting.

Example:
normalizes_words: [ [/\$(\w+)/i, '\1 dollars'] ]

removes_characters_after_splitting(regexp)

This is the same as removes_characters (see above), but after splitting.

rejects_token_if(a_lambda)

Define a lambda that can reject tokens. This is the last step.

Example:
rejects_token_if: lambda { |token| token.blank? || token == :i_dont_like_this_token }