Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose Lucene's codec api #2411

Closed
martijnvg opened this issue Nov 14, 2012 · 0 comments
Closed

Expose Lucene's codec api #2411

martijnvg opened this issue Nov 14, 2012 · 0 comments

Comments

@martijnvg
Copy link
Member

This issue adds the option to configure a PostingsFormat and assign it to a field in the mapping. This feature is very expert and in almost all cases Elasticsearch's defaults will suite your needs.

Configuring a postingsformat per field

There're several default postings formats configured by default which can be used in your mapping:

  • pulsing - A postings format that encodes the postinglist for terms with low document frequency in the term directory.
  • direct - A codec that wraps the default postings format during write time, but loads the terms and postinglists into memory directly in memory during read time as raw arrays. This postings format is exceptional memory intensive, but can give a substantial increase in search performance.
  • memory - A codec that loads and stores terms and postinglists in memory using a FST. Acts like a cached postingslist.
  • bloom_default - Maintains a bloom filter for the indexed terms, which is stored to disk and builds on top of the default postings format. This postings format is useful for low document frequency terms and offers a fail fast for seeks to terms that don't exist.
  • bloom_pulsing - Similar to the bloom_default postings format, but builds on top of the pulsing postings format.
  • default - The default postings format. The default if none is specified.

On all fields it possible to configure a postings_format attribute. Example mapping:

{
  "person" : {
     "properties" : {
         "second_person_id" : {"type" : "string", "postings_format" : "pulsing"}
     }
  }
}

Configuring a custom postingsformat

It is possible the instantiate custom postingsformats. This can be specified via the index settings.

{
   "codec" : {
      "postings_format" : {
         "my_format" : {
            "type" : "pulsing"
            "freq_cut_off" : "5"
         } 
      }
   }
}

In the above example the freq_cut_off is set the 5 (defaults to 1). This tells the pulsing postings format to inline the postinglist of terms with a document frequency lower or equal to 5 in the term dictionary.

Note: when we doc this, we need to properly doc and expose all the configuration options for all codecs.

kimchy added a commit that referenced this issue Feb 17, 2013
uses more hash iterations, yet require less memory for the same fpp
relates to #2411
imotov pushed a commit to imotov/elasticsearch that referenced this issue Feb 18, 2013
uses more hash iterations, yet require less memory for the same fpp
relates to elastic#2411
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant