Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow FieldData loading to be filtered #2874

Closed
s1monw opened this issue Apr 9, 2013 · 0 comments
Closed

Allow FieldData loading to be filtered #2874

s1monw opened this issue Apr 9, 2013 · 0 comments

Comments

@s1monw
Copy link
Contributor

s1monw commented Apr 9, 2013

FieldData Filter

FieldData is an in-memory representation of the term dictionary in an uninverted form. Under certain circumstances this FieldData representation can grow very large on high-cardinality fields like tokenized full-text. Depending on the use-case filtering the terms that are hold in the FieldData representation can heavily improve execution performance and application stability.

FieldData Filters can be applied on a per-segment basis. During FieldData loading the terms enumeration is passed through a filter predicate that either accepts or rejects a term.

Note: this feature is only supported on string fields

Frequency Filter

The Frequency Filter acts as a high / low pass filter based on the document frequencies of a certain term within the segment that is loaded into field data. It allows to reject terms that are very high or low frequent based on absolute frequencies or percentages relative to the number of documents in the segment or more precise the number of document that have at least one value in the field that is loaded in the current segment.

Here is an example mapping:

{
    "tweet" : {
        "properties" : {
            "locale" : {
                "type" : "string",
                "fielddata" : "format=paged_bytes;filter.frequency.min=0.001;filter.frequency.max=0.1",
                "index" : "analyzed",
            }
        }
    }
}

Paramters

  • filter.frequency.min - the minimum document frequency (inclusive) in order to be loaded in to memory. Either a percentage if < 1.0 or an absolute value. 0 if omitted.
  • filter.frequency.max - the maximum document frequency (inclusive) in order to be loaded in to memory. Either a percentage if < 1.0 or an absolute value. 0 if omitted.
  • filter.frequency.min_segment_size - the minimum number of documents in a segment in order for the filter to be applied. Small segments might be omitted with this setting.

Regular Expression Filter

The regular expression filter applies a regular expression to each term during loading and only loads terms into memory that match the given regular expression.

Here is an example mapping:

{
    "tweet" : {
        "properties" : {
            "locale" : {
                "type" : "string",
                "fielddata" : "format=paged_bytes;filter.regex=^en_.*",
                "index" : "analyzed",
            }
        }
    }
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant