New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed percolator engine #3173
Comments
Currently the queries are executed against the MemoryIndex sequentially, in a loop. Has any thought been given to somehow combining queries, in case there are common sub-queries that are needlessly executed over and over? |
@itsadok I haven't thought about it, but perhaps percolate queries can be structured in a tree like structure, so that only a part of the percolate queries have to be evaluated when percolating a document. |
Following up on: https://twitter.com/Argorak/statuses/357893281193017344 I really miss the feature to somehow attach the result of percolation into the document itself. This doesn't necessarily be in the document itself, e.g. a child document would be useful as well. I often use the percolator to categorize incoming data. This data is given by external services and is messy, though fixable by simple "search and clean". We use the percolator to register (sometimes user-created) queries that map those entries to our internal values. Currently, this works like this:
Allowing this within the percolation step itself would vastly reduce our network overhead in this case and (if bulk percolation happens) also allow us to do bulk actions in one step. |
So It is like a percolate post write operation? That would update a specific part of the percolated documents based on the percolate matches and then index this updated document. There're no plans for this kind of feature. You can create an issue for this feature if you want, then this idea doesn't get lost. |
Was this merged to master? if so what version? |
@missinglink Yes, this is now on master. Will be part of Elasticsearch when 1.0 beta will be released. |
Background
Redesigning the percolate engine is targeted for version 1.0. The main reason why the rewrite is necessary is that the current perculate engine doesn't scale. The idea is that perculating a document should be executed in the same manner as a distributed search request.
In the current approach queries are stored in a single primary shard index, that is auto replicated to each data node. This allows the percolation to happen locally. In the case that large amount of queries are index into this
_percolator
index, percolating document just start to take to long. Also all queries are loaded into memory (Map: query uid -> Lucene Query), so in this case heap space issues can occur. On top of this with the current api the query always need to get index into the_percolator
index and the type is the name of the index the query is percolated for. So scaling out the percolator feature is needed for sharing the percolator execution and memory load.Because of the fact that percolation will be a distributed request, the perculate option in the index api is scheduled to be removed. The main reason behind this is that we can't block and wait in the index api for a distributed percolate request to complete. The perculate request may take longer to complete then the actual index request (we currently perculate during replication) and thus slowing down the actual index request.
To substitute the percolate while indexing option, one just needs to run percolate api directly after the index api returned. The percolate api will remain to be a realtime api.
Implementation plan
The percolator index type approach stores the percolate queries in a special
_percolator
type with its own mapping in the same index where the actual data is or in a different index (dedicated percolation index, which might require different sharding behavior compared to the index that holds actual data and being search on). This approach also allows percolator to scale beyond the single shard exection we have today, meaning we both partition the percolated queries, and distribute the percolate execution.Store a query in the twitter index:
Percolating a document uses the same rest end point:
The response initially doesn't change. The rest endpoint will also support a routing query string parameter, to allow documents to only be percolated on queries in specific shards.
During regular searches, we will automatically filter out documents with the
_percolator
type (only if it exists, so its only added as an overhead if explicitly used). We won't filter_percolator
type if explciitly specified in the search request since users might still want to search and get back the percolated queries.Backwards compatibility
The plan is not to keep backwards compatibility with the current percolate implementation. Percolate queries indexed via the old infrastructure will need to be migrated into the new planned infrastructure. The 'old'
_percolate
index won't be removed, so the queries can easily be copied to the new infrastructure by using a scan search request.Post redesign
After the redesign has been implemented adding more features to the percolator is next. One of them is to highlight what parts of the query matched with the document.
The idea is have different response modes. For example:
count
- A count of how many queries matched with the document.compact
- Returns a list of query ids that have matched with the document. (just like we do today)verbose
- Returns a body per matched query. This body can for example hold a query highlight in the future.Here are a few thoughts on post features for percolator:
The text was updated successfully, but these errors were encountered: