Skip to content

Distributed percolator engine #3173

Closed
Closed
@martijnvg

Description

@martijnvg

Background

Redesigning the percolate engine is targeted for version 1.0. The main reason why the rewrite is necessary is that the current perculate engine doesn't scale. The idea is that perculating a document should be executed in the same manner as a distributed search request.

In the current approach queries are stored in a single primary shard index, that is auto replicated to each data node. This allows the percolation to happen locally. In the case that large amount of queries are index into this _percolator index, percolating document just start to take to long. Also all queries are loaded into memory (Map: query uid -> Lucene Query), so in this case heap space issues can occur. On top of this with the current api the query always need to get index into the _percolator index and the type is the name of the index the query is percolated for. So scaling out the percolator feature is needed for sharing the percolator execution and memory load.

Because of the fact that percolation will be a distributed request, the perculate option in the index api is scheduled to be removed. The main reason behind this is that we can't block and wait in the index api for a distributed percolate request to complete. The perculate request may take longer to complete then the actual index request (we currently perculate during replication) and thus slowing down the actual index request.

To substitute the percolate while indexing option, one just needs to run percolate api directly after the index api returned. The percolate api will remain to be a realtime api.

Implementation plan

The percolator index type approach stores the percolate queries in a special _percolator type with its own mapping in the same index where the actual data is or in a different index (dedicated percolation index, which might require different sharding behavior compared to the index that holds actual data and being search on). This approach also allows percolator to scale beyond the single shard exection we have today, meaning we both partition the percolated queries, and distribute the percolate execution.

Store a query in the twitter index:

curl -XPUT 'localhost:9200/twitter/_percolator/p_es' -d '{
    "query" : {
        "match" : {
            "message" : "elasticsearch"
        }
    }
}'

Percolating a document uses the same rest end point:

curl -XGET 'localhost:9200/twitter/tweet/_percolate' -d '{
    "doc" : {
        "message" : "Bonsai tree in elasticsearch office"
    }
}'

The response initially doesn't change. The rest endpoint will also support a routing query string parameter, to allow documents to only be percolated on queries in specific shards.

During regular searches, we will automatically filter out documents with the _percolator type (only if it exists, so its only added as an overhead if explicitly used). We won't filter _percolator type if explciitly specified in the search request since users might still want to search and get back the percolated queries.

Backwards compatibility

The plan is not to keep backwards compatibility with the current percolate implementation. Percolate queries indexed via the old infrastructure will need to be migrated into the new planned infrastructure. The 'old' _percolate index won't be removed, so the queries can easily be copied to the new infrastructure by using a scan search request.

Post redesign

After the redesign has been implemented adding more features to the percolator is next. One of them is to highlight what parts of the query matched with the document.

The idea is have different response modes. For example:

  • count - A count of how many queries matched with the document.
  • compact - Returns a list of query ids that have matched with the document. (just like we do today)
  • verbose - Returns a body per matched query. This body can for example hold a query highlight in the future.

Here are a few thoughts on post features for percolator:

  • Support additional operations such as highlighting
  • Allow to do bulk percolation. Two options, simple bulk and use the MemoryIndex to do it one by one, or somehow bulk index the docs into an in memory index (MemoryIndex does't support more than one index, possible RAM based dir?), and then execute the queries against it. Bulk percolation will still need to be distributed and broken down into shard level bulks.
  • Support percolating an existing document, by specifying an index, type and id and optionally an version instead of an actual document.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions