Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed percolator engine #3173

Closed
martijnvg opened this issue Jun 12, 2013 · 6 comments
Closed

Distributed percolator engine #3173

martijnvg opened this issue Jun 12, 2013 · 6 comments

Comments

@martijnvg
Copy link
Member

Background

Redesigning the percolate engine is targeted for version 1.0. The main reason why the rewrite is necessary is that the current perculate engine doesn't scale. The idea is that perculating a document should be executed in the same manner as a distributed search request.

In the current approach queries are stored in a single primary shard index, that is auto replicated to each data node. This allows the percolation to happen locally. In the case that large amount of queries are index into this _percolator index, percolating document just start to take to long. Also all queries are loaded into memory (Map: query uid -> Lucene Query), so in this case heap space issues can occur. On top of this with the current api the query always need to get index into the _percolator index and the type is the name of the index the query is percolated for. So scaling out the percolator feature is needed for sharing the percolator execution and memory load.

Because of the fact that percolation will be a distributed request, the perculate option in the index api is scheduled to be removed. The main reason behind this is that we can't block and wait in the index api for a distributed percolate request to complete. The perculate request may take longer to complete then the actual index request (we currently perculate during replication) and thus slowing down the actual index request.

To substitute the percolate while indexing option, one just needs to run percolate api directly after the index api returned. The percolate api will remain to be a realtime api.

Implementation plan

The percolator index type approach stores the percolate queries in a special _percolator type with its own mapping in the same index where the actual data is or in a different index (dedicated percolation index, which might require different sharding behavior compared to the index that holds actual data and being search on). This approach also allows percolator to scale beyond the single shard exection we have today, meaning we both partition the percolated queries, and distribute the percolate execution.

Store a query in the twitter index:

curl -XPUT 'localhost:9200/twitter/_percolator/p_es' -d '{
    "query" : {
        "match" : {
            "message" : "elasticsearch"
        }
    }
}'

Percolating a document uses the same rest end point:

curl -XGET 'localhost:9200/twitter/tweet/_percolate' -d '{
    "doc" : {
        "message" : "Bonsai tree in elasticsearch office"
    }
}'

The response initially doesn't change. The rest endpoint will also support a routing query string parameter, to allow documents to only be percolated on queries in specific shards.

During regular searches, we will automatically filter out documents with the _percolator type (only if it exists, so its only added as an overhead if explicitly used). We won't filter _percolator type if explciitly specified in the search request since users might still want to search and get back the percolated queries.

Backwards compatibility

The plan is not to keep backwards compatibility with the current percolate implementation. Percolate queries indexed via the old infrastructure will need to be migrated into the new planned infrastructure. The 'old' _percolate index won't be removed, so the queries can easily be copied to the new infrastructure by using a scan search request.

Post redesign

After the redesign has been implemented adding more features to the percolator is next. One of them is to highlight what parts of the query matched with the document.

The idea is have different response modes. For example:

  • count - A count of how many queries matched with the document.
  • compact - Returns a list of query ids that have matched with the document. (just like we do today)
  • verbose - Returns a body per matched query. This body can for example hold a query highlight in the future.

Here are a few thoughts on post features for percolator:

  • Support additional operations such as highlighting
  • Allow to do bulk percolation. Two options, simple bulk and use the MemoryIndex to do it one by one, or somehow bulk index the docs into an in memory index (MemoryIndex does't support more than one index, possible RAM based dir?), and then execute the queries against it. Bulk percolation will still need to be distributed and broken down into shard level bulks.
  • Support percolating an existing document, by specifying an index, type and id and optionally an version instead of an actual document.
@ghost ghost assigned martijnvg Jun 12, 2013
@itsadok
Copy link
Contributor

itsadok commented Jun 13, 2013

Currently the queries are executed against the MemoryIndex sequentially, in a loop. Has any thought been given to somehow combining queries, in case there are common sub-queries that are needlessly executed over and over?

@martijnvg
Copy link
Member Author

@itsadok I haven't thought about it, but perhaps percolate queries can be structured in a tree like structure, so that only a part of the percolate queries have to be evaluated when percolating a document.

@skade
Copy link
Contributor

skade commented Jul 18, 2013

Following up on:

https://twitter.com/Argorak/statuses/357893281193017344

I really miss the feature to somehow attach the result of percolation into the document itself. This doesn't necessarily be in the document itself, e.g. a child document would be useful as well. I often use the percolator to categorize incoming data. This data is given by external services and is messy, though fixable by simple "search and clean". We use the percolator to register (sometimes user-created) queries that map those entries to our internal values.

Currently, this works like this:

  • Percolate the document
  • Write the result to the categories field
  • Index the document

Allowing this within the percolation step itself would vastly reduce our network overhead in this case and (if bulk percolation happens) also allow us to do bulk actions in one step.

@martijnvg
Copy link
Member Author

So It is like a percolate post write operation? That would update a specific part of the percolated documents based on the percolate matches and then index this updated document.

There're no plans for this kind of feature. You can create an issue for this feature if you want, then this idea doesn't get lost.

martijnvg added a commit that referenced this issue Jul 19, 2013
@missinglink
Copy link
Contributor

Was this merged to master? if so what version?

@martijnvg
Copy link
Member Author

@missinglink Yes, this is now on master. Will be part of Elasticsearch when 1.0 beta will be released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants