Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new parallel reader API in ES 5.x #778

Closed
costin opened this issue Jun 2, 2016 · 4 comments
Closed

Use new parallel reader API in ES 5.x #778

costin opened this issue Jun 2, 2016 · 4 comments

Comments

@costin
Copy link
Member

costin commented Jun 2, 2016

ES 5.x has a proposal/PR elastic/elasticsearch/pull/18237 for allowing parallel reads for an alias/index based on a query based on an arbitrary number of slices. This goes beyond the capabilities of scan/scroll in ES 1.x/2.x and it's something that is desired in highly parallel environments or those that have memory/processing constrains (but high number of instances).

@garyelephant
Copy link

When will this feature be added ?

@Q-RK
Copy link

Q-RK commented Jul 5, 2016

It will help me a lot

@costin
Copy link
Member Author

costin commented Jul 12, 2016

@jimferenczi Can you please link(if you already have a ticket in place) or update this issue with your progress? Thanks.

@jimczi
Copy link
Contributor

jimczi commented Jul 21, 2016

@costin you can find the progress here:
#812

jimczi added a commit to jimczi/elasticsearch-hadoop that referenced this issue Jul 26, 2016
This commits changes how we split the query in multiple partitions.
For cluster running with version prior to v5.x:
* We create one partition for each shard of each requested index.
* If an alias is requested the search routing and the alias filter are respected.
* The partition is no longer attached to a node nor an ip. Only the shardId and index name are defined in order to be able to use any replica in the cluster when the partition is consumed. This makes the retry possible if a node disapears during a job.
* The ability to consume a partition on the node that is responsible for the index/shardId has been removed temporarily and should be re-added in a follow up.

For cluster ruuning with version v5.x:
* We first split by index then by shard and finally by the maximum number of documents allowed per partition (configurable through the new option named es.input.maxdocsperpartition. For instance an index with 5 shards, 1M documents and a maximum number of documents allowed per partition equals to 100,000, a match all query would be splitted in 50 partitions, 10 partitions per shard.
* If an alias is requested the search routing and the alias filter are respected.

Fixes elastic#778
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants