Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

elasticsearch spark size option to limit the number of documents returned #546

Closed
kim333 opened this issue Sep 8, 2015 · 4 comments
Closed

Comments

@kim333
Copy link

kim333 commented Sep 8, 2015

As @costin mentioned at the post #469, I have been trying to use batch.size setting for controlling the size of the return number of documents from elasticsearch to RDD.

However, when I looked at the configurations, the only things I can find with batch.size were es.batch.size.bytes and es.batch.size.entries which don't seem to be the option to limit the document numbers returned from elasticsearch. Also when I tried these options SparkES didn't limit the document results.

What is the option to limit the number of documents returned from elasticsearch-spark?
Thanks

@costin
Copy link
Member

costin commented Oct 29, 2015

Provided a fix in master through es.scroll.limit. By default it will read all entries however when a positive value is specified, it will limit the reads to that number.

@costin costin closed this as completed Oct 29, 2015
@kim333
Copy link
Author

kim333 commented Nov 11, 2015

Thanks! It works nicely. I am guessing es.scroll.limit value is assigned to each shard? When I put 1 and run it, I get 5 results.(Index with 5 primary shards)

@costin
Copy link
Member

costin commented Nov 11, 2015

Yes. This is explained in the docs:

Number of total results/items returned by each individual scroll. A negative value indicates that all documents that match should be returned. Do note that this applies per scroll which is typically bound to one of the job tasks. Thus the total number of documents returned is LIMIT * NUMBER_OF_SCROLLS (OR TASKS)

@JihanZhuang
Copy link

when the index is empty,I get the following error:
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: ElasticsearchIllegalArgumentException[Malformed scrollId []]
Is there another way to limit the number of docs returned?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants