New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[es 1.4.4] Scroll query returns 0 results before fully done #10417
Comments
When the query returns 0 results prematurely, do you have shard failures in the response? |
We've seen shard failures before where 1 or 2 shards would fail and instead of 1000 results, it'll return 900 or 800. Didn't log the shard failures this time, suppose all 5 shards failed, any idea why that would happen? I'll add logging for that and circle back. |
@hjz can you provide the actual code you're using for scroll requests, plus any exceptions? |
Here's the query to start the scroll def startScrollQueryGetTotalHitsAndScrollId(index: String, filter: BaseFilterBuilder): (Long, String) = {
val r = ES.client.prepareSearch(index)
.setTypes(User.eventType)
.setSearchType(SearchType.SCAN)
.setFrom(0)
.setSize(config.scrollSizePerShard)
.setScroll(scrollTime)
r.setQuery(QueryBuilders.filteredQuery(QueryBuilders.matchAllQuery(), filter))
val resp = r.execute().actionGet(config.scrollTimeoutSecs, TimeUnit.SECONDS)
(resp.getHits.getTotalHits, resp.getScrollId)
} Then subsequent looks up are done via: val scrollResp = ES.client
.prepareSearchScroll(scrollId)
.setScroll(scrollTime)
.execute()
.actionGet(config.scrollTimeoutSecs, TimeUnit.SECONDS)
if (scrollResp.getFailedShards > 0) {
val shards = scrollResp.getShardFailures
shards.foreach { s =>
val msg = s"Shard failed scroll query: ${s.index()}, shardId: ${s.shardId()} reason: ${s.reason()}}"
hipChatService.errorMessage(msg, createAlert = false)
}
}
(scrollResp.getHits.getHits.map { sh =>
ApiUser.fromSource(sh.getId, sh.source() )
}.toSeq, scrollResp.getScrollId) I'll follow up with exceptions |
@hjz I don't see where you are getting the scroll_id from the previous scroll request and using it for the next scroll request. Normally there would be some |
Yes that what we do. I omittied some code there. Essentially def getNextScroll() = {
val scrollResp = ES.client
.prepareSearchScroll(scrollId)
.setScroll(scrollTime)
.execute()
.actionGet(config.scrollTimeoutSecs, TimeUnit.SECONDS)
if (scrollResp.getFailedShards > 0) {
val shards = scrollResp.getShardFailures
shards.foreach { s =>
val msg = s"Shard failed scroll query: ${s.index()}, shardId: ${s.shardId()} reason: ${s.reason()}}"
hipChatService.errorMessage(msg, createAlert = false)
}
}
(scrollResp.getHits.getHits.map { sh =>
ApiUser.fromSource(sh.getId, sh.source() )
}.toSeq, scrollResp.getScrollId)
} val (totalHits, startScrollId) = startScrollQueryGetTotalHitsAndScrollId
var currentScrollId = startScrollId
var batch = 1
var remaining = totalHits
var retries = 0
var batchHits = 0L
do {
val startTime = DateTime.now
getNextScroll(currentScrollId) match {
case Success((users, nextScrollId)) =>
currentScrollId = nextScrollId
batch += 1
case Failure(e) =>
batchHits = 0
hipChatService.sendMessage(s"$logHdr scroll query error. ${e.getMessage}. retry $retries", color = HipChatNotificationColor.Red)
}
// If hits was 0 but there are users remaining, something fubared, so sleep for BatchRetrySeconds and retry
if (batchHits == 0 && remaining > 0) {
retries += 1
Logger.error(s"$logHdr Elasticsearch scroll request failed, retries: $retries. Sleeping for ${config.scrollRetrySleepDuration.seconds} seconds")
Thread.sleep(config.scrollRetrySleepDuration.millis)
}
remaining -= batchHits
} while (remaining > 0 && retries < config.scrollMaxRetries) |
@hjz Have you gotten any logs for this issue yet? |
// Starts just after the initial scroll query 2015-04-20 17:34:11,281 [ERROR] application - Shard failed scroll query: null, shardId: -1 reason: RemoteTransportException[[datanode4][inet[/x.x.x.248:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 1000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@1cf09828]; } 2015-04-20 17:34:11,282 [ERROR] application - Shard failed scroll query: null, shardId: -1 reason: RemoteTransportException[[datanode4][inet[/x.x.x.248:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 1000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@57767fad]; } 2015-04-20 17:34:11,283 [ERROR] application - Shard failed scroll query: null, shardId: -1 reason: RemoteTransportException[[datanode4][inet[/x.x.x.248:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 1000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@779578ef]; } 2015-04-20 17:34:11,305 [ERROR] application - [xxx Batch 3 - 14531 remaining] [200 DONE] Fetching: 56.319 secs. Queuing: 0.015 secs. 3.550253843149785 msgs/sec 2015-04-20 17:34:51,414 [ERROR] application - [xxx Batch 4 - 14331 remaining] [200 DONE] Fetching: 40.094 secs. Queuing: 0.013 secs. 4.986660682673848 msgs/sec 2015-04-20 17:34:59,458 [ERROR] application - [xxx Batch 5 - 14131 remaining] [200 DONE] Fetching: 8.029 secs. Queuing: 0.013 secs. 24.869435463814973 msgs/sec 2015-04-20 17:34:59,574 [ERROR] application - [xxx Batch 6 - 13931 remaining] [200 DONE] Fetching: 0.099 secs. Queuing: 0.015 secs. 1754.3859649122805 msgs/sec 2015-04-20 17:34:59,646 [ERROR] application - [xxx Batch 7 - 13731 remaining] [200 DONE] Fetching: 0.057 secs. Queuing: 0.012 secs. 2898.550724637681 msgs/sec 2015-04-20 17:34:59,865 [ERROR] application - [xxx Batch 9 - 13331 remaining] [200 DONE] Fetching: 0.098 secs. Queuing: 0.027 secs. 1600.0 msgs/sec 2015-04-20 17:34:59,994 [ERROR] application - [xxx Batch 10 - 13131 remaining] [200 DONE] Fetching: 0.083 secs. Queuing: 0.045 secs. 1562.5 msgs/sec 2015-04-20 17:35:00,009 [ERROR] application - [xxx Batch 11 - 12931 remaining] [9 DONE] Fetching: 0.011 secs. Queuing: 0.002 secs. 692.3076923076924 msgs/sec 2015-04-20 17:35:00,012 [ERROR] application - [xxx Batch 12 - 12922 remaining] [0 DONE] Fetching: 0.0 secs. Queuing: 0.0 secs. NaN msgs/sec 2015-04-20 17:35:00,015 [ERROR] application - [xxx Batch 13 - 12922 remaining] Elasticsearch scroll request failed, retries: 1. Sleeping for 5 seconds 2015-04-20 17:35:00,071 [WARN ] n.k.r.connection.AbstractConnection - Lockdown ended. 2015-04-20 17:35:05,017 [ERROR] application - [xxx Batch 13 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.0 secs. 0.0 msgs/sec 2015-04-20 17:35:05,018 [ERROR] application - [xxx Batch 14 - 12922 remaining] Elasticsearch scroll request failed, retries: 2. Sleeping for 5 seconds 2015-04-20 17:35:x.x.x [ERROR] application - [xxx Batch 14 - 12922 remaining] [0 DONE] Fetching: 0.0 secs. Queuing: 0.001 secs. 0.0 msgs/sec 2015-04-20 17:35:10,021 [ERROR] application - [xxx Batch 15 - 12922 remaining] Elasticsearch scroll request failed, retries: 3. Sleeping for 5 seconds 2015-04-20 17:35:15,023 [ERROR] application - [xxx Batch 15 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.0 secs. 0.0 msgs/sec 2015-04-20 17:35:15,025 [ERROR] application - [xxx Batch 16 - 12922 remaining] Elasticsearch scroll request failed, retries: 4. Sleeping for 5 seconds 2015-04-20 17:35:20,027 [ERROR] application - [xxx Batch 16 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.0 secs. 0.0 msgs/sec 2015-04-20 17:35:20,030 [ERROR] application - [xxx Batch 17 - 12922 remaining] Elasticsearch scroll request failed, retries: 5. Sleeping for 5 seconds 2015-04-20 17:35:25,032 [ERROR] application - [xxx Batch 17 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.001 secs. 0.0 msgs/sec 2015-04-20 17:35:25,034 [ERROR] application - [xxx Batch 18 - 12922 remaining] Elasticsearch scroll request failed, retries: 6. Sleeping for 5 seconds 2015-04-20 17:35:30,036 [ERROR] application - [xxx Batch 18 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.0 secs. 0.0 msgs/sec 2015-04-20 17:35:30,038 [ERROR] application - [xxx Batch 19 - 12922 remaining] Elasticsearch scroll request failed, retries: 7. Sleeping for 5 seconds 2015-04-20 17:35:30,350 [WARN ] n.k.r.connection.AbstractConnection - Lockdown ended. 2015-04-20 17:35:35,040 [ERROR] application - [xxx Batch 19 - 12922 remaining] [0 DONE] Fetching: 0.0 secs. Queuing: 0.001 secs. 0.0 msgs/sec 2015-04-20 17:35:35,042 [ERROR] application - [xxx Batch 20 - 12922 remaining] Elasticsearch scroll request failed, retries: 8. Sleeping for 5 seconds 2015-04-20 17:35:40,044 [ERROR] application - [xxx Batch 20 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.001 secs. 0.0 msgs/sec 2015-04-20 17:35:40,046 [ERROR] application - [xxx Batch 21 - 12922 remaining] Elasticsearch scroll request failed, retries: 9. Sleeping for 5 seconds 2015-04-20 17:35:45,048 [ERROR] application - [xxx Batch 21 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.0 secs. 0.0 msgs/sec 2015-04-20 17:35:45,050 [ERROR] application - [xxx Batch 22 - 12922 remaining] Elasticsearch scroll request failed, retries: 10. Sleeping for 5 seconds 2015-04-20 17:35:45,415 [WARN ] n.k.r.connection.AbstractConnection - Lockdown ended. 2015-04-20 17:35:50,051 [ERROR] application - [xxx Batch 22 - 12922 remaining] Some users failed to fetch: remaining, 12922 retries: 10 // We stop trying here |
@hjz For your value of |
We have _all turned off. This is hitting 15 shards that has 2 replicas. No doc value formats. |
I seem to have this issue too... I'm scanning a little more than 100 indices with
on :
I'm using pattern I haven't noticed it would happen with sorted |
I figured out why this happened to me, the reason is that scanning slows down so much that it gets out of
|
I have had some trouble with this as well. Sadly, it's been long enough that I won't be able to remember the details for debugging. (I just happened to see this issue while looking for something else.) What I can say is that I've been able to (as far as I can tell) eliminate the possibility of missing data during a scroll by counting the number of documents I've seen and checking at the end of the scroll that it exactly matches the total_hits reported by the initial query. This seems like a good sanity check for anyone doing a scan&scroll, as there's no reason for the hits to differ from the total_hits. |
The original issue looks like the cluster was overloaded. I don't think there is anything to do here. Closing |
Hey Clinton, Yeah, our clusters are generally pretty heavily loaded, but I don't know This error actually happens to us all the time, and we simply detect it and Thanks, On Friday, January 15, 2016, Clinton Gormley notifications@github.com
|
@vvcephei see these in the logs from the #10417 (comment):
You can keep an eye on your search thread pool queue size. If you're regularly filling it up then either you need (a) more efficient queries or (b) more hardware. |
We're on ES 1.4.4, running a 16 node cluster + 3 dedicated client nodes (c4.xlarge) which runs all our queries.
We're seeing intermittent failures on scroll queries on larger data sets of ~1 million documents. This has happened while we were on 1.3.2 as well.
Today, it failed after fetching 5000 documents (5 requests, each fetch is 1000 documents), then started returning 0. Retrying the scroll query again worked until completion. Each fetch took 1-2 seconds before failure and then started returning almost immediately with 0 results.
This index in particular has 5 shards and 26gb of data total.
Scroll size per shard: 200
Scroll time: 4 minutes
Let me know if there's other information I can provide to help diagnose
The text was updated successfully, but these errors were encountered: