[es 1.4.4] Scroll query returns 0 results before fully done #10417

hjz · 2015-04-03T16:28:30Z

We're on ES 1.4.4, running a 16 node cluster + 3 dedicated client nodes (c4.xlarge) which runs all our queries.

We're seeing intermittent failures on scroll queries on larger data sets of ~1 million documents. This has happened while we were on 1.3.2 as well.

Today, it failed after fetching 5000 documents (5 requests, each fetch is 1000 documents), then started returning 0. Retrying the scroll query again worked until completion. Each fetch took 1-2 seconds before failure and then started returning almost immediately with 0 results.

This index in particular has 5 shards and 26gb of data total.

Scroll size per shard: 200
Scroll time: 4 minutes

Let me know if there's other information I can provide to help diagnose

jpountz · 2015-04-03T22:02:37Z

When the query returns 0 results prematurely, do you have shard failures in the response?

hjz · 2015-04-03T22:29:25Z

We've seen shard failures before where 1 or 2 shards would fail and instead of 1000 results, it'll return 900 or 800.

Didn't log the shard failures this time, suppose all 5 shards failed, any idea why that would happen? I'll add logging for that and circle back.

clintongormley · 2015-04-05T19:02:08Z

@hjz can you provide the actual code you're using for scroll requests, plus any exceptions?

hjz · 2015-04-06T07:27:31Z

Here's the query to start the scroll

  def startScrollQueryGetTotalHitsAndScrollId(index: String, filter: BaseFilterBuilder): (Long, String) = {
    val r = ES.client.prepareSearch(index)
      .setTypes(User.eventType)
      .setSearchType(SearchType.SCAN)
      .setFrom(0)
      .setSize(config.scrollSizePerShard)
      .setScroll(scrollTime)

     r.setQuery(QueryBuilders.filteredQuery(QueryBuilders.matchAllQuery(), filter))

    val resp = r.execute().actionGet(config.scrollTimeoutSecs, TimeUnit.SECONDS)

    (resp.getHits.getTotalHits, resp.getScrollId)
  }

Then subsequent looks up are done via:

      val scrollResp = ES.client
        .prepareSearchScroll(scrollId)
        .setScroll(scrollTime)
        .execute()
        .actionGet(config.scrollTimeoutSecs, TimeUnit.SECONDS)

      if (scrollResp.getFailedShards > 0) {
        val shards = scrollResp.getShardFailures
        shards.foreach { s =>
          val msg = s"Shard failed scroll query: ${s.index()}, shardId: ${s.shardId()} reason: ${s.reason()}}"
          hipChatService.errorMessage(msg, createAlert = false)
        }
      }

      (scrollResp.getHits.getHits.map { sh =>
        ApiUser.fromSource(sh.getId, sh.source() )
      }.toSeq,  scrollResp.getScrollId)

I'll follow up with exceptions

clintongormley · 2015-04-06T18:22:37Z

@hjz I don't see where you are getting the scroll_id from the previous scroll request and using it for the next scroll request.

Normally there would be some while loop which does a scroll request, checks if there are any hits, then issues a new scroll request, using the scroll_id returned by the previous scroll request.

hjz · 2015-04-06T20:08:59Z

Yes that what we do. I omittied some code there. Essentially

def getNextScroll() = {
      val scrollResp = ES.client
        .prepareSearchScroll(scrollId)
        .setScroll(scrollTime)
        .execute()
        .actionGet(config.scrollTimeoutSecs, TimeUnit.SECONDS)

      if (scrollResp.getFailedShards > 0) {
        val shards = scrollResp.getShardFailures
        shards.foreach { s =>
          val msg = s"Shard failed scroll query: ${s.index()}, shardId: ${s.shardId()} reason: ${s.reason()}}"
          hipChatService.errorMessage(msg, createAlert = false)
        }
      }

      (scrollResp.getHits.getHits.map { sh =>
        ApiUser.fromSource(sh.getId, sh.source() )
      }.toSeq,  scrollResp.getScrollId)
}

val (totalHits, startScrollId) = startScrollQueryGetTotalHitsAndScrollId
var currentScrollId = startScrollId
var batch = 1
var remaining = totalHits
var retries = 0
var batchHits = 0L

do {
      val startTime = DateTime.now

     getNextScroll(currentScrollId) match {
        case Success((users, nextScrollId)) =>
          currentScrollId = nextScrollId
          batch += 1

        case Failure(e) =>
          batchHits = 0
          hipChatService.sendMessage(s"$logHdr scroll query error. ${e.getMessage}. retry $retries", color = HipChatNotificationColor.Red)
      }

      // If hits was 0 but there are users remaining, something fubared, so sleep for BatchRetrySeconds and retry
      if (batchHits == 0 && remaining > 0) {
        retries += 1
        Logger.error(s"$logHdr Elasticsearch scroll request failed, retries: $retries. Sleeping for ${config.scrollRetrySleepDuration.seconds} seconds")
        Thread.sleep(config.scrollRetrySleepDuration.millis)
      }

      remaining -= batchHits
} while (remaining > 0 && retries < config.scrollMaxRetries)

pickypg · 2015-04-09T18:30:14Z

@hjz Have you gotten any logs for this issue yet?

hjz · 2015-04-20T18:48:48Z

// Starts just after the initial scroll query

2015-04-20 17:34:11,281 [ERROR] application - Shard failed scroll query: null, shardId: -1 reason: RemoteTransportException[[datanode4][inet[/x.x.x.248:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 1000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@1cf09828]; }

2015-04-20 17:34:11,282 [ERROR] application - Shard failed scroll query: null, shardId: -1 reason: RemoteTransportException[[datanode4][inet[/x.x.x.248:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 1000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@57767fad]; }

2015-04-20 17:34:11,283 [ERROR] application - Shard failed scroll query: null, shardId: -1 reason: RemoteTransportException[[datanode4][inet[/x.x.x.248:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 1000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@779578ef]; }

2015-04-20 17:34:11,305 [ERROR] application - [xxx Batch 3 - 14531 remaining] [200 DONE] Fetching: 56.319 secs. Queuing: 0.015 secs. 3.550253843149785 msgs/sec

2015-04-20 17:34:51,414 [ERROR] application - [xxx Batch 4 - 14331 remaining] [200 DONE] Fetching: 40.094 secs. Queuing: 0.013 secs. 4.986660682673848 msgs/sec

2015-04-20 17:34:59,458 [ERROR] application - [xxx Batch 5 - 14131 remaining] [200 DONE] Fetching: 8.029 secs. Queuing: 0.013 secs. 24.869435463814973 msgs/sec

2015-04-20 17:34:59,574 [ERROR] application - [xxx Batch 6 - 13931 remaining] [200 DONE] Fetching: 0.099 secs. Queuing: 0.015 secs. 1754.3859649122805 msgs/sec

2015-04-20 17:34:59,646 [ERROR] application - [xxx Batch 7 - 13731 remaining] [200 DONE] Fetching: 0.057 secs. Queuing: 0.012 secs. 2898.550724637681 msgs/sec

2015-04-20 17:34:59,865 [ERROR] application - [xxx Batch 9 - 13331 remaining] [200 DONE] Fetching: 0.098 secs. Queuing: 0.027 secs. 1600.0 msgs/sec

2015-04-20 17:34:59,994 [ERROR] application - [xxx Batch 10 - 13131 remaining] [200 DONE] Fetching: 0.083 secs. Queuing: 0.045 secs. 1562.5 msgs/sec

2015-04-20 17:35:00,009 [ERROR] application - [xxx Batch 11 - 12931 remaining] [9 DONE] Fetching: 0.011 secs. Queuing: 0.002 secs. 692.3076923076924 msgs/sec

2015-04-20 17:35:00,012 [ERROR] application - [xxx Batch 12 - 12922 remaining] [0 DONE] Fetching: 0.0 secs. Queuing: 0.0 secs. NaN msgs/sec

2015-04-20 17:35:00,015 [ERROR] application - [xxx Batch 13 - 12922 remaining] Elasticsearch scroll request failed, retries: 1. Sleeping for 5 seconds

2015-04-20 17:35:00,071 [WARN ] n.k.r.connection.AbstractConnection - Lockdown ended.

2015-04-20 17:35:05,017 [ERROR] application - [xxx Batch 13 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.0 secs. 0.0 msgs/sec

2015-04-20 17:35:05,018 [ERROR] application - [xxx Batch 14 - 12922 remaining] Elasticsearch scroll request failed, retries: 2. Sleeping for 5 seconds

2015-04-20 17:35:x.x.x [ERROR] application - [xxx Batch 14 - 12922 remaining] [0 DONE] Fetching: 0.0 secs. Queuing: 0.001 secs. 0.0 msgs/sec

2015-04-20 17:35:10,021 [ERROR] application - [xxx Batch 15 - 12922 remaining] Elasticsearch scroll request failed, retries: 3. Sleeping for 5 seconds

2015-04-20 17:35:15,023 [ERROR] application - [xxx Batch 15 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.0 secs. 0.0 msgs/sec

2015-04-20 17:35:15,025 [ERROR] application - [xxx Batch 16 - 12922 remaining] Elasticsearch scroll request failed, retries: 4. Sleeping for 5 seconds

2015-04-20 17:35:20,027 [ERROR] application - [xxx Batch 16 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.0 secs. 0.0 msgs/sec

2015-04-20 17:35:20,030 [ERROR] application - [xxx Batch 17 - 12922 remaining] Elasticsearch scroll request failed, retries: 5. Sleeping for 5 seconds

2015-04-20 17:35:25,032 [ERROR] application - [xxx Batch 17 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.001 secs. 0.0 msgs/sec

2015-04-20 17:35:25,034 [ERROR] application - [xxx Batch 18 - 12922 remaining] Elasticsearch scroll request failed, retries: 6. Sleeping for 5 seconds

2015-04-20 17:35:30,036 [ERROR] application - [xxx Batch 18 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.0 secs. 0.0 msgs/sec

2015-04-20 17:35:30,038 [ERROR] application - [xxx Batch 19 - 12922 remaining] Elasticsearch scroll request failed, retries: 7. Sleeping for 5 seconds

2015-04-20 17:35:30,350 [WARN ] n.k.r.connection.AbstractConnection - Lockdown ended.

2015-04-20 17:35:35,040 [ERROR] application - [xxx Batch 19 - 12922 remaining] [0 DONE] Fetching: 0.0 secs. Queuing: 0.001 secs. 0.0 msgs/sec

2015-04-20 17:35:35,042 [ERROR] application - [xxx Batch 20 - 12922 remaining] Elasticsearch scroll request failed, retries: 8. Sleeping for 5 seconds

2015-04-20 17:35:40,044 [ERROR] application - [xxx Batch 20 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.001 secs. 0.0 msgs/sec

2015-04-20 17:35:40,046 [ERROR] application - [xxx Batch 21 - 12922 remaining] Elasticsearch scroll request failed, retries: 9. Sleeping for 5 seconds

2015-04-20 17:35:45,048 [ERROR] application - [xxx Batch 21 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.0 secs. 0.0 msgs/sec

2015-04-20 17:35:45,050 [ERROR] application - [xxx Batch 22 - 12922 remaining] Elasticsearch scroll request failed, retries: 10. Sleeping for 5 seconds

2015-04-20 17:35:45,415 [WARN ] n.k.r.connection.AbstractConnection - Lockdown ended.

2015-04-20 17:35:50,051 [ERROR] application - [xxx Batch 22 - 12922 remaining] Some users failed to fetch: remaining, 12922 retries: 10

// We stop trying here

pickypg · 2015-04-21T15:17:33Z

@hjz For your value of index, are you using "_all" (or equivalent) by chance? How many shards are actually associated with that request?

hjz · 2015-04-22T01:44:53Z

We have _all turned off. This is hitting 15 shards that has 2 replicas. No doc value formats.

l15k4 · 2015-06-19T10:09:49Z

I seem to have this issue too... I'm scanning a little more than 100 indices with keepAlive=10s :

            "number_of_replicas": "1",
            "number_of_shards": "5",

on :

   "number_of_nodes": 4,
   "number_of_data_nodes": 4,
   "active_primary_shards": 530,
   "active_shards": 1060,

I'm using pattern mi_* for scanning those 100+ indices.

I haven't noticed it would happen with sorted QueryThenFetch scroll because I couldn't scroll as much as I can Scan. But when scanning it returns 0 results after ~ 100M documents from 350M. I don't check SearchResponse#getShardFailures though. What should I do when there is a failure? Repeat the request with the same scroll_id ?

l15k4 · 2015-06-19T15:47:50Z

I figured out why this happened to me, the reason is that scanning slows down so much that it gets out of 10s keepAlive period which leads to SearchResponse#shardFailures. These logs document it :

100320 elements processed in 17 seconds
100320 elements processed in 13 seconds
100320 elements processed in 20 seconds
100320 elements processed in 20 seconds
100320 elements processed in 13 seconds
100320 elements processed in 32 seconds
100320 elements processed in 17 seconds
100320 elements processed in 53 seconds
100320 elements processed in 37 seconds
100320 elements processed in 38 seconds
100320 elements processed in 64 seconds
 - Shard failed scroll query: null, shardId: -1 reason: SearchContextMissingException[No search context found for id [208081]]
 - Shard failed scroll query: null, shardId: -1 reason: SearchContextMissingException[No search context found for id [208084]]
 - Shard failed scroll query: null, shardId: -1 reason: RemoteTransportException[[High Evolutionary][inet[/172.31.47.53:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [221366]];
 - Shard failed scroll query: null, shardId: -1 reason: SearchContextMissingException[No search context found for id [208092]]
 - Shard failed scroll query: null, shardId: -1 reason: SearchContextMissingException[No search context found for id [208086]]
 - Shard failed scroll query: null, shardId: -1 reason: SearchContextMissingException[No search context found for id [208093]]
 - Shard failed scroll query: null, shardId: -1 reason: SearchContextMissingException[No search context found for id [208089]]
 - Shard failed scroll query: null, shardId: -1 reason: RemoteTransportException[[High Evolutionary][inet[/172.31.47.53:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [221369]];
 - Shard failed scroll query: null, shardId: -1 reason: SearchContextMissingException[No search context found for id [208087]]
 - Shard failed scroll query: null, shardId: -1 reason: RemoteTransportException[[High Evolutionary][inet[/172.31.47.53:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [221371]];
 - Shard failed scroll query: null, shardId: -1 reason: RemoteTransportException[[High Evolutionary][inet[/172.31.47.53:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [221372]];
 - ScanSource just completed...

vvcephei · 2015-08-19T13:58:42Z

I have had some trouble with this as well. Sadly, it's been long enough that I won't be able to remember the details for debugging. (I just happened to see this issue while looking for something else.)

What I can say is that I've been able to (as far as I can tell) eliminate the possibility of missing data during a scroll by counting the number of documents I've seen and checking at the end of the scroll that it exactly matches the total_hits reported by the initial query.

This seems like a good sanity check for anyone doing a scan&scroll, as there's no reason for the hits to differ from the total_hits.

clintongormley · 2016-01-15T19:17:08Z

The original issue looks like the cluster was overloaded. I don't think there is anything to do here. Closing

vvcephei · 2016-01-15T21:53:49Z

Hey Clinton,

Yeah, our clusters are generally pretty heavily loaded, but I don't know
how to quantify it being overloaded.

This error actually happens to us all the time, and we simply detect it and
start the scroll again.

Thanks,
John

On Friday, January 15, 2016, Clinton Gormley notifications@github.com
wrote:

Closed #10417.

—
Reply to this email directly or view it on GitHub.<
https://ci5.googleusercontent.com/proxy/D3fUvDCaH68UfTcyrC_41gEBYmFovoX9yBs4r5lq1Laf4bxsDbQSWk6_se-Oi4oNSPW8HJJdyOFXj7r0ZdCsnJNsF2dhmKquDRV0JrTGA0sxdrSwRx6x1s1ATQgnOXp5fiJiFvulaNuD4TduirJrHx3lOHzZMA=s0-d-e1-ft#https://github.com/notifications/beacon/AAy1E4J4EKL3Uyv3_42vkojmEbfE-RTrks5paT2FgaJpZM4D537e.gif

clintongormley · 2016-01-18T10:36:16Z

@vvcephei see these in the logs from the #10417 (comment):

[indices:data/read/search[phase/scan/scroll]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 1000)

You can keep an eye on your search thread pool queue size. If you're regularly filling it up then either you need (a) more efficient queries or (b) more hardware.

clintongormley added the feedback_needed label Apr 5, 2015

clintongormley closed this as completed Jan 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[es 1.4.4] Scroll query returns 0 results before fully done #10417

[es 1.4.4] Scroll query returns 0 results before fully done #10417

hjz commented Apr 3, 2015

jpountz commented Apr 3, 2015

hjz commented Apr 3, 2015

clintongormley commented Apr 5, 2015

hjz commented Apr 6, 2015

clintongormley commented Apr 6, 2015

hjz commented Apr 6, 2015

pickypg commented Apr 9, 2015

hjz commented Apr 20, 2015

pickypg commented Apr 21, 2015

hjz commented Apr 22, 2015

l15k4 commented Jun 19, 2015

l15k4 commented Jun 19, 2015

vvcephei commented Aug 19, 2015

clintongormley commented Jan 15, 2016

vvcephei commented Jan 15, 2016

clintongormley commented Jan 18, 2016

[es 1.4.4] Scroll query returns 0 results before fully done #10417

[es 1.4.4] Scroll query returns 0 results before fully done #10417

Comments

hjz commented Apr 3, 2015

jpountz commented Apr 3, 2015

hjz commented Apr 3, 2015

clintongormley commented Apr 5, 2015

hjz commented Apr 6, 2015

clintongormley commented Apr 6, 2015

hjz commented Apr 6, 2015

pickypg commented Apr 9, 2015

hjz commented Apr 20, 2015

pickypg commented Apr 21, 2015

hjz commented Apr 22, 2015

l15k4 commented Jun 19, 2015

l15k4 commented Jun 19, 2015

vvcephei commented Aug 19, 2015

clintongormley commented Jan 15, 2016

vvcephei commented Jan 15, 2016

clintongormley commented Jan 18, 2016