Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[es 1.4.4] Scroll query returns 0 results before fully done #10417

Closed
hjz opened this issue Apr 3, 2015 · 16 comments
Closed

[es 1.4.4] Scroll query returns 0 results before fully done #10417

hjz opened this issue Apr 3, 2015 · 16 comments

Comments

@hjz
Copy link

hjz commented Apr 3, 2015

We're on ES 1.4.4, running a 16 node cluster + 3 dedicated client nodes (c4.xlarge) which runs all our queries.

We're seeing intermittent failures on scroll queries on larger data sets of ~1 million documents. This has happened while we were on 1.3.2 as well.

Today, it failed after fetching 5000 documents (5 requests, each fetch is 1000 documents), then started returning 0. Retrying the scroll query again worked until completion. Each fetch took 1-2 seconds before failure and then started returning almost immediately with 0 results.

This index in particular has 5 shards and 26gb of data total.

Scroll size per shard: 200
Scroll time: 4 minutes

Let me know if there's other information I can provide to help diagnose

@jpountz
Copy link
Contributor

jpountz commented Apr 3, 2015

When the query returns 0 results prematurely, do you have shard failures in the response?

@hjz
Copy link
Author

hjz commented Apr 3, 2015

We've seen shard failures before where 1 or 2 shards would fail and instead of 1000 results, it'll return 900 or 800.

Didn't log the shard failures this time, suppose all 5 shards failed, any idea why that would happen? I'll add logging for that and circle back.

@clintongormley
Copy link

@hjz can you provide the actual code you're using for scroll requests, plus any exceptions?

@hjz
Copy link
Author

hjz commented Apr 6, 2015

Here's the query to start the scroll

  def startScrollQueryGetTotalHitsAndScrollId(index: String, filter: BaseFilterBuilder): (Long, String) = {
    val r = ES.client.prepareSearch(index)
      .setTypes(User.eventType)
      .setSearchType(SearchType.SCAN)
      .setFrom(0)
      .setSize(config.scrollSizePerShard)
      .setScroll(scrollTime)

     r.setQuery(QueryBuilders.filteredQuery(QueryBuilders.matchAllQuery(), filter))

    val resp = r.execute().actionGet(config.scrollTimeoutSecs, TimeUnit.SECONDS)

    (resp.getHits.getTotalHits, resp.getScrollId)
  }

Then subsequent looks up are done via:

      val scrollResp = ES.client
        .prepareSearchScroll(scrollId)
        .setScroll(scrollTime)
        .execute()
        .actionGet(config.scrollTimeoutSecs, TimeUnit.SECONDS)

      if (scrollResp.getFailedShards > 0) {
        val shards = scrollResp.getShardFailures
        shards.foreach { s =>
          val msg = s"Shard failed scroll query: ${s.index()}, shardId: ${s.shardId()} reason: ${s.reason()}}"
          hipChatService.errorMessage(msg, createAlert = false)
        }
      }

      (scrollResp.getHits.getHits.map { sh =>
        ApiUser.fromSource(sh.getId, sh.source() )
      }.toSeq,  scrollResp.getScrollId)

I'll follow up with exceptions

@clintongormley
Copy link

@hjz I don't see where you are getting the scroll_id from the previous scroll request and using it for the next scroll request.

Normally there would be some while loop which does a scroll request, checks if there are any hits, then issues a new scroll request, using the scroll_id returned by the previous scroll request.

@hjz
Copy link
Author

hjz commented Apr 6, 2015

Yes that what we do. I omittied some code there. Essentially

def getNextScroll() = {
      val scrollResp = ES.client
        .prepareSearchScroll(scrollId)
        .setScroll(scrollTime)
        .execute()
        .actionGet(config.scrollTimeoutSecs, TimeUnit.SECONDS)

      if (scrollResp.getFailedShards > 0) {
        val shards = scrollResp.getShardFailures
        shards.foreach { s =>
          val msg = s"Shard failed scroll query: ${s.index()}, shardId: ${s.shardId()} reason: ${s.reason()}}"
          hipChatService.errorMessage(msg, createAlert = false)
        }
      }

      (scrollResp.getHits.getHits.map { sh =>
        ApiUser.fromSource(sh.getId, sh.source() )
      }.toSeq,  scrollResp.getScrollId)
}
val (totalHits, startScrollId) = startScrollQueryGetTotalHitsAndScrollId
var currentScrollId = startScrollId
var batch = 1
var remaining = totalHits
var retries = 0
var batchHits = 0L

do {
      val startTime = DateTime.now

     getNextScroll(currentScrollId) match {
        case Success((users, nextScrollId)) =>
          currentScrollId = nextScrollId
          batch += 1

        case Failure(e) =>
          batchHits = 0
          hipChatService.sendMessage(s"$logHdr scroll query error. ${e.getMessage}. retry $retries", color = HipChatNotificationColor.Red)
      }

      // If hits was 0 but there are users remaining, something fubared, so sleep for BatchRetrySeconds and retry
      if (batchHits == 0 && remaining > 0) {
        retries += 1
        Logger.error(s"$logHdr Elasticsearch scroll request failed, retries: $retries. Sleeping for ${config.scrollRetrySleepDuration.seconds} seconds")
        Thread.sleep(config.scrollRetrySleepDuration.millis)
      }

      remaining -= batchHits
} while (remaining > 0 && retries < config.scrollMaxRetries)

@pickypg
Copy link
Member

pickypg commented Apr 9, 2015

@hjz Have you gotten any logs for this issue yet?

@hjz
Copy link
Author

hjz commented Apr 20, 2015

// Starts just after the initial scroll query

2015-04-20 17:34:11,281 [ERROR] application - Shard failed scroll query: null, shardId: -1 reason: RemoteTransportException[[datanode4][inet[/x.x.x.248:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 1000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@1cf09828]; }

2015-04-20 17:34:11,282 [ERROR] application - Shard failed scroll query: null, shardId: -1 reason: RemoteTransportException[[datanode4][inet[/x.x.x.248:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 1000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@57767fad]; }

2015-04-20 17:34:11,283 [ERROR] application - Shard failed scroll query: null, shardId: -1 reason: RemoteTransportException[[datanode4][inet[/x.x.x.248:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 1000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@779578ef]; }

2015-04-20 17:34:11,305 [ERROR] application - [xxx Batch 3 - 14531 remaining] [200 DONE] Fetching: 56.319 secs. Queuing: 0.015 secs. 3.550253843149785 msgs/sec

2015-04-20 17:34:51,414 [ERROR] application - [xxx Batch 4 - 14331 remaining] [200 DONE] Fetching: 40.094 secs. Queuing: 0.013 secs. 4.986660682673848 msgs/sec

2015-04-20 17:34:59,458 [ERROR] application - [xxx Batch 5 - 14131 remaining] [200 DONE] Fetching: 8.029 secs. Queuing: 0.013 secs. 24.869435463814973 msgs/sec

2015-04-20 17:34:59,574 [ERROR] application - [xxx Batch 6 - 13931 remaining] [200 DONE] Fetching: 0.099 secs. Queuing: 0.015 secs. 1754.3859649122805 msgs/sec

2015-04-20 17:34:59,646 [ERROR] application - [xxx Batch 7 - 13731 remaining] [200 DONE] Fetching: 0.057 secs. Queuing: 0.012 secs. 2898.550724637681 msgs/sec

2015-04-20 17:34:59,865 [ERROR] application - [xxx Batch 9 - 13331 remaining] [200 DONE] Fetching: 0.098 secs. Queuing: 0.027 secs. 1600.0 msgs/sec

2015-04-20 17:34:59,994 [ERROR] application - [xxx Batch 10 - 13131 remaining] [200 DONE] Fetching: 0.083 secs. Queuing: 0.045 secs. 1562.5 msgs/sec

2015-04-20 17:35:00,009 [ERROR] application - [xxx Batch 11 - 12931 remaining] [9 DONE] Fetching: 0.011 secs. Queuing: 0.002 secs. 692.3076923076924 msgs/sec

2015-04-20 17:35:00,012 [ERROR] application - [xxx Batch 12 - 12922 remaining] [0 DONE] Fetching: 0.0 secs. Queuing: 0.0 secs. NaN msgs/sec

2015-04-20 17:35:00,015 [ERROR] application - [xxx Batch 13 - 12922 remaining] Elasticsearch scroll request failed, retries: 1. Sleeping for 5 seconds

2015-04-20 17:35:00,071 [WARN ] n.k.r.connection.AbstractConnection - Lockdown ended.

2015-04-20 17:35:05,017 [ERROR] application - [xxx Batch 13 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.0 secs. 0.0 msgs/sec

2015-04-20 17:35:05,018 [ERROR] application - [xxx Batch 14 - 12922 remaining] Elasticsearch scroll request failed, retries: 2. Sleeping for 5 seconds

2015-04-20 17:35:x.x.x [ERROR] application - [xxx Batch 14 - 12922 remaining] [0 DONE] Fetching: 0.0 secs. Queuing: 0.001 secs. 0.0 msgs/sec

2015-04-20 17:35:10,021 [ERROR] application - [xxx Batch 15 - 12922 remaining] Elasticsearch scroll request failed, retries: 3. Sleeping for 5 seconds

2015-04-20 17:35:15,023 [ERROR] application - [xxx Batch 15 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.0 secs. 0.0 msgs/sec

2015-04-20 17:35:15,025 [ERROR] application - [xxx Batch 16 - 12922 remaining] Elasticsearch scroll request failed, retries: 4. Sleeping for 5 seconds

2015-04-20 17:35:20,027 [ERROR] application - [xxx Batch 16 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.0 secs. 0.0 msgs/sec

2015-04-20 17:35:20,030 [ERROR] application - [xxx Batch 17 - 12922 remaining] Elasticsearch scroll request failed, retries: 5. Sleeping for 5 seconds

2015-04-20 17:35:25,032 [ERROR] application - [xxx Batch 17 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.001 secs. 0.0 msgs/sec

2015-04-20 17:35:25,034 [ERROR] application - [xxx Batch 18 - 12922 remaining] Elasticsearch scroll request failed, retries: 6. Sleeping for 5 seconds

2015-04-20 17:35:30,036 [ERROR] application - [xxx Batch 18 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.0 secs. 0.0 msgs/sec

2015-04-20 17:35:30,038 [ERROR] application - [xxx Batch 19 - 12922 remaining] Elasticsearch scroll request failed, retries: 7. Sleeping for 5 seconds

2015-04-20 17:35:30,350 [WARN ] n.k.r.connection.AbstractConnection - Lockdown ended.

2015-04-20 17:35:35,040 [ERROR] application - [xxx Batch 19 - 12922 remaining] [0 DONE] Fetching: 0.0 secs. Queuing: 0.001 secs. 0.0 msgs/sec

2015-04-20 17:35:35,042 [ERROR] application - [xxx Batch 20 - 12922 remaining] Elasticsearch scroll request failed, retries: 8. Sleeping for 5 seconds

2015-04-20 17:35:40,044 [ERROR] application - [xxx Batch 20 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.001 secs. 0.0 msgs/sec

2015-04-20 17:35:40,046 [ERROR] application - [xxx Batch 21 - 12922 remaining] Elasticsearch scroll request failed, retries: 9. Sleeping for 5 seconds

2015-04-20 17:35:45,048 [ERROR] application - [xxx Batch 21 - 12922 remaining] [0 DONE] Fetching: 0.001 secs. Queuing: 0.0 secs. 0.0 msgs/sec

2015-04-20 17:35:45,050 [ERROR] application - [xxx Batch 22 - 12922 remaining] Elasticsearch scroll request failed, retries: 10. Sleeping for 5 seconds

2015-04-20 17:35:45,415 [WARN ] n.k.r.connection.AbstractConnection - Lockdown ended.

2015-04-20 17:35:50,051 [ERROR] application - [xxx Batch 22 - 12922 remaining] Some users failed to fetch: remaining, 12922 retries: 10

// We stop trying here

@pickypg
Copy link
Member

pickypg commented Apr 21, 2015

@hjz For your value of index, are you using "_all" (or equivalent) by chance? How many shards are actually associated with that request?

@hjz
Copy link
Author

hjz commented Apr 22, 2015

We have _all turned off. This is hitting 15 shards that has 2 replicas. No doc value formats.

@l15k4
Copy link

l15k4 commented Jun 19, 2015

I seem to have this issue too... I'm scanning a little more than 100 indices with keepAlive=10s :

            "number_of_replicas": "1",
            "number_of_shards": "5",

on :

   "number_of_nodes": 4,
   "number_of_data_nodes": 4,
   "active_primary_shards": 530,
   "active_shards": 1060,

I'm using pattern mi_* for scanning those 100+ indices.

I haven't noticed it would happen with sorted QueryThenFetch scroll because I couldn't scroll as much as I can Scan. But when scanning it returns 0 results after ~ 100M documents from 350M. I don't check SearchResponse#getShardFailures though. What should I do when there is a failure? Repeat the request with the same scroll_id ?

@l15k4
Copy link

l15k4 commented Jun 19, 2015

I figured out why this happened to me, the reason is that scanning slows down so much that it gets out of 10s keepAlive period which leads to SearchResponse#shardFailures. These logs document it :

100320 elements processed in 17 seconds
100320 elements processed in 13 seconds
100320 elements processed in 20 seconds
100320 elements processed in 20 seconds
100320 elements processed in 13 seconds
100320 elements processed in 32 seconds
100320 elements processed in 17 seconds
100320 elements processed in 53 seconds
100320 elements processed in 37 seconds
100320 elements processed in 38 seconds
100320 elements processed in 64 seconds
 - Shard failed scroll query: null, shardId: -1 reason: SearchContextMissingException[No search context found for id [208081]]
 - Shard failed scroll query: null, shardId: -1 reason: SearchContextMissingException[No search context found for id [208084]]
 - Shard failed scroll query: null, shardId: -1 reason: RemoteTransportException[[High Evolutionary][inet[/172.31.47.53:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [221366]];
 - Shard failed scroll query: null, shardId: -1 reason: SearchContextMissingException[No search context found for id [208092]]
 - Shard failed scroll query: null, shardId: -1 reason: SearchContextMissingException[No search context found for id [208086]]
 - Shard failed scroll query: null, shardId: -1 reason: SearchContextMissingException[No search context found for id [208093]]
 - Shard failed scroll query: null, shardId: -1 reason: SearchContextMissingException[No search context found for id [208089]]
 - Shard failed scroll query: null, shardId: -1 reason: RemoteTransportException[[High Evolutionary][inet[/172.31.47.53:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [221369]];
 - Shard failed scroll query: null, shardId: -1 reason: SearchContextMissingException[No search context found for id [208087]]
 - Shard failed scroll query: null, shardId: -1 reason: RemoteTransportException[[High Evolutionary][inet[/172.31.47.53:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [221371]];
 - Shard failed scroll query: null, shardId: -1 reason: RemoteTransportException[[High Evolutionary][inet[/172.31.47.53:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [221372]];
 - ScanSource just completed...

@vvcephei
Copy link

I have had some trouble with this as well. Sadly, it's been long enough that I won't be able to remember the details for debugging. (I just happened to see this issue while looking for something else.)

What I can say is that I've been able to (as far as I can tell) eliminate the possibility of missing data during a scroll by counting the number of documents I've seen and checking at the end of the scroll that it exactly matches the total_hits reported by the initial query.

This seems like a good sanity check for anyone doing a scan&scroll, as there's no reason for the hits to differ from the total_hits.

@clintongormley
Copy link

The original issue looks like the cluster was overloaded. I don't think there is anything to do here. Closing

@vvcephei
Copy link

Hey Clinton,

Yeah, our clusters are generally pretty heavily loaded, but I don't know
how to quantify it being overloaded.

This error actually happens to us all the time, and we simply detect it and
start the scroll again.

Thanks,
John

On Friday, January 15, 2016, Clinton Gormley notifications@github.com
wrote:

Closed #10417.


Reply to this email directly or view it on GitHub.<
https://ci5.googleusercontent.com/proxy/D3fUvDCaH68UfTcyrC_41gEBYmFovoX9yBs4r5lq1Laf4bxsDbQSWk6_se-Oi4oNSPW8HJJdyOFXj7r0ZdCsnJNsF2dhmKquDRV0JrTGA0sxdrSwRx6x1s1ATQgnOXp5fiJiFvulaNuD4TduirJrHx3lOHzZMA=s0-d-e1-ft#https://github.com/notifications/beacon/AAy1E4J4EKL3Uyv3_42vkojmEbfE-RTrks5paT2FgaJpZM4D537e.gif

@clintongormley
Copy link

@vvcephei see these in the logs from the #10417 (comment):

[indices:data/read/search[phase/scan/scroll]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 1000)

You can keep an eye on your search thread pool queue size. If you're regularly filling it up then either you need (a) more efficient queries or (b) more hardware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants