Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck on shard recovery, NPE in _recovery API #6430

Closed
magnhaug opened this issue Jun 6, 2014 · 5 comments
Closed

Stuck on shard recovery, NPE in _recovery API #6430

magnhaug opened this issue Jun 6, 2014 · 5 comments

Comments

@magnhaug
Copy link

magnhaug commented Jun 6, 2014

I'm running a cluster of 5 nodes. After a normal reboot, recovery of certain shards started as normally. However, one of the replica shards doing recovery stopped when it had reached a shard size of about 233MB, out of the 1.4GB total shard size on the primary shard.

Here's what I've found out so far:

  • Shards might stop mid-recovery on any size. 0 bytes, a few KB, or several GB.
  • Shards stopped mid-recovery might even report a higher size_in_bytes than the primary shard.
  • Shards of any size might experience this (smalles index: 2 very small documents, biggest: millions of large documents).
  • The translog was empty for all shards on the sick node.
  • Accessing the _recovery API on any node in the cluster produces a {"error":"NullPointerException[null]","status":500} (verified multiple times). No NPE in a healthy cluster.
  • Even with a log level of TRACE there's no info in the logs, even after the NPE.
  • Waiting for hours does not seem to fix anything.
  • Rebooting the node with the shard stuck on recovery fixes it (another node tries, and most often succeeds).
  • Observed on ElasticSearch 1.1.1 and 1.1.2 on RHEL 6.5 with Sun Java 1.7.0_55. Not observed in the same cluster configuration when running 0.90.x or 0.20.x.

Output from _cat/recovery/ on my index:

curl "tsl0mag19:2500/_cat/recovery/meta"
meta 0 2332847 replica init n/a                tsl0mag15.skead.no n/a n/a 0 0.0%   0    0.0%
meta 0 17687   replica done tsl0mag16.skead.no tsl0mag18.skead.no n/a n/a 1 100.0% 7286 100.0%
meta 0 3634    replica done tsl0mag19.skead.no tsl0mag16.skead.no n/a n/a 9 100.0% 7286 100.0%

The segments have not yet been created/registered:

curl "tsl0mag19:2500/_cat/segments/meta"
meta 0 r 151.187.99.218 _7lhu 354450 1 3 3.6kb 443 true  true  4.7 true
meta 0 r 151.187.99.218 _ajjy 491902 4 3 3.4kb   0 true  false 4.7 true
meta 0 r 151.187.99.218 _ajp5 492089 1 2 3.3kb 442 false true  4.7 true
meta 0 p 151.187.99.216 _7lhu 354450 1 3 3.6kb 443 true  true  4.7 true
meta 0 p 151.187.99.216 _ajk4 491908 1 0 2.9kb   0 true  false 4.7 true
meta 0 p 151.187.99.216 _ajpm 492106 1 0 2.9kb 442 false true  4.7 true

Here is the stack trace from the ElasticSearch process that should be recovering this shard:
https://gist.github.com/magnhaug/11fa5750fe76a6adca4b

Here are the contents from the indices folder:
https://dl.dropboxusercontent.com/u/233260280/unhealthy_shard.tar.gz (problematic shard)
https://dl.dropboxusercontent.com/u/233260280/healthy_shard.tar.gz (primary shard, for reference)

This is how a sample stuck shard looks in HEAD:

{
  routing: {
    state: INITIALIZING
    primary: false
    node: 85EJGc_1RjyZMBuc2QluCw
    relocating_node: null
    shard: 0
  i  ndex: meta
}
  state: RECOVERING
  index: {
    size_in_bytes: 3566
  }
}

This was recovering from the following primary shard:

{
  routing: {
    state: STARTED
    primary: true
    node: ZrYUKHXZSFaftExxueYo3w
    relocating_node: null
    shard: 0
    index: meta
}
  state: STARTED
  index: {
    size_in_bytes: 10489
  }
  translog: {
    id: 1401785020764
    operations: 805
  }
  docs: {
    num_docs: 2
    max_doc: 5
    deleted_docs: 3
  }
  merges: {
    current: 0
    current_docs: 0
    current_size_in_bytes: 0
    total: 0
    total_time_in_millis: 0
    total_docs: 0
    total_size_in_bytes: 0
  }
  refresh: {
    total: 263
    total_time_in_millis: 2145
  }
  flush: {
    total: 0
    total_time_in_millis: 0
  }
}
@martijnvg
Copy link
Member

I don't know why the shard recovery got stuck here. The NPE you saw in the recovery api is likely to get fixed via: #6190

@magnhaug
Copy link
Author

magnhaug commented Sep 1, 2014

@s1monw I'm guessing this is fixed through #6808?
The only difference I see between this bug report and #6808 is that I've seen this happen mid-recovery, when the replica shard has reached a certain size.

@s1monw
Copy link
Contributor

s1monw commented Sep 1, 2014

@magnhaug I could totally buy that this is fixed by #6808 it's at least the same symptoms. I'd vote for closing this!

@magnhaug
Copy link
Author

magnhaug commented Oct 6, 2014

I'll close this when we've run 1.3.x for a while and not seen any problems with recovery.

@clintongormley
Copy link

Closing - please reopen is you see the problem recur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants