Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Red Cluster State: failed to obtain in-memory shard lock and closeShard NPE #23939

Closed
sikelong123 opened this issue Apr 6, 2017 · 6 comments
Closed

Comments

@sikelong123
Copy link

Elasticsearch version:
5.0.0

Plugins installed: [null]

JVM version:
"1.8.0_72"

OS version:
Linux version 2.6.32-431.el6.x86_64 (mockbuild@c6b8.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Fri Nov 22 03:15:09 UTC 2013
Description of the problem including expected versus actual behavior:
master node disconnected, rejoin, failed to obtain in-memory shard lock and closeShard NPE

Steps to reproduce:
1.Had a 2 node cluster , 114 indices, 449322260 docs, indices size:160G
2.Things were working out just fine for a while, then master node left, disconnected, rejoined ,exception happened

Provide logs (if relevant):
There is a null pointer exception,So far, I haven't find out why.This problem is also from disconnected to re-election, rejoin the cluster was wrong. My log is like this:

[2017-04-03T10:04:48,503][WARN ][o.e.i.IndexService ] [mogu015052] [es_xp_item_mgj] [2] failed to close store on shard removal (reason: [initialization failed])
java.lang.NullPointerException
at org.elasticsearch.index.IndexService.closeShard(IndexService.java:409) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.index.IndexService.createShard(IndexService.java:361) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:449) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:137) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:534) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:511) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:200) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor(ClusterService.java:708) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:894) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:444) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:237) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:200) [elasticsearch-5.0.0.jar:5.0.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_72]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_72]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_72]
[2017-04-03T10:04:48,504][WARN ][o.e.i.c.IndicesClusterStateService] [mogu015052] [[es_xp_item_mgj][2]] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
at org.elasticsearch.index.IndexService.createShard(IndexService.java:355) ~[elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:449) ~[elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:137) ~[elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:534) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:511) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:200) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor(ClusterService.java:708) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:894) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:444) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:237) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:200) [elasticsearch-5.0.0.jar:5.0.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_72]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_72]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_72]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [es_xp_item_mgj][2]: obtaining shard lock timed out after 5000ms
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:711) ~[elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:630) ~[elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.index.IndexService.createShard(IndexService.java:285) ~[elasticsearch-5.0.0.jar:5.0.0]
... 13 more

@sikelong123
Copy link
Author

The first problem: null pointer problem
Second questions: rejoin,java.io.IOException: failed to obtain in-memory shard lock

@ywelsch
Copy link
Contributor

ywelsch commented Apr 6, 2017

  1. That NullPointerException is harmless, but has been fixed in Abort snapshots on a node that leaves the cluster #21084, which is in ES v5.0.2+. Please upgrade.
  2. Is your cluster health still red? Does running curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed solve the issue? If not, can you post the output of curl -XGET 'localhost:9200/_cluster/allocation/explain

@sikelong123
Copy link
Author

A lot of main shards are lost before reboot.No problem after the restart.

@sikelong123
Copy link
Author

@ywelsch

@sikelong123
Copy link
Author

old gc logs:
firdt:
[2017-04-03T08:18:52,492][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][young][4633755][90625] duration [3.2s], collections [1]/[1.6m], total [3.2s]/[1.4h], memory [57.1gb]->[32.2gb]/[59.9gb], all_pools {[young] [24.7gb]->[266.6mb]/[24.8gb]}{[survivor] [515.9mb]->[0b]/[3gb]}{[old] [31.8gb]->[31.9gb]/[32gb]}
[2017-04-03T08:18:52,492][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4633755][2718] duration [1.6m], collections [2]/[1.6m], total [1.6m]/[29.6m], memory [57.1gb]->[32.2gb]/[59.9gb], all_pools {[young] [24.7gb]->[266.6mb]/[24.8gb]}{[survivor] [515.9mb]->[0b]/[3gb]}{[old] [31.8gb]->[31.9gb]/[32gb]}
[2017-04-03T08:18:52,492][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4633755] overhead, spent [1.6m] collecting in the last [1.6m]
second
[2017-04-03T08:19:42,618][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4633804] overhead, spent [1.2s] collecting in the last [1.9s]
[2017-04-03T08:21:20,793][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4633811][2720] duration [1.5m], collections [2]/[1.5m], total [1.5m]/[31.2m], memory [59.8gb]->[32.6gb]/[59.9gb], all_pools {[young] [24.8gb]->[635.3mb]/[24.8gb]}{[survivor] [3gb]->[0b]/[3gb]}{[old] [31.9gb]->[31.9gb]/[32gb]}
[2017-04-03T08:21:20,793][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4633811] overhead, spent [1.5m] collecting in the last [1.5m]

third
[2017-04-03T08:23:58,283][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4633880][2722] duration [1.4m], collections [1]/[1.4m], total [1.4m]/[32.7m], memory [59.6gb]->[32.3gb]/[59.9gb], all_pools {[young] [24.8gb]->[398.6mb]/[24.8gb]}{[survivor] [2.8gb]->[0b]/[3gb]}{[old] [31.9gb]->[31.9gb]/[32gb]}
[2017-04-03T08:23:58,283][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4633880] overhead, spent [1.4m] collecting in the last [1.4m]
[2017-04-03T08:26:09,411][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4633925][2723] duration [1.4m], collections [1]/[1.4m], total [1.4m]/[34.1m], memory [59.8gb]->[32.3gb]/[59.9gb], all_pools {[young] [24.8gb]->[401mb]/[24.8gb]}{[survivor] [3gb]->[0b]/[3gb]}{[old] [31.9gb]->[31.9gb]/[32gb]}
[2017-04-03T08:26:09,411][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4633925] overhead, spent [1.4m] collecting in the last [1.4m]
[2017-04-03T08:26:23,477][INFO ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4633939] overhead, spent [433ms] collecting in the last [1s]
[2017-04-03T08:27:56,157][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4633941][2725] duration [1.5m], collections [2]/[1.5m], total [1.5m]/[35.6m], memory [56.1gb]->[31.9gb]/[59.9gb], all_pools {[young] [24.2gb]->[19.3mb]/[24.8gb]}{[survivor] [0b]->[0b]/[3gb]}{[old] [31.9gb]->[31.9gb]/[32gb]}
[2017-04-03T08:27:56,157][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4633941] overhead, spent [1.5m] collecting in the last [1.5m]
fourth
[2017-04-03T08:34:08,761][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4634225][2735] duration [1.4m], collections [1]/[1.4m], total [1.4m]/[37.2m], memory [56.9gb]->[32.3gb]/[59.9gb], all_pools {[young] [24.2gb]->[374.5mb]/[24.8gb]}{[survivor] [844.2mb]->[0b]/[3gb]}{[old] [31.8gb]->[31.9gb]/[32gb]}
[2017-04-03T08:34:08,761][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4634225] overhead, spent [1.4m] collecting in the last [1.4m]
[2017-04-03T08:36:22,252][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4634259][2736] duration [1.6m], collections [1]/[1.6m], total [1.6m]/[38.9m], memory [59.8gb]->[32.6gb]/[59.9gb], all_pools {[young] [24.8gb]->[671.6mb]/[24.8gb]}{[survivor] [3gb]->[0b]/[3gb]}{[old] [31.9gb]->[32gb]/[32gb]}
[2017-04-03T08:36:22,253][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4634259] overhead, spent [1.6m] collecting in the last [1.6m]

fifth
[2017-04-03T08:38:32,952][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4634288][2737] duration [1.6m], collections [1]/[1.7m], total [1.6m]/[40.6m], memory [59.6gb]->[32.4gb]/[59.9gb], all_pools {[young] [24.8gb]->[436mb]/[24.8gb]}{[survivor] [2.8gb]->[0b]/[3gb]}{[old] [32gb]->[31.9gb]/[32gb]}
[2017-04-03T08:38:32,952][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4634288] overhead, spent [1.6m] collecting in the last [1.7m]
sixth
[2017-04-03T08:40:46,187][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4634324][2738] duration [1.6m], collections [1]/[1.6m], total [1.6m]/[42.2m], memory [59.7gb]->[32.1gb]/[59.9gb], all_pools {[young] [24.8gb]->[203mb]/[24.8gb]}{[survivor] [2.9gb]->[0b]/[3gb]}{[old] [31.9gb]->[31.9gb]/[32gb]}
[2017-04-03T08:40:46,187][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4634324] overhead, spent [1.6m] collecting in the last [1.6m]

seventh
[2017-04-03T08:42:56,802][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4634370][2739] duration [1.4m], collections [1]/[1.4m], total [1.4m]/[43.6m], memory [59.8gb]->[32.2gb]/[59.9gb], all_pools {[young] [24.8gb]->[247.9mb]/[24.8gb]}{[survivor] [3gb]->[0b]/[3gb]}{[old] [31.9gb]->[31.9gb]/[32gb]}
[2017-04-03T08:42:56,802][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4634370] overhead, spent [1.4m] collecting in the last [1.4m]
[2017-04-03T08:45:07,732][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4634400][2740] duration [1.6m], collections [1]/[1.6m], total [1.6m]/[45.3m], memory [59.7gb]->[32.4gb]/[59.9gb], all_pools {[young] [24.8gb]->[500.6mb]/[24.8gb]}{[survivor] [2.9gb]->[0b]/[3gb]}{[old] [31.9gb]->[31.9gb]/[32gb]}
[2017-04-03T08:45:07,732][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4634400] overhead, spent [1.6m] collecting in the last [1.6m]
eighth
[2017-04-03T08:47:19,016][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4634428][2741] duration [1.7m], collections [1]/[1.7m], total [1.7m]/[47.1m], memory [59.8gb]->[32.5gb]/[59.9gb], all_pools {[young] [24.8gb]->[532.6mb]/[24.8gb]}{[survivor] [3gb]->[0b]/[3gb]}{[old] [31.9gb]->[31.9gb]/[32gb]}
[2017-04-03T08:47:19,028][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4634428] overhead, spent [1.7m] collecting in the last [1.7m]

......

But why does it lead to second problems?
Also, why is primary shard lost?
The following issue similar with my problem, but also to turn over some of the parameters as you said, but second problems still happen above
https://github.com/elastic/elasticsearch/issues/23199
@ywelsch

@ywelsch
Copy link
Contributor

ywelsch commented Apr 6, 2017

But why does it lead to second problems?

It's all explained in #23199:

The ShardLockObtainFailedException manifests as a result of cluster instabilities having nodes come and go quickly. More resources on data nodes can, but is not guaranteed to, remedy such situations.

When the cluster forms again after long GC / network disconnects, and the master allocates the shards to the problematic node, the node first has to properly shutdown the existing shard resources. This can take a while, depending on the operations that were previously running. This manifests in the ShardLock failures, as the new shard instance cannot be created while the old one is not fully cleaned up. We are currently investigating how we can provide more insight into which types of operations are preventing the old shard instance from being released, which will require code changes.

Also, why is primary shard lost?

It's not lost. What happens is that the shard cannot be allocated as long as the locks of the previous instance are not being released.

As this is a duplicate of #23199 now, I'm closing this one.

@ywelsch ywelsch closed this as completed Apr 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants