Red Cluster State: failed to obtain in-memory shard lock and closeShard NPE #23939

sikelong123 · 2017-04-06T09:30:46Z

Elasticsearch version:
5.0.0

Plugins installed: [null]

JVM version:
"1.8.0_72"

OS version:
Linux version 2.6.32-431.el6.x86_64 (mockbuild@c6b8.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Fri Nov 22 03:15:09 UTC 2013
Description of the problem including expected versus actual behavior:
master node disconnected, rejoin, failed to obtain in-memory shard lock and closeShard NPE

Steps to reproduce:
1.Had a 2 node cluster , 114 indices, 449322260 docs, indices size:160G
2.Things were working out just fine for a while, then master node left, disconnected, rejoined ,exception happened

Provide logs (if relevant):
There is a null pointer exception，So far, I haven't find out why.This problem is also from disconnected to re-election, rejoin the cluster was wrong. My log is like this：

[2017-04-03T10:04:48,503][WARN ][o.e.i.IndexService ] [mogu015052] [es_xp_item_mgj] [2] failed to close store on shard removal (reason: [initialization failed])
java.lang.NullPointerException
at org.elasticsearch.index.IndexService.closeShard(IndexService.java:409) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.index.IndexService.createShard(IndexService.java:361) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:449) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:137) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:534) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:511) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:200) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor(ClusterService.java:708) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:894) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:444) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:237) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:200) [elasticsearch-5.0.0.jar:5.0.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_72]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_72]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_72]
[2017-04-03T10:04:48,504][WARN ][o.e.i.c.IndicesClusterStateService] [mogu015052] [[es_xp_item_mgj][2]] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
at org.elasticsearch.index.IndexService.createShard(IndexService.java:355) ~[elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:449) ~[elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:137) ~[elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:534) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:511) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:200) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor(ClusterService.java:708) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:894) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:444) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:237) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:200) [elasticsearch-5.0.0.jar:5.0.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_72]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_72]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_72]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [es_xp_item_mgj][2]: obtaining shard lock timed out after 5000ms
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:711) ~[elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:630) ~[elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.index.IndexService.createShard(IndexService.java:285) ~[elasticsearch-5.0.0.jar:5.0.0]
... 13 more

sikelong123 · 2017-04-06T09:36:49Z

The first problem: null pointer problem
Second questions: rejoin,java.io.IOException: failed to obtain in-memory shard lock

ywelsch · 2017-04-06T09:40:31Z

That NullPointerException is harmless, but has been fixed in Abort snapshots on a node that leaves the cluster #21084, which is in ES v5.0.2+. Please upgrade.
Is your cluster health still red? Does running curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed solve the issue? If not, can you post the output of curl -XGET 'localhost:9200/_cluster/allocation/explain

sikelong123 · 2017-04-06T10:07:16Z

A lot of main shards are lost before reboot.No problem after the restart.

sikelong123 · 2017-04-06T10:07:33Z

@ywelsch

sikelong123 · 2017-04-06T10:29:17Z

old gc logs:
firdt:
[2017-04-03T08:18:52,492][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][young][4633755][90625] duration [3.2s], collections [1]/[1.6m], total [3.2s]/[1.4h], memory [57.1gb]->[32.2gb]/[59.9gb], all_pools {[young] [24.7gb]->[266.6mb]/[24.8gb]}{[survivor] [515.9mb]->[0b]/[3gb]}{[old] [31.8gb]->[31.9gb]/[32gb]}
[2017-04-03T08:18:52,492][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4633755][2718] duration [1.6m], collections [2]/[1.6m], total [1.6m]/[29.6m], memory [57.1gb]->[32.2gb]/[59.9gb], all_pools {[young] [24.7gb]->[266.6mb]/[24.8gb]}{[survivor] [515.9mb]->[0b]/[3gb]}{[old] [31.8gb]->[31.9gb]/[32gb]}
[2017-04-03T08:18:52,492][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4633755] overhead, spent [1.6m] collecting in the last [1.6m]
second
[2017-04-03T08:19:42,618][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4633804] overhead, spent [1.2s] collecting in the last [1.9s]
[2017-04-03T08:21:20,793][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4633811][2720] duration [1.5m], collections [2]/[1.5m], total [1.5m]/[31.2m], memory [59.8gb]->[32.6gb]/[59.9gb], all_pools {[young] [24.8gb]->[635.3mb]/[24.8gb]}{[survivor] [3gb]->[0b]/[3gb]}{[old] [31.9gb]->[31.9gb]/[32gb]}
[2017-04-03T08:21:20,793][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4633811] overhead, spent [1.5m] collecting in the last [1.5m]

third
[2017-04-03T08:23:58,283][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4633880][2722] duration [1.4m], collections [1]/[1.4m], total [1.4m]/[32.7m], memory [59.6gb]->[32.3gb]/[59.9gb], all_pools {[young] [24.8gb]->[398.6mb]/[24.8gb]}{[survivor] [2.8gb]->[0b]/[3gb]}{[old] [31.9gb]->[31.9gb]/[32gb]}
[2017-04-03T08:23:58,283][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4633880] overhead, spent [1.4m] collecting in the last [1.4m]
[2017-04-03T08:26:09,411][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4633925][2723] duration [1.4m], collections [1]/[1.4m], total [1.4m]/[34.1m], memory [59.8gb]->[32.3gb]/[59.9gb], all_pools {[young] [24.8gb]->[401mb]/[24.8gb]}{[survivor] [3gb]->[0b]/[3gb]}{[old] [31.9gb]->[31.9gb]/[32gb]}
[2017-04-03T08:26:09,411][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4633925] overhead, spent [1.4m] collecting in the last [1.4m]
[2017-04-03T08:26:23,477][INFO ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4633939] overhead, spent [433ms] collecting in the last [1s]
[2017-04-03T08:27:56,157][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4633941][2725] duration [1.5m], collections [2]/[1.5m], total [1.5m]/[35.6m], memory [56.1gb]->[31.9gb]/[59.9gb], all_pools {[young] [24.2gb]->[19.3mb]/[24.8gb]}{[survivor] [0b]->[0b]/[3gb]}{[old] [31.9gb]->[31.9gb]/[32gb]}
[2017-04-03T08:27:56,157][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4633941] overhead, spent [1.5m] collecting in the last [1.5m]
fourth
[2017-04-03T08:34:08,761][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4634225][2735] duration [1.4m], collections [1]/[1.4m], total [1.4m]/[37.2m], memory [56.9gb]->[32.3gb]/[59.9gb], all_pools {[young] [24.2gb]->[374.5mb]/[24.8gb]}{[survivor] [844.2mb]->[0b]/[3gb]}{[old] [31.8gb]->[31.9gb]/[32gb]}
[2017-04-03T08:34:08,761][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4634225] overhead, spent [1.4m] collecting in the last [1.4m]
[2017-04-03T08:36:22,252][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4634259][2736] duration [1.6m], collections [1]/[1.6m], total [1.6m]/[38.9m], memory [59.8gb]->[32.6gb]/[59.9gb], all_pools {[young] [24.8gb]->[671.6mb]/[24.8gb]}{[survivor] [3gb]->[0b]/[3gb]}{[old] [31.9gb]->[32gb]/[32gb]}
[2017-04-03T08:36:22,253][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4634259] overhead, spent [1.6m] collecting in the last [1.6m]

fifth
[2017-04-03T08:38:32,952][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4634288][2737] duration [1.6m], collections [1]/[1.7m], total [1.6m]/[40.6m], memory [59.6gb]->[32.4gb]/[59.9gb], all_pools {[young] [24.8gb]->[436mb]/[24.8gb]}{[survivor] [2.8gb]->[0b]/[3gb]}{[old] [32gb]->[31.9gb]/[32gb]}
[2017-04-03T08:38:32,952][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4634288] overhead, spent [1.6m] collecting in the last [1.7m]
sixth
[2017-04-03T08:40:46,187][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4634324][2738] duration [1.6m], collections [1]/[1.6m], total [1.6m]/[42.2m], memory [59.7gb]->[32.1gb]/[59.9gb], all_pools {[young] [24.8gb]->[203mb]/[24.8gb]}{[survivor] [2.9gb]->[0b]/[3gb]}{[old] [31.9gb]->[31.9gb]/[32gb]}
[2017-04-03T08:40:46,187][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4634324] overhead, spent [1.6m] collecting in the last [1.6m]

seventh
[2017-04-03T08:42:56,802][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4634370][2739] duration [1.4m], collections [1]/[1.4m], total [1.4m]/[43.6m], memory [59.8gb]->[32.2gb]/[59.9gb], all_pools {[young] [24.8gb]->[247.9mb]/[24.8gb]}{[survivor] [3gb]->[0b]/[3gb]}{[old] [31.9gb]->[31.9gb]/[32gb]}
[2017-04-03T08:42:56,802][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4634370] overhead, spent [1.4m] collecting in the last [1.4m]
[2017-04-03T08:45:07,732][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4634400][2740] duration [1.6m], collections [1]/[1.6m], total [1.6m]/[45.3m], memory [59.7gb]->[32.4gb]/[59.9gb], all_pools {[young] [24.8gb]->[500.6mb]/[24.8gb]}{[survivor] [2.9gb]->[0b]/[3gb]}{[old] [31.9gb]->[31.9gb]/[32gb]}
[2017-04-03T08:45:07,732][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4634400] overhead, spent [1.6m] collecting in the last [1.6m]
eighth
[2017-04-03T08:47:19,016][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][old][4634428][2741] duration [1.7m], collections [1]/[1.7m], total [1.7m]/[47.1m], memory [59.8gb]->[32.5gb]/[59.9gb], all_pools {[young] [24.8gb]->[532.6mb]/[24.8gb]}{[survivor] [3gb]->[0b]/[3gb]}{[old] [31.9gb]->[31.9gb]/[32gb]}
[2017-04-03T08:47:19,028][WARN ][o.e.m.j.JvmGcMonitorService] [mogu015052] [gc][4634428] overhead, spent [1.7m] collecting in the last [1.7m]

......

But why does it lead to second problems?
Also, why is primary shard lost?
The following issue similar with my problem, but also to turn over some of the parameters as you said, but second problems still happen above
https://github.com/elastic/elasticsearch/issues/23199
@ywelsch

ywelsch · 2017-04-06T11:28:58Z

But why does it lead to second problems?

It's all explained in #23199:

The ShardLockObtainFailedException manifests as a result of cluster instabilities having nodes come and go quickly. More resources on data nodes can, but is not guaranteed to, remedy such situations.

When the cluster forms again after long GC / network disconnects, and the master allocates the shards to the problematic node, the node first has to properly shutdown the existing shard resources. This can take a while, depending on the operations that were previously running. This manifests in the ShardLock failures, as the new shard instance cannot be created while the old one is not fully cleaned up. We are currently investigating how we can provide more insight into which types of operations are preventing the old shard instance from being released, which will require code changes.

Also, why is primary shard lost?

It's not lost. What happens is that the shard cannot be allocated as long as the locks of the previous instance are not being released.

As this is a duplicate of #23199 now, I'm closing this one.

javanna added the feedback_needed label Apr 6, 2017

ywelsch closed this as completed Apr 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Red Cluster State: failed to obtain in-memory shard lock and closeShard NPE #23939

Red Cluster State: failed to obtain in-memory shard lock and closeShard NPE #23939

sikelong123 commented Apr 6, 2017

sikelong123 commented Apr 6, 2017

ywelsch commented Apr 6, 2017

sikelong123 commented Apr 6, 2017

sikelong123 commented Apr 6, 2017

sikelong123 commented Apr 6, 2017

ywelsch commented Apr 6, 2017

Navigation Menu

Red Cluster State: failed to obtain in-memory shard lock and closeShard NPE #23939

Red Cluster State: failed to obtain in-memory shard lock and closeShard NPE #23939

Comments

sikelong123 commented Apr 6, 2017

sikelong123 commented Apr 6, 2017

ywelsch commented Apr 6, 2017

sikelong123 commented Apr 6, 2017

sikelong123 commented Apr 6, 2017

sikelong123 commented Apr 6, 2017

ywelsch commented Apr 6, 2017