[CI] org.elasticsearch.action.admin.cluster.allocation.ClusterAllocationExplainIT#testDelayShards fails due to NPE #19488

polyfractal · 2016-07-18T20:20:34Z

So the test failure is in explaining the allocation of shards, but I believe the actual problem is an NPE in ShardRouting, where unassignedInfo is somehow null: https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/cluster/routing/ShardRouting.java#L243

CI failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+periodic/1584/console

I can usually recreate with this, although it appears to only happen 1 in 50 or so tests:

gradle :core:integTest -Dtests.seed=A82D8E5D38BF0A20 -Dtests.class=org.elasticsearch.action.admin.cluster.allocation.ClusterAllocationExplainIT -Dtests.method="testDelayShards" -Dtests.security.manager=true -Dtests.locale=nn-NO -Dtests.timezone=Europe/London -Dtests.iters=100

The test kills a random node, which usually goes smoothly. But sometimes it NPE's out and causes all the test to drop all the other nodes:

1> [2016-07-18 15:52:11,081][INFO ][org.elasticsearch.action.admin.cluster.allocation] --> stopping a random node
  1> [2016-07-18 15:52:11,081][INFO ][org.elasticsearch.test   ] Closing random node [node_t2] 
  1> [2016-07-18 15:52:11,081][INFO ][org.elasticsearch.node   ] [node_t2] stopping ...
  1> [2016-07-18 15:52:11,090][INFO ][org.elasticsearch.node   ] [node_t2] stopped
  1> [2016-07-18 15:52:11,090][ERROR][org.elasticsearch.discovery.local] [node_t2] unexpected failure during [local-disco-update]
  1> java.lang.NullPointerException
  1>    at org.elasticsearch.cluster.routing.ShardRouting.allocatedPostIndexCreate(ShardRouting.java:243)
  1>    at org.elasticsearch.gateway.ReplicaShardAllocator.processExistingRecoveries(ReplicaShardAllocator.java:81)
  1>    at org.elasticsearch.gateway.GatewayAllocator.allocateUnassigned(GatewayAllocator.java:139)
  1>    at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:356)
  1>    at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:330)
  1>    at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:315)
  1>    at org.elasticsearch.discovery.local.LocalDiscovery$3.execute(LocalDiscovery.java:234)
  1>    at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45)
  1>    at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor(ClusterService.java:552)
  1>    at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:855)
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:450)
  1>    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:237)
  1>    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:200)
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  1>    at java.lang.Thread.run(Thread.java:745)
  1> [2016-07-18 15:52:11,090][INFO ][org.elasticsearch.node   ] [node_t2] closing ...
  1> [2016-07-18 15:52:11,092][INFO ][org.elasticsearch.node   ] [node_t2] closed
  1> [2016-07-18 15:52:11,092][INFO ][org.elasticsearch.plugins] [transport_client_node_t0] no modules loaded
  1> [2016-07-18 15:52:11,092][INFO ][org.elasticsearch.plugins] [transport_client_node_t0] loaded plugin [org.elasticsearch.test.transport.AssertingLocalTransport$TestPlugin]
  1> [2016-07-18 15:52:11,096][INFO ][org.elasticsearch.transport] [transport_client_node_t0] publish_address {local[60]}, bound_addresses {local[60]}
  1> [2016-07-18 15:52:11,103][INFO ][org.elasticsearch.action.admin.cluster.allocation] [ClusterAllocationExplainIT#testDelayShards { seed=[A82D8E5D38BF0A20:5A6033F2A8FE37B]}]: finished test
  1> [2016-07-18 15:52:11,103][INFO ][org.elasticsearch.action.admin.cluster.allocation] [ClusterAllocationExplainIT#testDelayShards { seed=[A82D8E5D38BF0A20:5A6033F2A8FE37B]}]: cleaning up after test
  1> [2016-07-18 15:52:20,900][WARN ][org.elasticsearch.cluster] [node_t0] failed to connect to node {node_t2}{2gVawe0zRDqmYcUcdUZBAw}{Dw9SY5WaQ8-ewtmirmJgrA}{local}{local[59]} (tried [1] times)
  1> ConnectTransportException[[node_t2][local[59]] Failed to connect]
  1>    at org.elasticsearch.transport.local.LocalTransport.connectToNode(LocalTransport.java:184)
  1>    at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:272)
  1>    at org.elasticsearch.cluster.NodeConnectionsService.validateNodeConnected(NodeConnectionsService.java:108)
  1>    at org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:133)
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:510)
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  1>    at java.lang.Thread.run(Thread.java:745)
  1> [2016-07-18 15:52:20,927][WARN ][org.elasticsearch.cluster] [node_t1] failed to connect to node {node_t2}{2gVawe0zRDqmYcUcdUZBAw}{Dw9SY5WaQ8-ewtmirmJgrA}{local}{local[59]} (tried [1] times)
  1> ConnectTransportException[[node_t2][local[59]] Failed to connect]
  1>    at org.elasticsearch.transport.local.LocalTransport.connectToNode(LocalTransport.java:184)
  1>    at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:272)
  1>    at org.elasticsearch.cluster.NodeConnectionsService.validateNodeConnected(NodeConnectionsService.java:108)
  1>    at org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:133)
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:510)
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  1>    at java.lang.Thread.run(Thread.java:745)
HEARTBEAT J0 PID(25998@DualDesk): 2016-07-18T15:52:22, stalled for 11.5s at: ClusterAllocationExplainIT.testDelayShards { seed=[A82D8E5D38BF0A20:5A6033F2A8FE37B]}
HEARTBEAT J0 PID(25998@DualDesk): 2016-07-18T15:52:32, stalled for 21.5s at: ClusterAllocationExplainIT.testDelayShards { seed=[A82D8E5D38BF0A20:5A6033F2A8FE37B]}
  1> [2016-07-18 15:52:40,905][WARN ][org.elasticsearch.action.admin.cluster.node.stats] [node_t0] ignoring unexpected response [null] of type [null], expected [NodeStats] or [FailedNodeException]
  1> [2016-07-18 15:52:41,115][INFO ][org.elasticsearch.node   ] [node_t0] stopping ...
  1> [2016-07-18 15:52:41,120][INFO ][org.elasticsearch.cluster.routing.allocation] [node_t1] Cluster health status changed from [GREEN] to [RED] (reason: [elected as master]).
  1> [2016-07-18 15:52:41,120][INFO ][org.elasticsearch.cluster.service] [node_t1] master {new {node_t1}{lnZrFKUNTdS1JxgsR2k_-Q}{zdgmZzecTwKFnyvQ7g4DOA}{local}{local[58]}, previous {node_t0}{v36fo6K7Qz-E2KyaZUxZ4Q}{zhtJYDKQTzGmIQUjL9-kWA}{local}{local[57]}}, removed {{node_t0}{v36fo6K7Qz-E2KyaZUxZ4Q}{zhtJYDKQTzGmIQUjL9-kWA}{local}{local[57]},{node_t2}{2gVawe0zRDqmYcUcdUZBAw}{Dw9SY5WaQ8-ewtmirmJgrA}{local}{local[59]},}, reason: local-disco-update
  1> [2016-07-18 15:52:41,121][INFO ][org.elasticsearch.cluster.routing] [node_t1] scheduling reroute for delayed shards in [59.9s] (6 delayed shards)
  1> [2016-07-18 15:52:41,123][INFO ][org.elasticsearch.node   ] [node_t0] stopped
  1> [2016-07-18 15:52:41,123][INFO ][org.elasticsearch.node   ] [node_t0] closing ...
  1> [2016-07-18 15:52:41,126][INFO ][org.elasticsearch.node   ] [node_t0] closed
  1> [2016-07-18 15:52:41,126][INFO ][org.elasticsearch.node   ] [node_t1] stopping ...
  1> [2016-07-18 15:52:41,140][INFO ][org.elasticsearch.node   ] [node_t1] stopped
  1> [2016-07-18 15:52:41,140][INFO ][org.elasticsearch.node   ] [node_t1] closing ...
  1> [2016-07-18 15:52:41,142][INFO ][org.elasticsearch.node   ] [node_t1] closed

Which ultimately makes the test fail with:

> Throwable #1: ElasticsearchException[unable to find any shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false] in the routing table]
   >    at org.elasticsearch.action.admin.cluster.allocation.TransportClusterAllocationExplainAction.masterOperation(TransportClusterAllocationExplainAction.java:310)
   >    at org.elasticsearch.action.admin.cluster.allocation.TransportClusterAllocationExplainAction.masterOperation(TransportClusterAllocationExplainAction.java:68)
   >    at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:85)
   >    at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$3.doRun(TransportMasterNodeAction.java:169)
   >    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:510)
   >    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   >    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   >    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   >    at java.lang.Thread.run(Thread.java:745)Throwable #2: java.lang.AssertionError: ClusterHealthResponse has timed out - returned: [{
   >   "cluster_name" : "TEST-CHILD_VM=[0]-CLUSTER_SEED=[-7292795637849911491]-HASH=[127918AC5CC6]-cluster",
   >   "status" : "green",
   >   "timed_out" : true,
   >   "number_of_nodes" : 3,
   >   "number_of_data_nodes" : 3,
   >   "active_primary_shards" : 5,
   >   "active_shards" : 10,
   >   "relocating_shards" : 0,
   >   "initializing_shards" : 0,
   >   "unassigned_shards" : 0,
   >   "delayed_unassigned_shards" : 0,
   >   "number_of_pending_tasks" : 0,
   >   "number_of_in_flight_fetch" : 0,
   >   "task_max_waiting_in_queue_millis" : 0,
   >   "active_shards_percent_as_number" : 100.0
   > }]
   > Expected: is <false>
   >      but: was <true>
   >    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   >    at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertNoTimeout(ElasticsearchAssertions.java:111)
   >    at org.elasticsearch.test.ESIntegTestCase.ensureClusterSizeConsistency(ESIntegTestCase.java:1031)
   >    at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:528)
   >    at org.elasticsearch.test.ESIntegTestCase.after(ESIntegTestCase.java:1980)
   >    at java.lang.Thread.run(Thread.java:745)

Assigning to @ywelsch since it looks like you're intimately familiar with this class, but /cc'ing @dakrone in case it's I misjudged and it's related to cluster explanations somehow.

The text was updated successfully, but these errors were encountered:

s1monw · 2016-07-19T09:16:54Z

this comes from the fact that we might sometimes don't have unassinged info if we are initializing. For instance, this particular failure comes from here:

at org.elasticsearch.cluster.routing.ShardRouting.<init>(ShardRouting.java:80)
  2>        at org.elasticsearch.cluster.routing.ShardRouting.removeRelocationSource(ShardRouting.java:391)
  2>        at org.elasticsearch.cluster.routing.RoutingNodes.removeRelocationSource(RoutingNodes.java:500)
  2>        at org.elasticsearch.cluster.routing.allocation.AllocationService.cancelShard(AllocationService.java:570)
  2>        at org.elasticsearch.cluster.routing.allocation.AllocationService.applyFailedShard(AllocationService.java:534)
  2>        at org.elasticsearch.cluster.routing.allocation.AllocationService.deassociateDeadNodes(AllocationService.java:423)
  2>        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:347)

I think not having unassigned info here is fine, we just need to check for null in that method.

$@polyfractal$ polyfractal added the >test-failure Triaged test failures from CI label Jul 18, 2016

$@polyfractal$ polyfractal assigned ywelsch Jul 18, 2016

ywelsch mentioned this issue Jul 19, 2016

Fix NPE when initializing replica shard has no UnassignedInfo #19491

Merged

ywelsch closed this as completed in #19491 Jul 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] org.elasticsearch.action.admin.cluster.allocation.ClusterAllocationExplainIT#testDelayShards fails due to NPE #19488

[CI] org.elasticsearch.action.admin.cluster.allocation.ClusterAllocationExplainIT#testDelayShards fails due to NPE #19488

polyfractal commented Jul 18, 2016

s1monw commented Jul 19, 2016 •

edited

[CI] org.elasticsearch.action.admin.cluster.allocation.ClusterAllocationExplainIT#testDelayShards fails due to NPE #19488

[CI] org.elasticsearch.action.admin.cluster.allocation.ClusterAllocationExplainIT#testDelayShards fails due to NPE #19488

Comments

polyfractal commented Jul 18, 2016

s1monw commented Jul 19, 2016 • edited

s1monw commented Jul 19, 2016 •

edited