Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] org.elasticsearch.action.admin.cluster.allocation.ClusterAllocationExplainIT#testDelayShards fails due to NPE #19488

Closed
polyfractal opened this issue Jul 18, 2016 · 1 comment
Assignees
Labels
>test-failure Triaged test failures from CI

Comments

@polyfractal
Copy link
Contributor

So the test failure is in explaining the allocation of shards, but I believe the actual problem is an NPE in ShardRouting, where unassignedInfo is somehow null: https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/cluster/routing/ShardRouting.java#L243

CI failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+periodic/1584/console

I can usually recreate with this, although it appears to only happen 1 in 50 or so tests:

gradle :core:integTest -Dtests.seed=A82D8E5D38BF0A20 -Dtests.class=org.elasticsearch.action.admin.cluster.allocation.ClusterAllocationExplainIT -Dtests.method="testDelayShards" -Dtests.security.manager=true -Dtests.locale=nn-NO -Dtests.timezone=Europe/London -Dtests.iters=100

The test kills a random node, which usually goes smoothly. But sometimes it NPE's out and causes all the test to drop all the other nodes:

1> [2016-07-18 15:52:11,081][INFO ][org.elasticsearch.action.admin.cluster.allocation] --> stopping a random node
  1> [2016-07-18 15:52:11,081][INFO ][org.elasticsearch.test   ] Closing random node [node_t2] 
  1> [2016-07-18 15:52:11,081][INFO ][org.elasticsearch.node   ] [node_t2] stopping ...
  1> [2016-07-18 15:52:11,090][INFO ][org.elasticsearch.node   ] [node_t2] stopped
  1> [2016-07-18 15:52:11,090][ERROR][org.elasticsearch.discovery.local] [node_t2] unexpected failure during [local-disco-update]
  1> java.lang.NullPointerException
  1>    at org.elasticsearch.cluster.routing.ShardRouting.allocatedPostIndexCreate(ShardRouting.java:243)
  1>    at org.elasticsearch.gateway.ReplicaShardAllocator.processExistingRecoveries(ReplicaShardAllocator.java:81)
  1>    at org.elasticsearch.gateway.GatewayAllocator.allocateUnassigned(GatewayAllocator.java:139)
  1>    at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:356)
  1>    at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:330)
  1>    at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:315)
  1>    at org.elasticsearch.discovery.local.LocalDiscovery$3.execute(LocalDiscovery.java:234)
  1>    at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45)
  1>    at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor(ClusterService.java:552)
  1>    at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:855)
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:450)
  1>    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:237)
  1>    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:200)
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  1>    at java.lang.Thread.run(Thread.java:745)
  1> [2016-07-18 15:52:11,090][INFO ][org.elasticsearch.node   ] [node_t2] closing ...
  1> [2016-07-18 15:52:11,092][INFO ][org.elasticsearch.node   ] [node_t2] closed
  1> [2016-07-18 15:52:11,092][INFO ][org.elasticsearch.plugins] [transport_client_node_t0] no modules loaded
  1> [2016-07-18 15:52:11,092][INFO ][org.elasticsearch.plugins] [transport_client_node_t0] loaded plugin [org.elasticsearch.test.transport.AssertingLocalTransport$TestPlugin]
  1> [2016-07-18 15:52:11,096][INFO ][org.elasticsearch.transport] [transport_client_node_t0] publish_address {local[60]}, bound_addresses {local[60]}
  1> [2016-07-18 15:52:11,103][INFO ][org.elasticsearch.action.admin.cluster.allocation] [ClusterAllocationExplainIT#testDelayShards { seed=[A82D8E5D38BF0A20:5A6033F2A8FE37B]}]: finished test
  1> [2016-07-18 15:52:11,103][INFO ][org.elasticsearch.action.admin.cluster.allocation] [ClusterAllocationExplainIT#testDelayShards { seed=[A82D8E5D38BF0A20:5A6033F2A8FE37B]}]: cleaning up after test
  1> [2016-07-18 15:52:20,900][WARN ][org.elasticsearch.cluster] [node_t0] failed to connect to node {node_t2}{2gVawe0zRDqmYcUcdUZBAw}{Dw9SY5WaQ8-ewtmirmJgrA}{local}{local[59]} (tried [1] times)
  1> ConnectTransportException[[node_t2][local[59]] Failed to connect]
  1>    at org.elasticsearch.transport.local.LocalTransport.connectToNode(LocalTransport.java:184)
  1>    at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:272)
  1>    at org.elasticsearch.cluster.NodeConnectionsService.validateNodeConnected(NodeConnectionsService.java:108)
  1>    at org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:133)
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:510)
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  1>    at java.lang.Thread.run(Thread.java:745)
  1> [2016-07-18 15:52:20,927][WARN ][org.elasticsearch.cluster] [node_t1] failed to connect to node {node_t2}{2gVawe0zRDqmYcUcdUZBAw}{Dw9SY5WaQ8-ewtmirmJgrA}{local}{local[59]} (tried [1] times)
  1> ConnectTransportException[[node_t2][local[59]] Failed to connect]
  1>    at org.elasticsearch.transport.local.LocalTransport.connectToNode(LocalTransport.java:184)
  1>    at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:272)
  1>    at org.elasticsearch.cluster.NodeConnectionsService.validateNodeConnected(NodeConnectionsService.java:108)
  1>    at org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:133)
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:510)
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  1>    at java.lang.Thread.run(Thread.java:745)
HEARTBEAT J0 PID(25998@DualDesk): 2016-07-18T15:52:22, stalled for 11.5s at: ClusterAllocationExplainIT.testDelayShards { seed=[A82D8E5D38BF0A20:5A6033F2A8FE37B]}
HEARTBEAT J0 PID(25998@DualDesk): 2016-07-18T15:52:32, stalled for 21.5s at: ClusterAllocationExplainIT.testDelayShards { seed=[A82D8E5D38BF0A20:5A6033F2A8FE37B]}
  1> [2016-07-18 15:52:40,905][WARN ][org.elasticsearch.action.admin.cluster.node.stats] [node_t0] ignoring unexpected response [null] of type [null], expected [NodeStats] or [FailedNodeException]
  1> [2016-07-18 15:52:41,115][INFO ][org.elasticsearch.node   ] [node_t0] stopping ...
  1> [2016-07-18 15:52:41,120][INFO ][org.elasticsearch.cluster.routing.allocation] [node_t1] Cluster health status changed from [GREEN] to [RED] (reason: [elected as master]).
  1> [2016-07-18 15:52:41,120][INFO ][org.elasticsearch.cluster.service] [node_t1] master {new {node_t1}{lnZrFKUNTdS1JxgsR2k_-Q}{zdgmZzecTwKFnyvQ7g4DOA}{local}{local[58]}, previous {node_t0}{v36fo6K7Qz-E2KyaZUxZ4Q}{zhtJYDKQTzGmIQUjL9-kWA}{local}{local[57]}}, removed {{node_t0}{v36fo6K7Qz-E2KyaZUxZ4Q}{zhtJYDKQTzGmIQUjL9-kWA}{local}{local[57]},{node_t2}{2gVawe0zRDqmYcUcdUZBAw}{Dw9SY5WaQ8-ewtmirmJgrA}{local}{local[59]},}, reason: local-disco-update
  1> [2016-07-18 15:52:41,121][INFO ][org.elasticsearch.cluster.routing] [node_t1] scheduling reroute for delayed shards in [59.9s] (6 delayed shards)
  1> [2016-07-18 15:52:41,123][INFO ][org.elasticsearch.node   ] [node_t0] stopped
  1> [2016-07-18 15:52:41,123][INFO ][org.elasticsearch.node   ] [node_t0] closing ...
  1> [2016-07-18 15:52:41,126][INFO ][org.elasticsearch.node   ] [node_t0] closed
  1> [2016-07-18 15:52:41,126][INFO ][org.elasticsearch.node   ] [node_t1] stopping ...
  1> [2016-07-18 15:52:41,140][INFO ][org.elasticsearch.node   ] [node_t1] stopped
  1> [2016-07-18 15:52:41,140][INFO ][org.elasticsearch.node   ] [node_t1] closing ...
  1> [2016-07-18 15:52:41,142][INFO ][org.elasticsearch.node   ] [node_t1] closed

Which ultimately makes the test fail with:

> Throwable #1: ElasticsearchException[unable to find any shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false] in the routing table]
   >    at org.elasticsearch.action.admin.cluster.allocation.TransportClusterAllocationExplainAction.masterOperation(TransportClusterAllocationExplainAction.java:310)
   >    at org.elasticsearch.action.admin.cluster.allocation.TransportClusterAllocationExplainAction.masterOperation(TransportClusterAllocationExplainAction.java:68)
   >    at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:85)
   >    at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$3.doRun(TransportMasterNodeAction.java:169)
   >    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:510)
   >    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   >    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   >    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   >    at java.lang.Thread.run(Thread.java:745)Throwable #2: java.lang.AssertionError: ClusterHealthResponse has timed out - returned: [{
   >   "cluster_name" : "TEST-CHILD_VM=[0]-CLUSTER_SEED=[-7292795637849911491]-HASH=[127918AC5CC6]-cluster",
   >   "status" : "green",
   >   "timed_out" : true,
   >   "number_of_nodes" : 3,
   >   "number_of_data_nodes" : 3,
   >   "active_primary_shards" : 5,
   >   "active_shards" : 10,
   >   "relocating_shards" : 0,
   >   "initializing_shards" : 0,
   >   "unassigned_shards" : 0,
   >   "delayed_unassigned_shards" : 0,
   >   "number_of_pending_tasks" : 0,
   >   "number_of_in_flight_fetch" : 0,
   >   "task_max_waiting_in_queue_millis" : 0,
   >   "active_shards_percent_as_number" : 100.0
   > }]
   > Expected: is <false>
   >      but: was <true>
   >    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   >    at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertNoTimeout(ElasticsearchAssertions.java:111)
   >    at org.elasticsearch.test.ESIntegTestCase.ensureClusterSizeConsistency(ESIntegTestCase.java:1031)
   >    at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:528)
   >    at org.elasticsearch.test.ESIntegTestCase.after(ESIntegTestCase.java:1980)
   >    at java.lang.Thread.run(Thread.java:745)

Assigning to @ywelsch since it looks like you're intimately familiar with this class, but /cc'ing @dakrone in case it's I misjudged and it's related to cluster explanations somehow.

@s1monw
Copy link
Contributor

s1monw commented Jul 19, 2016

this comes from the fact that we might sometimes don't have unassinged info if we are initializing. For instance, this particular failure comes from here:

at org.elasticsearch.cluster.routing.ShardRouting.<init>(ShardRouting.java:80)
  2>        at org.elasticsearch.cluster.routing.ShardRouting.removeRelocationSource(ShardRouting.java:391)
  2>        at org.elasticsearch.cluster.routing.RoutingNodes.removeRelocationSource(RoutingNodes.java:500)
  2>        at org.elasticsearch.cluster.routing.allocation.AllocationService.cancelShard(AllocationService.java:570)
  2>        at org.elasticsearch.cluster.routing.allocation.AllocationService.applyFailedShard(AllocationService.java:534)
  2>        at org.elasticsearch.cluster.routing.allocation.AllocationService.deassociateDeadNodes(AllocationService.java:423)
  2>        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:347)

I think not having unassigned info here is fine, we just need to check for null in that method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

3 participants