You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The test kills a random node, which usually goes smoothly. But sometimes it NPE's out and causes all the test to drop all the other nodes:
1> [2016-07-18 15:52:11,081][INFO ][org.elasticsearch.action.admin.cluster.allocation] --> stopping a random node
1> [2016-07-18 15:52:11,081][INFO ][org.elasticsearch.test ] Closing random node [node_t2]
1> [2016-07-18 15:52:11,081][INFO ][org.elasticsearch.node ] [node_t2] stopping ...
1> [2016-07-18 15:52:11,090][INFO ][org.elasticsearch.node ] [node_t2] stopped
1> [2016-07-18 15:52:11,090][ERROR][org.elasticsearch.discovery.local] [node_t2] unexpected failure during [local-disco-update]
1> java.lang.NullPointerException
1> at org.elasticsearch.cluster.routing.ShardRouting.allocatedPostIndexCreate(ShardRouting.java:243)
1> at org.elasticsearch.gateway.ReplicaShardAllocator.processExistingRecoveries(ReplicaShardAllocator.java:81)
1> at org.elasticsearch.gateway.GatewayAllocator.allocateUnassigned(GatewayAllocator.java:139)
1> at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:356)
1> at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:330)
1> at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:315)
1> at org.elasticsearch.discovery.local.LocalDiscovery$3.execute(LocalDiscovery.java:234)
1> at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45)
1> at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor(ClusterService.java:552)
1> at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:855)
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:450)
1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:237)
1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:200)
1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
1> at java.lang.Thread.run(Thread.java:745)
1> [2016-07-18 15:52:11,090][INFO ][org.elasticsearch.node ] [node_t2] closing ...
1> [2016-07-18 15:52:11,092][INFO ][org.elasticsearch.node ] [node_t2] closed
1> [2016-07-18 15:52:11,092][INFO ][org.elasticsearch.plugins] [transport_client_node_t0] no modules loaded
1> [2016-07-18 15:52:11,092][INFO ][org.elasticsearch.plugins] [transport_client_node_t0] loaded plugin [org.elasticsearch.test.transport.AssertingLocalTransport$TestPlugin]
1> [2016-07-18 15:52:11,096][INFO ][org.elasticsearch.transport] [transport_client_node_t0] publish_address {local[60]}, bound_addresses {local[60]}
1> [2016-07-18 15:52:11,103][INFO ][org.elasticsearch.action.admin.cluster.allocation] [ClusterAllocationExplainIT#testDelayShards { seed=[A82D8E5D38BF0A20:5A6033F2A8FE37B]}]: finished test
1> [2016-07-18 15:52:11,103][INFO ][org.elasticsearch.action.admin.cluster.allocation] [ClusterAllocationExplainIT#testDelayShards { seed=[A82D8E5D38BF0A20:5A6033F2A8FE37B]}]: cleaning up after test
1> [2016-07-18 15:52:20,900][WARN ][org.elasticsearch.cluster] [node_t0] failed to connect to node {node_t2}{2gVawe0zRDqmYcUcdUZBAw}{Dw9SY5WaQ8-ewtmirmJgrA}{local}{local[59]} (tried [1] times)
1> ConnectTransportException[[node_t2][local[59]] Failed to connect]
1> at org.elasticsearch.transport.local.LocalTransport.connectToNode(LocalTransport.java:184)
1> at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:272)
1> at org.elasticsearch.cluster.NodeConnectionsService.validateNodeConnected(NodeConnectionsService.java:108)
1> at org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:133)
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:510)
1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
1> at java.lang.Thread.run(Thread.java:745)
1> [2016-07-18 15:52:20,927][WARN ][org.elasticsearch.cluster] [node_t1] failed to connect to node {node_t2}{2gVawe0zRDqmYcUcdUZBAw}{Dw9SY5WaQ8-ewtmirmJgrA}{local}{local[59]} (tried [1] times)
1> ConnectTransportException[[node_t2][local[59]] Failed to connect]
1> at org.elasticsearch.transport.local.LocalTransport.connectToNode(LocalTransport.java:184)
1> at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:272)
1> at org.elasticsearch.cluster.NodeConnectionsService.validateNodeConnected(NodeConnectionsService.java:108)
1> at org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:133)
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:510)
1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
1> at java.lang.Thread.run(Thread.java:745)
HEARTBEAT J0 PID(25998@DualDesk): 2016-07-18T15:52:22, stalled for 11.5s at: ClusterAllocationExplainIT.testDelayShards { seed=[A82D8E5D38BF0A20:5A6033F2A8FE37B]}
HEARTBEAT J0 PID(25998@DualDesk): 2016-07-18T15:52:32, stalled for 21.5s at: ClusterAllocationExplainIT.testDelayShards { seed=[A82D8E5D38BF0A20:5A6033F2A8FE37B]}
1> [2016-07-18 15:52:40,905][WARN ][org.elasticsearch.action.admin.cluster.node.stats] [node_t0] ignoring unexpected response [null] of type [null], expected [NodeStats] or [FailedNodeException]
1> [2016-07-18 15:52:41,115][INFO ][org.elasticsearch.node ] [node_t0] stopping ...
1> [2016-07-18 15:52:41,120][INFO ][org.elasticsearch.cluster.routing.allocation] [node_t1] Cluster health status changed from [GREEN] to [RED] (reason: [elected as master]).
1> [2016-07-18 15:52:41,120][INFO ][org.elasticsearch.cluster.service] [node_t1] master {new {node_t1}{lnZrFKUNTdS1JxgsR2k_-Q}{zdgmZzecTwKFnyvQ7g4DOA}{local}{local[58]}, previous {node_t0}{v36fo6K7Qz-E2KyaZUxZ4Q}{zhtJYDKQTzGmIQUjL9-kWA}{local}{local[57]}}, removed {{node_t0}{v36fo6K7Qz-E2KyaZUxZ4Q}{zhtJYDKQTzGmIQUjL9-kWA}{local}{local[57]},{node_t2}{2gVawe0zRDqmYcUcdUZBAw}{Dw9SY5WaQ8-ewtmirmJgrA}{local}{local[59]},}, reason: local-disco-update
1> [2016-07-18 15:52:41,121][INFO ][org.elasticsearch.cluster.routing] [node_t1] scheduling reroute for delayed shards in [59.9s] (6 delayed shards)
1> [2016-07-18 15:52:41,123][INFO ][org.elasticsearch.node ] [node_t0] stopped
1> [2016-07-18 15:52:41,123][INFO ][org.elasticsearch.node ] [node_t0] closing ...
1> [2016-07-18 15:52:41,126][INFO ][org.elasticsearch.node ] [node_t0] closed
1> [2016-07-18 15:52:41,126][INFO ][org.elasticsearch.node ] [node_t1] stopping ...
1> [2016-07-18 15:52:41,140][INFO ][org.elasticsearch.node ] [node_t1] stopped
1> [2016-07-18 15:52:41,140][INFO ][org.elasticsearch.node ] [node_t1] closing ...
1> [2016-07-18 15:52:41,142][INFO ][org.elasticsearch.node ] [node_t1] closed
Which ultimately makes the test fail with:
> Throwable #1: ElasticsearchException[unable to find any shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false] in the routing table]
> at org.elasticsearch.action.admin.cluster.allocation.TransportClusterAllocationExplainAction.masterOperation(TransportClusterAllocationExplainAction.java:310)
> at org.elasticsearch.action.admin.cluster.allocation.TransportClusterAllocationExplainAction.masterOperation(TransportClusterAllocationExplainAction.java:68)
> at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:85)
> at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$3.doRun(TransportMasterNodeAction.java:169)
> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:510)
> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)Throwable #2: java.lang.AssertionError: ClusterHealthResponse has timed out - returned: [{
> "cluster_name" : "TEST-CHILD_VM=[0]-CLUSTER_SEED=[-7292795637849911491]-HASH=[127918AC5CC6]-cluster",
> "status" : "green",
> "timed_out" : true,
> "number_of_nodes" : 3,
> "number_of_data_nodes" : 3,
> "active_primary_shards" : 5,
> "active_shards" : 10,
> "relocating_shards" : 0,
> "initializing_shards" : 0,
> "unassigned_shards" : 0,
> "delayed_unassigned_shards" : 0,
> "number_of_pending_tasks" : 0,
> "number_of_in_flight_fetch" : 0,
> "task_max_waiting_in_queue_millis" : 0,
> "active_shards_percent_as_number" : 100.0
> }]
> Expected: is <false>
> but: was <true>
> at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
> at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertNoTimeout(ElasticsearchAssertions.java:111)
> at org.elasticsearch.test.ESIntegTestCase.ensureClusterSizeConsistency(ESIntegTestCase.java:1031)
> at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:528)
> at org.elasticsearch.test.ESIntegTestCase.after(ESIntegTestCase.java:1980)
> at java.lang.Thread.run(Thread.java:745)
Assigning to @ywelsch since it looks like you're intimately familiar with this class, but /cc'ing @dakrone in case it's I misjudged and it's related to cluster explanations somehow.
The text was updated successfully, but these errors were encountered:
this comes from the fact that we might sometimes don't have unassinged info if we are initializing. For instance, this particular failure comes from here:
at org.elasticsearch.cluster.routing.ShardRouting.<init>(ShardRouting.java:80)
2> at org.elasticsearch.cluster.routing.ShardRouting.removeRelocationSource(ShardRouting.java:391)
2> at org.elasticsearch.cluster.routing.RoutingNodes.removeRelocationSource(RoutingNodes.java:500)
2> at org.elasticsearch.cluster.routing.allocation.AllocationService.cancelShard(AllocationService.java:570)
2> at org.elasticsearch.cluster.routing.allocation.AllocationService.applyFailedShard(AllocationService.java:534)
2> at org.elasticsearch.cluster.routing.allocation.AllocationService.deassociateDeadNodes(AllocationService.java:423)
2> at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:347)
I think not having unassigned info here is fine, we just need to check for null in that method.
So the test failure is in explaining the allocation of shards, but I believe the actual problem is an NPE in
ShardRouting
, whereunassignedInfo
is somehow null: https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/cluster/routing/ShardRouting.java#L243CI failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+periodic/1584/console
I can usually recreate with this, although it appears to only happen 1 in 50 or so tests:
The test kills a random node, which usually goes smoothly. But sometimes it NPE's out and causes all the test to drop all the other nodes:
Which ultimately makes the test fail with:
Assigning to @ywelsch since it looks like you're intimately familiar with this class, but /cc'ing @dakrone in case it's I misjudged and it's related to cluster explanations somehow.
The text was updated successfully, but these errors were encountered: