AsyncShardFetch can hang if there are new nodes in cluster state #11615

bleskes · 2015-06-11T21:24:44Z

The AsyncShardFetch retrieves shard information from the different nodes in order to detirment the best location for unassigned shards. The class uses TransportNodesListGatewayStartedShards and TransportNodesListShardStoreMetaData in order to fetch this information. These actions, inherit from TransportNodesAction and are activated using a list of node ids. Those node ids are extracted from the cluster state that is used to assign shards.

If we perform a reroute and adding new nodes in the same cluster state update task, it is possible that the AsyncShardFetch administration is based on
a different cluster state then the one used by TransportNodesAction to resolve nodes. This can cause a problem since TransportNodesAction filters away unknown nodes, causing the administration in AsyncShardFetch to get confused.

This commit fixes this allowing to override node resolving in TransportNodesAction and uses the exact node ids transfered by AsyncShardFetch.

NOTE: this is currently not an issue as we never add nodes and reroute in the same task. This is however dangerous and should be fixed.

…state The AsyncShardFetch retrieves shard information from the different nodes in order to detirment the best location for unassigned shards. The class uses TransportNodesListGatewayStartedShards and TransportNodesListShardStoreMetaData in order to fetch this information. These actions, inherit from TransportNodesAction and are activated using a list of node ids. Those node ids are extracted from the cluster state that is used to assign shards. If we perform a reroute and adding new news in the same cluster state update task, it is possible that the AsyncShardFetch administration is based on a different cluster state then the one used by TransportNodesAction to resolve nodes. This can cause a problem since TransportNodesAction filters away unkown nodes, causing the administration in AsyncShardFetch to get confused. This commit fixes this allowing to override node resolving in TransportNodesAction and uses the exact node ids transfered by AsyncShardFetch

kimchy · 2015-06-11T22:49:40Z

LGTM, even though this we don't have this problem today, I think its worth back porting it.

Internal: AsyncShardFetch can hang if there are new nodes in cluster state

…state The AsyncShardFetch retrieves shard information from the different nodes in order to detirment the best location for unassigned shards. The class uses TransportNodesListGatewayStartedShards and TransportNodesListShardStoreMetaData in order to fetch this information. These actions, inherit from TransportNodesAction and are activated using a list of node ids. Those node ids are extracted from the cluster state that is used to assign shards. If we perform a reroute and adding new news in the same cluster state update task, it is possible that the AsyncShardFetch administration is based on a different cluster state then the one used by TransportNodesAction to resolve nodes. This can cause a problem since TransportNodesAction filters away unkown nodes, causing the administration in AsyncShardFetch to get confused. This commit fixes this allowing to override node resolving in TransportNodesAction and uses the exact node ids transferred by AsyncShardFetch Closes elastic#11615

bleskes · 2015-06-12T10:21:22Z

Pushed to 1.x & 1.6

bleskes added >bug v2.0.0-beta1 review labels Jun 11, 2015

bleskes added a commit that referenced this pull request Jun 12, 2015

Merge pull request #11615 from bleskes/async_fetch_non_existent_nodes

df8a300

Internal: AsyncShardFetch can hang if there are new nodes in cluster state

bleskes merged commit df8a300 into elastic:master Jun 12, 2015

kevinkluge removed the review label Jun 12, 2015

bleskes added the v1.6.1 label Jun 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AsyncShardFetch can hang if there are new nodes in cluster state #11615

AsyncShardFetch can hang if there are new nodes in cluster state #11615

bleskes commented Jun 11, 2015

kimchy commented Jun 11, 2015

bleskes commented Jun 12, 2015

AsyncShardFetch can hang if there are new nodes in cluster state #11615

AsyncShardFetch can hang if there are new nodes in cluster state #11615

Conversation

bleskes commented Jun 11, 2015

kimchy commented Jun 11, 2015

bleskes commented Jun 12, 2015