Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AsyncShardFetch can hang if there are new nodes in cluster state #11615

Merged
merged 1 commit into from
Jun 12, 2015

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Jun 11, 2015

The AsyncShardFetch retrieves shard information from the different nodes in order to detirment the best location for unassigned shards. The class uses TransportNodesListGatewayStartedShards and TransportNodesListShardStoreMetaData in order to fetch this information. These actions, inherit from TransportNodesAction and are activated using a list of node ids. Those node ids are extracted from the cluster state that is used to assign shards.

If we perform a reroute and adding new nodes in the same cluster state update task, it is possible that the AsyncShardFetch administration is based on
a different cluster state then the one used by TransportNodesAction to resolve nodes. This can cause a problem since TransportNodesAction filters away unknown nodes, causing the administration in AsyncShardFetch to get confused.

This commit fixes this allowing to override node resolving in TransportNodesAction and uses the exact node ids transfered by AsyncShardFetch.

NOTE: this is currently not an issue as we never add nodes and reroute in the same task. This is however dangerous and should be fixed.

…state

The AsyncShardFetch retrieves shard information from the different nodes in order to detirment the best location for unassigned shards. The class uses TransportNodesListGatewayStartedShards and TransportNodesListShardStoreMetaData in order to fetch this information. These actions, inherit from TransportNodesAction and are activated using a list of node ids. Those node ids are extracted from the cluster state that is used to assign shards.

If we perform a reroute and adding new news in the same cluster state update task, it is possible that the AsyncShardFetch administration is based on
a different cluster state then the one used by TransportNodesAction to resolve nodes. This can cause a problem since TransportNodesAction filters away unkown nodes, causing the administration in AsyncShardFetch to get confused.

This commit fixes this allowing to override node resolving in TransportNodesAction and uses the exact node ids transfered by AsyncShardFetch
@kimchy
Copy link
Member

kimchy commented Jun 11, 2015

LGTM, even though this we don't have this problem today, I think its worth back porting it.

bleskes added a commit that referenced this pull request Jun 12, 2015
Internal: AsyncShardFetch can hang if there are new nodes in cluster state
@bleskes bleskes merged commit df8a300 into elastic:master Jun 12, 2015
@kevinkluge kevinkluge removed the review label Jun 12, 2015
bleskes added a commit to bleskes/elasticsearch that referenced this pull request Jun 12, 2015
…state

The AsyncShardFetch retrieves shard information from the different nodes in order to detirment the best location for unassigned shards. The class uses TransportNodesListGatewayStartedShards and TransportNodesListShardStoreMetaData in order to fetch this information. These actions, inherit from TransportNodesAction and are activated using a list of node ids. Those node ids are extracted from the cluster state that is used to assign shards.

If we perform a reroute and adding new news in the same cluster state update task, it is possible that the AsyncShardFetch administration is based on
a different cluster state then the one used by TransportNodesAction to resolve nodes. This can cause a problem since TransportNodesAction filters away unkown nodes, causing the administration in AsyncShardFetch to get confused.

This commit fixes this allowing to override node resolving in TransportNodesAction and uses the exact node ids transferred by AsyncShardFetch

Closes elastic#11615
bleskes added a commit to bleskes/elasticsearch that referenced this pull request Jun 12, 2015
…state

The AsyncShardFetch retrieves shard information from the different nodes in order to detirment the best location for unassigned shards. The class uses TransportNodesListGatewayStartedShards and TransportNodesListShardStoreMetaData in order to fetch this information. These actions, inherit from TransportNodesAction and are activated using a list of node ids. Those node ids are extracted from the cluster state that is used to assign shards.

If we perform a reroute and adding new news in the same cluster state update task, it is possible that the AsyncShardFetch administration is based on
a different cluster state then the one used by TransportNodesAction to resolve nodes. This can cause a problem since TransportNodesAction filters away unkown nodes, causing the administration in AsyncShardFetch to get confused.

This commit fixes this allowing to override node resolving in TransportNodesAction and uses the exact node ids transferred by AsyncShardFetch

Closes elastic#11615
@bleskes bleskes added the v1.6.1 label Jun 12, 2015
@bleskes
Copy link
Contributor Author

bleskes commented Jun 12, 2015

Pushed to 1.x & 1.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants