Do not start a recovery process if the primary shard is currently allocated on a node which is not part of the cluster state #6024

bleskes · 2014-05-02T18:19:14Z

If a source node disconnect during recover, the target node will respond by cancelling the recovery. Typically the master will respond by removing the disconnected node from the cluster state, promoting another shard to become primary. This is sent it to all nodes and the target node will start recovering from the new primary. However, if the drop of a node caused the node count to go bellow min_master_node, the master will step down and will not promote shard immediately. When a new master is elected we may publish a new cluster state (who's point is to notify of a new master) which is not yet updated. This caused the node to start a recovery to a non existent node. Before we aborted the recovery without cleaning up the shard, causing subsequent correct cluster states to be ignored. We should not start the recovery process but wait for another cluster state to come in.

…ocated on a node which is not part of the cluster state If a source node disconnect during recover, the target node will respond by canceling the recovery. Typically the master will respond by removing the disconnected node from the cluster state, promoting another shard to become primary. This is sent it to all nodes and the target node will start recovering from the new primary. However, if the drop of a node caused the node count to go bellow min_master_node, the master will step down and will not promote shard immediately. When a new master is elected we may publish a new cluster state (who's point is to notify of a new master) which is not yet updated. This caused the node to start a recovery to a non existent node. Before we aborted the recovery without cleaning up the shard, causing subsequent correct cluster states to be ignored. We should not start the recovery process but wait for another cluster state to come in.

kimchy · 2014-05-02T21:19:42Z

LGTM

…ocated on a node which is not part of the cluster state If a source node disconnect during recover, the target node will respond by canceling the recovery. Typically the master will respond by removing the disconnected node from the cluster state, promoting another shard to become primary. This is sent it to all nodes and the target node will start recovering from the new primary. However, if the drop of a node caused the node count to go bellow min_master_node, the master will step down and will not promote shard immediately. When a new master is elected we may publish a new cluster state (who's point is to notify of a new master) which is not yet updated. This caused the node to start a recovery to a non existent node. Before we aborted the recovery without cleaning up the shard, causing subsequent correct cluster states to be ignored. We should not start the recovery process but wait for another cluster state to come in. Closes #6024

bleskes added v1.2.0 labels May 2, 2014

bleskes closed this in 694bf28 May 2, 2014

bleskes deleted the bug/recovery_missing_node branch May 20, 2014 09:15

clintongormley added the :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. label Jun 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not start a recovery process if the primary shard is currently allocated on a node which is not part of the cluster state #6024

Do not start a recovery process if the primary shard is currently allocated on a node which is not part of the cluster state #6024

bleskes commented May 2, 2014

kimchy commented May 2, 2014

Do not start a recovery process if the primary shard is currently allocated on a node which is not part of the cluster state #6024

Do not start a recovery process if the primary shard is currently allocated on a node which is not part of the cluster state #6024

Conversation

bleskes commented May 2, 2014

kimchy commented May 2, 2014