Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync up snapshot shard status on a master restart #11450

Conversation

imotov
Copy link
Contributor

@imotov imotov commented Jun 2, 2015

When a snapshot operation on a particular shard finishes, the data node where this shard resides sends an update shard status request to the master node to indicate that the operation on the shard is done. When the master node receives the command it queues cluster state update task and acknowledges the receipt of the command to the data node.

The update snapshot shard status tasks have relatively low priority, so during cluster instability they tend to get stuck at the end of the queue. If the master node gets restarted before processing these tasks the information about the shards can be lost and the new master assumes that they are still in process while the data node thinks that these shards are already done.

This commit add a retry mechanism that checks compares cluster state of a newly elected master and the current state of snapshot shards and updates the cluster state on the master again if needed.

Closes #11314

@imotov imotov added >bug v2.0.0-beta1 review :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v1.6.0 labels Jun 2, 2015
@@ -830,7 +836,17 @@ private void processIndexShardSnapshots(SnapshotMetaData snapshotMetaData) {
for (Map.Entry<ShardId, SnapshotMetaData.ShardSnapshotStatus> shard : entry.shards().entrySet()) {
IndexShardSnapshotStatus snapshotStatus = snapshotShards.shards.get(shard.getKey());
if (snapshotStatus != null) {
snapshotStatus.abort();
if (snapshotStatus.stage() == IndexShardSnapshotStatus.Stage.STARTED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use switch case statements for this it seems to be easier to read?

@s1monw
Copy link
Contributor

s1monw commented Jun 3, 2015

left a minor comment LGTM otherwise

When a snapshot operation on a particular shard finishes, the data node where this shard resides sends an update shard status request to the master node to indicate that the operation on the shard is done. When the master node receives the command it queues cluster state update task and acknowledges the receipt of the command to the data node.

The update snapshot shard status tasks have relatively low priority, so during cluster instability they tend to get stuck at the end of the queue. If the master node gets restarted before processing these tasks the information about the shards can be lost and the new master assumes that they are still in process while the data node thinks that these shards are already done.

 This commit add a retry mechanism that checks compares cluster state of a newly elected master and the current state of snapshot shards and updates the cluster state on the master again if needed.

Closes elastic#11314
@imotov imotov force-pushed the issue-11314-update-snapshot-shards-on-master-change branch from 5286c18 to f0e6add Compare June 3, 2015 19:14
@imotov imotov merged commit f0e6add into elastic:master Jun 3, 2015
@imotov imotov removed the review label Jun 3, 2015
@clintongormley clintongormley changed the title Snapshot/Restore: sync up snapshot shard status on a master restart Sync up snapshot shard status on a master restart Jun 8, 2015
@imotov imotov deleted the issue-11314-update-snapshot-shards-on-master-change branch May 1, 2020 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v1.6.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Snapshot/Restore: restart of a master node during snapshot can lead to hanging snapshots
2 participants