Add ability to snapshot replicating primary shards #6139

imotov · 2014-05-12T23:13:40Z

This change adds a new cluster state that waits for the replication of a shard to finish before starting snapshotting process. Because this change adds a new snapshot state, it will not be possible to do rolling upgrade from 1.1 branch to 1.2 with snapshot is being executed in the cluster.

Closes #5531

s1monw · 2014-05-13T12:21:13Z

I want to do a deeper review but how do we handle bw compat here if a node is from a prev version and reads the "WAITING" state which is doesn't know about it just fails?

imotov · 2014-05-13T15:33:34Z

Yes, the node with old version will fail on trying to process new cluster state if there is a running snapshot that was started while a primary shard participating in the snapshot was relocated. That's why, as I mentioned in the comment, we cannot have snapshots during 1.1->1.2 rolling upgrade. We will need to add this to release notes. Alternatively, I could also add a check for the presence of older version nodes and if they are present - fallback to the old behavior. That's not going to be bulletproof solution though, since an older version node can suddenly show up later and will not be able to join the cluster. What do you think?

s1monw · 2014-05-13T16:09:34Z

+1 for best effort but It might make sense to have this as a separate commit so it's easier to review? I will do a review of this in the meanwhile.

imotov · 2014-05-14T01:33:07Z

Added version check as a separate commit.

s1monw · 2014-05-14T07:43:16Z

src/main/java/org/elasticsearch/cluster/metadata/SnapshotMetaData.java

            } else {
                this.shards = shards;
+                HashMap<String, ImmutableList.Builder<ShardId>> waitingIndices = null;


can we name this differently it's super hard to differentiate between the member and the local var. I also would love to not use null as the marker but make it final and check if it is empty? The extra object creation here is not worth the complexity IMO. We might even think of putting this into a separate method for readability ie Map<X,Y> findWaitingIndices()?

s1monw · 2014-05-14T08:00:16Z

so bascially if we snapshot an index that has a primary that is initializing or relocating we just wait for the shards to start up. I think we should documentt this behavior and also test that we can abort such a waiting snapshot just in case it never get's to a state where we are able to snapshot. I wonder if we also should add tests where we have snapshots in multiple states ie. a chaos snapshotting while indexing and relocating. And just make sure we are not seeing any failures or broken snapshots?

imotov · 2014-05-16T21:22:15Z

@s1monw I've implemented suggested changes. Thanks!

s1monw · 2014-05-19T08:07:03Z

src/test/java/org/elasticsearch/snapshots/DedicatedClusterSnapshotRestoreTests.java

+        long start = System.currentTimeMillis();
+        // Produce chaos for 30 sec or until snapshot is done whatever comes first
+        int randomIndices = 0;
+        while(System.currentTimeMillis() - start < 30000 && !snapshotIsDone("test-repo", "test-snap")) {


could we make this test somehow operations based rather than time based. The Problem with time-based tests is that you have a very very hard time to get anywhere near reproducability. I know this is multithreaded anyways so it won't really reproduces but we can nicely randomize the num ops per thread which make it a bit nicer :)

s1monw · 2014-05-19T09:30:33Z

FYI - I removed the labels in favor of the issue

s1monw · 2014-05-20T12:34:05Z

LGTM

This change adds a new cluster state that waits for the replication of a shard to finish before starting snapshotting process. Because this change adds a new snapshot state, an pre-1.2.0 nodes will not be able to join the 1.2.0 cluster that is currently running snapshot/restore operation. Closes elastic#5531

imotov added bug labels May 12, 2014

s1monw reviewed May 14, 2014
View reviewed changes

kevinkluge assigned imotov May 15, 2014

kevinkluge removed the v1.3.0 label May 15, 2014

s1monw added enhancement and removed enhancement labels May 15, 2014

kevinkluge added the review label May 18, 2014

s1monw reviewed May 19, 2014
View reviewed changes

s1monw removed the bug label May 19, 2014

s1monw removed v1.2.0 labels May 19, 2014

imotov merged commit 91c7892 into elastic:master May 20, 2014

s1monw removed the review label Jun 18, 2014

clintongormley added >bug v2.0.0-beta1 v1.2.0 :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Jun 7, 2015

imotov deleted the issue-5531-snapshot-relocating-primaries branch May 1, 2020 22:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to snapshot replicating primary shards #6139

Add ability to snapshot replicating primary shards #6139

imotov commented May 12, 2014

s1monw commented May 13, 2014

imotov commented May 13, 2014

s1monw commented May 13, 2014

imotov commented May 14, 2014

s1monw May 14, 2014

s1monw commented May 14, 2014

imotov commented May 16, 2014

s1monw May 19, 2014

s1monw commented May 19, 2014

s1monw commented May 20, 2014

Add ability to snapshot replicating primary shards #6139

Add ability to snapshot replicating primary shards #6139

Conversation

imotov commented May 12, 2014

s1monw commented May 13, 2014

imotov commented May 13, 2014

s1monw commented May 13, 2014

imotov commented May 14, 2014

s1monw May 14, 2014

Choose a reason for hiding this comment

s1monw commented May 14, 2014

imotov commented May 16, 2014

s1monw May 19, 2014

Choose a reason for hiding this comment

s1monw commented May 19, 2014

s1monw commented May 20, 2014