Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to snapshot replicating primary shards #6139

Merged

Conversation

imotov
Copy link
Contributor

@imotov imotov commented May 12, 2014

This change adds a new cluster state that waits for the replication of a shard to finish before starting snapshotting process. Because this change adds a new snapshot state, it will not be possible to do rolling upgrade from 1.1 branch to 1.2 with snapshot is being executed in the cluster.

Closes #5531

@s1monw
Copy link
Contributor

s1monw commented May 13, 2014

I want to do a deeper review but how do we handle bw compat here if a node is from a prev version and reads the "WAITING" state which is doesn't know about it just fails?

@imotov
Copy link
Contributor Author

imotov commented May 13, 2014

Yes, the node with old version will fail on trying to process new cluster state if there is a running snapshot that was started while a primary shard participating in the snapshot was relocated. That's why, as I mentioned in the comment, we cannot have snapshots during 1.1->1.2 rolling upgrade. We will need to add this to release notes. Alternatively, I could also add a check for the presence of older version nodes and if they are present - fallback to the old behavior. That's not going to be bulletproof solution though, since an older version node can suddenly show up later and will not be able to join the cluster. What do you think?

@s1monw
Copy link
Contributor

s1monw commented May 13, 2014

+1 for best effort but It might make sense to have this as a separate commit so it's easier to review? I will do a review of this in the meanwhile.

@imotov
Copy link
Contributor Author

imotov commented May 14, 2014

Added version check as a separate commit.

} else {
this.shards = shards;
HashMap<String, ImmutableList.Builder<ShardId>> waitingIndices = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we name this differently it's super hard to differentiate between the member and the local var. I also would love to not use null as the marker but make it final and check if it is empty? The extra object creation here is not worth the complexity IMO. We might even think of putting this into a separate method for readability ie Map<X,Y> findWaitingIndices()?

@s1monw
Copy link
Contributor

s1monw commented May 14, 2014

so bascially if we snapshot an index that has a primary that is initializing or relocating we just wait for the shards to start up. I think we should documentt this behavior and also test that we can abort such a waiting snapshot just in case it never get's to a state where we are able to snapshot. I wonder if we also should add tests where we have snapshots in multiple states ie. a chaos snapshotting while indexing and relocating. And just make sure we are not seeing any failures or broken snapshots?

@imotov
Copy link
Contributor Author

imotov commented May 16, 2014

@s1monw I've implemented suggested changes. Thanks!

long start = System.currentTimeMillis();
// Produce chaos for 30 sec or until snapshot is done whatever comes first
int randomIndices = 0;
while(System.currentTimeMillis() - start < 30000 && !snapshotIsDone("test-repo", "test-snap")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we make this test somehow operations based rather than time based. The Problem with time-based tests is that you have a very very hard time to get anywhere near reproducability. I know this is multithreaded anyways so it won't really reproduces but we can nicely randomize the num ops per thread which make it a bit nicer :)

@s1monw s1monw removed the bug label May 19, 2014
@s1monw
Copy link
Contributor

s1monw commented May 19, 2014

FYI - I removed the labels in favor of the issue

@s1monw
Copy link
Contributor

s1monw commented May 20, 2014

LGTM

This change adds a new cluster state that waits for the replication of a shard to finish before starting snapshotting process. Because this change adds a new snapshot state, an pre-1.2.0 nodes will not be able to join the 1.2.0 cluster that is currently running snapshot/restore operation.

Closes elastic#5531
@imotov imotov merged commit 91c7892 into elastic:master May 20, 2014
@s1monw s1monw removed the review label Jun 18, 2014
@clintongormley clintongormley added >bug v2.0.0-beta1 v1.2.0 :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Jun 7, 2015
@imotov imotov deleted the issue-5531-snapshot-relocating-primaries branch May 1, 2020 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v1.2.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an ability to snapshot relocating primary shards
4 participants