Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot restore operations throttle more than specified #13828

Merged
merged 1 commit into from Oct 3, 2015

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented Sep 28, 2015

Lucene's RateLimiter can do too much sleeping on small values (see also #6018).
The issue here is that calls to "pause" are not properly guarded in "restoreFile".

Instead of simply adding the guard, this commit uses the RateLimitingInputStream similar as for "snapshotFile".

@ywelsch
Copy link
Contributor Author

ywelsch commented Sep 29, 2015

@mikemccand can you have a look at this PR?

@clintongormley clintongormley added >bug review :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Sep 29, 2015
@ywelsch ywelsch self-assigned this Sep 30, 2015
@mikemccand
Copy link
Contributor

LGTM, this is a very bad bug! That small 4096 buffer means we are way over-throttling recovery :( We should push this back to 1.7.x?

@mikemccand
Copy link
Contributor

Are we directly calling RateLimiter.pause anywhere else in ES?

@ywelsch
Copy link
Contributor Author

ywelsch commented Sep 30, 2015

I checked the other places, and they are all properly guarded. Maybe RateLimiter.pause could have an assertion?

@mikemccand
Copy link
Contributor

Maybe RateLimiter.pause could have an assertion?

Well, it's called (correctly) by RateLimitingInputStream ... maybe we could ban ES from directly calling RateLimiter.pause using our forbidden APIs?

@ywelsch ywelsch force-pushed the fix-snapshot-restore-throttling branch from bae0790 to c95b2e2 Compare September 30, 2015 17:50
@ywelsch
Copy link
Contributor Author

ywelsch commented Sep 30, 2015

I'm very much in favour of back porting this to 1.7.x, as I saw a 10x slowdown on a production cluster.

I added your suggestions. Not closing the RateLimitingInputStream is ok, as it just delegates the close to PartSliceStream (which is now auto-closed).

Just banning RateLimiter.pause does not work as it is used in other places of ES (e.g. RecoverySourceHandler).

@mikemccand
Copy link
Contributor

Just banning RateLimiter.pause does not work as it is used in other places of ES (e.g. RecoverySourceHandler).

OK thanks, and it looks like it does the right thing (checks getMinPauseCheckBytes).

@mikemccand
Copy link
Contributor

LGTM, thanks @ywelsch!

@ywelsch
Copy link
Contributor Author

ywelsch commented Oct 1, 2015

@clintongormley which ES versions should we push this back to?

@clintongormley
Copy link

@ywelsch i'm good with 1.7.x and above

Lucene's RateLimiter can do too much sleeping on small values (see also elastic#6018).
The issue here is that calls to "pause" are not properly guarded in "restoreFile".

Instead of simply adding the guard, this commit uses the RateLimitingInputStream similar as for "snapshotFile".

Closes elastic#13828
@ywelsch ywelsch force-pushed the fix-snapshot-restore-throttling branch from 968419e to 03a4e22 Compare October 3, 2015 14:39
@ywelsch ywelsch changed the title Restore snapshot operations throttle more than specified Snapshot restore operations throttle more than specified Oct 3, 2015
ywelsch pushed a commit that referenced this pull request Oct 3, 2015
Snapshot restore operations throttle more than specified
@ywelsch ywelsch merged commit cdb0037 into elastic:master Oct 3, 2015
ywelsch pushed a commit that referenced this pull request Oct 3, 2015
Lucene's RateLimiter can do too much sleeping on small values (see also #6018).
The issue here is that calls to "pause" are not properly guarded in "restoreFile".

Instead of simply adding the guard, this commit uses the RateLimitingInputStream similar as for "snapshotFile".

Closes #13828
ywelsch pushed a commit that referenced this pull request Oct 3, 2015
Lucene's RateLimiter can do too much sleeping on small values (see also #6018).
The issue here is that calls to "pause" are not properly guarded in "restoreFile".

Instead of simply adding the guard, this commit uses the RateLimitingInputStream similar as for "snapshotFile".

Closes #13828
ywelsch pushed a commit that referenced this pull request Oct 3, 2015
Lucene's RateLimiter can do too much sleeping on small values (see also #6018).
The issue here is that calls to "pause" are not properly guarded in "restoreFile".

Instead of simply adding the guard, this commit uses the RateLimitingInputStream similar as for "snapshotFile".

Closes #13828
ywelsch pushed a commit that referenced this pull request Oct 3, 2015
Lucene's RateLimiter can do too much sleeping on small values (see also #6018).
The issue here is that calls to "pause" are not properly guarded in "restoreFile".

Instead of simply adding the guard, this commit uses the RateLimitingInputStream similar as for "snapshotFile".

Closes #13828
@pickypg pickypg added the v2.1.0 label Feb 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v1.7.3 v2.0.0 v2.1.0 v2.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants