Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for bug #1 #48

Closed
wants to merge 1 commit into from
Closed

Fix for bug #1 #48

wants to merge 1 commit into from

Conversation

ishan
Copy link

@ishan ishan commented Jan 29, 2013

If a .META. or -ROOT- lookup for a probe request leads to a
NonRecoverableException (eg. when the region server serving either
is down, leading to an underlying RegionServerStoppedException), the
probe's callback (RetryRPC) is not executed. Now, if an upstream
request comes in and sets the meta cache to a stale entry, all further
requests will go to the wrong the regionserver and NSRE. The cache entry
for the region is never deleted as that is the responsibility of the probe,
which is now "lost"! The bug manifests itself at the client in various
ways, like the client hanging up on regionserver restarts or continuos
PleaseThrottleExceptions even after the Region is back online. This
patch fixes the issue by triggering the callback when there is a
NonRecoverableException for a request.

If a .META. or -ROOT- lookup for a probe request leads to a
NonRecoverableException (eg. when the region server serving either
is down, leading to an underlying RegionServerStoppedException), the
probe's callback (RetryRPC) is not executed. Now, if an upstream
request comes in and sets the meta cache to a stale entry, all further
requests will go to the wrong the regionserver and NSRE. The cache entry
for the region is never deleted as that is the responsibility of the probe,
which is now "lost"! The bug manifests itself at the client in various
ways, like the client hanging up on regionserver restarts or continuos
PleaseThrottleExceptions even after the Region is back online. This
patch fixes the issue by triggering the callback when there is a
NonRecoverableException for a request.
@ishan
Copy link
Author

ishan commented Jan 29, 2013

Fixes #1

@tsuna
Copy link
Member

tsuna commented Jan 30, 2013

Good catch! I think it's possible that this was indeed the reason why sometimes the client seemingly stop retrying to find where ROOT or META is.

@ghost ghost assigned tsuna Jan 30, 2013
@tsuna
Copy link
Member

tsuna commented Feb 3, 2013

I've rolled out this fix in production at Arista. I'm gonna let it bake there for a few days. I used to trip on this bug maybe about once a week. We'll see if I manage to reproduce the issue with this fix. I'm fairly confident you nailed it, so if that turns out to be right, I'll cut a 1.4.1 release just for this one change.

@ishan
Copy link
Author

ishan commented Feb 4, 2013

Im doing the same here at Rocketfuel. We face this issue when we do a rolling restart for the cluster. Ill report here if our clients survive the next rolling restart without any issues.

@tsuna
Copy link
Member

tsuna commented Feb 15, 2013

Tumblr confirmed that they too no longer see the issue with this fix. Thank you so much for your contribution. You cannot imagine how many hours I spent looking at the code to try to nail down this bug. I've been looking at the wrong place all along.

Your change is in v1.4.1, merged as 86c82ab

@tsuna tsuna closed this Feb 15, 2013
@ishan
Copy link
Author

ishan commented Feb 16, 2013

Happy to help! It was fun chasing this one down. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants