Fix for bug #1 #48

ishan · 2013-01-29T17:32:27Z

If a .META. or -ROOT- lookup for a probe request leads to a
NonRecoverableException (eg. when the region server serving either
is down, leading to an underlying RegionServerStoppedException), the
probe's callback (RetryRPC) is not executed. Now, if an upstream
request comes in and sets the meta cache to a stale entry, all further
requests will go to the wrong the regionserver and NSRE. The cache entry
for the region is never deleted as that is the responsibility of the probe,
which is now "lost"! The bug manifests itself at the client in various
ways, like the client hanging up on regionserver restarts or continuos
PleaseThrottleExceptions even after the Region is back online. This
patch fixes the issue by triggering the callback when there is a
NonRecoverableException for a request.

If a .META. or -ROOT- lookup for a probe request leads to a NonRecoverableException (eg. when the region server serving either is down, leading to an underlying RegionServerStoppedException), the probe's callback (RetryRPC) is not executed. Now, if an upstream request comes in and sets the meta cache to a stale entry, all further requests will go to the wrong the regionserver and NSRE. The cache entry for the region is never deleted as that is the responsibility of the probe, which is now "lost"! The bug manifests itself at the client in various ways, like the client hanging up on regionserver restarts or continuos PleaseThrottleExceptions even after the Region is back online. This patch fixes the issue by triggering the callback when there is a NonRecoverableException for a request.

ishan · 2013-01-29T17:34:34Z

Fixes #1

tsuna · 2013-01-30T07:02:01Z

Good catch! I think it's possible that this was indeed the reason why sometimes the client seemingly stop retrying to find where ROOT or META is.

tsuna · 2013-02-03T20:49:18Z

I've rolled out this fix in production at Arista. I'm gonna let it bake there for a few days. I used to trip on this bug maybe about once a week. We'll see if I manage to reproduce the issue with this fix. I'm fairly confident you nailed it, so if that turns out to be right, I'll cut a 1.4.1 release just for this one change.

ishan · 2013-02-04T02:33:07Z

Im doing the same here at Rocketfuel. We face this issue when we do a rolling restart for the cluster. Ill report here if our clients survive the next rolling restart without any issues.

tsuna · 2013-02-15T08:56:12Z

Tumblr confirmed that they too no longer see the issue with this fix. Thank you so much for your contribution. You cannot imagine how many hours I spent looking at the code to try to nail down this bug. I've been looking at the wrong place all along.

Your change is in v1.4.1, merged as 86c82ab

ishan · 2013-02-16T02:27:16Z

Happy to help! It was fun chasing this one down. :)

ghost assigned tsuna Jan 30, 2013

tsuna closed this Feb 15, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for bug #1 #48

Fix for bug #1 #48

ishan commented Jan 29, 2013

ishan commented Jan 29, 2013

tsuna commented Jan 30, 2013

tsuna commented Feb 3, 2013

ishan commented Feb 4, 2013

tsuna commented Feb 15, 2013

ishan commented Feb 16, 2013

Fix for bug #1 #48

Fix for bug #1 #48

Conversation

ishan commented Jan 29, 2013

ishan commented Jan 29, 2013

tsuna commented Jan 30, 2013

tsuna commented Feb 3, 2013

ishan commented Feb 4, 2013

tsuna commented Feb 15, 2013

ishan commented Feb 16, 2013