Skip to content


Subversion checkout URL

You can clone with
Download ZIP


Client fails to recover on full HBase cluster restarts #1

tsuna opened this Issue · 3 comments

1 participant


We keep running into this issue at StumbleUpon: whenever we do a full cluster restart, clients get stuck, especially ones that are experiencing high write throughput (1000 QPS and more). Essentially the client knows that every region is NSRE'd or down and eventually everything is waiting on -ROOT- to come back online. But at some point the client "forgets" to retry finding -ROOT- and thus remains stuck forever.


I think I finally nailed down this one. It's not a race condition, but a flaw in the logic that handles NSRE. I need to write a unit test to try to reproduce this in a controlled fashion, but I think the problem goes like this:

  • One RPC get NSRE'd on some region that's being split
  • After a small timeout, an exists probe is sent to check if the region is still NSRE'd.
  • The exists probe causes a .META. lookup and its target region is set to be another region, because the target is now a daughter region.
  • The probe reaches the RS but the daughter region is also NSRE'd.
  • The probe comes back to handleNSRE but the lookup in got_nsre now searches for the daughter's name.
  • A new probe is created for the daughter, and the first probe is queued up.
  • The new probe doesn't succeed because the daughter is still NSRE'd, and for some reason the code doesn't register a timeout to retry again later.

The scenario is not entirely clear yet.


I wrote a unit test for the scenario above, it works fine. So this isn't it. There's something else I'm missing. How come there's an entry in got_nsre without a NSRETimer for it. I've seen this bug during a loadtest a few days ago, but I still understand how it crops up.

@bdd bdd referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
@tsuna tsuna closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.