Client fails to recover on full HBase cluster restarts #1

Closed
tsuna opened this Issue Mar 18, 2011 · 3 comments

1 participant

@tsuna
OpenTSDB member

We keep running into this issue at StumbleUpon: whenever we do a full cluster restart, clients get stuck, especially ones that are experiencing high write throughput (1000 QPS and more). Essentially the client knows that every region is NSRE'd or down and eventually everything is waiting on -ROOT- to come back online. But at some point the client "forgets" to retry finding -ROOT- and thus remains stuck forever.

@tsuna
OpenTSDB member

I think I finally nailed down this one. It's not a race condition, but a flaw in the logic that handles NSRE. I need to write a unit test to try to reproduce this in a controlled fashion, but I think the problem goes like this:

  • One RPC get NSRE'd on some region that's being split
  • After a small timeout, an exists probe is sent to check if the region is still NSRE'd.
  • The exists probe causes a .META. lookup and its target region is set to be another region, because the target is now a daughter region.
  • The probe reaches the RS but the daughter region is also NSRE'd.
  • The probe comes back to handleNSRE but the lookup in got_nsre now searches for the daughter's name.
  • A new probe is created for the daughter, and the first probe is queued up.
  • The new probe doesn't succeed because the daughter is still NSRE'd, and for some reason the code doesn't register a timeout to retry again later.

The scenario is not entirely clear yet.

@tsuna
OpenTSDB member

I wrote a unit test for the scenario above, it works fine. So this isn't it. There's something else I'm missing. How come there's an entry in got_nsre without a NSRETimer for it. I've seen this bug during a loadtest a few days ago, but I still understand how it crops up.

@tsuna
OpenTSDB member
@tsuna tsuna closed this Feb 15, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment