Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Client fails to recover on full HBase cluster restarts #1

Closed
tsuna opened this Issue · 3 comments

1 participant

@tsuna
Owner

We keep running into this issue at StumbleUpon: whenever we do a full cluster restart, clients get stuck, especially ones that are experiencing high write throughput (1000 QPS and more). Essentially the client knows that every region is NSRE'd or down and eventually everything is waiting on -ROOT- to come back online. But at some point the client "forgets" to retry finding -ROOT- and thus remains stuck forever.

@tsuna
Owner

I think I finally nailed down this one. It's not a race condition, but a flaw in the logic that handles NSRE. I need to write a unit test to try to reproduce this in a controlled fashion, but I think the problem goes like this:

  • One RPC get NSRE'd on some region that's being split
  • After a small timeout, an exists probe is sent to check if the region is still NSRE'd.
  • The exists probe causes a .META. lookup and its target region is set to be another region, because the target is now a daughter region.
  • The probe reaches the RS but the daughter region is also NSRE'd.
  • The probe comes back to handleNSRE but the lookup in got_nsre now searches for the daughter's name.
  • A new probe is created for the daughter, and the first probe is queued up.
  • The new probe doesn't succeed because the daughter is still NSRE'd, and for some reason the code doesn't register a timeout to retry again later.

The scenario is not entirely clear yet.

@tsuna
Owner

I wrote a unit test for the scenario above, it works fine. So this isn't it. There's something else I'm missing. How come there's an entry in got_nsre without a NSRETimer for it. I've seen this bug during a loadtest a few days ago, but I still understand how it crops up.

@bdd bdd referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
@tsuna tsuna closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.