We keep running into this issue at StumbleUpon: whenever we do a full cluster restart, clients get stuck, especially ones that are experiencing high write throughput (1000 QPS and more). Essentially the client knows that every region is NSRE'd or down and eventually everything is waiting on -ROOT- to come back online. But at some point the client "forgets" to retry finding -ROOT- and thus remains stuck forever.
I think I finally nailed down this one. It's not a race condition, but a flaw in the logic that handles NSRE. I need to write a unit test to try to reproduce this in a controlled fashion, but I think the problem goes like this:
The scenario is not entirely clear yet.
I wrote a unit test for the scenario above, it works fine. So this isn't it. There's something else I'm missing. How come there's an entry in got_nsre without a NSRETimer for it. I've seen this bug during a loadtest a few days ago, but I still understand how it crops up.
Fixed by OpenTSDB/asynchbase@86c82ab