-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VReplication: Improve handling of vplayer stalls #15797
base: main
Are you sure you want to change the base?
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
fe9ed6d
to
03a5b03
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
03a5b03
to
bee30d4
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15797 +/- ##
==========================================
- Coverage 68.47% 68.26% -0.22%
==========================================
Files 1562 1541 -21
Lines 197083 197145 +62
==========================================
- Hits 134962 134585 -377
- Misses 62121 62560 +439 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
cb5fc2e
to
b78059d
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
b78059d
to
d4517f2
Compare
…eout Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@rohit-nayak-ps I could also just remove the transaction duration (stallHandler) part and stick with the simpler heartbeat based check. I do believe that will still be able to catch all of the cases, including the one seen in the original production issue. In that case the position was not getting updated via the replicated transaction or the heartbeat recording, so we should still have gotten the stalled vttablet log and the issue noted in the workflow's message field. In any event, I did another test with the latest state:
|
This might indeed be a good path to choose, since your other change is catching the known issues. Also while goroutines are lightweight, generating goroutines for every transaction could also lead to unpredictable performance impacts due to garbage collection, memory usage, scheduling issues etc especially since we will be enabling it in a high-qps situation. |
6050c19
to
d9a276c
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
d9a276c
to
3839e90
Compare
@rohit-nayak-ps Having more than 1 active/concurrent goroutine per I've added comments and a new unit test that demonstrate it now working as intended: 3839e90 I wanted to get it working so that even if I do end up discarding it here we'll still have it in the PR history if we ever want to revive it or reuse it for something else. With all that being said, let me know what you think. I'm fine removing it for now since it was only well into my testing and debugging that I realized the existing heartbeat mechanism combined with the |
6f9d494
to
d1f65b6
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
d1f65b6
to
d2fea8d
Compare
@mattlord now that's it's been said, it cannot be unsaid... :) |
Signed-off-by: Matt Lord <mattalord@gmail.com>
3ebfc1d
to
c666b14
Compare
@rohit-nayak-ps and @shlomi-noach this removes all of the What we lose w/o it is the ability to perform out-of-band monitoring and errors. Meaning that the heartbeat method will only detect a stall when it was due to a failure to commit the transaction which updates the timestamp for the workflow (whether it was done on its own or as part of replicating user generated events). If you both prefer that then I'll update the PR description accordingly. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattlord thank you, I think we should go with this change, seeing that it's so succinct.
Good news is that I can repeat the exact scenario and symptoms seen in production with the Bad news is that neither of the two methods added here are generating the helpful error message. I'm still getting:
So I need to keep digging... |
Signed-off-by: Matt Lord <mattalord@gmail.com>
@rohit-nayak-ps and @shlomi-noach OK, now that I'm able to repeat the exact issue seen in production (which kicked off this work) — I was missing a detail in that test case which I only realized this week — I know that I've now fixed it with the latest commit. Before that commit, even with the
It's quite some time before that error surfaces and when in the state leading up to it you cannot even With the latest commit, however (
And you can What the PR now does is detect stalls on both "ends" in the vplayer:
I will now update the PR description as well. I will also add the |
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
It says this in the description:
What causes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
@rohit-nayak-ps I'm sorry, I was wrong on the For the
That's here: vitess/go/vt/vttablet/tabletmanager/vreplication/controller.go Lines 322 to 326 in 8c97857
Which is waiting for That goroutine in turn is blocked on this one:
Which is here at line 293 (line numbers are a little off as this was done in my PR branch): vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go Lines 289 to 294 in 8c97857
And the goroutine for that function:
Which is trying and retrying to commit the last relay log batch, cycling through queries like this:
Each of which are doing a table scan so take a few seconds. So it looks like we process the stop request and cancel the controller's context, which then causes the |
Description
Please see the issue for details about the problems we're attempting to solve/improve in this PR.
The improvements here are about detecting when we're not making any progress in the vplayer (running/replicating phase of a vreplication workflow) and showing/logging meaningful errors to replace the eventual generic EOF errors seen before this PR:
It's quite some time before that error surfaces when in this stalled state and when in the state leading up to this error you cannot even
stop
the workflow either as things have backed up so badly (thestop
command hangs for some time before eventually completing, see here).With this PR, you would now by default get this helpful error instead every 5 minutes (in the vreplication workflow message field, in the
_vt.vreplication_log
table, and in the vttablet logs):And you can
stop
the workflow during this time as we're not letting things back up so badly.What the PR now does is detect stalls on both "ends" in the vplayer:
Related Issue(s)
Checklist