VReplication: Improve handling of vplayer stalls #15797

mattlord · 2024-04-25T15:58:28Z

Description

⚠️ I do not want to merge this into v20 just before RC, so I'm adding the Do Not Merge label until the RC is cut and main becomes v21. I will then remove this label and add the v21.0.0 milestone.

Please see the issue for details about the problems we're attempting to solve/improve in this PR.

The improvements here are about detecting when we're not making any progress in the vplayer (running/replicating phase of a vreplication workflow) and showing/logging meaningful errors to replace the eventual generic EOF errors seen before this PR:

vttablet: rpc error: code = Unknown desc = stream (at source tablet) error @ 1cb76134-1dc9-11ef-82c4-5f4020e10986:1-537840: unexpected EOF\nio.ReadFull(packet body of length 2658) failed (errno 2013) (sqlstate HY000)

It's quite some time before that error surfaces when in this stalled state and when in the state leading up to this error you cannot even stop the workflow either as things have backed up so badly (the stop command hangs for some time before eventually completing, see here).

With this PR, you would now by default get this helpful error instead every 5 minutes (in the vreplication workflow message field, in the _vt.vreplication_log table, and in the vttablet logs):

error in stream 3, will retry after 5s: relay log I/O stalled: progress stalled; vplayer was unable to replicate the transaction in a timely manner; examine the target mysqld instance health and the replicated queries' EXPLAIN output to see why queries are taking unusually long

And you can stop the workflow during this time as we're not letting things back up so badly.

What the PR now does is detect stalls on both "ends" in the vplayer:

It will detect relay log I/O stalls — where we cannot successfully replicate the previous relay log contents, continually retrying to, so the relay log fills up with the next batch and we cannot write to it anymore as we hit the max relay log size / items. This a stall on the input side of the vplayer.
It will detect heartbeat recording failures — where we cannot record the heartbeat in the vreplication workflow record within the minimum heartbeat interval. This is a stall on the output side of the vplayer.

Related Issue(s)

Fixes: Bug Report: VPlayer does not detect stalls #15974

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation: ⚠️ TODO: docs PR to come after review approvals ⚠️

vitess-bot · 2024-04-25T15:58:30Z

Signed-off-by: Matt Lord <mattalord@gmail.com>

codecov · 2024-04-25T16:26:38Z

Codecov Report

Attention: Patch coverage is 82.85714% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 68.26%. Comparing base (3b09eb2) to head (a79516f).
Report is 23 commits behind head on main.

Files	Patch %	Lines
.../vt/vttablet/tabletmanager/vreplication/vplayer.go	61.53%	5 Missing ⚠️
go/vt/vttablet/tabletmanager/vreplication/flags.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #15797      +/-   ##
==========================================
- Coverage   68.47%   68.26%   -0.22%     
==========================================
  Files        1562     1541      -21     
  Lines      197083   197145      +62     
==========================================
- Hits       134962   134585     -377     
- Misses      62121    62560     +439

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Matt Lord <mattalord@gmail.com>

…eout Signed-off-by: Matt Lord <mattalord@gmail.com>

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord · 2024-05-25T06:55:54Z

@rohit-nayak-ps I could also just remove the transaction duration (stallHandler) part and stick with the simpler heartbeat based check. I do believe that will still be able to catch all of the cases, including the one seen in the original production issue. In that case the position was not getting updated via the replicated transaction or the heartbeat recording, so we should still have gotten the stalled vttablet log and the issue noted in the workflow's message field.

In any event, I did another test with the latest state:

❯ grep "stall handler" /opt/vtdataroot/tmp/*
...
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.532924   12895 vplayer.go:897] StallHandler-2527617803385045679: After stopping the stall handler goroutine for 15s, total: 422, active: 0 -- totalStallHandlerCount: 36
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.537210   12895 vplayer.go:878] StallHandler-2527617803385045679: Resetting the stall handler timer to 15s
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.537223   12895 vplayer.go:893] StallHandler-2527617803385045679: Starting the stall handler goroutine for 15s, total: 423, active: 1, totalStallHandlerCount: 36
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.537449   12895 vplayer.go:878] StallHandler-2527617803385045679: Resetting the stall handler timer to 15s
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.537781   12895 vplayer.go:917] StallHandler-2527617803385045679: Stopping the stall handler timer
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.537791   12895 vplayer.go:905] StallHandler-2527617803385045679: the stall handler timer was stopped
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.537793   12895 vplayer.go:897] StallHandler-2527617803385045679: After stopping the stall handler goroutine for 15s, total: 423, active: 0 -- totalStallHandlerCount: 36

rohit-nayak-ps · 2024-05-25T18:14:26Z

@rohit-nayak-ps I could also just remove the transaction duration (stallHandler) part and stick with the simpler heartbeat based check.

This might indeed be a good path to choose, since your other change is catching the known issues.

Also while goroutines are lightweight, generating goroutines for every transaction could also lead to unpredictable performance impacts due to garbage collection, memory usage, scheduling issues etc especially since we will be enabling it in a high-qps situation.

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord · 2024-05-26T06:10:12Z

@rohit-nayak-ps I could also just remove the transaction duration (stallHandler) part and stick with the simpler heartbeat based check.

This might indeed be a good path to choose, since your other change is catching the known issues.

Also while goroutines are lightweight, generating goroutines for every transaction could also lead to unpredictable performance impacts due to garbage collection, memory usage, scheduling issues etc especially since we will be enabling it in a high-qps situation.

@rohit-nayak-ps Having more than 1 active/concurrent goroutine per vplayer/stallHandler was a bug. The intention was for this to be a lock-free idempotent stall handler that was very difficult to mess up — precisely because it's a high QPS component with a somewhat awkward way of managing the transactions so you may do multiple BEGINs before a COMMIT e.g. There should only ever be 0 or 1 goroutines active/running per vplayer/stallHander instance. After spending so much time on the investigation/testing/debugging part of the work I was a bit too eager to mark this as ready for review once my manual test and the new unit test were passing. That was sloppy on my part — I missed the bug in my self review and I clearly didn't have adequate testing, I apologize. Your thorough review was much appreciated though as you caught it. ❤️

I've added comments and a new unit test that demonstrate it now working as intended: 3839e90

I wanted to get it working so that even if I do end up discarding it here we'll still have it in the PR history if we ever want to revive it or reuse it for something else.

With all that being said, let me know what you think. I'm fine removing it for now since it was only well into my testing and debugging that I realized the existing heartbeat mechanism combined with the vreplicationMinimumHeartbeatUpdateInterval flag/variable were already supposed to effectively be serving the role of detecting stalls. The problem was a bug in that handling where we cleared the pending heartbeat counter before we were successfully able to save/record the heartbeat (we assumed it would be immediately successful). So now that I'm well aware of it and have also confirmed that mechanism is working as it should (it's used in the ROW based test case in the new TestPlayerStalls unit test) I think that the stallHandler mechanism is at least largely redundant. If I'm being honest, the main reason I would not want to delete it is that I spent some time on it. 🙂 That's obviously not a great reason to keep it and now that it's working and saved somewhere (commit history here), I'm totally fine removing it if you and others prefer it. Thanks again!

Signed-off-by: Matt Lord <mattalord@gmail.com>

shlomi-noach · 2024-05-27T11:59:52Z

I think that the stallHandler mechanism is at least largely redundant. If I'm being honest, the main reason I would not want to delete it is that I spent some time on it.

@mattlord now that's it's been said, it cannot be unsaid... :)
Looking at the PR changes, the stall handler takes a considerable part of the actual logic changes. So I'd be curious to see a solution where the stall handler does not exist.

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord · 2024-05-28T16:52:06Z

@rohit-nayak-ps and @shlomi-noach this removes all of the stallHandler related code: c666b14

What we lose w/o it is the ability to perform out-of-band monitoring and errors. Meaning that the heartbeat method will only detect a stall when it was due to a failure to commit the transaction which updates the timestamp for the workflow (whether it was done on its own or as part of replicating user generated events).

If you both prefer that then I'll update the PR description accordingly. Thanks!

shlomi-noach

@mattlord thank you, I think we should go with this change, seeing that it's so succinct.

rohit-nayak-ps · 2024-05-29T09:40:19Z

@mattlord thank you, I think we should go with this change, seeing that it's so succinct.

Same here. Thanks @mattlord.

mattlord · 2024-05-29T21:44:17Z

Good news is that I can repeat the exact scenario and symptoms seen in production with the Test case which demonstrates a more typical scenario that would be seen in production here: #15974

Bad news is that neither of the two methods added here are generating the helpful error message. I'm still getting:

                {
                  "id": "21",
                  "stream_id": "1",
                  "type": "Message",
                  "state": "Running",
                  "created_at": {
                    "seconds": "1716998552",
                    "nanoseconds": 0
                  },
                  "updated_at": {
                    "seconds": "1716998552",
                    "nanoseconds": 0
                  },
                  "message": "vttablet: rpc error: code = Unknown desc = stream (at source tablet) error @ 1cb76134-1dc9-11ef-82c4-5f4020e10986:1-546395: unexpected EOF\nio.ReadFull(packet body of length 78) failed (errno 2013) (sqlstate HY000)",
                  "count": "1"
                }

So I need to keep digging...

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord · 2024-05-30T23:55:28Z

@rohit-nayak-ps and @shlomi-noach OK, now that I'm able to repeat the exact issue seen in production (which kicked off this work) — I was missing a detail in that test case which I only realized this week — I know that I've now fixed it with the latest commit.

Before that commit, even with the stallHandler related code in place, I was still only getting the eventual not-terribly-helpful error that had been seen in production:

# from the 'vtctldclient workflow show' output
"message": "vttablet: rpc error: code = Unknown desc = stream (at source tablet) error @ 1cb76134-1dc9-11ef-82c4-5f4020e10986:1-537840: unexpected EOF\nio.ReadFull(packet body of length 2658) failed (errno 2013) (sqlstate HY000)",

❯ grep -Ri stalled ${VTDATAROOT}/tmp/vttablet*ERROR*

It's quite some time before that error surfaces and when in the state leading up to it you cannot even stop or show the workflow either as things have backed up so badly.

With the latest commit, however (vplayer-progress-deadline=5m and we are in fact not making any progress — the position in the workflow is not moving forward):

 # from the 'vtctldclient workflow show' output
"message": "relay log I/O stalled: progress stalled; vplayer was unable to replicate the transaction in a timely manner; examine the target mysqld instance health and the replicated queries' EXPLAIN output to see why queries are taking unusually long",

❯ grep -Ri stalled ${VTDATAROOT}/tmp/vttablet*ERROR*
E0530 19:24:37.570804   71715 dbclient.go:128] error in stream 3, will retry after 5s: relay log I/O stalled: progress stalled; vplayer was unable to replicate the transaction in a timely manner; examine the target mysqld instance health and the replicated queries' EXPLAIN output to see why queries are taking unusually long
E0530 19:31:04.982780   71715 dbclient.go:128] error in stream 3, will retry after 5s: relay log I/O stalled: progress stalled; vplayer was unable to replicate the transaction in a timely manner; examine the target mysqld instance health and the replicated queries' EXPLAIN output to see why queries are taking unusually long
E0530 19:37:34.468556   71715 dbclient.go:128] error in stream 3, will retry after 5s: relay log I/O stalled: progress stalled; vplayer was unable to replicate the transaction in a timely manner; examine the target mysqld instance health and the replicated queries' EXPLAIN output to see why queries are taking unusually long
...

And you can start/stop/show the workflow during this time as we're not letting things back up so badly.

What the PR now does is detect stalls on both "ends" in the vplayer:

It will detect relay log I/O stalls — where we cannot successfully replicate the previous relay log contents, continually retrying to, so the relay log fills up with the next batch and we cannot write to it anymore as we hit the max relay log size / items. This a stall on the input side of the vplayer.
It will detect heartbeat recording failures — where we cannot record the heartbeat in the workflow record within the minimum heartbeat interval. This is a stall on the output side of the vplayer.

I will now update the PR description as well. I will also add the Do Not Merge label on the PR since we should not merge this until we've cut the v20 RC and main becomes v21. Given that, no rush on the reviews but I appreciate it whenever you can. 🙏

Signed-off-by: Matt Lord <mattalord@gmail.com>

rohit-nayak-ps · 2024-05-31T08:11:06Z

It says this in the description:

It's quite some time before that error surfaces when in this stalled state and when in the state leading up to this error you cannot even stop or show the workflow either as things have backed up so badly.

What causes show to hang?

rohit-nayak-ps

Nice catch!

mattlord · 2024-05-31T14:36:51Z

It says this in the description:

It's quite some time before that error surfaces when in this stalled state and when in the state leading up to this error you cannot even stop or show the workflow either as things have backed up so badly.

What causes show to hang?

@rohit-nayak-ps I'm sorry, I was wrong on the show part. I've confirmed that in new tests and now that I think about it, I WAS able to run show during the production incident as well. It was only stop that hung (I'll update the description accordingly). (I did see show hang in earlier tests but that was due to things being hung in a different way from various tests.)

For the stop in my tests...

1 @ 0x10032f118 0x1002f8b64 0x1002f8724 0x10128d288 0x101291f78 0x1014f0bc0 0x1014f0ba5 0x10192670c 0x101906f30 0x1008c9cc8 0x1008cdd40 0x1008c808c 0x100369854
#	0x10128d287	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*controller).Stop+0x47				vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/controller.go:325
#	0x101291f77	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*Engine).exec+0xcd7					vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/engine.go:452
#	0x1014f0bbf	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*Engine).Exec+0xb6f					vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/engine.go:348
#	0x1014f0ba4	vitess.io/vitess/go/vt/vttablet/tabletmanager.(*TabletManager).UpdateVReplicationWorkflow+0xb54			vitess.io/vitess/go/vt/vttablet/tabletmanager/rpc_vreplication.go:472
#	0x10192670b	vitess.io/vitess/go/vt/vttablet/grpctmserver.(*server).UpdateVReplicationWorkflow+0x11b				vitess.io/vitess/go/vt/vttablet/grpctmserver/server.go:423
#	0x101906f2f	vitess.io/vitess/go/vt/proto/tabletmanagerservice._TabletManager_UpdateVReplicationWorkflow_Handler+0x1bf	vitess.io/vitess/go/vt/proto/tabletmanagerservice/tabletmanagerservice_grpc.pb.go:1634
#	0x1008c9cc7	google.golang.org/grpc.(*Server).processUnaryRPC+0xb57								google.golang.org/grpc@v1.63.2/server.go:1369
#	0x1008cdd3f	google.golang.org/grpc.(*Server).handleStream+0xb1f								google.golang.org/grpc@v1.63.2/server.go:1780
#	0x1008c808b	google.golang.org/grpc.(*Server).serveStreams.func2.1+0x8b							google.golang.org/grpc@v1.63.2/server.go:1019

That's here:

vitess/go/vt/vttablet/tabletmanager/vreplication/controller.go

Lines 322 to 326 in 8c97857

    
           func (ct *controller) Stop() { 
        
           	ct.cancel() 
        
           	ct.blpStats.Stop() 
        
           	<-ct.done 
        
           }

Which is waiting for run() to end as that done channel is closed in a defer there.

That goroutine in turn is blocked on this one:

1 @ 0x10032f118 0x1002f8b64 0x1002f8724 0x1012ba348 0x1012ba254 0x1012b9a1c 0x1012bd548 0x1012bcb58 0x10128c28c 0x10128b9a8 0x100369854
#	0x1012ba347	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).fetchAndApply.func4+0x37	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:304
#	0x1012ba253	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).fetchAndApply+0x573	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:309
#	0x1012b9a1b	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).play+0x31b		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:210
#	0x1012bd547	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vreplicator).replicate+0x5d7	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vreplicator.go:311
#	0x1012bcb57	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vreplicator).Replicate+0x27	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vreplicator.go:178
#	0x10128c28b	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*controller).runBlp+0x52b		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/controller.go:257
#	0x10128b9a7	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*controller).run+0x67		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/controller.go:153

Which is here at line 293 (line numbers are a little off as this was done in my PR branch):

vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go

Lines 289 to 294 in 8c97857

    
           case err := <-streamErr: 
        
           	defer func() { 
        
           		// cancel and wait for the other thread to finish. 
        
           		cancel() 
        
           		<-applyErr 
        
           	}()

And the goroutine for that function:

1 @ 0x10032f118 0x100328948 0x100362b70 0x10040efc8 0x100410310 0x100410301 0x1004b0d98 0x1004c0a14 0x100542c14 0x1003a87c0 0x100dd6ff8 0x100dd6fd1 0x100dd7230 0x100dec9b8 0x100deb964 0x100deb588 0x100deb008 0x101232080 0x1012b8684 0x1012b8c74 0x1012b8c61 0x1012b96c4 0x1012baa28 0x10129bf5c 0x10129aa14 0x1012ba8f8 0x1012bc19c 0x1012bb5ac 0x1012ba394 0x100369854
#	0x100362b6f	internal/poll.runtime_pollWait+0x9f								runtime/netpoll.go:345
#	0x10040efc7	internal/poll.(*pollDesc).wait+0x27								internal/poll/fd_poll_runtime.go:84
#	0x10041030f	internal/poll.(*pollDesc).waitRead+0x1ff							internal/poll/fd_poll_runtime.go:89
#	0x100410300	internal/poll.(*FD).Read+0x1f0									internal/poll/fd_unix.go:164
#	0x1004b0d97	net.(*netFD).Read+0x27										net/fd_posix.go:55
#	0x1004c0a13	net.(*conn).Read+0x33										net/net.go:179
#	0x100542c13	bufio.(*Reader).Read+0x1b3									bufio/bufio.go:241
#	0x1003a87bf	io.ReadAtLeast+0x9f										io/io.go:335
#	0x100dd6ff7	io.ReadFull+0x47										io/io.go:354
#	0x100dd6fd0	vitess.io/vitess/go/mysql.(*Conn).readHeaderFrom+0x20						vitess.io/vitess/go/mysql/conn.go:389
#	0x100dd722f	vitess.io/vitess/go/mysql.(*Conn).readEphemeralPacket+0x6f					vitess.io/vitess/go/mysql/conn.go:427
#	0x100dec9b7	vitess.io/vitess/go/mysql.(*Conn).readComQueryResponse+0x37					vitess.io/vitess/go/mysql/query.go:544
#	0x100deb963	vitess.io/vitess/go/mysql.(*Conn).ReadQueryResult+0x53						vitess.io/vitess/go/mysql/query.go:396
#	0x100deb587	vitess.io/vitess/go/mysql.(*Conn).ExecuteFetchMulti+0x97					vitess.io/vitess/go/mysql/query.go:364
#	0x100deb007	vitess.io/vitess/go/mysql.(*Conn).ExecuteFetch+0x27						vitess.io/vitess/go/mysql/query.go:318
#	0x10123207f	vitess.io/vitess/go/vt/binlog/binlogplayer.(*dbClientImpl).ExecuteFetch+0x2f			vitess.io/vitess/go/vt/binlog/binlogplayer/dbclient.go:144
#	0x1012b8683	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vdbClient).ExecuteFetch+0x1b3	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vdbclient.go:123
#	0x1012b8c73	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vdbClient).Execute+0x43		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vdbclient.go:166
#	0x1012b8c60	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vdbClient).ExecuteWithRetry+0x30	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vdbclient.go:170
#	0x1012b96c3	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.newVPlayer.func1+0x43		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:128
#	0x1012baa27	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).applyRowEvent.func1+0xd7	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:348
#	0x10129bf5b	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.execParsedQuery+0x5b			vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/replicator_plan.go:561
#	0x10129aa13	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*TablePlan).applyChange+0x4a3	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/replicator_plan.go:412
#	0x1012ba8f7	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).applyRowEvent+0x1e7	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:373
#	0x1012bc19b	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).applyEvent+0x82b		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:665
#	0x1012bb5ab	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).applyEvents+0x64b		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:547
#	0x1012ba393	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).fetchAndApply.func2+0x33	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:281

Which is trying and retrying to commit the last relay log batch, cycling through queries like this:

| 460472 | vt_filtered     | localhost       | vt_commerce | Query            |      1 | updating                                                        | update _vt_vrp_a505f42c1f5611ef9043920702940ee1_20240531100407_ set updated_at='2024-05-31 10:05:38', created_at='2024-05-29 11:54:37', colstuff='foobar' where id=(convert('126062630024235_id' using utf8mb4)) and id2='186062188424265_id2' |

Each of which are doing a table scan so take a few seconds.

So it looks like we process the stop request and cancel the controller's context, which then causes the vstreamer to generate an io.EOF stream error which the vplayer gets and then it waits for the fetch and apply goroutine to end and close its error channel before the run can complete and close its done channel and the controller finally ends. I think that we're stuck down in the MySQL layer for some time (it DOES eventually stop) which does not check the context.

mattlord added this to In progress in VReplication via automation Apr 25, 2024

github-actions bot added this to the v20.0.0 milestone Apr 25, 2024

mattlord force-pushed the vplayer_batch_trx_timeout branch 2 times, most recently from fe9ed6d to 03a5b03 Compare April 25, 2024 16:03

Improve handling of vplayer stalls

bee30d4

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the vplayer_batch_trx_timeout branch from 03a5b03 to bee30d4 Compare April 25, 2024 16:09

mattlord added 3 commits May 15, 2024 22:06

Add progress timer and test

6e929fe

Signed-off-by: Matt Lord <mattalord@gmail.com>

Fix table name logging

5f05388

Signed-off-by: Matt Lord <mattalord@gmail.com>

Make error handling concurrency safe

e9e5fc5

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord added Type: Enhancement Logical improvement (somewhere between a bug and feature) and removed Type: Bug labels May 16, 2024

mattlord added 3 commits May 16, 2024 11:07

More tweaks from self review

b72bfbe

Signed-off-by: Matt Lord <mattalord@gmail.com>

Lower progress timeout as the overall stall time is a multiple

ae275a5

Signed-off-by: Matt Lord <mattalord@gmail.com>

Move progress tracking to vplayer

99e2d86

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the vplayer_batch_trx_timeout branch from cb5fc2e to b78059d Compare May 17, 2024 00:31

Only use the stall detection for replicated user events

d4517f2

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the vplayer_batch_trx_timeout branch from b78059d to d4517f2 Compare May 17, 2024 01:27

Merge remote-tracking branch 'origin/main' into vplayer_batch_trx_tim…

c54c507

…eout Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord removed request for harshit-gangal, notfelineit and GuptaManan100 May 25, 2024 06:19

Reapply one other comment improvement

eeba51b

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the vplayer_batch_trx_timeout branch from 6050c19 to d9a276c Compare May 26, 2024 05:48

Add leak checking unit test

3839e90

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the vplayer_batch_trx_timeout branch from d9a276c to 3839e90 Compare May 26, 2024 06:01

mattlord force-pushed the vplayer_batch_trx_timeout branch from 6f9d494 to d1f65b6 Compare May 26, 2024 14:10

Add a few more expected TestMain goroutines to ignore list

d2fea8d

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the vplayer_batch_trx_timeout branch from d1f65b6 to d2fea8d Compare May 26, 2024 14:44

Remove stallHandler related code

c666b14

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the vplayer_batch_trx_timeout branch from 3ebfc1d to c666b14 Compare May 28, 2024 16:48

shlomi-noach reviewed May 29, 2024

View reviewed changes

shlomi-noach approved these changes May 29, 2024

View reviewed changes

Re-add vplayer-progress-deadline and use it in relayLog

0e88d6c

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord added the Do Not Merge label May 31, 2024

Correct flag help output

94877d2

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord removed this from the v20.0.0 milestone May 31, 2024

Minor changes after self review

a79516f

Signed-off-by: Matt Lord <mattalord@gmail.com>

rohit-nayak-ps approved these changes May 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VReplication: Improve handling of vplayer stalls #15797

VReplication: Improve handling of vplayer stalls #15797

mattlord commented Apr 25, 2024 •

edited

vitess-bot bot commented Apr 25, 2024

codecov bot commented Apr 25, 2024 •

edited

mattlord commented May 25, 2024 •

edited

rohit-nayak-ps commented May 25, 2024

mattlord commented May 26, 2024 •

edited

shlomi-noach commented May 27, 2024

mattlord commented May 28, 2024

shlomi-noach left a comment

rohit-nayak-ps commented May 29, 2024

mattlord commented May 29, 2024 •

edited

mattlord commented May 30, 2024

rohit-nayak-ps commented May 31, 2024

rohit-nayak-ps left a comment

mattlord commented May 31, 2024 •

edited

VReplication: Improve handling of vplayer stalls #15797

Are you sure you want to change the base?

VReplication: Improve handling of vplayer stalls #15797

Conversation

mattlord commented Apr 25, 2024 • edited

Description

Related Issue(s)

Checklist

vitess-bot bot commented Apr 25, 2024

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

codecov bot commented Apr 25, 2024 • edited

Codecov Report

mattlord commented May 25, 2024 • edited

rohit-nayak-ps commented May 25, 2024

mattlord commented May 26, 2024 • edited

shlomi-noach commented May 27, 2024

mattlord commented May 28, 2024

shlomi-noach left a comment

Choose a reason for hiding this comment

rohit-nayak-ps commented May 29, 2024

mattlord commented May 29, 2024 • edited

mattlord commented May 30, 2024

rohit-nayak-ps commented May 31, 2024

rohit-nayak-ps left a comment

Choose a reason for hiding this comment

mattlord commented May 31, 2024 • edited

mattlord commented Apr 25, 2024 •

edited

codecov bot commented Apr 25, 2024 •

edited

mattlord commented May 25, 2024 •

edited

mattlord commented May 26, 2024 •

edited

mattlord commented May 29, 2024 •

edited

mattlord commented May 31, 2024 •

edited