Fix resume data loss: route heartbeat coords through applyEventsQueue#34
Merged
grodowski merged 3 commits intoMay 21, 2026
Merged
Conversation
onChangelogHeartbeatEvent was mutating applier.CurrentCoordinates directly from the streamer goroutine, before any DML that preceded the heartbeat was applied to the ghost table. The checkpoint loop reads CurrentCoordinates as "applied through this GTID" and could persist a checkpoint whose LastTrxCoords was ahead of what was actually applied. If gh-ost crashed before applyEventsQueue drained, --resume read that checkpoint and called StartSyncGTID with the persisted set; MySQL treated the un-applied GTIDs as already-seen and never re-streamed them. The ghost table silently lost those DMLs and cut-over produced a stale table. Fix: enqueue a tableWriteFunc onto applyEventsQueue that performs the coords bump. The apply goroutine executes it in order, after the DMLs the streamer enqueued before the heartbeat, restoring the invariant. Adds TestMigratorHeartbeatDoesNotAdvancePastUnappliedDML, which fails at the previous HEAD and passes after the fix; also asserts queue ordering to guard against future changes that wrap the heartbeat enqueue in a goroutine. Co-authored-by: Bastian Bartmann <bastian.bartmann@shopify.com>
coding-chimp
approved these changes
May 20, 2026
Comment on lines
+332
to
+343
| // Route the coords bump through applyEventsQueue so it is ordered after | ||
| // any DMLs the streamer enqueued before this heartbeat. | ||
| // | ||
| // If we instead mutated applier.CurrentCoordinates directly from this | ||
| // streamer goroutine, the checkpoint loop (Migrator.Checkpoint) could | ||
| // observe coords that include DMLs still sitting un-applied in | ||
| // applyEventsQueue and write a checkpoint row whose LastTrxCoords is | ||
| // AHEAD of what has actually been applied to the ghost table. If gh-ost | ||
| // then crashes before the queue drains, resume reads that checkpoint and | ||
| // calls StartSyncGTID with the persisted set; MySQL treats the un-applied | ||
| // GTIDs as already-seen and never re-streams them, so the ghost table | ||
| // silently loses those DMLs and cut-over produces a stale table. |
There was a problem hiding this comment.
Not sure we really need this comment in the code. We have a test and context in git.
I would make it more concise if we intend to keep it.
Co-authored-by: Bastian Bartmann <bastian.bartmann@shopify.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Member
Author
|
FYI added d96ad02 to align with github#1639 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
onChangelogHeartbeatEventwas mutatingapplier.CurrentCoordinatesdirectly from the streamer goroutine, before any DML that preceded the heartbeat was applied to the ghost table. The checkpoint loop readsCurrentCoordinatesas "applied through this GTID" and could persist a checkpoint whoseLastTrxCoordswas ahead of what was actually applied.If
gh-ostcrashed beforeapplyEventsQueuedrained,--resumeread that checkpoint and calledStartSyncGTIDwith the persisted set; MySQL treated the un-applied GTIDs as already-seen and never re-streamed them. The ghost table silently lost those DMLs and cut-over produced a stale table.Fix: enqueue the heartbeat through
applyEventsQueue. The apply goroutine executes it in order, after the DMLs the streamer enqueued before the heartbeat, restoring the invariant.Adds
TestMigratorHeartbeatDoesNotAdvancePastUnappliedDML, which fails at the previous HEAD and passes after the fix.script/cibuildreturns with no formatting errors, build errors or unit test errors.