Skip to content

[LLM EXPERIMENT] mysql: Add regression test for backup/restore DROP TABLE race#35926

Draft
bosconi wants to merge 3 commits intomainfrom
jc-repro-mysql-backup-restore-race
Draft

[LLM EXPERIMENT] mysql: Add regression test for backup/restore DROP TABLE race#35926
bosconi wants to merge 3 commits intomainfrom
jc-repro-mysql-backup-restore-race

Conversation

@bosconi
Copy link
Copy Markdown
Member

@bosconi bosconi commented Apr 9, 2026

Summary

Regression test and investigation for database-issues#7683: during a mysqldump --all-databases restore, the MySQL replication worker can miss a DROP TABLE event because verify_schemas() queries the current state of information_schema rather than parsing the DDL statement. The flood of DDL events from system tables causes the worker to fall behind the binlog stream, and by the time verify_schemas runs, the table has already been recreated.

This is the same bug observed in nightly builds 15970 and 15981, where the backup_restore_mysql scenario in mysql-cdc-resumption failed with query succeeded, but expected error containing "Source error". PR #35910 commented out that assertion to unblock CI.

Commits

  1. mysql: Add regression test for backup/restore DROP TABLE race — Adds a minimal reproduction (test/mysql-cdc/drop-recreate/) that does mysqldump + piped restore against a single tracked table. Fails identically to the nightly.

  2. test: Match assertion to existing nightly failure — Changes contains:table was dropped to contains:Source error so CI output matches the nightly failure messages exactly.

  3. PARTIAL fix — Parses DROP TABLE statements directly in handle_query_event instead of calling verify_schemas. Also adds docstring noting relationship to mysql-cdc-resumption/backup_restore_mysql.

Investigation findings

The partial fix in commit 3 correctly detects the drop, but the error does not reach queries:

What Status Evidence
DDL parsing Works Correctly parses DROP TABLE IF EXISTS \t` /* generated by server */` — note the server-appended comment that must be stripped
Error emission Works DefiniteError::TableDropped emitted via give_fueled at the correct GTID
Health status Works Source export transitions to Stalled in mz_source_statuses with the correct error
Data blocking Works handle_rows_event filters errored outputs — no new data from the recreated table reaches the dataflow
Query result Broken SELECT * FROM drop_recreate returns the original snapshot data (1 before), not a Source error

The error emitted at the DROP TABLE GTID does not make it to persist at a timestamp the query can read. The likely cause is in the reclock operator or persist sink — the error gets stuck at a GTID timestamp that the output's frontier never advances past.

Completing this fix requires understanding the remap/reclock/persist pipeline for per-output errors, which is beyond the scope of this PR.

Test plan

  • bin/mzcompose --find mysql-cdc run drop-recreate fails with query succeeded, but expected error containing "Source error" — confirms the race condition (CI build 120314)
  • Once fix is complete: test passes, and verify-source-failed.td can be uncommented in mysql-cdc-resumption

Generated with Claude Code

Adds a test that reproduces the race condition in database-issues#7683:
during a mysqldump --all-databases restore, the replication worker's
verify_schemas() queries the CURRENT state of information_schema to
detect table drops. The flood of DDL events from the restore causes the
worker to fall behind, and by the time verify_schemas runs for a tracked
table's DROP TABLE event, that table has already been recreated — so the
drop goes undetected and the source silently continues.

This test is expected to fail until the fix (parsing DROP TABLE
statements directly instead of querying MySQL state) is implemented.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

  • Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
  • Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
  • Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

  • The PR title is descriptive and will make sense in the git log.
  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

Change `contains:table was dropped` to `contains:Source error` so our
CI output is identical to the nightly failures in builds 15970/15981.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bosconi bosconi changed the title mysql: Add regression test for backup/restore DROP TABLE race [LLM EXPERIMENT] mysql: Add regression test for backup/restore DROP TABLE race Apr 10, 2026
Add docstring noting relationship to mysql-cdc-resumption.

Also include the partial fix: parse DROP TABLE statements directly
instead of calling verify_schemas. This correctly detects the drop
(health transitions to Stalled, errored_outputs blocks new rows) but
the error emitted via give_fueled does not reach the query result.
The error likely gets stuck in the reclock/persist pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bosconi bosconi force-pushed the jc-repro-mysql-backup-restore-race branch from a3e2af8 to 76b335e Compare April 10, 2026 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant