Fix premature wake-up in MergeTreeTransaction::afterCommit#104708
Merged
alexey-milovidov merged 6 commits intoMay 14, 2026
Merged
Fix premature wake-up in MergeTreeTransaction::afterCommit#104708alexey-milovidov merged 6 commits into
alexey-milovidov merged 6 commits into
Conversation
Contributor
|
Workflow [PR], commit [6e5cdd1] Summary: ✅ AI ReviewSummaryThis PR reorders Final Verdict
|
tuanpach
added a commit
to tuanpach/ClickHouse
that referenced
this pull request
May 12, 2026
Reviewer feedback on PR ClickHouse#104708 (clickhouse-gh[bot]): the 2s sleep was timing-based and could read `removal_csn_visible_during_pause` before `afterCommit` actually paused under load. Switch to `SYSTEM WAIT FAILPOINT transaction_after_commit_pause PAUSE`, which checks the failpoint's current pause state and returns deterministically once the background thread is paused. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tuanpach
added a commit
to tuanpach/ClickHouse
that referenced
this pull request
May 12, 2026
Reviewer feedback on PR ClickHouse#104708 (clickhouse-gh[bot]): the 2s sleep was timing-based and could read `removal_csn_visible_during_pause` before `afterCommit` actually paused under load. Switch to `SYSTEM WAIT FAILPOINT transaction_after_commit_pause PAUSE`, which checks the failpoint's current pause state and returns deterministically once the background thread is paused.
dab7449 to
58e071a
Compare
tuanpach
added a commit
to tuanpach/ClickHouse
that referenced
this pull request
May 13, 2026
Reviewer feedback on PR ClickHouse#104708 (clickhouse-gh[bot]): the 2s sleep was timing-based and could read `removal_csn_visible_during_pause` before `afterCommit` actually paused under load. Switch to `SYSTEM WAIT FAILPOINT transaction_after_commit_pause PAUSE`, which checks the failpoint's current pause state and returns deterministically once the background thread is paused.
58e071a to
a29c1b0
Compare
tuanpach
added a commit
to tuanpach/ClickHouse
that referenced
this pull request
May 13, 2026
Reviewer feedback on PR ClickHouse#104708 (clickhouse-gh[bot]): without an EXIT trap, an early test failure between SYSTEM ENABLE FAILPOINT and the explicit DISABLE at the end (for example a SYSTEM WAIT FAILPOINT ... PAUSE timeout) would leave transaction_after_commit_pause active and could block any later MergeTreeTransaction::afterCommit running on the same server. The trap is idempotent with the existing disables, so it is purely defense in depth.
After fault_probability_after_commit fires in TransactionLog::commitTransaction, commit finalization is deferred to runUpdatingThread. The bg thread calls MergeTreeTransaction::afterCommit, which flips the atomic CSN (waking client's waitStateChange) before persisting per-part version metadata. On slow storage (S3) the client can issue the next query before removal_csn / creation_csn become visible via system.parts. Reproduced in CI on PR ClickHouse/clickhouse-private#56144 (job: Stateless tests, amd_debug, meta in keeper, distributed plan, s3 storage, parallel) where 04057_transaction_version_metadata_lifecycle returned 0 rows for committed_removal_csn_positive instead of 2. This commit only adds diagnostic hooks; afterCommit logic is unchanged: - transaction_force_unknown_state_after_commit: deterministically takes the "connection lost after commit" code path so finalization is deferred. - transaction_after_commit_pause: pauses inside afterCommit between the CSN flip and the per-part disk writes. The new test 04141_transaction_after_commit_no_premature_wakeup enables both failpoints, runs BEGIN/DROP PARTITION/COMMIT asynchronously, and verifies that the foreground COMMIT is still in flight while per-part metadata is visible. With the current (buggy) ordering the test diffs against the reference; the fix that swaps the CSN flip with the disk-write loop makes it match.
`afterCommit` previously flipped the per-transaction `csn` atomic before persisting per-part version metadata. The atomic flip is what `MergeTreeTransaction::waitStateChange` is blocked on, so the foreground `COMMIT` could return to the client while `setAndStoreRemovalCSN` / `setAndStoreCreationCSN` were still running. On slow storage (S3) the next query could read `system.parts` and see stale `removal_csn = 0`. This only manifested when `fault_probability_after_commit` fired and the finalization was handed off to `runUpdatingThread`; the inline (no-fault) path was already correct because the whole function ran before `commitTransaction` returned. Fix: move the disk-backed `setAndStore...CSN` loops above `csn.exchange`, and pass `assigned_csn` directly since `this->csn` is still `Tx::CommittingCSN` during the loops. Once `waitStateChange` unblocks, every part already exposes the new CSN through `VersionMetadata::getInfo`. Crash-safety unchanged: a partial loop is still recoverable on restart via the TID -> CSN lookup in `TransactionLog::getCSN`. Verified with 04141_transaction_after_commit_no_premature_wakeup, which now passes (previously diffed against its reference under the new failpoints).
Style check requires the keyword 'Ok' in catches that intentionally swallow exceptions. The catch wraps a test-only failpoint inside a `noexcept` function, so it must not propagate.
The wrapping was inconsistent with the surrounding code: the per-part `setAndStoreCreationCSN` / `setAndStoreRemovalCSN` calls in the same `noexcept` function are not wrapped, even though they do S3 I/O. `pauseFailPoint` only takes a mutex and waits on a condition variable, so it is even less likely to throw. Removing the catch also resolves the empty-catch style warning at the right level.
Reviewer feedback on PR ClickHouse#104708 (clickhouse-gh[bot]): the 2s sleep was timing-based and could read `removal_csn_visible_during_pause` before `afterCommit` actually paused under load. Switch to `SYSTEM WAIT FAILPOINT transaction_after_commit_pause PAUSE`, which checks the failpoint's current pause state and returns deterministically once the background thread is paused.
Reviewer feedback on PR ClickHouse#104708 (clickhouse-gh[bot]): without an EXIT trap, an early test failure between SYSTEM ENABLE FAILPOINT and the explicit DISABLE at the end (for example a SYSTEM WAIT FAILPOINT ... PAUSE timeout) would leave transaction_after_commit_pause active and could block any later MergeTreeTransaction::afterCommit running on the same server. The trap is idempotent with the existing disables, so it is purely defense in depth.
c8baf20 to
6e5cdd1
Compare
Contributor
LLVM Coverage Report
Changed lines: 100.00% (29/29) | lost baseline coverage: 2 line(s) · Uncovered code |
alexey-milovidov
approved these changes
May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Under
fault_probability_after_commit,TransactionLog::commitTransactionwrites the CSN znode to ZooKeeper, then a fake hardware error pushes the transaction ontounknown_state_listfor the backgroundrunUpdatingThreadto finalize. That thread callsMergeTreeTransaction::afterCommit, which today flips the per-transactioncsnatomic before persistingcreation_csn/removal_csnon each part. The atomic flip is the signalMergeTreeTransaction::waitStateChangeis blocked on, so the foregroundCOMMITreturns to the client while the per-partsetAndStoreRemovalCSNloop is still running. On slow storage (S3) the next query readssystem.partsand sees staleremoval_csn = 0.This PR moves the per-part metadata writes to run before
csn.exchange. OncewaitStateChangeunblocks,system.partsalready exposes the new CSN. Crash-safety is unchanged: a partial loop is still recoverable on restart via the TID -> CSN lookup inTransactionLog::getCSN.Two failpoints are added to make the race deterministic and a regression test (
04141_transaction_after_commit_no_premature_wakeup) exercises both invariants:commit_still_running— the foregroundCOMMITmust not return whileafterCommitis paused.removal_csn_visible_during_pause— both parts must already exposeremoval_csn > 0while the pause is held.Fixes the flaky test reported in #103152.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix a race in
MergeTreeTransaction::afterCommitwhere, after a connection loss between writing the commit CSN to ZooKeeper and finalizing the transaction, theCOMMITresponse could reach the client before the newcreation_csn/removal_csnbecame visible insystem.parts.Documentation entry for user-facing changes