Skip to content

Fix premature wake-up in MergeTreeTransaction::afterCommit#104708

Merged
alexey-milovidov merged 6 commits into
ClickHouse:masterfrom
tuanpach:fix-afterCommit-removal-csn-race
May 14, 2026
Merged

Fix premature wake-up in MergeTreeTransaction::afterCommit#104708
alexey-milovidov merged 6 commits into
ClickHouse:masterfrom
tuanpach:fix-afterCommit-removal-csn-race

Conversation

@tuanpach
Copy link
Copy Markdown
Member

Under fault_probability_after_commit, TransactionLog::commitTransaction writes the CSN znode to ZooKeeper, then a fake hardware error pushes the transaction onto unknown_state_list for the background runUpdatingThread to finalize. That thread calls MergeTreeTransaction::afterCommit, which today flips the per-transaction csn atomic before persisting creation_csn / removal_csn on each part. The atomic flip is the signal MergeTreeTransaction::waitStateChange is blocked on, so the foreground COMMIT returns to the client while the per-part setAndStoreRemovalCSN loop is still running. On slow storage (S3) the next query reads system.parts and sees stale removal_csn = 0.

This PR moves the per-part metadata writes to run before csn.exchange. Once waitStateChange unblocks, system.parts already exposes the new CSN. Crash-safety is unchanged: a partial loop is still recoverable on restart via the TID -> CSN lookup in TransactionLog::getCSN.

Two failpoints are added to make the race deterministic and a regression test (04141_transaction_after_commit_no_premature_wakeup) exercises both invariants:

  • commit_still_running — the foreground COMMIT must not return while afterCommit is paused.
  • removal_csn_visible_during_pause — both parts must already expose removal_csn > 0 while the pause is held.

Fixes the flaky test reported in #103152.

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix a race in MergeTreeTransaction::afterCommit where, after a connection loss between writing the commit CSN to ZooKeeper and finalizing the transaction, the COMMIT response could reach the client before the new creation_csn / removal_csn became visible in system.parts.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented May 12, 2026

Workflow [PR], commit [6e5cdd1]

Summary:


AI Review

Summary

This PR reorders MergeTreeTransaction::afterCommit so per-part creation_csn / removal_csn are persisted before the transaction csn state flip, and adds deterministic failpoints plus a focused stateless regression test for the unknown-status commit path. After reviewing the current diff, touched code paths, and existing discussion threads, I found no remaining correctness, safety, or test-reliability issues.

Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-ci label May 12, 2026
Comment thread tests/queries/0_stateless/04141_transaction_after_commit_no_premature_wakeup.sh Outdated
tuanpach added a commit to tuanpach/ClickHouse that referenced this pull request May 12, 2026
Reviewer feedback on PR ClickHouse#104708 (clickhouse-gh[bot]): the 2s sleep was
timing-based and could read `removal_csn_visible_during_pause` before
`afterCommit` actually paused under load. Switch to
`SYSTEM WAIT FAILPOINT transaction_after_commit_pause PAUSE`, which checks the
failpoint's current pause state and returns deterministically once the
background thread is paused.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tuanpach added a commit to tuanpach/ClickHouse that referenced this pull request May 12, 2026
Reviewer feedback on PR ClickHouse#104708 (clickhouse-gh[bot]): the 2s sleep was
timing-based and could read `removal_csn_visible_during_pause` before
`afterCommit` actually paused under load. Switch to
`SYSTEM WAIT FAILPOINT transaction_after_commit_pause PAUSE`, which checks the
failpoint's current pause state and returns deterministically once the
background thread is paused.
@tuanpach tuanpach force-pushed the fix-afterCommit-removal-csn-race branch from dab7449 to 58e071a Compare May 12, 2026 14:40
tuanpach added a commit to tuanpach/ClickHouse that referenced this pull request May 13, 2026
Reviewer feedback on PR ClickHouse#104708 (clickhouse-gh[bot]): the 2s sleep was
timing-based and could read `removal_csn_visible_during_pause` before
`afterCommit` actually paused under load. Switch to
`SYSTEM WAIT FAILPOINT transaction_after_commit_pause PAUSE`, which checks the
failpoint's current pause state and returns deterministically once the
background thread is paused.
@tuanpach tuanpach force-pushed the fix-afterCommit-removal-csn-race branch from 58e071a to a29c1b0 Compare May 13, 2026 01:44
tuanpach added a commit to tuanpach/ClickHouse that referenced this pull request May 13, 2026
Reviewer feedback on PR ClickHouse#104708 (clickhouse-gh[bot]): without an EXIT trap,
an early test failure between SYSTEM ENABLE FAILPOINT and the explicit DISABLE
at the end (for example a SYSTEM WAIT FAILPOINT ... PAUSE timeout) would leave
transaction_after_commit_pause active and could block any later
MergeTreeTransaction::afterCommit running on the same server. The trap is
idempotent with the existing disables, so it is purely defense in depth.
tuanpach added 6 commits May 13, 2026 10:40
After fault_probability_after_commit fires in TransactionLog::commitTransaction,
commit finalization is deferred to runUpdatingThread. The bg thread calls
MergeTreeTransaction::afterCommit, which flips the atomic CSN (waking client's
waitStateChange) before persisting per-part version metadata. On slow storage
(S3) the client can issue the next query before removal_csn / creation_csn
become visible via system.parts.

Reproduced in CI on PR ClickHouse/clickhouse-private#56144 (job: Stateless
tests, amd_debug, meta in keeper, distributed plan, s3 storage, parallel) where
04057_transaction_version_metadata_lifecycle returned 0 rows for
committed_removal_csn_positive instead of 2.

This commit only adds diagnostic hooks; afterCommit logic is unchanged:
  - transaction_force_unknown_state_after_commit: deterministically takes the
    "connection lost after commit" code path so finalization is deferred.
  - transaction_after_commit_pause: pauses inside afterCommit between the CSN
    flip and the per-part disk writes.

The new test 04141_transaction_after_commit_no_premature_wakeup enables both
failpoints, runs BEGIN/DROP PARTITION/COMMIT asynchronously, and verifies that
the foreground COMMIT is still in flight while per-part metadata is visible.
With the current (buggy) ordering the test diffs against the reference; the
fix that swaps the CSN flip with the disk-write loop makes it match.
`afterCommit` previously flipped the per-transaction `csn` atomic before
persisting per-part version metadata. The atomic flip is what
`MergeTreeTransaction::waitStateChange` is blocked on, so the foreground
`COMMIT` could return to the client while `setAndStoreRemovalCSN` /
`setAndStoreCreationCSN` were still running. On slow storage (S3) the next
query could read `system.parts` and see stale `removal_csn = 0`.

This only manifested when `fault_probability_after_commit` fired and the
finalization was handed off to `runUpdatingThread`; the inline (no-fault) path
was already correct because the whole function ran before
`commitTransaction` returned.

Fix: move the disk-backed `setAndStore...CSN` loops above `csn.exchange`, and
pass `assigned_csn` directly since `this->csn` is still `Tx::CommittingCSN`
during the loops. Once `waitStateChange` unblocks, every part already exposes
the new CSN through `VersionMetadata::getInfo`.

Crash-safety unchanged: a partial loop is still recoverable on restart via the
TID -> CSN lookup in `TransactionLog::getCSN`.

Verified with 04141_transaction_after_commit_no_premature_wakeup, which now
passes (previously diffed against its reference under the new failpoints).
Style check requires the keyword 'Ok' in catches that intentionally swallow
exceptions. The catch wraps a test-only failpoint inside a `noexcept` function,
so it must not propagate.
The wrapping was inconsistent with the surrounding code: the per-part
`setAndStoreCreationCSN` / `setAndStoreRemovalCSN` calls in the same `noexcept`
function are not wrapped, even though they do S3 I/O. `pauseFailPoint` only
takes a mutex and waits on a condition variable, so it is even less likely to
throw. Removing the catch also resolves the empty-catch style warning at the
right level.
Reviewer feedback on PR ClickHouse#104708 (clickhouse-gh[bot]): the 2s sleep was
timing-based and could read `removal_csn_visible_during_pause` before
`afterCommit` actually paused under load. Switch to
`SYSTEM WAIT FAILPOINT transaction_after_commit_pause PAUSE`, which checks the
failpoint's current pause state and returns deterministically once the
background thread is paused.
Reviewer feedback on PR ClickHouse#104708 (clickhouse-gh[bot]): without an EXIT trap,
an early test failure between SYSTEM ENABLE FAILPOINT and the explicit DISABLE
at the end (for example a SYSTEM WAIT FAILPOINT ... PAUSE timeout) would leave
transaction_after_commit_pause active and could block any later
MergeTreeTransaction::afterCommit running on the same server. The trap is
idempotent with the existing disables, so it is purely defense in depth.
@tuanpach tuanpach force-pushed the fix-afterCommit-removal-csn-race branch from c8baf20 to 6e5cdd1 Compare May 13, 2026 10:41
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented May 13, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 92.10% 92.10% +0.00%
Branches 76.50% 76.60% +0.10%

Changed lines: 100.00% (29/29) | lost baseline coverage: 2 line(s) · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov self-assigned this May 14, 2026
@alexey-milovidov alexey-milovidov added this pull request to the merge queue May 14, 2026
Merged via the queue into ClickHouse:master with commit 45d7aa2 May 14, 2026
166 checks passed
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants