Fix silent data loss in DistributedAsyncInsertBatch::recoverBatch when middle file is broken#105281
Conversation
`recoverBatch` iterates over `files` to check that every batch file has a readable header, but inside the loop it reads `files.back()` instead of the current loop variable `file`. As a result, only the last file's header is ever validated, and `total_rows`/`total_bytes` are accumulated N times from that single file. When the abnormal-shutdown recovery path runs against a batch where the last `.bin` is intact but a middle one is corrupted, `recoverBatch` returns `true`, then `sendBatch` fails on the corrupted file and the ENTIRE batch (including the intact files) is moved to `broken/`, silently losing those rows. Use the loop variable so each file's header is validated on its own and broken middle files cause recovery to return `false`, after which `current_batch.txt` is removed and the surviving files are re-processed individually. Closes ClickHouse#101745
|
cc @azat @devcrafter — could you review this? One-character fix for the Pre-PR validation gate (per TASK.md Phase 4 Step 9):
session-id: cron:clickhouse-ci-task-worker:20260518-221500 |
|
Workflow [PR], commit [75d0404] Summary: ✅ AI ReviewSummaryThis PR fixes a real recovery correctness bug in Final VerdictStatus: ✅ Approve |
|
@groeneai fix CI failures |
…bin files The regression test simulated an abnormal shutdown via `stop_clickhouse()`, which sends `SIGTERM` and triggers `StorageDistributed::flushAndPrepareForShutdown`. That helper calls `flushClusterNodesAllDataImpl(..., flush=true)` whenever `flush_on_detach=1`, which is mandatory together with `background_insert_batch=1`. As a result, the three queued `.bin` files were flushed to `node_shard` during graceful shutdown — bypassing `SYSTEM STOP DISTRIBUTED SENDS` — and the simulated abnormal state was set up against an already-emptied queue. After restart, the only remaining file was the manually-truncated `2.bin`, which `recoverBatch` correctly rejected and `markAsBroken` moved to `broken/`. The shard, however, already contained rows 1, 2, and 3 from the pre-shutdown flush, so the assertion `rows == ['1', '3']` failed with the actual rows `['1', '2', '3']`. Switch to `stop_clickhouse(kill=True)` so the server is `SIGKILL`-ed before any flush can happen, and the three queued `.bin` files survive on disk for the recovery scenario the test wants to exercise. Also sanity-check that the three files are still present right after the kill, so future regressions in the helper or in the shutdown ordering fail with a clear message rather than at the final assertion. The production-code fix in `DistributedAsyncInsertBatch.cpp` is unchanged. CI report: ClickHouse#105281
|
Hi @tiandiwonder, thanks for the approval and the heads-up on CI. Triage of the 7 failing jobs: 1) Six integration test failures — all the same failure in the new regression test (real, PR-caused)All six integration jobs failed identically on Root cause: the new test, not the C++ fix. The test used Server log from one of the failing CI runs confirms this exactly: By the time the test wrote Fix pushed (5597598): switch to 2)
|
| # | Job | Cause | Action |
|---|---|---|---|
| 1-6 | 6× integration test | New test uses graceful shutdown which bypasses SYSTEM STOP DISTRIBUTED SENDS and flushes the queue |
Fixed in 5597598 (kill=True) |
| 7 | Unit tests (msan, function_prop_fuzzer) / FunctionsStress |
Chronic trunk flake, #104877 | None — wait for re-run |
CI re-run is in flight on the new head. PTAL once it goes green.
… setup Per @azat's review on ClickHouse#105281: use a single-node cluster with `prefer_localhost_replica = 0` to force the async-insert batch queue path (instead of the local-shortcut), so the regression scenario only needs one Docker container instead of two. The semantics are unchanged - the test still creates three `.bin` files, hard-kills the server, plants a `current_batch.txt` plus a truncated middle file, restarts, and verifies that `recoverBatch` detects the broken middle file and delivers the two intact files to the (now self-targeted) `local` table. Local run on a binary without the `recoverBatch` fix on this branch hits the expected end-of-test assertion (`[] == ['1', '3']`), proving the new scaffolding still produces the three queued files, the hard kill preserves them, the restart succeeds, and the final delivery correctness check still catches the bug. CI will validate the with-fix branch.
|
Thanks @azat — your commit Local validation against a binary on this branch without the That confirms the new scaffolding still produces the three queued Pre-PR validation gate re-attested on the refactored test (per TASK.md Phase 4 Step 9):
Session: |
LLVM Coverage Report
Changed lines: 100.00% (4/4) · Uncovered code |
|
Hi — could this be a backport candidate for Affected code: Why this might apply: This is a data loss bug in Caveat: I verified that the modified code exists in those branches, but not whether the bug is actually reachable there — a new caller in master may be what makes this newly observable. If older branches can't reach the affected path, please ignore this; otherwise consider |
The strict equality output == b"0\t0\n1\t0\n2\t0\n3\t0\n" assumed that exactly two blocks (4 rows) would reach the client before the cancel signal interrupted execution. The exact number depends on how the server-side block production races against cancel propagation, which is timing-sensitive under load. The flakiness became visible in Integration tests (amd_tsan, 6/6) after PR ClickHouse#105281 (test_distributed_async_insert_batch_recovery) shifted test_grpc_protocol from the lighter 5/6 shard into 6/6 via the deterministic round-robin assignment in get_optimal_test_batch. Relax the assertion to verify the real invariants: cancel was delivered, output is a strict prefix of the full 10-row result, and execution was actually interrupted (fewer than 10 rows received). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes #101745.
DistributedAsyncInsertBatch::recoverBatchiterates overfilesto validatethat every batch file has a readable header, but inside the loop it reads
files.back()instead of the current loop variablefile. As a result, onlythe last file's header is ever validated; the loop just re-reads that one file
N times and accumulates its rows N times into
total_rows/total_bytes.If a
Distributedtable is mid-batch when the server is killed (betweenunlink(*.bin)andunlink(current_batch.txt)) and the last.binin thebatch is intact while a middle one is corrupted,
recoverBatchreturnstrue,sendBatchis then called, hits the corrupted file, throws, andthe whole batch — including the intact files — is moved to
broken/. Therows in the intact files are silently lost.
Using the loop variable makes each file's header validated on its own, so a
broken middle file causes recovery to return
false.current_batch.txtisthen removed by the unconditional
fs::removeinprocessFilesWithBatching, and the surviving files are re-processedindividually through the normal pending-files path, where only the actually
broken file is moved to
broken/. The rest reach the remote shard.Bug originally found by automated review of #72939 (cc @clickgapai), confirmed
on master at
f86671aa80af. Triage requested by @azat in #101745.A new integration test
test_distributed_async_insert_batch_recoverysimulates the abnormal-shutdown state by stopping
node_dist, manuallywriting a
current_batch.txtreferencing three.binfiles, truncating themiddle one to 0 bytes, then restarting and verifying that the two intact
files reach
node_shardand only the corrupted file ends up inbroken/.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix silent data loss in
Distributedasync inserts when recovering from an abnormal shutdown: if the last.binfile in a saved batch was intact but a middle one was corrupted,DistributedAsyncInsertBatch::recoverBatchwould only validate the last file's header and thensendBatchwould mark the entire batch — including the intact files — as broken, losing their rows. Each file's header is now validated individually so only the actually broken file is moved tobroken/and the surviving rows reach the remote shard.Documentation entry for user-facing changes