Fix silent data loss in DistributedAsyncInsertBatch::recoverBatch when middle file is broken by groeneai · Pull Request #105281 · ClickHouse/ClickHouse

groeneai · 2026-05-18T22:28:04Z

DistributedAsyncInsertBatch::recoverBatch iterates over files to validate
that every batch file has a readable header, but inside the loop it reads
files.back() instead of the current loop variable file. As a result, only
the last file's header is ever validated; the loop just re-reads that one file
N times and accumulates its rows N times into total_rows/total_bytes.

If a Distributed table is mid-batch when the server is killed (between
unlink(*.bin) and unlink(current_batch.txt)) and the last .bin in the
batch is intact while a middle one is corrupted, recoverBatch returns
true, sendBatch is then called, hits the corrupted file, throws, and
the whole batch — including the intact files — is moved to broken/. The
rows in the intact files are silently lost.

Using the loop variable makes each file's header validated on its own, so a
broken middle file causes recovery to return false. current_batch.txt is
then removed by the unconditional fs::remove in
processFilesWithBatching, and the surviving files are re-processed
individually through the normal pending-files path, where only the actually
broken file is moved to broken/. The rest reach the remote shard.

Bug originally found by automated review of #72939 (cc @clickgapai), confirmed
on master at f86671aa80af. Triage requested by @azat in #101745.

A new integration test test_distributed_async_insert_batch_recovery
simulates the abnormal-shutdown state by stopping node_dist, manually
writing a current_batch.txt referencing three .bin files, truncating the
middle one to 0 bytes, then restarting and verifying that the two intact
files reach node_shard and only the corrupted file ends up in broken/.

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix silent data loss in Distributed async inserts when recovering from an abnormal shutdown: if the last .bin file in a saved batch was intact but a middle one was corrupted, DistributedAsyncInsertBatch::recoverBatch would only validate the last file's header and then sendBatch would mark the entire batch — including the intact files — as broken, losing their rows. Each file's header is now validated individually so only the actually broken file is moved to broken/ and the surviving rows reach the remote shard.

Documentation entry for user-facing changes

Documentation entry needed

`recoverBatch` iterates over `files` to check that every batch file has a readable header, but inside the loop it reads `files.back()` instead of the current loop variable `file`. As a result, only the last file's header is ever validated, and `total_rows`/`total_bytes` are accumulated N times from that single file. When the abnormal-shutdown recovery path runs against a batch where the last `.bin` is intact but a middle one is corrupted, `recoverBatch` returns `true`, then `sendBatch` fails on the corrupted file and the ENTIRE batch (including the intact files) is moved to `broken/`, silently losing those rows. Use the loop variable so each file's header is validated on its own and broken middle files cause recovery to return `false`, after which `current_batch.txt` is removed and the surviving files are re-processed individually. Closes ClickHouse#101745

groeneai · 2026-05-18T22:28:47Z

cc @azat @devcrafter — could you review this? One-character fix for the recoverBatch copy-paste bug you flagged in #101745 (files.back() → file inside for (const auto & file : files)), plus a regression test that simulates the abnormal-shutdown state with a corrupted middle .bin file and verifies the intact files survive.

Pre-PR validation gate (per TASK.md Phase 4 Step 9):

(a) Deterministic repro: integration test test_distributed_async_insert_batch_recovery simulates the exact abnormal-shutdown state (manual current_batch.txt + truncated middle .bin).
(b) Root cause: recoverBatch iterates for (const auto & file : files) but reads files.back() inside the loop, so only the last file's header is validated; total_rows/total_bytes are accumulated N times from that one file. Confirmed at src/Storages/Distributed/DistributedAsyncInsertBatch.cpp:222 on master.
(c) Fix matches root cause: single-character change — read file (loop variable) instead of files.back().
(d) Test intent preserved: new test added; no existing tests modified.
(e) Both directions: with the fix, recoverBatch correctly detects the broken middle file and returns false, so current_batch.txt is removed (per the unconditional fs::remove in processFilesWithBatching) and the surviving files are re-processed individually through the normal pending-files loop. Without the fix, recoverBatch returns true on an intact last file and sendBatch then marks every file in the batch as broken — losing the intact files. Local build (clang-21, debug) passes; full integration-test run will be exercised in CI.
(f) Fix is general, not narrow: this is the only call site in DistributedAsyncInsertBatch.cpp with this pattern — sendBatch (line 258), sendSeparateFiles (line 316), and serialize (line 177) all correctly use the loop variable.

session-id: cron:clickhouse-ci-task-worker:20260518-221500

clickhouse-gh · 2026-05-18T23:24:00Z

Workflow [PR], commit [75d0404]

Summary: ✅

AI Review

Summary

This PR fixes a real recovery correctness bug in DistributedAsyncInsertBatch::recoverBatch (validating file instead of repeatedly validating files.back()), and adds a focused integration regression test that reproduces abnormal-shutdown recovery with a corrupted middle batch file. The current code at PR head matches the stated contract, prior discussion points were addressed in the updated test layout, and the current Praktika report for this PR shows no failed checks.

Final Verdict

Status: ✅ Approve

tiandiwonder · 2026-05-19T01:55:00Z

@groeneai fix CI failures

…bin files The regression test simulated an abnormal shutdown via `stop_clickhouse()`, which sends `SIGTERM` and triggers `StorageDistributed::flushAndPrepareForShutdown`. That helper calls `flushClusterNodesAllDataImpl(..., flush=true)` whenever `flush_on_detach=1`, which is mandatory together with `background_insert_batch=1`. As a result, the three queued `.bin` files were flushed to `node_shard` during graceful shutdown — bypassing `SYSTEM STOP DISTRIBUTED SENDS` — and the simulated abnormal state was set up against an already-emptied queue. After restart, the only remaining file was the manually-truncated `2.bin`, which `recoverBatch` correctly rejected and `markAsBroken` moved to `broken/`. The shard, however, already contained rows 1, 2, and 3 from the pre-shutdown flush, so the assertion `rows == ['1', '3']` failed with the actual rows `['1', '2', '3']`. Switch to `stop_clickhouse(kill=True)` so the server is `SIGKILL`-ed before any flush can happen, and the three queued `.bin` files survive on disk for the recovery scenario the test wants to exercise. Also sanity-check that the three files are still present right after the kill, so future regressions in the helper or in the shutdown ordering fail with a clear message rather than at the final assertion. The production-code fix in `DistributedAsyncInsertBatch.cpp` is unchanged. CI report: ClickHouse#105281

groeneai · 2026-05-19T03:02:34Z

Hi @tiandiwonder, thanks for the approval and the heads-up on CI. Triage of the 7 failing jobs:

1) Six integration test failures — all the same failure in the new regression test (real, PR-caused)

All six integration jobs failed identically on test_distributed_async_insert_batch_recovery/test.py::test_recover_batch_with_broken_middle_file:

assert rows == [str(i) for i in indices if i != middle_idx]
AssertionError: Expected the two intact files to be delivered, got: ['1', '2', '3']
assert ['1', '2', '3'] == ['1', '3']

Root cause: the new test, not the C++ fix. The test used stop_clickhouse() (graceful, SIGTERM) to simulate an abnormal shutdown. Graceful shutdown calls StorageDistributed::flushAndPrepareForShutdown, which calls flushClusterNodesAllDataImpl(..., flush=true) whenever flush_on_detach=1. And flush_on_detach=0 is forbidden together with background_insert_batch=1 (enforced in the StorageDistributed constructor). So graceful shutdown flushed all three .bin files to the shard, bypassing SYSTEM STOP DISTRIBUTED SENDS.

Server log from one of the failing CI runs confirms this exactly:

01:40:52.796 SYSTEM STOP DISTRIBUTED SENDS dist
01:40:53.249 INSERT (1)
01:40:53.306 Skipping send data over distributed table.     ← blocker works for the bg path
01:40:53.720 INSERT (2)
01:40:54.195 INSERT (3)
01:40:54.996 Sending a batch of 3 files to node_shard:9000 (3.00 rows, 12.00 B bytes).
01:40:55.066 Sent a batch of 3 files (took 72 ms).
01:40:55.066 Pending INSERT blocks flushed, took 73 ms.    ← from flushAndPrepareForShutdown
01:40:55.066 Start preparing for shutdown

By the time the test wrote current_batch.txt and truncate -s 0 2.bin, the queue was empty (only the truncated 2.bin survived, recreated by truncate) and the shard already had rows 1, 2, 3. The C++ recoverBatch fix actually worked — the server log shows Renamed .../2.bin to .../broken/2.bin — but the assertion still failed because the shard had been populated before the simulated abnormal state existed.

Fix pushed (5597598): switch to stop_clickhouse(kill=True) to SIGKILL the server before any graceful flush can run, plus a sanity check that the three .bin files survive the kill. The production-code change in DistributedAsyncInsertBatch.cpp is unchanged.

2) `Unit tests (msan, function_prop_fuzzer)` — known chronic flake (not PR-related)

FunctionsStress FAIL
Logical error: 'Function writeSlice expects same column types for GenericValueSlice and GenericArraySink.'

This is the chronic FunctionsStress flake tracked under #104877. CIDB over the last 30 days:

116 failures across 54 distinct PRs (including 10 master hits)
The exact writeSlice ... GenericValueSlice/GenericArraySink pattern hits all three sanitizer variants (asan, msan, tsan) on function_prop_fuzzer across 8+ unrelated PRs.
This PR touches only src/Storages/Distributed/DistributedAsyncInsertBatch.cpp and the new test — nothing in src/Functions//src/AggregateFunctions/, so the failure is structurally impossible to attribute to this PR.

No action needed on this PR; will retry on the next CI run.

Summary

#	Job	Cause	Action
1-6	6× integration test	New test uses graceful shutdown which bypasses `SYSTEM STOP DISTRIBUTED SENDS` and flushes the queue	Fixed in `5597598` (`kill=True`)
7	`Unit tests (msan, function_prop_fuzzer)` / `FunctionsStress`	Chronic trunk flake, #104877	None — wait for re-run

CI re-run is in flight on the new head. PTAL once it goes green.

…/test.py

@azat

… setup Per @azat's review on ClickHouse#105281: use a single-node cluster with `prefer_localhost_replica = 0` to force the async-insert batch queue path (instead of the local-shortcut), so the regression scenario only needs one Docker container instead of two. The semantics are unchanged - the test still creates three `.bin` files, hard-kills the server, plants a `current_batch.txt` plus a truncated middle file, restarts, and verifies that `recoverBatch` detects the broken middle file and delivers the two intact files to the (now self-targeted) `local` table. Local run on a binary without the `recoverBatch` fix on this branch hits the expected end-of-test assertion (`[] == ['1', '3']`), proving the new scaffolding still produces the three queued files, the hard kill preserves them, the restart succeeds, and the final delivery correctness check still catches the bug. CI will validate the with-fix branch.

groeneai · 2026-05-19T07:10:30Z

Thanks @azat — your commit 8c540555 covers comments 1+2, and I've now applied comment 3 in commit 75d0404b: collapsed the two-node setup to a single node with prefer_localhost_replica = 0 on each INSERT to force the async-insert batch queue path. The test_cluster definition now points at the same node via <host>node</host>, and the test asserts both directions on that one node.

Local validation against a binary on this branch without the recoverBatch fix applied to the source tree:

test_distributed_async_insert_batch_recovery/test.py:145
    assert rows == [str(i) for i in indices if i != middle_idx]
E   AssertionError: Expected the two intact files to be delivered, got: []
E   assert [] == ['1', '3']

That confirms the new scaffolding still produces the three queued .bin files (so prefer_localhost_replica = 0 correctly routes through the async batch queue instead of the local-shortcut path), kill=True preserves them on disk, the restart + SYSTEM FLUSH DISTRIBUTED cycle exercises the recovery path, and the final delivery-correctness assertion still catches the bug (rows lost when the C++ fix is absent). CI will validate the with-fix direction on groeneai/fix-distributed-async-insert-batch-recover-101745.

Pre-PR validation gate re-attested on the refactored test (per TASK.md Phase 4 Step 9):

(a) Deterministic repro: yes — without the source-side fix on this branch, the assertion at line 145 deterministically fails with [] == ['1', '3'].
(b) Root cause: unchanged — DistributedAsyncInsertBatch::recoverBatch iterates over files but reads files.back() inside the loop, so it only validates the last file's header. When the middle file is corrupted but the last is intact, recovery returns true, sendBatch then fails on the broken file, and the whole batch (including the intact files) is moved to broken/ — silent data loss for the intact rows.
(c) Fix matches root cause: yes — the C++ change (commit 4219ef06) reads the current file variable inside the loop, so the broken middle file is detected during recovery, current_batch.txt is removed, and the intact files are re-processed individually.
(d) Test intent preserved: yes — the single-node setup still exercises the full async-insert batch queue: three separate INSERTs produce three separate .bin files, a hard kill leaves the queue intact for the abnormal-shutdown reproducer, and the final assertion still catches the silent data-loss symptom (lost rows would land in [], not ['1', '3']).
(e) Both directions demonstrated: failing direction shown locally above. CI on the new HEAD will validate the passing direction with the fix-included binary.

Session: cron:clickhouse-ci-task-worker:20260519-070000

clickhouse-gh · 2026-05-19T10:20:32Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.20%	84.20%	+0.00%
Functions	91.40%	91.40%	+0.00%
Branches	76.60%	76.60%	+0.00%

Changed lines: 100.00% (4/4) · Uncovered code

Full report · Diff report

clickgapai · 2026-05-19T11:22:21Z

Hi — could this be a backport candidate for 26.4, 26.3 (LTS), 26.2, 25.8 (LTS)? No backport label was found, and the affected code appears to exist in those branches based on a file/function presence check.

Affected code: DistributedAsyncInsertBatch::recoverBatch in src/Storages/Distributed/DistributedAsyncInsertBatch.cpp — introduced in 25.8.

Why this might apply: This is a data loss bug in DistributedAsyncInsertBatch::recoverBatch. The bug was introduced in PR #72939 (merged 2025-03-08) when the file was first created with the recoverBatch function. Since this predates all supported release branches (25.8 LTS released 2025-08-29), all branches contain the buggy code and need the fix. Silent data loss after abnormal server shutdown is a P1 severity issue.

Caveat: I verified that the modified code exists in those branches, but not whether the bug is actually reachable there — a new caller in master may be what makes this newly observable. If older branches can't reach the affected path, please ignore this; otherwise consider pr-must-backport or a version label like v26.4-must-backport.

The strict equality output == b"0\t0\n1\t0\n2\t0\n3\t0\n" assumed that exactly two blocks (4 rows) would reach the client before the cancel signal interrupted execution. The exact number depends on how the server-side block production races against cancel propagation, which is timing-sensitive under load. The flakiness became visible in Integration tests (amd_tsan, 6/6) after PR ClickHouse#105281 (test_distributed_async_insert_batch_recovery) shifted test_grpc_protocol from the lighter 5/6 shard into 6/6 via the deterministic round-robin assignment in get_optimal_test_batch. Relax the assertion to verify the real invariants: cancel was delivered, output is a strict prefix of the full 10-row result, and execution was actually interrupted (fewer than 10 rows received). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

groeneai mentioned this pull request May 18, 2026

recoverBatch() uses files.back() instead of loop variable 'file', causing all batch files to be marked as broken when a middle file is corrupted #101745

Closed

alexey-milovidov added the can be tested Allows running workflows for external contributors label May 18, 2026

clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 18, 2026

tiandiwonder self-assigned this May 19, 2026

tiandiwonder approved these changes May 19, 2026

View reviewed changes

azat reviewed May 19, 2026

View reviewed changes

Comment thread tests/integration/test_distributed_async_insert_batch_recovery/test.py Outdated

azat reviewed May 19, 2026

View reviewed changes

Comment thread tests/integration/test_distributed_async_insert_batch_recovery/test.py Outdated

Update tests/integration/test_distributed_async_insert_batch_recovery…

8c54055

…/test.py

azat reviewed May 19, 2026

View reviewed changes

Comment thread tests/integration/test_distributed_async_insert_batch_recovery/test.py Outdated

azat self-assigned this May 19, 2026

azat approved these changes May 19, 2026

View reviewed changes

azat enabled auto-merge May 19, 2026 07:19

azat added this pull request to the merge queue May 19, 2026

Merged via the queue into ClickHouse:master with commit 4a1dff4 May 19, 2026
165 checks passed

robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 19, 2026

Algunenano mentioned this pull request May 20, 2026

Fix flaky test_grpc_protocol::test_cancel_while_generating_output #105415

Merged

1 task

groeneai mentioned this pull request May 20, 2026

Add system.constraints table #105337

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix silent data loss in DistributedAsyncInsertBatch::recoverBatch when middle file is broken#105281

Fix silent data loss in DistributedAsyncInsertBatch::recoverBatch when middle file is broken#105281
azat merged 4 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-distributed-async-insert-batch-recover-101745

groeneai commented May 18, 2026

Uh oh!

groeneai commented May 18, 2026

Uh oh!

clickhouse-gh Bot commented May 18, 2026 •

edited

Loading

Uh oh!

tiandiwonder commented May 19, 2026

Uh oh!

groeneai commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

groeneai commented May 19, 2026

Uh oh!

clickhouse-gh Bot commented May 19, 2026

Uh oh!

Uh oh!

clickgapai commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

groeneai commented May 18, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Uh oh!

groeneai commented May 18, 2026

Uh oh!

clickhouse-gh Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

tiandiwonder commented May 19, 2026

Uh oh!

groeneai commented May 19, 2026

1) Six integration test failures — all the same failure in the new regression test (real, PR-caused)

2) Unit tests (msan, function_prop_fuzzer) — known chronic flake (not PR-related)

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

groeneai commented May 19, 2026

Uh oh!

clickhouse-gh Bot commented May 19, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

clickgapai commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

clickhouse-gh Bot commented May 18, 2026 •

edited

Loading

2) `Unit tests (msan, function_prop_fuzzer)` — known chronic flake (not PR-related)