Skip to content

Fix silent data loss in DistributedAsyncInsertBatch::recoverBatch when middle file is broken#105281

Merged
azat merged 4 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-distributed-async-insert-batch-recover-101745
May 19, 2026
Merged

Fix silent data loss in DistributedAsyncInsertBatch::recoverBatch when middle file is broken#105281
azat merged 4 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-distributed-async-insert-batch-recover-101745

Conversation

@groeneai
Copy link
Copy Markdown
Contributor

Fixes #101745.

DistributedAsyncInsertBatch::recoverBatch iterates over files to validate
that every batch file has a readable header, but inside the loop it reads
files.back() instead of the current loop variable file. As a result, only
the last file's header is ever validated; the loop just re-reads that one file
N times and accumulates its rows N times into total_rows/total_bytes.

If a Distributed table is mid-batch when the server is killed (between
unlink(*.bin) and unlink(current_batch.txt)) and the last .bin in the
batch is intact while a middle one is corrupted, recoverBatch returns
true, sendBatch is then called, hits the corrupted file, throws, and
the whole batch — including the intact files — is moved to broken/. The
rows in the intact files are silently lost.

Using the loop variable makes each file's header validated on its own, so a
broken middle file causes recovery to return false. current_batch.txt is
then removed by the unconditional fs::remove in
processFilesWithBatching, and the surviving files are re-processed
individually through the normal pending-files path, where only the actually
broken file is moved to broken/. The rest reach the remote shard.

Bug originally found by automated review of #72939 (cc @clickgapai), confirmed
on master at f86671aa80af. Triage requested by @azat in #101745.

A new integration test test_distributed_async_insert_batch_recovery
simulates the abnormal-shutdown state by stopping node_dist, manually
writing a current_batch.txt referencing three .bin files, truncating the
middle one to 0 bytes, then restarting and verifying that the two intact
files reach node_shard and only the corrupted file ends up in broken/.

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix silent data loss in Distributed async inserts when recovering from an abnormal shutdown: if the last .bin file in a saved batch was intact but a middle one was corrupted, DistributedAsyncInsertBatch::recoverBatch would only validate the last file's header and then sendBatch would mark the entire batch — including the intact files — as broken, losing their rows. Each file's header is now validated individually so only the actually broken file is moved to broken/ and the surviving rows reach the remote shard.

Documentation entry for user-facing changes

  • Documentation entry needed

`recoverBatch` iterates over `files` to check that every batch file has a
readable header, but inside the loop it reads `files.back()` instead of
the current loop variable `file`. As a result, only the last file's
header is ever validated, and `total_rows`/`total_bytes` are accumulated
N times from that single file.

When the abnormal-shutdown recovery path runs against a batch where the
last `.bin` is intact but a middle one is corrupted, `recoverBatch`
returns `true`, then `sendBatch` fails on the corrupted file and the
ENTIRE batch (including the intact files) is moved to `broken/`,
silently losing those rows.

Use the loop variable so each file's header is validated on its own and
broken middle files cause recovery to return `false`, after which
`current_batch.txt` is removed and the surviving files are re-processed
individually.

Closes ClickHouse#101745
@groeneai
Copy link
Copy Markdown
Contributor Author

cc @azat @devcrafter — could you review this? One-character fix for the recoverBatch copy-paste bug you flagged in #101745 (files.back()file inside for (const auto & file : files)), plus a regression test that simulates the abnormal-shutdown state with a corrupted middle .bin file and verifies the intact files survive.

Pre-PR validation gate (per TASK.md Phase 4 Step 9):

  • (a) Deterministic repro: integration test test_distributed_async_insert_batch_recovery simulates the exact abnormal-shutdown state (manual current_batch.txt + truncated middle .bin).
  • (b) Root cause: recoverBatch iterates for (const auto & file : files) but reads files.back() inside the loop, so only the last file's header is validated; total_rows/total_bytes are accumulated N times from that one file. Confirmed at src/Storages/Distributed/DistributedAsyncInsertBatch.cpp:222 on master.
  • (c) Fix matches root cause: single-character change — read file (loop variable) instead of files.back().
  • (d) Test intent preserved: new test added; no existing tests modified.
  • (e) Both directions: with the fix, recoverBatch correctly detects the broken middle file and returns false, so current_batch.txt is removed (per the unconditional fs::remove in processFilesWithBatching) and the surviving files are re-processed individually through the normal pending-files loop. Without the fix, recoverBatch returns true on an intact last file and sendBatch then marks every file in the batch as broken — losing the intact files. Local build (clang-21, debug) passes; full integration-test run will be exercised in CI.
  • (f) Fix is general, not narrow: this is the only call site in DistributedAsyncInsertBatch.cpp with this pattern — sendBatch (line 258), sendSeparateFiles (line 316), and serialize (line 177) all correctly use the loop variable.

session-id: cron:clickhouse-ci-task-worker:20260518-221500

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label May 18, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented May 18, 2026

Workflow [PR], commit [75d0404]

Summary:


AI Review

Summary

This PR fixes a real recovery correctness bug in DistributedAsyncInsertBatch::recoverBatch (validating file instead of repeatedly validating files.back()), and adds a focused integration regression test that reproduces abnormal-shutdown recovery with a corrupted middle batch file. The current code at PR head matches the stated contract, prior discussion points were addressed in the updated test layout, and the current Praktika report for this PR shows no failed checks.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 18, 2026
@tiandiwonder tiandiwonder self-assigned this May 19, 2026
@tiandiwonder
Copy link
Copy Markdown
Contributor

@groeneai fix CI failures

…bin files

The regression test simulated an abnormal shutdown via `stop_clickhouse()`,
which sends `SIGTERM` and triggers
`StorageDistributed::flushAndPrepareForShutdown`. That helper calls
`flushClusterNodesAllDataImpl(..., flush=true)` whenever
`flush_on_detach=1`, which is mandatory together with
`background_insert_batch=1`. As a result, the three queued `.bin` files
were flushed to `node_shard` during graceful shutdown — bypassing
`SYSTEM STOP DISTRIBUTED SENDS` — and the simulated abnormal state was
set up against an already-emptied queue.

After restart, the only remaining file was the manually-truncated
`2.bin`, which `recoverBatch` correctly rejected and `markAsBroken`
moved to `broken/`. The shard, however, already contained rows 1, 2,
and 3 from the pre-shutdown flush, so the assertion
`rows == ['1', '3']` failed with the actual rows `['1', '2', '3']`.

Switch to `stop_clickhouse(kill=True)` so the server is `SIGKILL`-ed
before any flush can happen, and the three queued `.bin` files survive
on disk for the recovery scenario the test wants to exercise. Also
sanity-check that the three files are still present right after the
kill, so future regressions in the helper or in the shutdown ordering
fail with a clear message rather than at the final assertion.

The production-code fix in `DistributedAsyncInsertBatch.cpp` is
unchanged.

CI report: ClickHouse#105281
@groeneai
Copy link
Copy Markdown
Contributor Author

Hi @tiandiwonder, thanks for the approval and the heads-up on CI. Triage of the 7 failing jobs:

1) Six integration test failures — all the same failure in the new regression test (real, PR-caused)

All six integration jobs failed identically on test_distributed_async_insert_batch_recovery/test.py::test_recover_batch_with_broken_middle_file:

assert rows == [str(i) for i in indices if i != middle_idx]
AssertionError: Expected the two intact files to be delivered, got: ['1', '2', '3']
assert ['1', '2', '3'] == ['1', '3']

Root cause: the new test, not the C++ fix. The test used stop_clickhouse() (graceful, SIGTERM) to simulate an abnormal shutdown. Graceful shutdown calls StorageDistributed::flushAndPrepareForShutdown, which calls flushClusterNodesAllDataImpl(..., flush=true) whenever flush_on_detach=1. And flush_on_detach=0 is forbidden together with background_insert_batch=1 (enforced in the StorageDistributed constructor). So graceful shutdown flushed all three .bin files to the shard, bypassing SYSTEM STOP DISTRIBUTED SENDS.

Server log from one of the failing CI runs confirms this exactly:

01:40:52.796 SYSTEM STOP DISTRIBUTED SENDS dist
01:40:53.249 INSERT (1)
01:40:53.306 Skipping send data over distributed table.     ← blocker works for the bg path
01:40:53.720 INSERT (2)
01:40:54.195 INSERT (3)
01:40:54.996 Sending a batch of 3 files to node_shard:9000 (3.00 rows, 12.00 B bytes).
01:40:55.066 Sent a batch of 3 files (took 72 ms).
01:40:55.066 Pending INSERT blocks flushed, took 73 ms.    ← from flushAndPrepareForShutdown
01:40:55.066 Start preparing for shutdown

By the time the test wrote current_batch.txt and truncate -s 0 2.bin, the queue was empty (only the truncated 2.bin survived, recreated by truncate) and the shard already had rows 1, 2, 3. The C++ recoverBatch fix actually worked — the server log shows Renamed .../2.bin to .../broken/2.bin — but the assertion still failed because the shard had been populated before the simulated abnormal state existed.

Fix pushed (5597598): switch to stop_clickhouse(kill=True) to SIGKILL the server before any graceful flush can run, plus a sanity check that the three .bin files survive the kill. The production-code change in DistributedAsyncInsertBatch.cpp is unchanged.

2) Unit tests (msan, function_prop_fuzzer) — known chronic flake (not PR-related)

FunctionsStress FAIL
Logical error: 'Function writeSlice expects same column types for GenericValueSlice and GenericArraySink.'

This is the chronic FunctionsStress flake tracked under #104877. CIDB over the last 30 days:

  • 116 failures across 54 distinct PRs (including 10 master hits)
  • The exact writeSlice ... GenericValueSlice/GenericArraySink pattern hits all three sanitizer variants (asan, msan, tsan) on function_prop_fuzzer across 8+ unrelated PRs.
  • This PR touches only src/Storages/Distributed/DistributedAsyncInsertBatch.cpp and the new test — nothing in src/Functions//src/AggregateFunctions/, so the failure is structurally impossible to attribute to this PR.

No action needed on this PR; will retry on the next CI run.

Summary

# Job Cause Action
1-6 6× integration test New test uses graceful shutdown which bypasses SYSTEM STOP DISTRIBUTED SENDS and flushes the queue Fixed in 5597598 (kill=True)
7 Unit tests (msan, function_prop_fuzzer) / FunctionsStress Chronic trunk flake, #104877 None — wait for re-run

CI re-run is in flight on the new head. PTAL once it goes green.

Comment thread tests/integration/test_distributed_async_insert_batch_recovery/test.py Outdated
Comment thread tests/integration/test_distributed_async_insert_batch_recovery/test.py Outdated
Comment thread tests/integration/test_distributed_async_insert_batch_recovery/test.py Outdated
@azat azat self-assigned this May 19, 2026
… setup

Per @azat's review on ClickHouse#105281: use a single-node cluster with
`prefer_localhost_replica = 0` to force the async-insert batch queue
path (instead of the local-shortcut), so the regression scenario only
needs one Docker container instead of two. The semantics are unchanged
- the test still creates three `.bin` files, hard-kills the server,
plants a `current_batch.txt` plus a truncated middle file, restarts,
and verifies that `recoverBatch` detects the broken middle file and
delivers the two intact files to the (now self-targeted) `local`
table.

Local run on a binary without the `recoverBatch` fix on this branch
hits the expected end-of-test assertion (`[] == ['1', '3']`), proving
the new scaffolding still produces the three queued files, the hard
kill preserves them, the restart succeeds, and the final delivery
correctness check still catches the bug. CI will validate the
with-fix branch.
@groeneai
Copy link
Copy Markdown
Contributor Author

Thanks @azat — your commit 8c540555 covers comments 1+2, and I've now applied comment 3 in commit 75d0404b: collapsed the two-node setup to a single node with prefer_localhost_replica = 0 on each INSERT to force the async-insert batch queue path. The test_cluster definition now points at the same node via <host>node</host>, and the test asserts both directions on that one node.

Local validation against a binary on this branch without the recoverBatch fix applied to the source tree:

test_distributed_async_insert_batch_recovery/test.py:145
    assert rows == [str(i) for i in indices if i != middle_idx]
E   AssertionError: Expected the two intact files to be delivered, got: []
E   assert [] == ['1', '3']

That confirms the new scaffolding still produces the three queued .bin files (so prefer_localhost_replica = 0 correctly routes through the async batch queue instead of the local-shortcut path), kill=True preserves them on disk, the restart + SYSTEM FLUSH DISTRIBUTED cycle exercises the recovery path, and the final delivery-correctness assertion still catches the bug (rows lost when the C++ fix is absent). CI will validate the with-fix direction on groeneai/fix-distributed-async-insert-batch-recover-101745.

Pre-PR validation gate re-attested on the refactored test (per TASK.md Phase 4 Step 9):

  • (a) Deterministic repro: yes — without the source-side fix on this branch, the assertion at line 145 deterministically fails with [] == ['1', '3'].
  • (b) Root cause: unchanged — DistributedAsyncInsertBatch::recoverBatch iterates over files but reads files.back() inside the loop, so it only validates the last file's header. When the middle file is corrupted but the last is intact, recovery returns true, sendBatch then fails on the broken file, and the whole batch (including the intact files) is moved to broken/ — silent data loss for the intact rows.
  • (c) Fix matches root cause: yes — the C++ change (commit 4219ef06) reads the current file variable inside the loop, so the broken middle file is detected during recovery, current_batch.txt is removed, and the intact files are re-processed individually.
  • (d) Test intent preserved: yes — the single-node setup still exercises the full async-insert batch queue: three separate INSERTs produce three separate .bin files, a hard kill leaves the queue intact for the abnormal-shutdown reproducer, and the final assertion still catches the silent data-loss symptom (lost rows would land in [], not ['1', '3']).
  • (e) Both directions demonstrated: failing direction shown locally above. CI on the new HEAD will validate the passing direction with the fix-included binary.

Session: cron:clickhouse-ci-task-worker:20260519-070000

@azat azat enabled auto-merge May 19, 2026 07:19
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented May 19, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.20% 84.20% +0.00%
Functions 91.40% 91.40% +0.00%
Branches 76.60% 76.60% +0.00%

Changed lines: 100.00% (4/4) · Uncovered code

Full report · Diff report

@azat azat added this pull request to the merge queue May 19, 2026
Merged via the queue into ClickHouse:master with commit 4a1dff4 May 19, 2026
165 checks passed
@clickgapai
Copy link
Copy Markdown
Contributor

Hi — could this be a backport candidate for 26.4, 26.3 (LTS), 26.2, 25.8 (LTS)? No backport label was found, and the affected code appears to exist in those branches based on a file/function presence check.

Affected code: DistributedAsyncInsertBatch::recoverBatch in src/Storages/Distributed/DistributedAsyncInsertBatch.cpp — introduced in 25.8.

Why this might apply: This is a data loss bug in DistributedAsyncInsertBatch::recoverBatch. The bug was introduced in PR #72939 (merged 2025-03-08) when the file was first created with the recoverBatch function. Since this predates all supported release branches (25.8 LTS released 2025-08-29), all branches contain the buggy code and need the fix. Silent data loss after abnormal server shutdown is a P1 severity issue.

Caveat: I verified that the modified code exists in those branches, but not whether the bug is actually reachable there — a new caller in master may be what makes this newly observable. If older branches can't reach the affected path, please ignore this; otherwise consider pr-must-backport or a version label like v26.4-must-backport.

@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 19, 2026
alexey-milovidov pushed a commit to jh0x/ClickHouse that referenced this pull request May 20, 2026
The strict equality output == b"0\t0\n1\t0\n2\t0\n3\t0\n" assumed that
exactly two blocks (4 rows) would reach the client before the cancel
signal interrupted execution. The exact number depends on how the
server-side block production races against cancel propagation, which is
timing-sensitive under load.

The flakiness became visible in Integration tests (amd_tsan, 6/6) after
PR ClickHouse#105281 (test_distributed_async_insert_batch_recovery) shifted
test_grpc_protocol from the lighter 5/6 shard into 6/6 via the
deterministic round-robin assignment in get_optimal_test_batch.

Relax the assertion to verify the real invariants: cancel was delivered,
output is a strict prefix of the full 10-row result, and execution was
actually interrupted (fewer than 10 rows received).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

recoverBatch() uses files.back() instead of loop variable 'file', causing all batch files to be marked as broken when a middle file is corrupted

6 participants