You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TRUNCATE TABLE on a MergeTree table is not crash-atomic without an explicit BEGIN TRANSACTION. If the server is killed while TRUNCATE is in progress, the table is left in a partial state on restart — neither full nor empty. For a 5-part / 500-row table, the post-crash count is observed as 100, 200, 300, or 400 rows depending on where the kill landed. No error is logged.
What the user sees
-- Setup: 5 parts × 100 rows = 500 rows.CREATETABLEt (id UInt64, val UInt64) ENGINE = MergeTree ORDER BY id;
INSERT INTO t SELECTnumber, number*7FROM numbers( 1, 100);
INSERT INTO t SELECTnumber, number*7FROM numbers(101, 100);
INSERT INTO t SELECTnumber, number*7FROM numbers(201, 100);
INSERT INTO t SELECTnumber, number*7FROM numbers(301, 100);
INSERT INTO t SELECTnumber, number*7FROM numbers(401, 100);
SELECTcount() FROM t; -- 500 (as expected)-- Now: TRUNCATE TABLE t, then SIGKILL the server mid-TRUNCATE.-- Restart the server cleanly.SELECTcount() FROM t; -- 300 ← BUG: not 0, not 500
On-disk forensics after the crash:
data/store/<uuid>/all_1_1_1/ ← empty replacement part (committed)
data/store/<uuid>/all_2_2_1/ ← empty replacement part (committed)
data/store/<uuid>/all_3_3_0/ ← original source part still active (100 rows)
data/store/<uuid>/all_4_4_0/ ← original source part still active (100 rows)
data/store/<uuid>/all_5_5_0/ ← original source part still active (100 rows)
Why it happens
TRUNCATE doesn't directly delete parts. Instead it creates N empty "covering" replacement parts (one per source part at a higher mutation level) and renames them into the store one at a time. The rename loop is at src/Storages/MergeTree/MergeTreeData.cpp:8430:
voidMergeTreeData::Transaction::renameParts()
{
for (constauto & part_need_rename : precommitted_parts_need_rename)
{
LOG_TEST(data.log, "Renaming part to {}", part_need_rename->name);
part_need_rename->renameTo(part_need_rename->name, true);
}
precommitted_parts_need_rename.clear();
}
A plain for loop with one rename(2) per part. No batched-atomic primitive, no journal, no startup recovery to roll forward or back. A SIGKILL between iterations leaves K replacements renamed in and N-K source parts still active. The recovery path loads what it sees: the K committed empty replacements "win" over their corresponding source parts (higher level), but the remaining N-K source parts stay active and visible to queries.
Strace of a complete clean TRUNCATE confirms 5 individual renameat(2) calls for the empty replacements (plus 5 more for source→delete_tmp_ cleanup):
Killing the server between any two of these renames leaves partial state.
Reproduce
The kill window is narrow (the rename loop runs in milliseconds). A shell-level kill -9 with random ms-delays does not reliably hit it. Two ways to land the kill deterministically:
With gdb (simplest, no external code):
# Terminal 1: start server normally (without --daemon so gdb can attach).
clickhouse-server --config-file=<conf># Terminal 2: set up the data.
clickhouse-client -q "CREATE TABLE t (id UInt64, val UInt64) ENGINE = MergeTree ORDER BY id"foriin 0 1 2 3 4;do
clickhouse-client -q "INSERT INTO t SELECT number, number*7 FROM numbers($((i*100+1)), 100)"done
clickhouse-client -q "SELECT count() FROM t"# 500# Terminal 3: attach gdb and break inside the rename loop.
gdb -p $(pidof clickhouse-server) -batch -ex " set pagination off set print pretty on break MergeTreeData.cpp:8435 commands silent if \$_hit_bkpt_num > 0 shell echo 'rename iteration hit' end continue end continue"&# Terminal 2: trigger TRUNCATE — the breakpoint will fire once per rename.
clickhouse-client -q "TRUNCATE TABLE t"&# Terminal 4: after the breakpoint has hit 1-2 times, SIGKILL the server.
sleep 0.3 &&kill -9 $(pidof clickhouse-server)# Restart the server normally.
clickhouse-server --config-file=<conf>&
sleep 5
clickhouse-client -q "SELECT count() FROM t"# 100/200/300/400 — bug
With strace -e inject (kernel-level, no gdb needed):
# Attach to a running server (or start under strace):
strace -f -p <pid> -e inject=renameat:signal=SIGKILL:when=2 -e trace=renameat &# Issue TRUNCATE in a separate client.
clickhouse-client -q "TRUNCATE TABLE t"
The bug requires precise crash timing — a regular kill -9 $(pidof clickhouse-server) from a shell loop won't reproduce it because the rename loop completes before the kill lands (0 out of 20 attempts in my testing). A crash-durability fuzzer (or one of the methods above) is needed to hit the window.
In a fuzzer I run with LD_PRELOAD interception of renameat(2) and SIGKILL on the Nth call: 5/5 reproductions at the same seed, with the simulator-induced power-loss layer turned OFF. The bug is in CH, not in the fuzzer.
Expected behavior
TRUNCATE TABLE t should be atomic. After a crash mid-TRUNCATE and a clean restart, SELECT count() FROM t should return either 500 (TRUNCATE rolled back) or 0 (TRUNCATE committed) — never an intermediate value.
Two possible fix shapes:
Journal-based recovery: write a sentinel file (e.g., truncate_in_progress) into the table's metadata directory before the rename loop, remove it after the loop returns. Startup recovery sees the sentinel and either rolls forward (finishes the remaining renames) or rolls back (moves committed empty parts back to tmp_empty_* and reactivates source parts).
Document the limitation and direct users to transactions. The transactional path at StorageMergeTree.cpp:2290-2297 IS atomic. Update the TRUNCATE docs to say TRUNCATE TABLE is not crash-atomic without BEGIN TRANSACTION. (Soft alternative, since transactions are still experimental.)
Severity
Not data loss in the strict sense — the source-part rows are still on disk under their original directories (just hidden by the partially-committed covering parts). But it's silent data masking: a scripted TRUNCATE-then-SELECT pipeline gets a wrong count back and may proceed assuming TRUNCATE succeeded. To recover, the operator has to inspect system.parts / system.detached_parts and figure out what happened.
Reproduce on the most recent release?
Yes. The non-atomic loop at MergeTreeData.cpp:8430 is unchanged on master (current HEAD). Reproduces on 26.4.1.1. Same code path has likely been present since renameAndCommitEmptyParts was introduced — earlier versions likely affected too.
Additional context
Same atomicity class as RESTORE leaves orphan empty table after mid-restore crash #104464 (RESTORE leaves orphan empty table after mid-restore crash). Different DDL surface, identical mechanism: multi-step DDL commit implemented as a sequential filesystem loop without a crash-recovery primitive. renameAndCommitEmptyParts is also called from dropPart and dropPartition (StorageMergeTree.cpp:2375, 2462) — DROP PARTITION and DETACH PARTITION on a multi-part partition likely share the same gap (not yet tested in this report).
The transactional path at StorageMergeTree.cpp:2290-2297 uses removePartsFromWorkingSet(txn.get(), ..., true, ...) and IS atomic by design — but transactions are still experimental and off by default.
Test gap: tests/integration/test_truncate_* has no crash-during-TRUNCATE coverage. An integration test that injects a SIGKILL during the rename loop and asserts count() ∈ {0, N} would catch this.
Found by a crash-durability fuzzer (ClickFawkes-style) that uses LD_PRELOAD to intercept renameat(2) and SIGKILL the server on the Nth matching call. ClickHouse strips LD_PRELOAD from its environment via secure_getenv; the fuzzer patches the binary (sed -i 's/LD_PRELOAD/XX_PRELOAD/g') to use a renamed env var, then sets XX_PRELOAD=... instead. This detail isn't important for the bug itself — it just explains how the fuzzer reached deterministic reproduction.
Error message and/or stacktrace
None — the failure is silent. No errors at server startup; logs show ordinary part-loading messages and the table comes online with the partial count.
Summary
TRUNCATE TABLEon aMergeTreetable is not crash-atomic without an explicitBEGIN TRANSACTION. If the server is killed while TRUNCATE is in progress, the table is left in a partial state on restart — neither full nor empty. For a 5-part / 500-row table, the post-crash count is observed as100,200,300, or400rows depending on where the kill landed. No error is logged.What the user sees
On-disk forensics after the crash:
Why it happens
TRUNCATE doesn't directly delete parts. Instead it creates N empty "covering" replacement parts (one per source part at a higher mutation level) and renames them into the store one at a time. The rename loop is at
src/Storages/MergeTree/MergeTreeData.cpp:8430:A plain
forloop with onerename(2)per part. No batched-atomic primitive, no journal, no startup recovery to roll forward or back. ASIGKILLbetween iterations leaves K replacements renamed in and N-K source parts still active. The recovery path loads what it sees: the K committed empty replacements "win" over their corresponding source parts (higher level), but the remaining N-K source parts stay active and visible to queries.Strace of a complete clean TRUNCATE confirms 5 individual
renameat(2)calls for the empty replacements (plus 5 more for source→delete_tmp_cleanup):Killing the server between any two of these renames leaves partial state.
Reproduce
The kill window is narrow (the rename loop runs in milliseconds). A shell-level
kill -9with random ms-delays does not reliably hit it. Two ways to land the kill deterministically:With gdb (simplest, no external code):
With
strace -e inject(kernel-level, no gdb needed):The bug requires precise crash timing — a regular
kill -9 $(pidof clickhouse-server)from a shell loop won't reproduce it because the rename loop completes before the kill lands (0 out of 20 attempts in my testing). A crash-durability fuzzer (or one of the methods above) is needed to hit the window.In a fuzzer I run with
LD_PRELOADinterception ofrenameat(2)andSIGKILLon the Nth call: 5/5 reproductions at the same seed, with the simulator-induced power-loss layer turned OFF. The bug is in CH, not in the fuzzer.Expected behavior
TRUNCATE TABLE tshould be atomic. After a crash mid-TRUNCATE and a clean restart,SELECT count() FROM tshould return either 500 (TRUNCATE rolled back) or 0 (TRUNCATE committed) — never an intermediate value.Two possible fix shapes:
truncate_in_progress) into the table's metadata directory before the rename loop, remove it after the loop returns. Startup recovery sees the sentinel and either rolls forward (finishes the remaining renames) or rolls back (moves committed empty parts back totmp_empty_*and reactivates source parts).StorageMergeTree.cpp:2290-2297IS atomic. Update the TRUNCATE docs to sayTRUNCATE TABLEis not crash-atomic withoutBEGIN TRANSACTION. (Soft alternative, since transactions are still experimental.)Severity
Not data loss in the strict sense — the source-part rows are still on disk under their original directories (just hidden by the partially-committed covering parts). But it's silent data masking: a scripted
TRUNCATE-then-SELECTpipeline gets a wrong count back and may proceed assuming TRUNCATE succeeded. To recover, the operator has to inspectsystem.parts/system.detached_partsand figure out what happened.Reproduce on the most recent release?
Yes. The non-atomic loop at
MergeTreeData.cpp:8430is unchanged onmaster(currentHEAD). Reproduces on 26.4.1.1. Same code path has likely been present sincerenameAndCommitEmptyPartswas introduced — earlier versions likely affected too.Additional context
renameAndCommitEmptyPartsis also called fromdropPartanddropPartition(StorageMergeTree.cpp:2375, 2462) —DROP PARTITIONandDETACH PARTITIONon a multi-part partition likely share the same gap (not yet tested in this report).StorageMergeTree.cpp:2290-2297usesremovePartsFromWorkingSet(txn.get(), ..., true, ...)and IS atomic by design — but transactions are still experimental and off by default.tests/integration/test_truncate_*has no crash-during-TRUNCATE coverage. An integration test that injects a SIGKILL during the rename loop and assertscount() ∈ {0, N}would catch this.LD_PRELOADto interceptrenameat(2)and SIGKILL the server on the Nth matching call. ClickHouse stripsLD_PRELOADfrom its environment viasecure_getenv; the fuzzer patches the binary (sed -i 's/LD_PRELOAD/XX_PRELOAD/g') to use a renamed env var, then setsXX_PRELOAD=...instead. This detail isn't important for the bug itself — it just explains how the fuzzer reached deterministic reproduction.Error message and/or stacktrace
None — the failure is silent. No errors at server startup; logs show ordinary part-loading messages and the table comes online with the partial count.