Skip to content

TRUNCATE TABLE leaves partial state after mid-rename crash on MergeTree #104624

@zlareb1

Description

@zlareb1

Summary

TRUNCATE TABLE on a MergeTree table is not crash-atomic without an explicit BEGIN TRANSACTION. If the server is killed while TRUNCATE is in progress, the table is left in a partial state on restart — neither full nor empty. For a 5-part / 500-row table, the post-crash count is observed as 100, 200, 300, or 400 rows depending on where the kill landed. No error is logged.

What the user sees

-- Setup: 5 parts × 100 rows = 500 rows.
CREATE TABLE t (id UInt64, val UInt64) ENGINE = MergeTree ORDER BY id;
INSERT INTO t SELECT number, number*7 FROM numbers(  1, 100);
INSERT INTO t SELECT number, number*7 FROM numbers(101, 100);
INSERT INTO t SELECT number, number*7 FROM numbers(201, 100);
INSERT INTO t SELECT number, number*7 FROM numbers(301, 100);
INSERT INTO t SELECT number, number*7 FROM numbers(401, 100);

SELECT count() FROM t;          -- 500   (as expected)

-- Now: TRUNCATE TABLE t, then SIGKILL the server mid-TRUNCATE.
-- Restart the server cleanly.

SELECT count() FROM t;          -- 300   ← BUG: not 0, not 500

On-disk forensics after the crash:

data/store/<uuid>/all_1_1_1/   ← empty replacement part (committed)
data/store/<uuid>/all_2_2_1/   ← empty replacement part (committed)
data/store/<uuid>/all_3_3_0/   ← original source part still active (100 rows)
data/store/<uuid>/all_4_4_0/   ← original source part still active (100 rows)
data/store/<uuid>/all_5_5_0/   ← original source part still active (100 rows)

Why it happens

TRUNCATE doesn't directly delete parts. Instead it creates N empty "covering" replacement parts (one per source part at a higher mutation level) and renames them into the store one at a time. The rename loop is at src/Storages/MergeTree/MergeTreeData.cpp:8430:

void MergeTreeData::Transaction::renameParts()
{
    for (const auto & part_need_rename : precommitted_parts_need_rename)
    {
        LOG_TEST(data.log, "Renaming part to {}", part_need_rename->name);
        part_need_rename->renameTo(part_need_rename->name, true);
    }
    precommitted_parts_need_rename.clear();
}

A plain for loop with one rename(2) per part. No batched-atomic primitive, no journal, no startup recovery to roll forward or back. A SIGKILL between iterations leaves K replacements renamed in and N-K source parts still active. The recovery path loads what it sees: the K committed empty replacements "win" over their corresponding source parts (higher level), but the remaining N-K source parts stay active and visible to queries.

Strace of a complete clean TRUNCATE confirms 5 individual renameat(2) calls for the empty replacements (plus 5 more for source→delete_tmp_ cleanup):

renameat(AT_FDCWD, ".../tmp_empty_all_1_1_1/", AT_FDCWD, ".../all_1_1_1/") = 0
renameat(AT_FDCWD, ".../tmp_empty_all_2_2_1/", AT_FDCWD, ".../all_2_2_1/") = 0
renameat(AT_FDCWD, ".../tmp_empty_all_3_3_1/", AT_FDCWD, ".../all_3_3_1/") = 0
renameat(AT_FDCWD, ".../tmp_empty_all_4_4_1/", AT_FDCWD, ".../all_4_4_1/") = 0
renameat(AT_FDCWD, ".../tmp_empty_all_5_5_1/", AT_FDCWD, ".../all_5_5_1/") = 0
renameat(AT_FDCWD, ".../all_1_1_0",  AT_FDCWD, ".../delete_tmp_all_1_1_0") = 0
... (4 more)

Killing the server between any two of these renames leaves partial state.

Reproduce

The kill window is narrow (the rename loop runs in milliseconds). A shell-level kill -9 with random ms-delays does not reliably hit it. Two ways to land the kill deterministically:

With gdb (simplest, no external code):

# Terminal 1: start server normally (without --daemon so gdb can attach).
clickhouse-server --config-file=<conf>

# Terminal 2: set up the data.
clickhouse-client -q "CREATE TABLE t (id UInt64, val UInt64) ENGINE = MergeTree ORDER BY id"
for i in 0 1 2 3 4; do
    clickhouse-client -q "INSERT INTO t SELECT number, number*7 FROM numbers($((i*100+1)), 100)"
done
clickhouse-client -q "SELECT count() FROM t"                    # 500

# Terminal 3: attach gdb and break inside the rename loop.
gdb -p $(pidof clickhouse-server) -batch -ex "
    set pagination off
    set print pretty on
    break MergeTreeData.cpp:8435
    commands
        silent
        if \$_hit_bkpt_num > 0
            shell echo 'rename iteration hit'
        end
        continue
    end
    continue
" &

# Terminal 2: trigger TRUNCATE — the breakpoint will fire once per rename.
clickhouse-client -q "TRUNCATE TABLE t" &

# Terminal 4: after the breakpoint has hit 1-2 times, SIGKILL the server.
sleep 0.3 && kill -9 $(pidof clickhouse-server)

# Restart the server normally.
clickhouse-server --config-file=<conf> &
sleep 5
clickhouse-client -q "SELECT count() FROM t"                    # 100/200/300/400 — bug

With strace -e inject (kernel-level, no gdb needed):

# Attach to a running server (or start under strace):
strace -f -p <pid> -e inject=renameat:signal=SIGKILL:when=2 -e trace=renameat &

# Issue TRUNCATE in a separate client.
clickhouse-client -q "TRUNCATE TABLE t"

The bug requires precise crash timing — a regular kill -9 $(pidof clickhouse-server) from a shell loop won't reproduce it because the rename loop completes before the kill lands (0 out of 20 attempts in my testing). A crash-durability fuzzer (or one of the methods above) is needed to hit the window.

In a fuzzer I run with LD_PRELOAD interception of renameat(2) and SIGKILL on the Nth call: 5/5 reproductions at the same seed, with the simulator-induced power-loss layer turned OFF. The bug is in CH, not in the fuzzer.

Expected behavior

TRUNCATE TABLE t should be atomic. After a crash mid-TRUNCATE and a clean restart, SELECT count() FROM t should return either 500 (TRUNCATE rolled back) or 0 (TRUNCATE committed) — never an intermediate value.

Two possible fix shapes:

  • Journal-based recovery: write a sentinel file (e.g., truncate_in_progress) into the table's metadata directory before the rename loop, remove it after the loop returns. Startup recovery sees the sentinel and either rolls forward (finishes the remaining renames) or rolls back (moves committed empty parts back to tmp_empty_* and reactivates source parts).
  • Document the limitation and direct users to transactions. The transactional path at StorageMergeTree.cpp:2290-2297 IS atomic. Update the TRUNCATE docs to say TRUNCATE TABLE is not crash-atomic without BEGIN TRANSACTION. (Soft alternative, since transactions are still experimental.)

Severity

Not data loss in the strict sense — the source-part rows are still on disk under their original directories (just hidden by the partially-committed covering parts). But it's silent data masking: a scripted TRUNCATE-then-SELECT pipeline gets a wrong count back and may proceed assuming TRUNCATE succeeded. To recover, the operator has to inspect system.parts / system.detached_parts and figure out what happened.

Reproduce on the most recent release?

Yes. The non-atomic loop at MergeTreeData.cpp:8430 is unchanged on master (current HEAD). Reproduces on 26.4.1.1. Same code path has likely been present since renameAndCommitEmptyParts was introduced — earlier versions likely affected too.

Additional context

  • Same atomicity class as RESTORE leaves orphan empty table after mid-restore crash #104464 (RESTORE leaves orphan empty table after mid-restore crash). Different DDL surface, identical mechanism: multi-step DDL commit implemented as a sequential filesystem loop without a crash-recovery primitive. renameAndCommitEmptyParts is also called from dropPart and dropPartition (StorageMergeTree.cpp:2375, 2462) — DROP PARTITION and DETACH PARTITION on a multi-part partition likely share the same gap (not yet tested in this report).
  • The transactional path at StorageMergeTree.cpp:2290-2297 uses removePartsFromWorkingSet(txn.get(), ..., true, ...) and IS atomic by design — but transactions are still experimental and off by default.
  • Test gap: tests/integration/test_truncate_* has no crash-during-TRUNCATE coverage. An integration test that injects a SIGKILL during the rename loop and asserts count() ∈ {0, N} would catch this.
  • Found by a crash-durability fuzzer (ClickFawkes-style) that uses LD_PRELOAD to intercept renameat(2) and SIGKILL the server on the Nth matching call. ClickHouse strips LD_PRELOAD from its environment via secure_getenv; the fuzzer patches the binary (sed -i 's/LD_PRELOAD/XX_PRELOAD/g') to use a renamed env var, then sets XX_PRELOAD=... instead. This detail isn't important for the bug itself — it just explains how the fuzzer reached deterministic reproduction.

Error message and/or stacktrace

None — the failure is silent. No errors at server startup; logs show ordinary part-loading messages and the table comes online with the partial count.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugConfirmed user-visible misbehaviour in official release

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions