TRUNCATE TABLE leaves partial state after mid-rename crash on MergeTree

### Summary

`TRUNCATE TABLE` on a `MergeTree` table is **not crash-atomic** without an explicit `BEGIN TRANSACTION`. If the server is killed while TRUNCATE is in progress, the table is left in a partial state on restart — neither full nor empty. For a 5-part / 500-row table, the post-crash count is observed as `100`, `200`, `300`, or `400` rows depending on where the kill landed. No error is logged.

### What the user sees

```sql
-- Setup: 5 parts × 100 rows = 500 rows.
CREATE TABLE t (id UInt64, val UInt64) ENGINE = MergeTree ORDER BY id;
INSERT INTO t SELECT number, number*7 FROM numbers(  1, 100);
INSERT INTO t SELECT number, number*7 FROM numbers(101, 100);
INSERT INTO t SELECT number, number*7 FROM numbers(201, 100);
INSERT INTO t SELECT number, number*7 FROM numbers(301, 100);
INSERT INTO t SELECT number, number*7 FROM numbers(401, 100);

SELECT count() FROM t;          -- 500   (as expected)

-- Now: TRUNCATE TABLE t, then SIGKILL the server mid-TRUNCATE.
-- Restart the server cleanly.

SELECT count() FROM t;          -- 300   ← BUG: not 0, not 500
```

On-disk forensics after the crash:

```
data/store/<uuid>/all_1_1_1/   ← empty replacement part (committed)
data/store/<uuid>/all_2_2_1/   ← empty replacement part (committed)
data/store/<uuid>/all_3_3_0/   ← original source part still active (100 rows)
data/store/<uuid>/all_4_4_0/   ← original source part still active (100 rows)
data/store/<uuid>/all_5_5_0/   ← original source part still active (100 rows)
```

### Why it happens

TRUNCATE doesn't directly delete parts. Instead it creates N empty "covering" replacement parts (one per source part at a higher mutation level) and renames them into the store one at a time. The rename loop is at [`src/Storages/MergeTree/MergeTreeData.cpp:8430`](https://github.com/ClickHouse/ClickHouse/blob/master/src/Storages/MergeTree/MergeTreeData.cpp#L8430):

```cpp
void MergeTreeData::Transaction::renameParts()
{
    for (const auto & part_need_rename : precommitted_parts_need_rename)
    {
        LOG_TEST(data.log, "Renaming part to {}", part_need_rename->name);
        part_need_rename->renameTo(part_need_rename->name, true);
    }
    precommitted_parts_need_rename.clear();
}
```

A plain `for` loop with one `rename(2)` per part. No batched-atomic primitive, no journal, no startup recovery to roll forward or back. A `SIGKILL` between iterations leaves K replacements renamed in and N-K source parts still active. The recovery path loads what it sees: the K committed empty replacements "win" over their corresponding source parts (higher level), but the remaining N-K source parts stay active and visible to queries.

Strace of a complete clean TRUNCATE confirms 5 individual `renameat(2)` calls for the empty replacements (plus 5 more for source→`delete_tmp_` cleanup):

```
renameat(AT_FDCWD, ".../tmp_empty_all_1_1_1/", AT_FDCWD, ".../all_1_1_1/") = 0
renameat(AT_FDCWD, ".../tmp_empty_all_2_2_1/", AT_FDCWD, ".../all_2_2_1/") = 0
renameat(AT_FDCWD, ".../tmp_empty_all_3_3_1/", AT_FDCWD, ".../all_3_3_1/") = 0
renameat(AT_FDCWD, ".../tmp_empty_all_4_4_1/", AT_FDCWD, ".../all_4_4_1/") = 0
renameat(AT_FDCWD, ".../tmp_empty_all_5_5_1/", AT_FDCWD, ".../all_5_5_1/") = 0
renameat(AT_FDCWD, ".../all_1_1_0",  AT_FDCWD, ".../delete_tmp_all_1_1_0") = 0
... (4 more)
```

Killing the server between any two of these renames leaves partial state.

### Reproduce

The kill window is narrow (the rename loop runs in milliseconds). A shell-level `kill -9` with random ms-delays does not reliably hit it. Two ways to land the kill deterministically:

**With gdb (simplest, no external code):**

```bash
# Terminal 1: start server normally (without --daemon so gdb can attach).
clickhouse-server --config-file=<conf>

# Terminal 2: set up the data.
clickhouse-client -q "CREATE TABLE t (id UInt64, val UInt64) ENGINE = MergeTree ORDER BY id"
for i in 0 1 2 3 4; do
    clickhouse-client -q "INSERT INTO t SELECT number, number*7 FROM numbers($((i*100+1)), 100)"
done
clickhouse-client -q "SELECT count() FROM t"                    # 500

# Terminal 3: attach gdb and break inside the rename loop.
gdb -p $(pidof clickhouse-server) -batch -ex "
    set pagination off
    set print pretty on
    break MergeTreeData.cpp:8435
    commands
        silent
        if \$_hit_bkpt_num > 0
            shell echo 'rename iteration hit'
        end
        continue
    end
    continue
" &

# Terminal 2: trigger TRUNCATE — the breakpoint will fire once per rename.
clickhouse-client -q "TRUNCATE TABLE t" &

# Terminal 4: after the breakpoint has hit 1-2 times, SIGKILL the server.
sleep 0.3 && kill -9 $(pidof clickhouse-server)

# Restart the server normally.
clickhouse-server --config-file=<conf> &
sleep 5
clickhouse-client -q "SELECT count() FROM t"                    # 100/200/300/400 — bug
```

**With `strace -e inject` (kernel-level, no gdb needed):**

```bash
# Attach to a running server (or start under strace):
strace -f -p <pid> -e inject=renameat:signal=SIGKILL:when=2 -e trace=renameat &

# Issue TRUNCATE in a separate client.
clickhouse-client -q "TRUNCATE TABLE t"
```

The bug requires precise crash timing — a regular `kill -9 $(pidof clickhouse-server)` from a shell loop won't reproduce it because the rename loop completes before the kill lands (0 out of 20 attempts in my testing). A crash-durability fuzzer (or one of the methods above) is needed to hit the window.

In a fuzzer I run with `LD_PRELOAD` interception of `renameat(2)` and `SIGKILL` on the Nth call: **5/5 reproductions at the same seed**, with the simulator-induced power-loss layer turned OFF. The bug is in CH, not in the fuzzer.

### Expected behavior

`TRUNCATE TABLE t` should be atomic. After a crash mid-TRUNCATE and a clean restart, `SELECT count() FROM t` should return either 500 (TRUNCATE rolled back) or 0 (TRUNCATE committed) — never an intermediate value.

Two possible fix shapes:

- **Journal-based recovery:** write a sentinel file (e.g., `truncate_in_progress`) into the table's metadata directory before the rename loop, remove it after the loop returns. Startup recovery sees the sentinel and either rolls forward (finishes the remaining renames) or rolls back (moves committed empty parts back to `tmp_empty_*` and reactivates source parts).
- **Document the limitation and direct users to transactions.** The transactional path at `StorageMergeTree.cpp:2290-2297` IS atomic. Update the [TRUNCATE docs](https://clickhouse.com/docs/sql-reference/statements/truncate) to say `TRUNCATE TABLE` is not crash-atomic without `BEGIN TRANSACTION`. (Soft alternative, since transactions are still experimental.)

### Severity

Not data loss in the strict sense — the source-part rows are still on disk under their original directories (just hidden by the partially-committed covering parts). But it's **silent data masking**: a scripted `TRUNCATE`-then-`SELECT` pipeline gets a wrong count back and may proceed assuming TRUNCATE succeeded. To recover, the operator has to inspect `system.parts` / `system.detached_parts` and figure out what happened.

### Reproduce on the most recent release?

Yes. The non-atomic loop at `MergeTreeData.cpp:8430` is unchanged on `master` (current `HEAD`). Reproduces on 26.4.1.1. Same code path has likely been present since `renameAndCommitEmptyParts` was introduced — earlier versions likely affected too.

### Additional context

- Same atomicity class as #104464 (RESTORE leaves orphan empty table after mid-restore crash). Different DDL surface, identical mechanism: multi-step DDL commit implemented as a sequential filesystem loop without a crash-recovery primitive. `renameAndCommitEmptyParts` is also called from `dropPart` and `dropPartition` (`StorageMergeTree.cpp:2375, 2462`) — `DROP PARTITION` and `DETACH PARTITION` on a multi-part partition likely share the same gap (not yet tested in this report).
- The transactional path at `StorageMergeTree.cpp:2290-2297` uses `removePartsFromWorkingSet(txn.get(), ..., true, ...)` and IS atomic by design — but transactions are still experimental and off by default.
- Test gap: `tests/integration/test_truncate_*` has no crash-during-TRUNCATE coverage. An integration test that injects a SIGKILL during the rename loop and asserts `count() ∈ {0, N}` would catch this.
- Found by a crash-durability fuzzer (ClickFawkes-style) that uses `LD_PRELOAD` to intercept `renameat(2)` and SIGKILL the server on the Nth matching call. ClickHouse strips `LD_PRELOAD` from its environment via `secure_getenv`; the fuzzer patches the binary (`sed -i 's/LD_PRELOAD/XX_PRELOAD/g'`) to use a renamed env var, then sets `XX_PRELOAD=...` instead. This detail isn't important for the bug itself — it just explains how the fuzzer reached deterministic reproduction.

### Error message and/or stacktrace

None — the failure is silent. No errors at server startup; logs show ordinary part-loading messages and the table comes online with the partial count.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TRUNCATE TABLE leaves partial state after mid-rename crash on MergeTree #104624

Summary

What the user sees

Why it happens

Reproduce

Expected behavior

Severity

Reproduce on the most recent release?

Additional context

Error message and/or stacktrace

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

TRUNCATE TABLE leaves partial state after mid-rename crash on MergeTree #104624

Description

Summary

What the user sees

Why it happens

Reproduce

Expected behavior

Severity

Reproduce on the most recent release?

Additional context

Error message and/or stacktrace

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions