Skip 01154_move_partition_long on Azure blob storage by groeneai · Pull Request #103857 · ClickHouse/ClickHouse

groeneai · 2026-05-01T13:22:58Z

Why

Test 01154_move_partition_long.sh is chronically flaky on a single CI configuration: Stateless tests (arm_asan_ubsan, azure, sequential). CIDB shows 23 failures vs 4583 successes on that one config over 30 days, while every other configuration the test runs on (amd_debug, amd_asan_ubsan with distributed plan / db disk, amd_tsan local + s3, amd_binary, arm_binary) sits at 0 failures across roughly 30K runs.

The failing config affected at least 18 distinct PRs and 5 master runs in the last 30 days, including PRs #103733, #103723, #103263, #101158, #101143, #101023, #100943, #100383, #100332, #100177, #100035, #99802, #99543, #102354, #101973, #101788, #101401, and others — it costs developers a CI re-run almost every day.

Failure modes

Both modes stem from slow Azure object-storage I/O combined with the ASan + UBSan overhead amplifying the 40-second concurrency-chaos workload (16 replicated tables, 8 background threads doing mixed INSERT, MOVE PARTITION, REPLACE PARTITION, DROP PARTITION, OPTIMIZE, DROP PART):

TimeoutError in _cleanup_database — the test framework's per-test deadline expires while waiting for the cleanup HTTP request to come back from a slow Azure backend.
result differs with reference showing extra rows from the test's own diagnostic line:
```
SELECT table, lost_part_count
FROM system.replicas
WHERE database = currentDatabase() AND lost_part_count != 0
```
Slow I/O widens the window where concurrent DROP_RANGE entries race against in-flight INSERT propagation, producing genuine but timing-dependent lost parts. Some failures also include many curl: (28) Operation timed out after 120001 milliseconds lines.

Why `no-azure-blob-storage`

The test verifies ReplicatedMergeTree replication invariants under chaos. It is not testing Azure-specific behavior. Coverage of those invariants is preserved by every non-Azure configuration the test still runs on, including S3 storage on amd_tsan. Disabling on Azure removes the slow-I/O variable while keeping every meaningful sanitizer/storage combination.

This follows the established pattern of disabling such long replication-stress tests on Azure: PR #95313 added no-azure-blob-storage to the sibling tests 01169_alter_partition_isolation_stress.sh, 01171_mv_select_insert_isolation_long.sh, 01443_merge_truncate_long.sh, and 01516_drop_table_stress_long.sh for the same reason. 01154_move_partition_long.sh is the missing companion.

A previous attempt (PR #85974, closed unmerged) tried to address this by raising s3_max_single_read_retries. That approach did not converge.

Closes #80315.

CIDB query for reference:
https://play.clickhouse.com/?user=play&query=SELECT+check_name%2C+test_status%2C+count%28%29+FROM+default.checks+WHERE+test_name%3D%2701154_move_partition_long%27+AND+check_start_time%3Enow%28%29-INTERVAL+30+DAY+GROUP+BY+check_name%2C+test_status

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Skip flaky test 01154_move_partition_long on Azure blob storage runs.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

This concurrent replication-chaos test (40s of mixed INSERT, MOVE PARTITION, REPLACE PARTITION, DROP PARTITION, OPTIMIZE, DROP PART against 16 replicated tables) reliably exceeds the test framework timeout on the `Stateless tests (arm_asan_ubsan, azure, sequential)` configuration. CIDB shows 23 failures / 4583 successes on that single config over 30 days, while every other configuration (amd_debug, amd_asan_ubsan with distributed plan / db disk, amd_tsan local + s3, amd_binary, arm_binary) is at 0 failures across ~30K runs. The two failure modes both stem from slow Azure object-storage I/O combined with the ASan + UBSan overhead: - `TimeoutError` in `_cleanup_database` after the chaos finishes: the cleanup HTTP request blocks longer than the framework's per-test deadline. - `result differs with reference` showing extra rows from `SELECT table, lost_part_count FROM system.replicas WHERE lost_part_count != 0` — the chaos-test diagnostic check fires because slow I/O widens the window for races between concurrent `DROP_RANGE` entries and INSERT propagation, yielding genuine but timing-dependent lost parts. The test verifies `ReplicatedMergeTree` replication invariants under chaos; it is not testing Azure-specific behavior. Coverage of those invariants is preserved by every non-Azure configuration the test still runs on. This follows the established pattern of disabling such long replication-stress tests on Azure: PR ClickHouse#95313 added `no-azure-blob-storage` to sibling tests including `01169_alter_partition_isolation_stress.sh`, `01171_mv_select_insert_isolation_long.sh`, `01443_merge_truncate_long.sh`, `01516_drop_table_stress_long.sh` for the same reason. Closes ClickHouse#80315 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

groeneai · 2026-05-01T13:23:17Z

Pre-PR validation gate (per worker policy):

(a) Deterministic repro? N/A by design — the fix is a configuration tag that the test runner is verified to recognize (tests/clickhouse-test:2605 skips when no-azure-blob-storage tag is set and args.azure_blob_storage is true). The failing CI configuration is verified by CIDB to be exclusively Stateless tests (arm_asan_ubsan, azure, sequential).

(b) Root cause explained? Yes. Slow Azure object-storage I/O combined with ASan+UBSan overhead amplifies the 40-second chaos workload past the framework deadline AND widens the timing window for races between concurrent DROP_RANGE and INSERT propagation. The test's own lost_part_count != 0 diagnostic — designed to surface real replication races — fires when those races land within the test window. The PR description has the failure modes spelled out.

(c) Fix matches root cause? Yes. The test verifies ReplicatedMergeTree chaos invariants, not Azure-specific behavior. Skipping Azure removes the slow-I/O variable that makes the chaos window unmanageable on this single config, identical to PR #95313's treatment of sibling tests 01169, 01171, 01443, 01516.

(d) Test intent preserved? Yes. Test still runs on amd_debug, amd_asan_ubsan (distributed plan / db disk), amd_tsan local + s3, amd_binary, arm_binary — all of which ran cleanly across ~30K invocations in the last 30 days. Replication-chaos coverage is fully maintained.

(e) Both directions demonstrated? Demonstrated via CIDB rather than a local rerun: the test fails 23 / 4606 times only on arm_asan_ubsan, azure, sequential and never on the seven other production configs. The tag will mechanically eliminate that single failing combination via a code path verified to exist in tests/clickhouse-test. Local repro is impractical because we don't have an arm_asan_ubsan + Azurite box, and the failure rate is roughly 0.5% even on that config.

(f) Fix is general, not narrow patch? Yes. The same change has already been applied to every sibling stress test in PR #95313. 01154 was the missing companion to that batch.

Session id: cron:clickhouse-ci-task-worker:20260501-131500

groeneai · 2026-05-01T13:23:35Z

cc @alesapin — could you take a look? This adds the same no-azure-blob-storage tag to 01154_move_partition_long.sh that PR #95313 already applied to its sibling chaos tests (01169 / 01171 / 01443 / 01516). Test-only one-liner; CIDB justification is in the PR description and validation comment above.

clickhouse-gh · 2026-05-01T15:23:06Z

Workflow [PR], commit [29b92ab]

Summary: ❌

job_name	test_name	status	info	comment
Stateless tests (amd_msan, WasmEdge, parallel, 1/2)		FAIL
	04051_pk_analysis_stats	FAIL	cidb

AI Review

Summary

This PR adds no-azure-blob-storage to 01154_move_partition_long.sh and documents the rationale inline, aligning the test with existing sibling long stress tests that are already excluded on Azure blob storage due to infra-specific flakiness. I did not find correctness, safety, or maintainability issues in the proposed change.

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	✅
No test removal	✅
Experimental gate	➖
No magic constants	✅
Backward compatibility	✅
`SettingsChangesHistory.cpp`	➖
PR metadata quality	✅
Safe rollout	✅
Compilation time	✅
No large/binary files	✅

Final Verdict

Status: ✅ Approve

alexey-milovidov · 2026-05-06T17:22:47Z

@groeneai, fix the 04051_pk_analysis_stats test and send a separate PR, or link the existing fix here. Then update the branch here.

These two tests (`01192_rename_database_zookeeper.sh` and `01307_multiple_leaders_zookeeper.sh`) reliably exceed the test framework timeout on the `Stateless tests (arm_asan_ubsan, azure, sequential)` configuration, while passing on every other configuration without a single failure. The failure mode is `Azure::Storage::StorageException: 408 Timeout: connect timed out: 127.0.0.1:10000` — the Azurite mock storage backend cannot keep up under the combined load of `arm` + ASan/UBSan instrumentation slowdown plus the test's concurrent `INSERT` workload. Both tests target ZooKeeper / `ReplicatedMergeTree` behaviour, not Azure-specific code paths, so excluding them on the Azure blob storage config does not reduce coverage of the feature being tested. CIDB evidence (30-day window): Test Azure config Other configs 01192_rename_database_zookeeper 19 FAIL / 4482 OK 0 FAIL / 50K+ OK 01307_multiple_leaders_zookeeper 14 FAIL / 4491 OK 0 FAIL / 45K+ OK This follows the same pattern used by sibling concurrent replication-chaos tests that already carry the `no-azure-blob-storage` tag: `01154_move_partition_long.sh` (PR #103857), `01169_alter_partition_isolation_stress.sh`, `01171_mv_select_insert_isolation_long.sh`, `01443_merge_truncate_long.sh`, `01516_drop_table_stress_long.sh`.

groeneai · 2026-05-06T23:42:50Z

@alexey-milovidov filed the fix as #104272.

Investigation summary for the chronic-flaky label on 04051_pk_analysis_stats:

Failure clustering: 6 of the 11 30-day failures were on PR Refuse QueryConditionCache writes from contexts with relaxed/experimental settings #104225 commit ef78dcecf6d, all 100% reproducible per --diagnose-random-settings step 1. The remaining 4 (PR Enable optimize_or_like_chain by default. #94517, Skip 01154_move_partition_long on Azure blob storage #103857, Fix exception in system log flush when table has extra columns #102316, Add projection merge introspection to system.merges #102611) and 1 master hit were single transient occurrences — --diagnose-random-settings step 1 reported "All reruns passed. The failure is not reproducible."
Layer 1 root cause (already self-fixed in PR Refuse QueryConditionCache writes from contexts with relaxed/experimental settings #104225): PR Refuse QueryConditionCache writes from contexts with relaxed/experimental settings #104225 commit ef78dcecf6d introduced a QueryConditionCache writability gate that blocked any context with a non-default Beta- or Experimental-tier setting. The test runner injects --session_timezone <Beta-tier-zone> on most runs, so the warmup query was unable to populate the QCC. The measurement query then found no entry, PK analysis ran on all 25 marks, and throwIf(processed >= total_marks) fired. PR Refuse QueryConditionCache writes from contexts with relaxed/experimental settings #104225's latest commit 1860c812f0f replaces the tier-based gate with a name-pattern gate (allow_suspicious_*, *relaxed*, allow_experimental_* only when still in EXPERIMENTAL tier), which lets session_timezone and other Beta production-grade settings stay writable.
Layer 2 (this fix, Pin prewhere settings in 04051_pk_analysis_stats #104272): Even after Layer 1, the test is sensitive to the runner randomizing optimize_move_to_prewhere / query_plan_optimize_prewhere. With prewhere off, the QCC entry is keyed by the WHERE-side hash instead of the PREWHERE-side hash documented in the test's docstring. The assertion still holds in either path, but the test ends up exercising a different code path than the docstring describes — and --diagnose-random-settings minimizes the culprit set down to non-causal session settings (--session_timezone, etc.), wasting CI cycles and triage attention on each flake. PR Pin prewhere settings in 04051_pk_analysis_stats #104272 pins both prewhere settings inside the SQL.

The single transient failure on this PR (#103857) at commit 29b92abb, run 2026-05-01 16:39:04 in Stateless tests (amd_msan, WasmEdge, parallel, 1/2), was flagged "not reproducible" by the diagnoser (6/6 reruns passed). It is not blocking anymore — re-running the failed shard should be enough.

Verified locally on a debug build that the pinned test passes 50/50 and 100/100 even with adversarial overrides that disable QCC, prewhere, multiple-prewhere-read-steps, and force --session_timezone America/Mazatlan.

alexey-milovidov added the can be tested Allows running workflows for external contributors label May 1, 2026

clickhouse-gh Bot added the pr-ci label May 1, 2026

groeneai mentioned this pull request May 4, 2026

Skip 01192 and 01307 on Azure blob storage #103992

Merged

1 task

alexey-milovidov approved these changes May 6, 2026

View reviewed changes

alexey-milovidov self-assigned this May 6, 2026

groeneai mentioned this pull request May 6, 2026

Pin prewhere settings in 04051_pk_analysis_stats #104272

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip 01154_move_partition_long on Azure blob storage#103857

Skip 01154_move_partition_long on Azure blob storage#103857
groeneai wants to merge 1 commit intoClickHouse:masterfrom
groeneai:groeneai/fix-flaky-01154-move-partition-long-azure

groeneai commented May 1, 2026

Uh oh!

groeneai commented May 1, 2026

Uh oh!

groeneai commented May 1, 2026

Uh oh!

clickhouse-gh Bot commented May 1, 2026 •

edited

Loading

Uh oh!

alexey-milovidov commented May 6, 2026

Uh oh!

groeneai commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

groeneai commented May 1, 2026

Why

Failure modes

Why no-azure-blob-storage

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

groeneai commented May 1, 2026

Uh oh!

groeneai commented May 1, 2026

Uh oh!

clickhouse-gh Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

ClickHouse Rules

Final Verdict

Uh oh!

alexey-milovidov commented May 6, 2026

Uh oh!

groeneai commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Why `no-azure-blob-storage`

clickhouse-gh Bot commented May 1, 2026 •

edited

Loading