Skip 01154_move_partition_long on Azure blob storage#103857
Skip 01154_move_partition_long on Azure blob storage#103857groeneai wants to merge 1 commit intoClickHouse:masterfrom
Conversation
This concurrent replication-chaos test (40s of mixed INSERT, MOVE PARTITION, REPLACE PARTITION, DROP PARTITION, OPTIMIZE, DROP PART against 16 replicated tables) reliably exceeds the test framework timeout on the `Stateless tests (arm_asan_ubsan, azure, sequential)` configuration. CIDB shows 23 failures / 4583 successes on that single config over 30 days, while every other configuration (amd_debug, amd_asan_ubsan with distributed plan / db disk, amd_tsan local + s3, amd_binary, arm_binary) is at 0 failures across ~30K runs. The two failure modes both stem from slow Azure object-storage I/O combined with the ASan + UBSan overhead: - `TimeoutError` in `_cleanup_database` after the chaos finishes: the cleanup HTTP request blocks longer than the framework's per-test deadline. - `result differs with reference` showing extra rows from `SELECT table, lost_part_count FROM system.replicas WHERE lost_part_count != 0` — the chaos-test diagnostic check fires because slow I/O widens the window for races between concurrent `DROP_RANGE` entries and INSERT propagation, yielding genuine but timing-dependent lost parts. The test verifies `ReplicatedMergeTree` replication invariants under chaos; it is not testing Azure-specific behavior. Coverage of those invariants is preserved by every non-Azure configuration the test still runs on. This follows the established pattern of disabling such long replication-stress tests on Azure: PR ClickHouse#95313 added `no-azure-blob-storage` to sibling tests including `01169_alter_partition_isolation_stress.sh`, `01171_mv_select_insert_isolation_long.sh`, `01443_merge_truncate_long.sh`, `01516_drop_table_stress_long.sh` for the same reason. Closes ClickHouse#80315 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Pre-PR validation gate (per worker policy): (a) Deterministic repro? N/A by design — the fix is a configuration tag that the test runner is verified to recognize ( (b) Root cause explained? Yes. Slow Azure object-storage I/O combined with (c) Fix matches root cause? Yes. The test verifies (d) Test intent preserved? Yes. Test still runs on (e) Both directions demonstrated? Demonstrated via CIDB rather than a local rerun: the test fails 23 / 4606 times only on (f) Fix is general, not narrow patch? Yes. The same change has already been applied to every sibling stress test in PR #95313. Session id: |
|
cc @alesapin — could you take a look? This adds the same |
|
Workflow [PR], commit [29b92ab] Summary: ❌
AI ReviewSummaryThis PR adds ClickHouse Rules
Final VerdictStatus: ✅ Approve |
|
@groeneai, fix the |
These two tests (`01192_rename_database_zookeeper.sh` and `01307_multiple_leaders_zookeeper.sh`) reliably exceed the test framework timeout on the `Stateless tests (arm_asan_ubsan, azure, sequential)` configuration, while passing on every other configuration without a single failure. The failure mode is `Azure::Storage::StorageException: 408 Timeout: connect timed out: 127.0.0.1:10000` — the Azurite mock storage backend cannot keep up under the combined load of `arm` + ASan/UBSan instrumentation slowdown plus the test's concurrent `INSERT` workload. Both tests target ZooKeeper / `ReplicatedMergeTree` behaviour, not Azure-specific code paths, so excluding them on the Azure blob storage config does not reduce coverage of the feature being tested. CIDB evidence (30-day window): Test Azure config Other configs 01192_rename_database_zookeeper 19 FAIL / 4482 OK 0 FAIL / 50K+ OK 01307_multiple_leaders_zookeeper 14 FAIL / 4491 OK 0 FAIL / 45K+ OK This follows the same pattern used by sibling concurrent replication-chaos tests that already carry the `no-azure-blob-storage` tag: `01154_move_partition_long.sh` (PR #103857), `01169_alter_partition_isolation_stress.sh`, `01171_mv_select_insert_isolation_long.sh`, `01443_merge_truncate_long.sh`, `01516_drop_table_stress_long.sh`.
|
@alexey-milovidov filed the fix as #104272. Investigation summary for the chronic-flaky label on
The single transient failure on this PR (#103857) at commit Verified locally on a debug build that the pinned test passes 50/50 and 100/100 even with adversarial overrides that disable QCC, prewhere, multiple-prewhere-read-steps, and force |
Why
Test
01154_move_partition_long.shis chronically flaky on a single CI configuration:Stateless tests (arm_asan_ubsan, azure, sequential). CIDB shows 23 failures vs 4583 successes on that one config over 30 days, while every other configuration the test runs on (amd_debug,amd_asan_ubsanwith distributed plan / db disk,amd_tsanlocal + s3,amd_binary,arm_binary) sits at 0 failures across roughly 30K runs.The failing config affected at least 18 distinct PRs and 5 master runs in the last 30 days, including PRs #103733, #103723, #103263, #101158, #101143, #101023, #100943, #100383, #100332, #100177, #100035, #99802, #99543, #102354, #101973, #101788, #101401, and others — it costs developers a CI re-run almost every day.
Failure modes
Both modes stem from slow Azure object-storage I/O combined with the
ASan+UBSanoverhead amplifying the 40-second concurrency-chaos workload (16 replicated tables, 8 background threads doing mixedINSERT,MOVE PARTITION,REPLACE PARTITION,DROP PARTITION,OPTIMIZE,DROP PART):TimeoutErrorin_cleanup_database— the test framework's per-test deadline expires while waiting for the cleanup HTTP request to come back from a slow Azure backend.result differs with referenceshowing extra rows from the test's own diagnostic line:Slow I/O widens the window where concurrent
DROP_RANGEentries race against in-flight INSERT propagation, producing genuine but timing-dependent lost parts. Some failures also include manycurl: (28) Operation timed out after 120001 millisecondslines.Why
no-azure-blob-storageThe test verifies
ReplicatedMergeTreereplication invariants under chaos. It is not testing Azure-specific behavior. Coverage of those invariants is preserved by every non-Azure configuration the test still runs on, including S3 storage onamd_tsan. Disabling on Azure removes the slow-I/O variable while keeping every meaningful sanitizer/storage combination.This follows the established pattern of disabling such long replication-stress tests on Azure: PR #95313 added
no-azure-blob-storageto the sibling tests01169_alter_partition_isolation_stress.sh,01171_mv_select_insert_isolation_long.sh,01443_merge_truncate_long.sh, and01516_drop_table_stress_long.shfor the same reason.01154_move_partition_long.shis the missing companion.A previous attempt (PR #85974, closed unmerged) tried to address this by raising
s3_max_single_read_retries. That approach did not converge.Closes #80315.
CIDB query for reference:
https://play.clickhouse.com/?user=play&query=SELECT+check_name%2C+test_status%2C+count%28%29+FROM+default.checks+WHERE+test_name%3D%2701154_move_partition_long%27+AND+check_start_time%3Enow%28%29-INTERVAL+30+DAY+GROUP+BY+check_name%2C+test_status
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Skip flaky test
01154_move_partition_longon Azure blob storage runs.Documentation entry for user-facing changes