Skip to content

Skip 01154_move_partition_long on Azure blob storage#103857

Open
groeneai wants to merge 1 commit intoClickHouse:masterfrom
groeneai:groeneai/fix-flaky-01154-move-partition-long-azure
Open

Skip 01154_move_partition_long on Azure blob storage#103857
groeneai wants to merge 1 commit intoClickHouse:masterfrom
groeneai:groeneai/fix-flaky-01154-move-partition-long-azure

Conversation

@groeneai
Copy link
Copy Markdown
Contributor

@groeneai groeneai commented May 1, 2026

Why

Test 01154_move_partition_long.sh is chronically flaky on a single CI configuration: Stateless tests (arm_asan_ubsan, azure, sequential). CIDB shows 23 failures vs 4583 successes on that one config over 30 days, while every other configuration the test runs on (amd_debug, amd_asan_ubsan with distributed plan / db disk, amd_tsan local + s3, amd_binary, arm_binary) sits at 0 failures across roughly 30K runs.

The failing config affected at least 18 distinct PRs and 5 master runs in the last 30 days, including PRs #103733, #103723, #103263, #101158, #101143, #101023, #100943, #100383, #100332, #100177, #100035, #99802, #99543, #102354, #101973, #101788, #101401, and others — it costs developers a CI re-run almost every day.

Failure modes

Both modes stem from slow Azure object-storage I/O combined with the ASan + UBSan overhead amplifying the 40-second concurrency-chaos workload (16 replicated tables, 8 background threads doing mixed INSERT, MOVE PARTITION, REPLACE PARTITION, DROP PARTITION, OPTIMIZE, DROP PART):

  1. TimeoutError in _cleanup_database — the test framework's per-test deadline expires while waiting for the cleanup HTTP request to come back from a slow Azure backend.

  2. result differs with reference showing extra rows from the test's own diagnostic line:

    SELECT table, lost_part_count
    FROM system.replicas
    WHERE database = currentDatabase() AND lost_part_count != 0

    Slow I/O widens the window where concurrent DROP_RANGE entries race against in-flight INSERT propagation, producing genuine but timing-dependent lost parts. Some failures also include many curl: (28) Operation timed out after 120001 milliseconds lines.

Why no-azure-blob-storage

The test verifies ReplicatedMergeTree replication invariants under chaos. It is not testing Azure-specific behavior. Coverage of those invariants is preserved by every non-Azure configuration the test still runs on, including S3 storage on amd_tsan. Disabling on Azure removes the slow-I/O variable while keeping every meaningful sanitizer/storage combination.

This follows the established pattern of disabling such long replication-stress tests on Azure: PR #95313 added no-azure-blob-storage to the sibling tests 01169_alter_partition_isolation_stress.sh, 01171_mv_select_insert_isolation_long.sh, 01443_merge_truncate_long.sh, and 01516_drop_table_stress_long.sh for the same reason. 01154_move_partition_long.sh is the missing companion.

A previous attempt (PR #85974, closed unmerged) tried to address this by raising s3_max_single_read_retries. That approach did not converge.

Closes #80315.

CIDB query for reference:
https://play.clickhouse.com/?user=play&query=SELECT+check_name%2C+test_status%2C+count%28%29+FROM+default.checks+WHERE+test_name%3D%2701154_move_partition_long%27+AND+check_start_time%3Enow%28%29-INTERVAL+30+DAY+GROUP+BY+check_name%2C+test_status

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Skip flaky test 01154_move_partition_long on Azure blob storage runs.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

This concurrent replication-chaos test (40s of mixed INSERT, MOVE PARTITION,
REPLACE PARTITION, DROP PARTITION, OPTIMIZE, DROP PART against 16 replicated
tables) reliably exceeds the test framework timeout on the
`Stateless tests (arm_asan_ubsan, azure, sequential)` configuration. CIDB shows
23 failures / 4583 successes on that single config over 30 days, while every
other configuration (amd_debug, amd_asan_ubsan with distributed plan / db disk,
amd_tsan local + s3, amd_binary, arm_binary) is at 0 failures across ~30K runs.

The two failure modes both stem from slow Azure object-storage I/O combined with
the ASan + UBSan overhead:

- `TimeoutError` in `_cleanup_database` after the chaos finishes: the cleanup
  HTTP request blocks longer than the framework's per-test deadline.
- `result differs with reference` showing extra rows from
  `SELECT table, lost_part_count FROM system.replicas WHERE lost_part_count != 0`
  — the chaos-test diagnostic check fires because slow I/O widens the window
  for races between concurrent `DROP_RANGE` entries and INSERT propagation,
  yielding genuine but timing-dependent lost parts.

The test verifies `ReplicatedMergeTree` replication invariants under chaos; it
is not testing Azure-specific behavior. Coverage of those invariants is
preserved by every non-Azure configuration the test still runs on.

This follows the established pattern of disabling such long replication-stress
tests on Azure: PR ClickHouse#95313 added `no-azure-blob-storage` to sibling tests
including `01169_alter_partition_isolation_stress.sh`,
`01171_mv_select_insert_isolation_long.sh`, `01443_merge_truncate_long.sh`,
`01516_drop_table_stress_long.sh` for the same reason.

Closes ClickHouse#80315

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@groeneai
Copy link
Copy Markdown
Contributor Author

groeneai commented May 1, 2026

Pre-PR validation gate (per worker policy):

(a) Deterministic repro? N/A by design — the fix is a configuration tag that the test runner is verified to recognize (tests/clickhouse-test:2605 skips when no-azure-blob-storage tag is set and args.azure_blob_storage is true). The failing CI configuration is verified by CIDB to be exclusively Stateless tests (arm_asan_ubsan, azure, sequential).

(b) Root cause explained? Yes. Slow Azure object-storage I/O combined with ASan+UBSan overhead amplifies the 40-second chaos workload past the framework deadline AND widens the timing window for races between concurrent DROP_RANGE and INSERT propagation. The test's own lost_part_count != 0 diagnostic — designed to surface real replication races — fires when those races land within the test window. The PR description has the failure modes spelled out.

(c) Fix matches root cause? Yes. The test verifies ReplicatedMergeTree chaos invariants, not Azure-specific behavior. Skipping Azure removes the slow-I/O variable that makes the chaos window unmanageable on this single config, identical to PR #95313's treatment of sibling tests 01169, 01171, 01443, 01516.

(d) Test intent preserved? Yes. Test still runs on amd_debug, amd_asan_ubsan (distributed plan / db disk), amd_tsan local + s3, amd_binary, arm_binary — all of which ran cleanly across ~30K invocations in the last 30 days. Replication-chaos coverage is fully maintained.

(e) Both directions demonstrated? Demonstrated via CIDB rather than a local rerun: the test fails 23 / 4606 times only on arm_asan_ubsan, azure, sequential and never on the seven other production configs. The tag will mechanically eliminate that single failing combination via a code path verified to exist in tests/clickhouse-test. Local repro is impractical because we don't have an arm_asan_ubsan + Azurite box, and the failure rate is roughly 0.5% even on that config.

(f) Fix is general, not narrow patch? Yes. The same change has already been applied to every sibling stress test in PR #95313. 01154 was the missing companion to that batch.

Session id: cron:clickhouse-ci-task-worker:20260501-131500

@groeneai
Copy link
Copy Markdown
Contributor Author

groeneai commented May 1, 2026

cc @alesapin — could you take a look? This adds the same no-azure-blob-storage tag to 01154_move_partition_long.sh that PR #95313 already applied to its sibling chaos tests (01169 / 01171 / 01443 / 01516). Test-only one-liner; CIDB justification is in the PR description and validation comment above.

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label May 1, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented May 1, 2026

Workflow [PR], commit [29b92ab]

Summary:

job_name test_name status info comment
Stateless tests (amd_msan, WasmEdge, parallel, 1/2) FAIL
04051_pk_analysis_stats FAIL cidb

AI Review

Summary

This PR adds no-azure-blob-storage to 01154_move_partition_long.sh and documents the rationale inline, aligning the test with existing sibling long stress tests that are already excluded on Azure blob storage due to infra-specific flakiness. I did not find correctness, safety, or maintainability issues in the proposed change.

ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
No large/binary files
Final Verdict

Status: ✅ Approve

@alexey-milovidov
Copy link
Copy Markdown
Member

@groeneai, fix the 04051_pk_analysis_stats test and send a separate PR, or link the existing fix here. Then update the branch here.

@alexey-milovidov alexey-milovidov self-assigned this May 6, 2026
alexey-milovidov pushed a commit that referenced this pull request May 6, 2026
These two tests (`01192_rename_database_zookeeper.sh` and
`01307_multiple_leaders_zookeeper.sh`) reliably exceed the test
framework timeout on the `Stateless tests (arm_asan_ubsan, azure,
sequential)` configuration, while passing on every other configuration
without a single failure.

The failure mode is `Azure::Storage::StorageException: 408 Timeout:
connect timed out: 127.0.0.1:10000` — the Azurite mock storage
backend cannot keep up under the combined load of `arm` + ASan/UBSan
instrumentation slowdown plus the test's concurrent `INSERT` workload.

Both tests target ZooKeeper / `ReplicatedMergeTree` behaviour, not
Azure-specific code paths, so excluding them on the Azure blob storage
config does not reduce coverage of the feature being tested.

CIDB evidence (30-day window):

  Test                                Azure config        Other configs
  01192_rename_database_zookeeper     19 FAIL / 4482 OK   0 FAIL / 50K+ OK
  01307_multiple_leaders_zookeeper    14 FAIL / 4491 OK   0 FAIL / 45K+ OK

This follows the same pattern used by sibling concurrent
replication-chaos tests that already carry the `no-azure-blob-storage`
tag: `01154_move_partition_long.sh` (PR #103857),
`01169_alter_partition_isolation_stress.sh`,
`01171_mv_select_insert_isolation_long.sh`,
`01443_merge_truncate_long.sh`, `01516_drop_table_stress_long.sh`.
@groeneai
Copy link
Copy Markdown
Contributor Author

groeneai commented May 6, 2026

@alexey-milovidov filed the fix as #104272.

Investigation summary for the chronic-flaky label on 04051_pk_analysis_stats:

The single transient failure on this PR (#103857) at commit 29b92abb, run 2026-05-01 16:39:04 in Stateless tests (amd_msan, WasmEdge, parallel, 1/2), was flagged "not reproducible" by the diagnoser (6/6 reruns passed). It is not blocking anymore — re-running the failed shard should be enough.

Verified locally on a debug build that the pinned test passes 50/50 and 100/100 even with adversarial overrides that disable QCC, prewhere, multiple-prewhere-read-steps, and force --session_timezone America/Mazatlan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

01154_move_partition_long is flaky

2 participants