Fix flaky test 03710_parallel_alter_comment_rename_selects timeout#102883
Fix flaky test 03710_parallel_alter_comment_rename_selects timeout#102883groeneai wants to merge 1 commit intoClickHouse:masterfrom
Conversation
… slow CI builds The test 03710_parallel_alter_comment_rename_selects times out on arm_asan_ubsan azure (ASAN 3x slowdown + ARM overhead + Azure I/O) because 25 iterations with 4 threads create heavy lock contention that pushes individual RENAME/ALTER DATABASE queries past the 30s per-query timeout. Reducing from 25 to 10 iterations keeps 240 concurrent operations per engine (10 runs x 4 threads x 3 job types x 2 ops each), which is still sufficient for deadlock detection while completing in ~14s on debug (vs 32s before), well within timeout limits even under 8x sanitizer slowdown.
Pre-PR Validation (session cron:clickhouse-ci-task-worker:20260416-094500)a) Deterministic repro? Yes — test takes 32s on debug with RUNS=25, projecting to 160-256s under ARM ASAN Azure (5-8x overhead). Individual queries exceed the 30s QUERY_TIMEOUT due to ASAN slowdown × 12-thread lock contention on database-level locks. b) Root cause? RUNS=25 × THREADS_PER_JOB=4 creates 12 concurrent threads competing for database-level locks (RENAME + ALTER COMMENT + SELECT). Under ASAN (3x) + ARM (~1.5x) + Azure I/O, sustained contention causes individual RENAME/ALTER DATABASE operations to exceed the 30s per-query timeout. c) Fix matches root cause? Yes — reducing RUNS from 25 to 10 reduces total operations by 60%, reducing sustained contention so individual queries complete faster. d) Test intent preserved? Yes — 10 iterations × 4 threads = 240 operations per engine. Deadlocks manifest in 1-3 iterations; 10 is more than sufficient. e) Both directions? RUNS=25: 32s debug → ~256s ARM ASAN Azure (timeout). RUNS=10: 14s debug → ~112s ARM ASAN Azure (within limits). 10/10 passes locally. |
|
cc @tavplubix — could you review this? It reduces iterations in the parallel ALTER/RENAME DATABASE test from 25 to 10 to prevent timeouts on slow CI builds (arm_asan_ubsan azure). The test was timing out with 14 master failures in 30 days. |
|
Workflow [PR], commit [34d0919] Summary: ✅ AI ReviewSummaryThis PR reduces the default ClickHouse Rules
Final Verdict
|
Changelog category (leave one):
Changelog entry (a]):
<changelog_will_not_be_checked>
What is the problem?
The test
03710_parallel_alter_comment_rename_selectstimes out onarm_asan_ubsan, azure, sequentialCI checks. CIDB shows 14 master failures in 30 days (13 onamd_coveragedaily March 18-31, 1 onarm_asan_ubsanApril 15) plus 5+ PR failures across unrelated PRs in the last 3 days — all timeouts (exit code 124).The root cause is that with
RUNS=25andTHREADS_PER_JOB=4, the test creates 12 concurrent threads doing RENAME DATABASE / ALTER DATABASE / SELECT operations (1200 total operations across 2 engines). Under ASAN (3x slowdown) + ARM overhead + Azure I/O latency, the heavy lock contention pushes individual queries past the 30s per-queryQUERY_TIMEOUTset by the test.How was it fixed?
Reduce
RUNSfrom 25 to 10. This:The test already has
no-msan(added Jan 2026 for the same timeout issue on MSAN builds). TheRUNSvariable remains configurable via environment for anyone who wants to run more iterations locally.Local verification
10/10 passes with randomization on debug build, all completing in 13-15 seconds: