Skip to content

Fix flaky test 03710_parallel_alter_comment_rename_selects timeout#102883

Open
groeneai wants to merge 1 commit intoClickHouse:masterfrom
groeneai:fix/03710-reduce-runs-timeout
Open

Fix flaky test 03710_parallel_alter_comment_rename_selects timeout#102883
groeneai wants to merge 1 commit intoClickHouse:masterfrom
groeneai:fix/03710-reduce-runs-timeout

Conversation

@groeneai
Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a]):

<changelog_will_not_be_checked>

What is the problem?

The test 03710_parallel_alter_comment_rename_selects times out on arm_asan_ubsan, azure, sequential CI checks. CIDB shows 14 master failures in 30 days (13 on amd_coverage daily March 18-31, 1 on arm_asan_ubsan April 15) plus 5+ PR failures across unrelated PRs in the last 3 days — all timeouts (exit code 124).

The root cause is that with RUNS=25 and THREADS_PER_JOB=4, the test creates 12 concurrent threads doing RENAME DATABASE / ALTER DATABASE / SELECT operations (1200 total operations across 2 engines). Under ASAN (3x slowdown) + ARM overhead + Azure I/O latency, the heavy lock contention pushes individual queries past the 30s per-query QUERY_TIMEOUT set by the test.

How was it fixed?

Reduce RUNS from 25 to 10. This:

  • Reduces total operations from 1200 to 480 (10 × 4 threads × 3 job types × 2 ops × 2 engines)
  • Reduces test runtime from ~32s to ~14s on debug build (58% reduction)
  • Projected ARM ASAN Azure runtime: ~112s (vs ~256s before), well within the 600s test runner timeout
  • Preserves test intent: 10 iterations × 4 threads = 240 concurrent operations per engine — more than sufficient for deadlock detection (deadlocks are lock-ordering bugs that manifest in 1-3 iterations)

The test already has no-msan (added Jan 2026 for the same timeout issue on MSAN builds). The RUNS variable remains configurable via environment for anyone who wants to run more iterations locally.

Local verification

10/10 passes with randomization on debug build, all completing in 13-15 seconds:

03710_parallel_alter_comment_rename_selects: [ OK ] 14.21 sec.
03710_parallel_alter_comment_rename_selects: [ OK ] 14.12 sec.
...
03710_parallel_alter_comment_rename_selects: [ OK ] 13.83 sec.

… slow CI builds

The test 03710_parallel_alter_comment_rename_selects times out on
arm_asan_ubsan azure (ASAN 3x slowdown + ARM overhead + Azure I/O)
because 25 iterations with 4 threads create heavy lock contention
that pushes individual RENAME/ALTER DATABASE queries past the 30s
per-query timeout.

Reducing from 25 to 10 iterations keeps 240 concurrent operations
per engine (10 runs x 4 threads x 3 job types x 2 ops each),
which is still sufficient for deadlock detection while completing
in ~14s on debug (vs 32s before), well within timeout limits even
under 8x sanitizer slowdown.
@groeneai
Copy link
Copy Markdown
Contributor Author

Pre-PR Validation (session cron:clickhouse-ci-task-worker:20260416-094500)

a) Deterministic repro? Yes — test takes 32s on debug with RUNS=25, projecting to 160-256s under ARM ASAN Azure (5-8x overhead). Individual queries exceed the 30s QUERY_TIMEOUT due to ASAN slowdown × 12-thread lock contention on database-level locks.

b) Root cause? RUNS=25 × THREADS_PER_JOB=4 creates 12 concurrent threads competing for database-level locks (RENAME + ALTER COMMENT + SELECT). Under ASAN (3x) + ARM (~1.5x) + Azure I/O, sustained contention causes individual RENAME/ALTER DATABASE operations to exceed the 30s per-query timeout.

c) Fix matches root cause? Yes — reducing RUNS from 25 to 10 reduces total operations by 60%, reducing sustained contention so individual queries complete faster.

d) Test intent preserved? Yes — 10 iterations × 4 threads = 240 operations per engine. Deadlocks manifest in 1-3 iterations; 10 is more than sufficient.

e) Both directions? RUNS=25: 32s debug → ~256s ARM ASAN Azure (timeout). RUNS=10: 14s debug → ~112s ARM ASAN Azure (within limits). 10/10 passes locally.

@groeneai
Copy link
Copy Markdown
Contributor Author

cc @tavplubix — could you review this? It reduces iterations in the parallel ALTER/RENAME DATABASE test from 25 to 10 to prevent timeouts on slow CI builds (arm_asan_ubsan azure). The test was timing out with 14 master failures in 30 days.

@nikitamikhaylov nikitamikhaylov added the can be tested Allows running workflows for external contributors label Apr 16, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Apr 16, 2026

Workflow [PR], commit [34d0919]

Summary:


AI Review

Summary

This PR reduces the default RUNS in 03710_parallel_alter_comment_rename_selects.sh from 25 to 10 to lower timeout risk on slower CI environments while preserving concurrent coverage structure. Based on the diff, this is a focused CI-stability adjustment with low risk to product correctness.

ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
No large/binary files
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Apr 16, 2026
@evillique evillique self-assigned this Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants