Skip to content

Enhance test execution and timeout configurations#3457

Merged
mkoura merged 3 commits into
masterfrom
tune_cluster_instance_distribution
May 8, 2026
Merged

Enhance test execution and timeout configurations#3457
mkoura merged 3 commits into
masterfrom
tune_cluster_instance_distribution

Conversation

@mkoura
Copy link
Copy Markdown
Collaborator

@mkoura mkoura commented May 8, 2026

This pull request introduces significant improvements to the test cluster scheduling and resource management system, focusing on smarter distribution of test workloads and better prioritization for smoke and heavy tests. The changes include a new scheduling strategy for cluster instances, the introduction of a dedicated fast lane for smoke tests in the xdist scheduler, and adjustments to test timeouts for more robust test execution.

Cluster Instance Scheduling Improvements:

  • Added the _make_instances_order method to cluster_getter.py to prioritize heavy tests for head cluster instances and pack light tests onto tail instances, reducing resource contention and improving scheduling efficiency.
  • Updated the instance iteration logic in get_cluster_instance to use the new precomputed order, ensuring better allocation of cluster resources. [1] [2]

Test Scheduler Enhancements (xdist):

  • Introduced smoke test prioritization: When enough workers are available, a subset is dedicated to running smoke tests, ensuring these short tests are not delayed by longer or heavier tests. This includes new constants, worker selection logic, and scheduler routing for smoke tests. [1] [2] [3] [4] [5] [6] [7]
  • Modified the test collection process to tag smoke tests and ensure they are routed to the dedicated smoke worker lane. [1] [2]

Timeout and Configuration Adjustments:

  • Increased the default session timeout for testnets from 20h to 24h in run_tests.sh to reduce premature test termination.
  • Extended both the soft and hard grace periods for cluster resource cleanup to allow for longer test sessions.

Other Improvements:

  • Minor refactoring and documentation updates to clarify the new scheduling logic and marker handling. [1] [2] [3]

These updates collectively make the test infrastructure more robust, efficient, and responsive to different test types and workloads.

mkoura added 2 commits May 8, 2026 15:25
Extend the wait windows to match: cluster grace period 28800/30000 ->
36000/37800 seconds, and target_testnets SESSION_TIMEOUT 20h -> 24h.
Light tests (no `lock_resources`, only `CLUSTER` in `use_resources`) now
iterate cluster instances starting from the last two. Heavy tests start
from the head. This keeps earlier instances free for heavy tests, which
otherwise struggle to obtain an instance once light tests spread across
all of them.

Iteration order is computed by the new `_make_instances_order` helper,
called outside of the global cluster lock to keep the locked section as
short as possible.
@mkoura mkoura requested a review from saratomaz as a code owner May 8, 2026 15:12
@mkoura mkoura requested a review from Copilot May 8, 2026 20:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the test execution infrastructure by optimizing how cluster instances are selected under parallel load and by enhancing the pytest-xdist scheduler to prioritize smoke tests, along with extending timeouts to better accommodate long-running sessions.

Changes:

  • Increased default testnet session timeout and extended cluster cleanup grace periods to reduce premature termination.
  • Added a cluster instance iteration strategy that biases light tests toward “tail” instances to keep “head” instances available for heavy tests.
  • Enhanced the custom xdist scheduler with a dedicated smoke-test fast lane when sufficient workers are available.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
runner/run_tests.sh Increases default SESSION_TIMEOUT for testnets to 24h.
cardano_node_tests/pytest_plugins/xdist_scheduler.py Adds smoke-test routing/dedicated worker lane and updates nodeid tagging logic.
cardano_node_tests/cluster_management/cluster_getter.py Extends grace periods and introduces precomputed instance iteration ordering for improved scheduling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cardano_node_tests/pytest_plugins/xdist_scheduler.py Outdated
When the run has at least SMOKE_DEDICATED_THRESHOLD (10) workers and the
collection contains smoke tests, reserve SMOKE_DEDICATED_COUNT (2) of the
lowest-id workers as a smoke-only fast lane. Other workers continue to
schedule any work, including smoke. This prevents smoke tests from being
queued behind long/heavy tests on overloaded workers.
@mkoura mkoura force-pushed the tune_cluster_instance_distribution branch from cc2a150 to 4a94cd5 Compare May 8, 2026 20:54
@mkoura mkoura requested a review from Copilot May 8, 2026 20:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread cardano_node_tests/pytest_plugins/xdist_scheduler.py
@mkoura mkoura merged commit c519d63 into master May 8, 2026
7 checks passed
@mkoura mkoura deleted the tune_cluster_instance_distribution branch May 8, 2026 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants