Generated with Cursor GPT-5.5 1M High
1. Problem
pytest-randomly is valuable because it surfaces hidden order dependencies, leaked global state, missing cleanup, and other test-suite bugs that normal deterministic runs miss. The problem is where that signal appears today: in the middle of human-driven development and CI work that is usually about something else.
When an order-dependent failure appears opportunistically, it is rarely actionable in the moment. A developer is pulled away from the change they were trying to make, CI reruns become the default response, and the team spends time deciding whether a failure is "real" before anyone can even start fixing it. In other words, pytest-randomly is doing useful chaos testing, but it is delivering the interrupt to the wrong place.
The goal is to keep the bug-finding value while moving the noise out of the normal developer path.
2. Proposed tool
Move pytest-randomly from the human-driven test environment into a dedicated nightly flake-finding workflow, then attach an AI agent to that workflow.
The nightly job would run the relevant CUDA Python test suites with randomized ordering across a controlled matrix and multiple seeds. When a failure occurs, an agent would automatically:
- Collect the failing workflow run, job, matrix entry, pytest seed, logs, and artifacts.
- Reproduce or narrow the failure using the same seed and a smaller subset of tests.
- Diagnose the likely order dependency or leaked state.
- Create a GitHub issue with the failure signature, reproduction command, seed, suspected cause, and affected tests.
- Open a PR when the fix is straightforward.
- Continue on PR review feedback when reviewers leave comments, while allowing a human to take over if local GPU debugging is more practical.
This makes pytest-randomly a scheduled maintenance signal rather than a surprise interruption. Humans still review fixes, but they should only enter the loop after the agent has already produced the useful context: reproducible seed, minimized test order when possible, and either a proposed patch or a clearly scoped issue.
3. Metric it moves
Primary metric: fewer random-order failures reaching human-driven CI and local development.
Useful supporting metrics:
- Number of
pytest-randomly failures detected nightly.
- Percentage of failures with an automatically minimized reproduction.
- Percentage of failures where the agent opens a PR.
- Median time from nightly failure to issue creation.
- Median time from nightly failure to merged fix.
- Reduction in CI reruns caused by suspected test-order flakes.
- Developer-reported interruptions from unrelated randomized test failures.
4. Rough effort
Medium.
The first phase is small: disable randomized test reordering in normal developer workflows, or remove pytest-randomly from default test dependency groups and install it only in the nightly job.
The CI work is moderate but should reuse existing pieces. The repository already has scheduled CI, repeated test runs, and reusable wheel-test workflows. The agent integration is the larger part: it needs access to workflow logs, a reproducible command format, issue/PR creation, and guardrails about when it should stop and ask for human help.
Existing CI hooks to reuse
There are several relevant pieces already in the repo:
.github/workflows/ci.yml already runs on a daily schedule at midnight UTC.
- Scheduled
ci.yml runs already pass nruns: 5 into the wheel test jobs, using the existing pytest-repeat plumbing.
.github/workflows/test-wheel-linux.yml and .github/workflows/test-wheel-windows.yml already expose reusable inputs such as build-type, matrix_filter, and nruns.
ci/test-matrix.yml already separates pull-request and nightly matrices, although the current nightly entries are empty.
ci/tools/run-tests currently invokes pytest with --randomly-dont-reorganize for cuda.bindings and cuda.core, which means the project is already partially opting out of randomized ordering even while pytest-randomly is installed.
PR #1987 is also directly relevant. It adds a new .github/workflows/ci-nightly.yml orchestrator for optional-dependency testing, scheduled at 2 AM UTC, and extends the wheel-test workflows so they can download wheels from the latest successful main CI run and select test modes via the existing matrix/filter machinery.
If PR #1987 lands, the pytest-randomly work should probably build on the same pattern rather than inventing a parallel structure. The clean options are:
- Add a
nightly-random-order mode to the new ci-nightly.yml flow.
- Create a sibling nightly workflow that reuses the same
lookup-run-id, artifact download, test-mode, and matrix-filter conventions.
Either way, the important design point is to avoid rebuilding wheels just to hunt random-order flakes. Use known-good artifacts from the latest successful main run, spend the nightly budget on randomized test execution, and keep the normal PR path deterministic.
Rollout plan
Phase 1: Make normal runs deterministic
Remove random-order behavior from local and PR-oriented paths. Depending on what is least disruptive, this could mean either:
- Remove
pytest-randomly from default developer test dependency groups and install it only in the nightly workflow.
- Keep the dependency but make
--randomly-dont-reorganize the default for normal test entry points.
The key point is that developers should not discover unrelated order dependencies while trying to land unrelated work.
Phase 2: Add the nightly randomized workflow
Add a scheduled workflow that runs the relevant test suites with pytest-randomly enabled and with explicit seed reporting. Start with a focused matrix to keep signal high and runner cost controlled, then expand after the loop proves useful.
Initial scope could be:
cuda.pathfinder, cuda.bindings, and cuda.core tests.
- Linux first, then Windows once log capture and reproduction are solid.
- A small number of seeds per matrix entry.
- Reuse
nruns or add a clearer randomly-seeds input so failures can be tied to exact seeds rather than only to repeated runs.
Phase 3: Attach the AI agent
Run an agent after nightly failures. The agent should:
- Fetch the failing workflow run and logs.
- Extract the failing matrix, package, test path, seed, and first useful traceback.
- Rerun with the same seed, then bisect or minimize the test order when feasible.
- Identify leaked state such as environment variables, global module state, CUDA context/device state, monkeypatches, random seeds, temporary files, or process-wide configuration.
- Prefer fixes that restore test isolation over fixes that hide the failure with reruns, sleeps, or broad skips.
- Open an issue when diagnosis is useful but a fix is not safe.
- Open a PR when the fix is scoped and reviewable.
Phase 4: Review loop and ownership
Treat agent-created PRs like normal engineering work. Reviewers can leave comments for the agent to address, but they can also take over immediately when a failure requires hands-on GPU debugging or broader test-design judgment.
Issue labels should make the queue easy to triage, for example test-flake, pytest-randomly, and package-specific labels such as cuda-core or cuda-bindings.
Guardrails
- Do not use
pytest-rerunfailures as the fix for random-order failures unless the underlying failure is genuinely external.
- Do not silence failures with broad skips.
- Every issue should include the seed, failing job URL, matrix entry, and reproduction command.
- Every PR should explain the leaked state or order dependency it fixes.
- Start with a bounded nightly budget and expand only when the signal-to-noise ratio is good.
- Keep normal PR CI deterministic, except for explicit opt-in debugging jobs.
Success criteria
After this is working, pytest-randomly should still find latent test-suite bugs, but those bugs should arrive as prepared issues or small PRs instead of surprise failures in unrelated development work.
A good first milestone would be:
- Normal PR and local test runs no longer reorder tests unexpectedly.
- A nightly randomized workflow runs successfully on a small matrix.
- Failures automatically produce issues with reproduction details.
- At least one order-dependent flake is fixed from an agent-generated diagnosis or PR.
Generated with Cursor GPT-5.5 1M High
1. Problem
pytest-randomlyis valuable because it surfaces hidden order dependencies, leaked global state, missing cleanup, and other test-suite bugs that normal deterministic runs miss. The problem is where that signal appears today: in the middle of human-driven development and CI work that is usually about something else.When an order-dependent failure appears opportunistically, it is rarely actionable in the moment. A developer is pulled away from the change they were trying to make, CI reruns become the default response, and the team spends time deciding whether a failure is "real" before anyone can even start fixing it. In other words,
pytest-randomlyis doing useful chaos testing, but it is delivering the interrupt to the wrong place.The goal is to keep the bug-finding value while moving the noise out of the normal developer path.
2. Proposed tool
Move
pytest-randomlyfrom the human-driven test environment into a dedicated nightly flake-finding workflow, then attach an AI agent to that workflow.The nightly job would run the relevant CUDA Python test suites with randomized ordering across a controlled matrix and multiple seeds. When a failure occurs, an agent would automatically:
This makes
pytest-randomlya scheduled maintenance signal rather than a surprise interruption. Humans still review fixes, but they should only enter the loop after the agent has already produced the useful context: reproducible seed, minimized test order when possible, and either a proposed patch or a clearly scoped issue.3. Metric it moves
Primary metric: fewer random-order failures reaching human-driven CI and local development.
Useful supporting metrics:
pytest-randomlyfailures detected nightly.4. Rough effort
Medium.
The first phase is small: disable randomized test reordering in normal developer workflows, or remove
pytest-randomlyfrom default test dependency groups and install it only in the nightly job.The CI work is moderate but should reuse existing pieces. The repository already has scheduled CI, repeated test runs, and reusable wheel-test workflows. The agent integration is the larger part: it needs access to workflow logs, a reproducible command format, issue/PR creation, and guardrails about when it should stop and ask for human help.
Existing CI hooks to reuse
There are several relevant pieces already in the repo:
.github/workflows/ci.ymlalready runs on a daily schedule at midnight UTC.ci.ymlruns already passnruns: 5into the wheel test jobs, using the existingpytest-repeatplumbing..github/workflows/test-wheel-linux.ymland.github/workflows/test-wheel-windows.ymlalready expose reusable inputs such asbuild-type,matrix_filter, andnruns.ci/test-matrix.ymlalready separatespull-requestandnightlymatrices, although the currentnightlyentries are empty.ci/tools/run-testscurrently invokespytestwith--randomly-dont-reorganizeforcuda.bindingsandcuda.core, which means the project is already partially opting out of randomized ordering even whilepytest-randomlyis installed.PR #1987 is also directly relevant. It adds a new
.github/workflows/ci-nightly.ymlorchestrator for optional-dependency testing, scheduled at 2 AM UTC, and extends the wheel-test workflows so they can download wheels from the latest successfulmainCI run and select test modes via the existing matrix/filter machinery.If PR #1987 lands, the
pytest-randomlywork should probably build on the same pattern rather than inventing a parallel structure. The clean options are:nightly-random-ordermode to the newci-nightly.ymlflow.lookup-run-id, artifact download,test-mode, and matrix-filter conventions.Either way, the important design point is to avoid rebuilding wheels just to hunt random-order flakes. Use known-good artifacts from the latest successful
mainrun, spend the nightly budget on randomized test execution, and keep the normal PR path deterministic.Rollout plan
Phase 1: Make normal runs deterministic
Remove random-order behavior from local and PR-oriented paths. Depending on what is least disruptive, this could mean either:
pytest-randomlyfrom default developer test dependency groups and install it only in the nightly workflow.--randomly-dont-reorganizethe default for normal test entry points.The key point is that developers should not discover unrelated order dependencies while trying to land unrelated work.
Phase 2: Add the nightly randomized workflow
Add a scheduled workflow that runs the relevant test suites with
pytest-randomlyenabled and with explicit seed reporting. Start with a focused matrix to keep signal high and runner cost controlled, then expand after the loop proves useful.Initial scope could be:
cuda.pathfinder,cuda.bindings, andcuda.coretests.nrunsor add a clearerrandomly-seedsinput so failures can be tied to exact seeds rather than only to repeated runs.Phase 3: Attach the AI agent
Run an agent after nightly failures. The agent should:
Phase 4: Review loop and ownership
Treat agent-created PRs like normal engineering work. Reviewers can leave comments for the agent to address, but they can also take over immediately when a failure requires hands-on GPU debugging or broader test-design judgment.
Issue labels should make the queue easy to triage, for example
test-flake,pytest-randomly, and package-specific labels such ascuda-coreorcuda-bindings.Guardrails
pytest-rerunfailuresas the fix for random-order failures unless the underlying failure is genuinely external.Success criteria
After this is working,
pytest-randomlyshould still find latent test-suite bugs, but those bugs should arrive as prepared issues or small PRs instead of surprise failures in unrelated development work.A good first milestone would be: