Skip to content

[Pitch] Nightly pytest-randomly flake finder + AI fixer #2088

@rwgk

Description

@rwgk

Generated with Cursor GPT-5.5 1M High


1. Problem

pytest-randomly is valuable because it surfaces hidden order dependencies, leaked global state, missing cleanup, and other test-suite bugs that normal deterministic runs miss. The problem is where that signal appears today: in the middle of human-driven development and CI work that is usually about something else.

When an order-dependent failure appears opportunistically, it is rarely actionable in the moment. A developer is pulled away from the change they were trying to make, CI reruns become the default response, and the team spends time deciding whether a failure is "real" before anyone can even start fixing it. In other words, pytest-randomly is doing useful chaos testing, but it is delivering the interrupt to the wrong place.

The goal is to keep the bug-finding value while moving the noise out of the normal developer path.

2. Proposed tool

Move pytest-randomly from the human-driven test environment into a dedicated nightly flake-finding workflow, then attach an AI agent to that workflow.

The nightly job would run the relevant CUDA Python test suites with randomized ordering across a controlled matrix and multiple seeds. When a failure occurs, an agent would automatically:

  1. Collect the failing workflow run, job, matrix entry, pytest seed, logs, and artifacts.
  2. Reproduce or narrow the failure using the same seed and a smaller subset of tests.
  3. Diagnose the likely order dependency or leaked state.
  4. Create a GitHub issue with the failure signature, reproduction command, seed, suspected cause, and affected tests.
  5. Open a PR when the fix is straightforward.
  6. Continue on PR review feedback when reviewers leave comments, while allowing a human to take over if local GPU debugging is more practical.

This makes pytest-randomly a scheduled maintenance signal rather than a surprise interruption. Humans still review fixes, but they should only enter the loop after the agent has already produced the useful context: reproducible seed, minimized test order when possible, and either a proposed patch or a clearly scoped issue.

3. Metric it moves

Primary metric: fewer random-order failures reaching human-driven CI and local development.

Useful supporting metrics:

  • Number of pytest-randomly failures detected nightly.
  • Percentage of failures with an automatically minimized reproduction.
  • Percentage of failures where the agent opens a PR.
  • Median time from nightly failure to issue creation.
  • Median time from nightly failure to merged fix.
  • Reduction in CI reruns caused by suspected test-order flakes.
  • Developer-reported interruptions from unrelated randomized test failures.

4. Rough effort

Medium.

The first phase is small: disable randomized test reordering in normal developer workflows, or remove pytest-randomly from default test dependency groups and install it only in the nightly job.

The CI work is moderate but should reuse existing pieces. The repository already has scheduled CI, repeated test runs, and reusable wheel-test workflows. The agent integration is the larger part: it needs access to workflow logs, a reproducible command format, issue/PR creation, and guardrails about when it should stop and ask for human help.

Existing CI hooks to reuse

There are several relevant pieces already in the repo:

  • .github/workflows/ci.yml already runs on a daily schedule at midnight UTC.
  • Scheduled ci.yml runs already pass nruns: 5 into the wheel test jobs, using the existing pytest-repeat plumbing.
  • .github/workflows/test-wheel-linux.yml and .github/workflows/test-wheel-windows.yml already expose reusable inputs such as build-type, matrix_filter, and nruns.
  • ci/test-matrix.yml already separates pull-request and nightly matrices, although the current nightly entries are empty.
  • ci/tools/run-tests currently invokes pytest with --randomly-dont-reorganize for cuda.bindings and cuda.core, which means the project is already partially opting out of randomized ordering even while pytest-randomly is installed.

PR #1987 is also directly relevant. It adds a new .github/workflows/ci-nightly.yml orchestrator for optional-dependency testing, scheduled at 2 AM UTC, and extends the wheel-test workflows so they can download wheels from the latest successful main CI run and select test modes via the existing matrix/filter machinery.

If PR #1987 lands, the pytest-randomly work should probably build on the same pattern rather than inventing a parallel structure. The clean options are:

  1. Add a nightly-random-order mode to the new ci-nightly.yml flow.
  2. Create a sibling nightly workflow that reuses the same lookup-run-id, artifact download, test-mode, and matrix-filter conventions.

Either way, the important design point is to avoid rebuilding wheels just to hunt random-order flakes. Use known-good artifacts from the latest successful main run, spend the nightly budget on randomized test execution, and keep the normal PR path deterministic.

Rollout plan

Phase 1: Make normal runs deterministic

Remove random-order behavior from local and PR-oriented paths. Depending on what is least disruptive, this could mean either:

  • Remove pytest-randomly from default developer test dependency groups and install it only in the nightly workflow.
  • Keep the dependency but make --randomly-dont-reorganize the default for normal test entry points.

The key point is that developers should not discover unrelated order dependencies while trying to land unrelated work.

Phase 2: Add the nightly randomized workflow

Add a scheduled workflow that runs the relevant test suites with pytest-randomly enabled and with explicit seed reporting. Start with a focused matrix to keep signal high and runner cost controlled, then expand after the loop proves useful.

Initial scope could be:

  • cuda.pathfinder, cuda.bindings, and cuda.core tests.
  • Linux first, then Windows once log capture and reproduction are solid.
  • A small number of seeds per matrix entry.
  • Reuse nruns or add a clearer randomly-seeds input so failures can be tied to exact seeds rather than only to repeated runs.

Phase 3: Attach the AI agent

Run an agent after nightly failures. The agent should:

  • Fetch the failing workflow run and logs.
  • Extract the failing matrix, package, test path, seed, and first useful traceback.
  • Rerun with the same seed, then bisect or minimize the test order when feasible.
  • Identify leaked state such as environment variables, global module state, CUDA context/device state, monkeypatches, random seeds, temporary files, or process-wide configuration.
  • Prefer fixes that restore test isolation over fixes that hide the failure with reruns, sleeps, or broad skips.
  • Open an issue when diagnosis is useful but a fix is not safe.
  • Open a PR when the fix is scoped and reviewable.

Phase 4: Review loop and ownership

Treat agent-created PRs like normal engineering work. Reviewers can leave comments for the agent to address, but they can also take over immediately when a failure requires hands-on GPU debugging or broader test-design judgment.

Issue labels should make the queue easy to triage, for example test-flake, pytest-randomly, and package-specific labels such as cuda-core or cuda-bindings.

Guardrails

  • Do not use pytest-rerunfailures as the fix for random-order failures unless the underlying failure is genuinely external.
  • Do not silence failures with broad skips.
  • Every issue should include the seed, failing job URL, matrix entry, and reproduction command.
  • Every PR should explain the leaked state or order dependency it fixes.
  • Start with a bounded nightly budget and expand only when the signal-to-noise ratio is good.
  • Keep normal PR CI deterministic, except for explicit opt-in debugging jobs.

Success criteria

After this is working, pytest-randomly should still find latent test-suite bugs, but those bugs should arrive as prepared issues or small PRs instead of surprise failures in unrelated development work.

A good first milestone would be:

  • Normal PR and local test runs no longer reorder tests unexpectedly.
  • A nightly randomized workflow runs successfully on a small matrix.
  • Failures automatically produce issues with reproduction details.
  • At least one order-dependent flake is fixed from an agent-generated diagnosis or PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    CI/CDCI/CD infrastructuretestImprovements or additions to teststriageNeeds the team's attention

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions