[Pitch] Nightly `pytest-randomly` flake finder + AI fixer

Generated with Cursor GPT-5.5 1M High

___

### 1. Problem

`pytest-randomly` is valuable because it surfaces hidden order dependencies, leaked global state, missing cleanup, and other test-suite bugs that normal deterministic runs miss. The problem is where that signal appears today: in the middle of human-driven development and CI work that is usually about something else.

When an order-dependent failure appears opportunistically, it is rarely actionable in the moment. A developer is pulled away from the change they were trying to make, CI reruns become the default response, and the team spends time deciding whether a failure is "real" before anyone can even start fixing it. In other words, `pytest-randomly` is doing useful chaos testing, but it is delivering the interrupt to the wrong place.

The goal is to keep the bug-finding value while moving the noise out of the normal developer path.

### 2. Proposed tool

Move `pytest-randomly` from the human-driven test environment into a dedicated nightly flake-finding workflow, then attach an AI agent to that workflow.

The nightly job would run the relevant CUDA Python test suites with randomized ordering across a controlled matrix and multiple seeds. When a failure occurs, an agent would automatically:

1. Collect the failing workflow run, job, matrix entry, pytest seed, logs, and artifacts.
2. Reproduce or narrow the failure using the same seed and a smaller subset of tests.
3. Diagnose the likely order dependency or leaked state.
4. Create a GitHub issue with the failure signature, reproduction command, seed, suspected cause, and affected tests.
5. Open a PR when the fix is straightforward.
6. Continue on PR review feedback when reviewers leave comments, while allowing a human to take over if local GPU debugging is more practical.

This makes `pytest-randomly` a scheduled maintenance signal rather than a surprise interruption. Humans still review fixes, but they should only enter the loop after the agent has already produced the useful context: reproducible seed, minimized test order when possible, and either a proposed patch or a clearly scoped issue.

### 3. Metric it moves

Primary metric: fewer random-order failures reaching human-driven CI and local development.

Useful supporting metrics:

- Number of `pytest-randomly` failures detected nightly.
- Percentage of failures with an automatically minimized reproduction.
- Percentage of failures where the agent opens a PR.
- Median time from nightly failure to issue creation.
- Median time from nightly failure to merged fix.
- Reduction in CI reruns caused by suspected test-order flakes.
- Developer-reported interruptions from unrelated randomized test failures.

### 4. Rough effort

Medium.

The first phase is small: disable randomized test reordering in normal developer workflows, or remove `pytest-randomly` from default test dependency groups and install it only in the nightly job.

The CI work is moderate but should reuse existing pieces. The repository already has scheduled CI, repeated test runs, and reusable wheel-test workflows. The agent integration is the larger part: it needs access to workflow logs, a reproducible command format, issue/PR creation, and guardrails about when it should stop and ask for human help.

## Existing CI hooks to reuse

There are several relevant pieces already in the repo:

- `.github/workflows/ci.yml` already runs on a daily schedule at midnight UTC.
- Scheduled `ci.yml` runs already pass `nruns: 5` into the wheel test jobs, using the existing `pytest-repeat` plumbing.
- `.github/workflows/test-wheel-linux.yml` and `.github/workflows/test-wheel-windows.yml` already expose reusable inputs such as `build-type`, `matrix_filter`, and `nruns`.
- `ci/test-matrix.yml` already separates `pull-request` and `nightly` matrices, although the current `nightly` entries are empty.
- `ci/tools/run-tests` currently invokes `pytest` with `--randomly-dont-reorganize` for `cuda.bindings` and `cuda.core`, which means the project is already partially opting out of randomized ordering even while `pytest-randomly` is installed.

PR #1987 is also directly relevant. It adds a new `.github/workflows/ci-nightly.yml` orchestrator for optional-dependency testing, scheduled at 2 AM UTC, and extends the wheel-test workflows so they can download wheels from the latest successful `main` CI run and select test modes via the existing matrix/filter machinery.

If PR #1987 lands, the `pytest-randomly` work should probably build on the same pattern rather than inventing a parallel structure. The clean options are:

1. Add a `nightly-random-order` mode to the new `ci-nightly.yml` flow.
2. Create a sibling nightly workflow that reuses the same `lookup-run-id`, artifact download, `test-mode`, and matrix-filter conventions.

Either way, the important design point is to avoid rebuilding wheels just to hunt random-order flakes. Use known-good artifacts from the latest successful `main` run, spend the nightly budget on randomized test execution, and keep the normal PR path deterministic.

## Rollout plan

### Phase 1: Make normal runs deterministic

Remove random-order behavior from local and PR-oriented paths. Depending on what is least disruptive, this could mean either:

- Remove `pytest-randomly` from default developer test dependency groups and install it only in the nightly workflow.
- Keep the dependency but make `--randomly-dont-reorganize` the default for normal test entry points.

The key point is that developers should not discover unrelated order dependencies while trying to land unrelated work.

### Phase 2: Add the nightly randomized workflow

Add a scheduled workflow that runs the relevant test suites with `pytest-randomly` enabled and with explicit seed reporting. Start with a focused matrix to keep signal high and runner cost controlled, then expand after the loop proves useful.

Initial scope could be:

- `cuda.pathfinder`, `cuda.bindings`, and `cuda.core` tests.
- Linux first, then Windows once log capture and reproduction are solid.
- A small number of seeds per matrix entry.
- Reuse `nruns` or add a clearer `randomly-seeds` input so failures can be tied to exact seeds rather than only to repeated runs.

### Phase 3: Attach the AI agent

Run an agent after nightly failures. The agent should:

- Fetch the failing workflow run and logs.
- Extract the failing matrix, package, test path, seed, and first useful traceback.
- Rerun with the same seed, then bisect or minimize the test order when feasible.
- Identify leaked state such as environment variables, global module state, CUDA context/device state, monkeypatches, random seeds, temporary files, or process-wide configuration.
- Prefer fixes that restore test isolation over fixes that hide the failure with reruns, sleeps, or broad skips.
- Open an issue when diagnosis is useful but a fix is not safe.
- Open a PR when the fix is scoped and reviewable.

### Phase 4: Review loop and ownership

Treat agent-created PRs like normal engineering work. Reviewers can leave comments for the agent to address, but they can also take over immediately when a failure requires hands-on GPU debugging or broader test-design judgment.

Issue labels should make the queue easy to triage, for example `test-flake`, `pytest-randomly`, and package-specific labels such as `cuda-core` or `cuda-bindings`.

## Guardrails

- Do not use `pytest-rerunfailures` as the fix for random-order failures unless the underlying failure is genuinely external.
- Do not silence failures with broad skips.
- Every issue should include the seed, failing job URL, matrix entry, and reproduction command.
- Every PR should explain the leaked state or order dependency it fixes.
- Start with a bounded nightly budget and expand only when the signal-to-noise ratio is good.
- Keep normal PR CI deterministic, except for explicit opt-in debugging jobs.

## Success criteria

After this is working, `pytest-randomly` should still find latent test-suite bugs, but those bugs should arrive as prepared issues or small PRs instead of surprise failures in unrelated development work.

A good first milestone would be:

- Normal PR and local test runs no longer reorder tests unexpectedly.
- A nightly randomized workflow runs successfully on a small matrix.
- Failures automatically produce issues with reproduction details.
- At least one order-dependent flake is fixed from an agent-generated diagnosis or PR.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pitch] Nightly `pytest-randomly` flake finder + AI fixer #2088

1. Problem

2. Proposed tool

3. Metric it moves

4. Rough effort

Existing CI hooks to reuse

Rollout plan

Phase 1: Make normal runs deterministic

Phase 2: Add the nightly randomized workflow

Phase 3: Attach the AI agent

Phase 4: Review loop and ownership

Guardrails

Success criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Pitch] Nightly pytest-randomly flake finder + AI fixer #2088

Description

1. Problem

2. Proposed tool

3. Metric it moves

4. Rough effort

Existing CI hooks to reuse

Rollout plan

Phase 1: Make normal runs deterministic

Phase 2: Add the nightly randomized workflow

Phase 3: Attach the AI agent

Phase 4: Review loop and ownership

Guardrails

Success criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Pitch] Nightly `pytest-randomly` flake finder + AI fixer #2088