Skip to content

ci(gpu) + test(pi07): drop temp push trigger + scope CUDA drain (cleanup of #285)#286

Merged
shuheng-liu merged 1 commit into
mainfrom
claude/fix-gpu-tests-TStWF
May 8, 2026
Merged

ci(gpu) + test(pi07): drop temp push trigger + scope CUDA drain (cleanup of #285)#286
shuheng-liu merged 1 commit into
mainfrom
claude/fix-gpu-tests-TStWF

Conversation

@shuheng-liu
Copy link
Copy Markdown
Member

What this does

Cleanup left by #285's squash-merge.

#285 went through with the temporary gpu_test.yml push trigger I had added on the fix branch to manually dispatch GPU CI without API auth. The maintainer marked the PR ready and squash-merged it before my follow-up cleanup commit landed, so the trigger leaked into main. This PR drops it.

While I'm here, also addressing the claude[bot] nit from the original review (PR #285):
tests/policies/test_pi07_low_level.py — the pre-init CUDA drain (set_device(0) / synchronize() / empty_cache()) was at function scope, so it ran even when an earlier test in the session had already initialized the PG. In that case NCCL is already bound to whichever device that test picked and re-pinning here can't unbind it, so the drain only does useful work on the path where we own the init. Move it inside the if not already_initialized: block alongside the actual init_process_group call. Comment is updated to spell out the asymmetry.

No functional change to GPU behavior — the GPU CI run on dd42b69 (Run Pytest on GPU job 74911614535, 13min, success) already confirmed the underlying fix works.

How it was tested

  • pre-commit run --files tests/policies/test_pi07_low_level.py .github/workflows/gpu_test.yml — all hooks pass.
  • pytest -m "not gpu" -n auto tests/policies/test_pi07_cpu.py tests/policies/test_pi07_low_level.py — 69 passed, no regressions.
  • The gpu_test.yml change is trigger-only; the gpu-test job body is unchanged so a re-run on the nightly cron (or manual workflow_dispatch) still exercises the same fix that already passed on dd42b69.

How to checkout & try? (for the reviewer)

git fetch origin claude/fix-gpu-tests-TStWF
git checkout claude/fix-gpu-tests-TStWF
git diff main -- .github/workflows/gpu_test.yml tests/policies/test_pi07_low_level.py
pytest -m "not gpu" -n auto tests/policies/test_pi07_cpu.py tests/policies/test_pi07_low_level.py

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Generated by Claude Code

Cleanup left by #285's squash-merge:
- Remove the gpu_test.yml `push` trigger I added to dispatch GPU CI
  on the fix branch — the run on dd42b69 already confirmed the fix.
- Address the claude[bot] nit on #285: scope the pre-init CUDA drain
  (set_device + synchronize + empty_cache) to the path where we own
  the init. When a prior test had already initialized the PG, NCCL
  is already bound to that device and re-pinning here can't unbind
  it, so the drain is only meaningful when we own init.
@shuheng-liu shuheng-liu added the test label May 8, 2026 — with Claude
@shuheng-liu shuheng-liu self-assigned this May 8, 2026
@shuheng-liu shuheng-liu marked this pull request as ready for review May 8, 2026 19:21
@shuheng-liu shuheng-liu merged commit 22063d2 into main May 8, 2026
5 of 6 checks passed
@shuheng-liu shuheng-liu deleted the claude/fix-gpu-tests-TStWF branch May 8, 2026 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant