ci: add cu12/cu13 matrix to GPU unit tests by ivanbasov · Pull Request #36 · NVIDIA/Ising-Decoding

ivanbasov · 2026-04-01T23:33:44Z

Summary

Expand gpu-tests to a 3 × 2 matrix (Python 3.11/3.12/3.13 × cu124/cu130) so both CUDA 12.x and CUDA 13.0 PyTorch wheels are exercised on every push
Add code/requirements_public_gpu_cu12.txt and code/requirements_public_gpu_cu13.txt — each inherits training deps and pins the matching cupy-cuda12x / cupy-cuda13x wheel for zero-copy DLPack GPU transfers
TORCH_CUDA, VENV_DIR, and REQ_FILE are namespaced per matrix cell to prevent venv collisions between cu12 and cu13 runs
CPU unit tests (ci.yml) are unchanged — the cpu PyTorch wheel is CUDA-version-agnostic, so a cu12/cu13 matrix there adds noise with no signal
mid-gpu-tests and gpu-coverage are also unchanged (each uses a single canonical environment)

Test plan

Verify all 6 gpu / py* / cu* jobs pass on the PR branch
Confirm venv paths don't collide (.venv_train_3.12_cu124 vs .venv_train_3.12_cu130)
Check cupy import succeeds inside each venv on the GPU runner
Confirm CPU unit-tests matrix is unaffected

🤖 Generated with Claude Code

…fault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…vent segfault" This reverts commit 7f0f6c8.

Expand the gpu-tests job to a 3×2 matrix (Python 3.11/3.12/3.13 × cu124/cu130) so both CUDA 12.x and CUDA 13.x PyTorch wheels are exercised on every push. TORCH_CUDA and VENV_DIR are namespaced per matrix cell to prevent venv collisions. REQ_FILE selects the new cu-specific requirements files that add the matching cupy wheel (cupy-cuda12x / cupy-cuda13x) for zero-copy DLPack GPU transfers. CPU unit tests are unchanged — the cpu wheel is CUDA-version-agnostic. Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The source command was missing _${{ matrix.torch-cuda }} suffix, so the multi-worker step would fail to activate the correct venv created by check_python_compat.sh. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ivanbasov · 2026-04-02T22:13:11Z

It seems that valid PyTorch CUDA 12.x wheel tags are cu121, cu124, cu128 (cu126 is not published).
Switching to 128 then.

cu126 is not published by PyTorch; pip fell back to a CPU build causing cudaErrorNoKernelImageForDevice on all cu126 matrix cells. cu128 is a valid published wheel tag. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bmhowe23 · 2026-04-02T22:15:30Z

It seems that valid PyTorch CUDA 12.x wheel tags are cu121, cu124, cu128 (cu126 is not published). Switching to 128 then.

Which version of torch are we using? The ones that I see support 12.6:

ivanbasov · 2026-04-02T22:43:45Z

Which version of torch are we using? The ones that I see support 12.6:

cu126 wheels do exist (PyTorch 2.6+) and would work on pre-Blackwell GPUs. Our CI runner is an RTX Pro 6000 (Blackwell, SM_100), and SM_100 kernel images were not included until cu128. cu126 installs successfully but then hits cudaErrorNoKernelImageForDevice at runtime on this GPU — which is exactly what we saw. cu128 is the minimum needed to cover Blackwell

bmhowe23 · 2026-04-02T22:45:29Z

Which version of torch are we using? The ones that I see support 12.6:

cu126 wheels do exist (PyTorch 2.6+) and would work on pre-Blackwell GPUs. Our CI runner is an RTX Pro 6000 (Blackwell, SM_100), and SM_100 kernel images were not included until cu128. cu126 installs successfully but then hits cudaErrorNoKernelImageForDevice at runtime on this GPU — which is exactly what we saw. cu128 is the minimum needed to cover Blackwell

OK good point. Yes, Blackwell is only supported on >=12.8. Thanks for the explanation!

* fix(ci): disable torch.compile in orientation training to prevent segfault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "fix(ci): disable torch.compile in orientation training to prevent segfault" This reverts commit 7f0f6c8. * ci: add cu12/cu13 matrix to GPU unit tests Expand the gpu-tests job to a 3×2 matrix (Python 3.11/3.12/3.13 × cu124/cu130) so both CUDA 12.x and CUDA 13.x PyTorch wheels are exercised on every push. TORCH_CUDA and VENV_DIR are namespaced per matrix cell to prevent venv collisions. REQ_FILE selects the new cu-specific requirements files that add the matching cupy wheel (cupy-cuda12x / cupy-cuda13x) for zero-copy DLPack GPU transfers. CPU unit tests are unchanged — the cpu wheel is CUDA-version-agnostic. Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(ci): correct venv path in multi-worker DataLoader test step The source command was missing _${{ matrix.torch-cuda }} suffix, so the multi-worker step would fail to activate the correct venv created by check_python_compat.sh. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: bump representative CUDA 12.x wheel from cu124 to cu126 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: use cu128 instead of cu126 for CUDA 12.x wheel cu126 is not published by PyTorch; pip fell back to a CPU build causing cudaErrorNoKernelImageForDevice on all cu126 matrix cells. cu128 is a valid published wheel tag. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

ivanbasov and others added 3 commits April 1, 2026 16:38

Revert "fix(ci): disable torch.compile in orientation training to pre…

9f5db0f

…vent segfault" This reverts commit 7f0f6c8.

ivanbasov marked this pull request as draft April 1, 2026 23:44

fix(ci): correct venv path in multi-worker DataLoader test step

e8e3f75

The source command was missing _${{ matrix.torch-cuda }} suffix, so the multi-worker step would fail to activate the correct venv created by check_python_compat.sh. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ivanbasov force-pushed the worktree-cu-matrix branch from 0d2b674 to e8e3f75 Compare April 1, 2026 23:54

ivanbasov requested review from bmhowe23 and kvmto April 2, 2026 00:02

ivanbasov marked this pull request as ready for review April 2, 2026 00:02

bmhowe23 reviewed Apr 2, 2026

View reviewed changes

Comment thread .github/workflows/ci-gpu.yml Outdated

ci: bump representative CUDA 12.x wheel from cu124 to cu126

8a500df

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci: use cu128 instead of cu126 for CUDA 12.x wheel

ea041f7

cu126 is not published by PyTorch; pip fell back to a CPU build causing cudaErrorNoKernelImageForDevice on all cu126 matrix cells. cu128 is a valid published wheel tag. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bmhowe23 approved these changes Apr 3, 2026

View reviewed changes

ivanbasov merged commit 3ce44fe into NVIDIA:main Apr 3, 2026
16 checks passed

ivanbasov deleted the worktree-cu-matrix branch April 3, 2026 00:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: add cu12/cu13 matrix to GPU unit tests#36

ci: add cu12/cu13 matrix to GPU unit tests#36
ivanbasov merged 6 commits into
NVIDIA:mainfrom
ivanbasov:worktree-cu-matrix

ivanbasov commented Apr 1, 2026

Uh oh!

Uh oh!

ivanbasov commented Apr 2, 2026

Uh oh!

bmhowe23 commented Apr 2, 2026

Uh oh!

ivanbasov commented Apr 2, 2026

Uh oh!

bmhowe23 commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ivanbasov commented Apr 1, 2026

Summary

Test plan

Uh oh!

Uh oh!

ivanbasov commented Apr 2, 2026

Uh oh!

bmhowe23 commented Apr 2, 2026

Uh oh!

ivanbasov commented Apr 2, 2026

Uh oh!

bmhowe23 commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants