Skip to content

ci: add cu12/cu13 matrix to GPU unit tests#36

Merged
ivanbasov merged 6 commits into
NVIDIA:mainfrom
ivanbasov:worktree-cu-matrix
Apr 3, 2026
Merged

ci: add cu12/cu13 matrix to GPU unit tests#36
ivanbasov merged 6 commits into
NVIDIA:mainfrom
ivanbasov:worktree-cu-matrix

Conversation

@ivanbasov
Copy link
Copy Markdown
Member

Summary

  • Expand gpu-tests to a 3 × 2 matrix (Python 3.11/3.12/3.13 × cu124/cu130) so both CUDA 12.x and CUDA 13.0 PyTorch wheels are exercised on every push
  • Add code/requirements_public_gpu_cu12.txt and code/requirements_public_gpu_cu13.txt — each inherits training deps and pins the matching cupy-cuda12x / cupy-cuda13x wheel for zero-copy DLPack GPU transfers
  • TORCH_CUDA, VENV_DIR, and REQ_FILE are namespaced per matrix cell to prevent venv collisions between cu12 and cu13 runs
  • CPU unit tests (ci.yml) are unchanged — the cpu PyTorch wheel is CUDA-version-agnostic, so a cu12/cu13 matrix there adds noise with no signal
  • mid-gpu-tests and gpu-coverage are also unchanged (each uses a single canonical environment)

Test plan

  • Verify all 6 gpu / py* / cu* jobs pass on the PR branch
  • Confirm venv paths don't collide (.venv_train_3.12_cu124 vs .venv_train_3.12_cu130)
  • Check cupy import succeeds inside each venv on the GPU runner
  • Confirm CPU unit-tests matrix is unaffected

🤖 Generated with Claude Code

ivanbasov and others added 3 commits April 1, 2026 16:38
…fault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Expand the gpu-tests job to a 3×2 matrix (Python 3.11/3.12/3.13 ×
cu124/cu130) so both CUDA 12.x and CUDA 13.x PyTorch wheels are
exercised on every push. TORCH_CUDA and VENV_DIR are namespaced per
matrix cell to prevent venv collisions. REQ_FILE selects the new
cu-specific requirements files that add the matching cupy wheel
(cupy-cuda12x / cupy-cuda13x) for zero-copy DLPack GPU transfers.

CPU unit tests are unchanged — the cpu wheel is CUDA-version-agnostic.

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov ivanbasov marked this pull request as draft April 1, 2026 23:44
The source command was missing _${{ matrix.torch-cuda }} suffix, so the
multi-worker step would fail to activate the correct venv created by
check_python_compat.sh.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov ivanbasov force-pushed the worktree-cu-matrix branch from 0d2b674 to e8e3f75 Compare April 1, 2026 23:54
@ivanbasov ivanbasov requested review from bmhowe23 and kvmto April 2, 2026 00:02
@ivanbasov ivanbasov marked this pull request as ready for review April 2, 2026 00:02
Comment thread .github/workflows/ci-gpu.yml Outdated
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov
Copy link
Copy Markdown
Member Author

It seems that valid PyTorch CUDA 12.x wheel tags are cu121, cu124, cu128 (cu126 is not published).
Switching to 128 then.

cu126 is not published by PyTorch; pip fell back to a CPU build causing
cudaErrorNoKernelImageForDevice on all cu126 matrix cells. cu128 is a
valid published wheel tag.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@bmhowe23
Copy link
Copy Markdown
Collaborator

bmhowe23 commented Apr 2, 2026

It seems that valid PyTorch CUDA 12.x wheel tags are cu121, cu124, cu128 (cu126 is not published). Switching to 128 then.

Which version of torch are we using? The ones that I see support 12.6:
image

@ivanbasov
Copy link
Copy Markdown
Member Author

Which version of torch are we using? The ones that I see support 12.6:

cu126 wheels do exist (PyTorch 2.6+) and would work on pre-Blackwell GPUs. Our CI runner is an RTX Pro 6000 (Blackwell, SM_100), and SM_100 kernel images were not included until cu128. cu126 installs successfully but then hits cudaErrorNoKernelImageForDevice at runtime on this GPU — which is exactly what we saw. cu128 is the minimum needed to cover Blackwell

@bmhowe23
Copy link
Copy Markdown
Collaborator

bmhowe23 commented Apr 2, 2026

Which version of torch are we using? The ones that I see support 12.6:

cu126 wheels do exist (PyTorch 2.6+) and would work on pre-Blackwell GPUs. Our CI runner is an RTX Pro 6000 (Blackwell, SM_100), and SM_100 kernel images were not included until cu128. cu126 installs successfully but then hits cudaErrorNoKernelImageForDevice at runtime on this GPU — which is exactly what we saw. cu128 is the minimum needed to cover Blackwell

OK good point. Yes, Blackwell is only supported on >=12.8. Thanks for the explanation!

@ivanbasov ivanbasov merged commit 3ce44fe into NVIDIA:main Apr 3, 2026
16 checks passed
@ivanbasov ivanbasov deleted the worktree-cu-matrix branch April 3, 2026 00:27
ivanbasov added a commit that referenced this pull request Apr 10, 2026
* fix(ci): disable torch.compile in orientation training to prevent segfault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "fix(ci): disable torch.compile in orientation training to prevent segfault"

This reverts commit 7f0f6c8.

* ci: add cu12/cu13 matrix to GPU unit tests

Expand the gpu-tests job to a 3×2 matrix (Python 3.11/3.12/3.13 ×
cu124/cu130) so both CUDA 12.x and CUDA 13.x PyTorch wheels are
exercised on every push. TORCH_CUDA and VENV_DIR are namespaced per
matrix cell to prevent venv collisions. REQ_FILE selects the new
cu-specific requirements files that add the matching cupy wheel
(cupy-cuda12x / cupy-cuda13x) for zero-copy DLPack GPU transfers.

CPU unit tests are unchanged — the cpu wheel is CUDA-version-agnostic.

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(ci): correct venv path in multi-worker DataLoader test step

The source command was missing _${{ matrix.torch-cuda }} suffix, so the
multi-worker step would fail to activate the correct venv created by
check_python_compat.sh.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci: bump representative CUDA 12.x wheel from cu124 to cu126

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci: use cu128 instead of cu126 for CUDA 12.x wheel

cu126 is not published by PyTorch; pip fell back to a CPU build causing
cudaErrorNoKernelImageForDevice on all cu126 matrix cells. cu128 is a
valid published wheel tag.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ivanbasov added a commit that referenced this pull request Apr 10, 2026
* fix(ci): disable torch.compile in orientation training to prevent segfault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "fix(ci): disable torch.compile in orientation training to prevent segfault"

This reverts commit 7f0f6c8.

* ci: add cu12/cu13 matrix to GPU unit tests

Expand the gpu-tests job to a 3×2 matrix (Python 3.11/3.12/3.13 ×
cu124/cu130) so both CUDA 12.x and CUDA 13.x PyTorch wheels are
exercised on every push. TORCH_CUDA and VENV_DIR are namespaced per
matrix cell to prevent venv collisions. REQ_FILE selects the new
cu-specific requirements files that add the matching cupy wheel
(cupy-cuda12x / cupy-cuda13x) for zero-copy DLPack GPU transfers.

CPU unit tests are unchanged — the cpu wheel is CUDA-version-agnostic.

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(ci): correct venv path in multi-worker DataLoader test step

The source command was missing _${{ matrix.torch-cuda }} suffix, so the
multi-worker step would fail to activate the correct venv created by
check_python_compat.sh.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci: bump representative CUDA 12.x wheel from cu124 to cu126

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci: use cu128 instead of cu126 for CUDA 12.x wheel

cu126 is not published by PyTorch; pip fell back to a CPU build causing
cudaErrorNoKernelImageForDevice on all cu126 matrix cells. cu128 is a
valid published wheel tag.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants