ci: add cu12/cu13 matrix to GPU unit tests#36
Conversation
…fault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vent segfault" This reverts commit 7f0f6c8.
Expand the gpu-tests job to a 3×2 matrix (Python 3.11/3.12/3.13 × cu124/cu130) so both CUDA 12.x and CUDA 13.x PyTorch wheels are exercised on every push. TORCH_CUDA and VENV_DIR are namespaced per matrix cell to prevent venv collisions. REQ_FILE selects the new cu-specific requirements files that add the matching cupy wheel (cupy-cuda12x / cupy-cuda13x) for zero-copy DLPack GPU transfers. CPU unit tests are unchanged — the cpu wheel is CUDA-version-agnostic. Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The source command was missing _${{ matrix.torch-cuda }} suffix, so the
multi-worker step would fail to activate the correct venv created by
check_python_compat.sh.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
0d2b674 to
e8e3f75
Compare
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
It seems that valid PyTorch CUDA 12.x wheel tags are cu121, cu124, cu128 (cu126 is not published). |
cu126 is not published by PyTorch; pip fell back to a CPU build causing cudaErrorNoKernelImageForDevice on all cu126 matrix cells. cu128 is a valid published wheel tag. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cu126 wheels do exist (PyTorch 2.6+) and would work on pre-Blackwell GPUs. Our CI runner is an RTX Pro 6000 (Blackwell, SM_100), and SM_100 kernel images were not included until cu128. cu126 installs successfully but then hits cudaErrorNoKernelImageForDevice at runtime on this GPU — which is exactly what we saw. cu128 is the minimum needed to cover Blackwell |
OK good point. Yes, Blackwell is only supported on >=12.8. Thanks for the explanation! |
* fix(ci): disable torch.compile in orientation training to prevent segfault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "fix(ci): disable torch.compile in orientation training to prevent segfault" This reverts commit 7f0f6c8. * ci: add cu12/cu13 matrix to GPU unit tests Expand the gpu-tests job to a 3×2 matrix (Python 3.11/3.12/3.13 × cu124/cu130) so both CUDA 12.x and CUDA 13.x PyTorch wheels are exercised on every push. TORCH_CUDA and VENV_DIR are namespaced per matrix cell to prevent venv collisions. REQ_FILE selects the new cu-specific requirements files that add the matching cupy wheel (cupy-cuda12x / cupy-cuda13x) for zero-copy DLPack GPU transfers. CPU unit tests are unchanged — the cpu wheel is CUDA-version-agnostic. Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(ci): correct venv path in multi-worker DataLoader test step The source command was missing _${{ matrix.torch-cuda }} suffix, so the multi-worker step would fail to activate the correct venv created by check_python_compat.sh. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: bump representative CUDA 12.x wheel from cu124 to cu126 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: use cu128 instead of cu126 for CUDA 12.x wheel cu126 is not published by PyTorch; pip fell back to a CPU build causing cudaErrorNoKernelImageForDevice on all cu126 matrix cells. cu128 is a valid published wheel tag. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(ci): disable torch.compile in orientation training to prevent segfault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "fix(ci): disable torch.compile in orientation training to prevent segfault" This reverts commit 7f0f6c8. * ci: add cu12/cu13 matrix to GPU unit tests Expand the gpu-tests job to a 3×2 matrix (Python 3.11/3.12/3.13 × cu124/cu130) so both CUDA 12.x and CUDA 13.x PyTorch wheels are exercised on every push. TORCH_CUDA and VENV_DIR are namespaced per matrix cell to prevent venv collisions. REQ_FILE selects the new cu-specific requirements files that add the matching cupy wheel (cupy-cuda12x / cupy-cuda13x) for zero-copy DLPack GPU transfers. CPU unit tests are unchanged — the cpu wheel is CUDA-version-agnostic. Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(ci): correct venv path in multi-worker DataLoader test step The source command was missing _${{ matrix.torch-cuda }} suffix, so the multi-worker step would fail to activate the correct venv created by check_python_compat.sh. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: bump representative CUDA 12.x wheel from cu124 to cu126 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: use cu128 instead of cu126 for CUDA 12.x wheel cu126 is not published by PyTorch; pip fell back to a CPU build causing cudaErrorNoKernelImageForDevice on all cu126 matrix cells. cu128 is a valid published wheel tag. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Summary
gpu-teststo a 3 × 2 matrix (Python 3.11/3.12/3.13 ×cu124/cu130) so both CUDA 12.x and CUDA 13.0 PyTorch wheels are exercised on every pushcode/requirements_public_gpu_cu12.txtandcode/requirements_public_gpu_cu13.txt— each inherits training deps and pins the matchingcupy-cuda12x/cupy-cuda13xwheel for zero-copy DLPack GPU transfersTORCH_CUDA,VENV_DIR, andREQ_FILEare namespaced per matrix cell to prevent venv collisions between cu12 and cu13 runsci.yml) are unchanged — thecpuPyTorch wheel is CUDA-version-agnostic, so a cu12/cu13 matrix there adds noise with no signalmid-gpu-testsandgpu-coverageare also unchanged (each uses a single canonical environment)Test plan
gpu / py* / cu*jobs pass on the PR branch.venv_train_3.12_cu124vs.venv_train_3.12_cu130)unit-testsmatrix is unaffected🤖 Generated with Claude Code