feat(ci): add multi-GPU tests and CI job for DDP validation by ivanbasov · Pull Request #37 · NVIDIA/Ising-Decoding

ivanbasov · 2026-04-01T23:39:19Z

Summary

Add code/tests/test_multi_gpu.py with three short-running tests (skip unless torch.cuda.device_count() >= 2):
- NCCL all_reduce: 2 ranks each hold (rank+1), verify sum = 3.0
- DDP forward+backward: wraps PreDecoderModelMemory_v1 (d=3) in DDP, asserts all gradients finite after backward
- Per-rank data generation: QCDataGeneratorTorch output lands on cuda:{rank}
Add multi-gpu-tests job to ci-gpu.yml running on linux-amd64-gpu-rtxpro6000-latest-2 (post-merge only, 20 min timeout), with a 2-GPU DDP smoke train+inference and LER ≤ 0.35 check

Notes

mp.spawn with file:// rendezvous avoids port conflicts between parallel test runs
The smoke training step calls local_run.sh directly with GPUS=2 — smoke_run.sh hardcodes GPUS=1 so cannot be reused
Runner label linux-amd64-gpu-rtxpro6000-latest-2 is assumed based on the existing -1 single-GPU naming pattern; confirm this label exists in the runner pool before merging, otherwise the job will queue indefinitely

Test plan

Confirm linux-amd64-gpu-rtxpro6000-latest-2 runner label exists
Verify test_multi_gpu.py skips cleanly on single-GPU CI runners (existing gpu-tests job)
multi-gpu-tests job passes on a 2-GPU runner after merge to main

🤖 Generated with Claude Code

…fault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…vent segfault" This reverts commit 7f0f6c8.

- Add code/tests/test_multi_gpu.py with three test classes (skipped unless torch.cuda.device_count() >= 2): - TestNCCLCommunication: verifies NCCL all_reduce sum across 2 ranks - TestDDPForwardBackward: DDP forward+backward with PreDecoder, checks finite gradients - TestMultiGPUDataGenerator: QCDataGeneratorTorch places output on the correct cuda:{rank} device per rank Uses mp.spawn with file:// rendezvous to avoid port conflicts. - Add multi-gpu-tests job to ci-gpu.yml: - Runs on linux-amd64-gpu-rtxpro6000-latest-2 (2-GPU runner) - Post-merge only (if: main + needs: gpu-tests), 20 min timeout - Verifies >=2 GPUs are visible before proceeding - Runs test_multi_gpu.py then a 2-GPU DDP smoke train+inference via local_run.sh with GPUS=2 (smoke_run.sh hardcodes GPUS=1) - LER check <= 0.35 matches the existing gpu-tests threshold Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Apply yapf formatting (blank lines before top-level functions, assert formatting) to test_multi_gpu.py - Remove `if: github.ref == 'refs/heads/main'` from multi-gpu-tests so the job appears in PR CI checks (was invisible on pull-request branches) Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

check_ler_from_log.py looks for [LER Validation] lines which are emitted during training, not inference. Was incorrectly pointing at the inference log. Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bmhowe23 · 2026-04-02T01:02:32Z

Thank you, @ivanbasov. @kvmto - do you know if the tests in this PR would've caught the multi-GPU integration issue you previously mentioned today (with respect to cuStabilizer)?

kvmto · 2026-04-02T09:34:21Z

Thank you, @ivanbasov. @kvmto - do you know if the tests in this PR would've caught the multi-GPU integration issue you previously mentioned today (with respect to cuStabilizer)?

This overlaps with what I found. I would merge it.

kvmto

Thanks!

* fix(ci): disable torch.compile in orientation training to prevent segfault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "fix(ci): disable torch.compile in orientation training to prevent segfault" This reverts commit 7f0f6c8. * feat(ci): add multi-GPU tests and CI job for DDP validation - Add code/tests/test_multi_gpu.py with three test classes (skipped unless torch.cuda.device_count() >= 2): - TestNCCLCommunication: verifies NCCL all_reduce sum across 2 ranks - TestDDPForwardBackward: DDP forward+backward with PreDecoder, checks finite gradients - TestMultiGPUDataGenerator: QCDataGeneratorTorch places output on the correct cuda:{rank} device per rank Uses mp.spawn with file:// rendezvous to avoid port conflicts. - Add multi-gpu-tests job to ci-gpu.yml: - Runs on linux-amd64-gpu-rtxpro6000-latest-2 (2-GPU runner) - Post-merge only (if: main + needs: gpu-tests), 20 min timeout - Verifies >=2 GPUs are visible before proceeding - Runs test_multi_gpu.py then a 2-GPU DDP smoke train+inference via local_run.sh with GPUS=2 (smoke_run.sh hardcodes GPUS=1) - LER check <= 0.35 matches the existing gpu-tests threshold Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(ci): fix YAPF formatting and show multi-gpu-tests on PRs - Apply yapf formatting (blank lines before top-level functions, assert formatting) to test_multi_gpu.py - Remove `if: github.ref == 'refs/heads/main'` from multi-gpu-tests so the job appears in PR CI checks (was invisible on pull-request branches) Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(ci): check training log for LER in multi-gpu-tests check_ler_from_log.py looks for [LER Validation] lines which are emitted during training, not inference. Was incorrectly pointing at the inference log. Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

ivanbasov and others added 4 commits March 30, 2026 11:54

Revert "fix(ci): disable torch.compile in orientation training to pre…

9d3fa08

…vent segfault" This reverts commit 7f0f6c8.

ivanbasov marked this pull request as draft April 1, 2026 23:46

ivanbasov requested review from bmhowe23 and kvmto April 2, 2026 00:12

ivanbasov marked this pull request as ready for review April 2, 2026 00:35

kvmto approved these changes Apr 2, 2026

View reviewed changes

kvmto merged commit 2c6f22e into NVIDIA:main Apr 2, 2026
13 checks passed

ivanbasov mentioned this pull request Apr 7, 2026

Need to add multi-GPU testing to our regular tests #35

Closed

bmhowe23 linked an issue Apr 7, 2026 that may be closed by this pull request

Need to add multi-GPU testing to our regular tests #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): add multi-GPU tests and CI job for DDP validation#37

feat(ci): add multi-GPU tests and CI job for DDP validation#37
kvmto merged 5 commits into
NVIDIA:mainfrom
ivanbasov:worktree-multi-gpu

ivanbasov commented Apr 1, 2026

Uh oh!

bmhowe23 commented Apr 2, 2026

Uh oh!

kvmto commented Apr 2, 2026

Uh oh!

kvmto left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ivanbasov commented Apr 1, 2026

Summary

Notes

Test plan

Uh oh!

bmhowe23 commented Apr 2, 2026

Uh oh!

kvmto commented Apr 2, 2026

Uh oh!

kvmto left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants