Skip to content

feat(ci): add multi-GPU tests and CI job for DDP validation#37

Merged
kvmto merged 5 commits into
NVIDIA:mainfrom
ivanbasov:worktree-multi-gpu
Apr 2, 2026
Merged

feat(ci): add multi-GPU tests and CI job for DDP validation#37
kvmto merged 5 commits into
NVIDIA:mainfrom
ivanbasov:worktree-multi-gpu

Conversation

@ivanbasov
Copy link
Copy Markdown
Member

Summary

  • Add code/tests/test_multi_gpu.py with three short-running tests (skip unless torch.cuda.device_count() >= 2):
    • NCCL all_reduce: 2 ranks each hold (rank+1), verify sum = 3.0
    • DDP forward+backward: wraps PreDecoderModelMemory_v1 (d=3) in DDP, asserts all gradients finite after backward
    • Per-rank data generation: QCDataGeneratorTorch output lands on cuda:{rank}
  • Add multi-gpu-tests job to ci-gpu.yml running on linux-amd64-gpu-rtxpro6000-latest-2 (post-merge only, 20 min timeout), with a 2-GPU DDP smoke train+inference and LER ≤ 0.35 check

Notes

  • mp.spawn with file:// rendezvous avoids port conflicts between parallel test runs
  • The smoke training step calls local_run.sh directly with GPUS=2smoke_run.sh hardcodes GPUS=1 so cannot be reused
  • Runner label linux-amd64-gpu-rtxpro6000-latest-2 is assumed based on the existing -1 single-GPU naming pattern; confirm this label exists in the runner pool before merging, otherwise the job will queue indefinitely

Test plan

  • Confirm linux-amd64-gpu-rtxpro6000-latest-2 runner label exists
  • Verify test_multi_gpu.py skips cleanly on single-GPU CI runners (existing gpu-tests job)
  • multi-gpu-tests job passes on a 2-GPU runner after merge to main

🤖 Generated with Claude Code

ivanbasov and others added 4 commits March 30, 2026 11:54
…fault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add code/tests/test_multi_gpu.py with three test classes (skipped
  unless torch.cuda.device_count() >= 2):
  - TestNCCLCommunication: verifies NCCL all_reduce sum across 2 ranks
  - TestDDPForwardBackward: DDP forward+backward with PreDecoder, checks
    finite gradients
  - TestMultiGPUDataGenerator: QCDataGeneratorTorch places output on the
    correct cuda:{rank} device per rank
  Uses mp.spawn with file:// rendezvous to avoid port conflicts.

- Add multi-gpu-tests job to ci-gpu.yml:
  - Runs on linux-amd64-gpu-rtxpro6000-latest-2 (2-GPU runner)
  - Post-merge only (if: main + needs: gpu-tests), 20 min timeout
  - Verifies >=2 GPUs are visible before proceeding
  - Runs test_multi_gpu.py then a 2-GPU DDP smoke train+inference via
    local_run.sh with GPUS=2 (smoke_run.sh hardcodes GPUS=1)
  - LER check <= 0.35 matches the existing gpu-tests threshold

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Apply yapf formatting (blank lines before top-level functions, assert
  formatting) to test_multi_gpu.py
- Remove `if: github.ref == 'refs/heads/main'` from multi-gpu-tests so
  the job appears in PR CI checks (was invisible on pull-request branches)

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov ivanbasov marked this pull request as draft April 1, 2026 23:46
check_ler_from_log.py looks for [LER Validation] lines which are
emitted during training, not inference. Was incorrectly pointing at
the inference log.

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov ivanbasov requested review from bmhowe23 and kvmto April 2, 2026 00:12
@ivanbasov ivanbasov marked this pull request as ready for review April 2, 2026 00:35
@bmhowe23
Copy link
Copy Markdown
Collaborator

bmhowe23 commented Apr 2, 2026

Thank you, @ivanbasov. @kvmto - do you know if the tests in this PR would've caught the multi-GPU integration issue you previously mentioned today (with respect to cuStabilizer)?

@kvmto
Copy link
Copy Markdown
Collaborator

kvmto commented Apr 2, 2026

Thank you, @ivanbasov. @kvmto - do you know if the tests in this PR would've caught the multi-GPU integration issue you previously mentioned today (with respect to cuStabilizer)?

This overlaps with what I found. I would merge it.

Copy link
Copy Markdown
Collaborator

@kvmto kvmto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@kvmto kvmto merged commit 2c6f22e into NVIDIA:main Apr 2, 2026
13 checks passed
@bmhowe23 bmhowe23 linked an issue Apr 7, 2026 that may be closed by this pull request
ivanbasov added a commit that referenced this pull request Apr 10, 2026
* fix(ci): disable torch.compile in orientation training to prevent segfault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "fix(ci): disable torch.compile in orientation training to prevent segfault"

This reverts commit 7f0f6c8.

* feat(ci): add multi-GPU tests and CI job for DDP validation

- Add code/tests/test_multi_gpu.py with three test classes (skipped
  unless torch.cuda.device_count() >= 2):
  - TestNCCLCommunication: verifies NCCL all_reduce sum across 2 ranks
  - TestDDPForwardBackward: DDP forward+backward with PreDecoder, checks
    finite gradients
  - TestMultiGPUDataGenerator: QCDataGeneratorTorch places output on the
    correct cuda:{rank} device per rank
  Uses mp.spawn with file:// rendezvous to avoid port conflicts.

- Add multi-gpu-tests job to ci-gpu.yml:
  - Runs on linux-amd64-gpu-rtxpro6000-latest-2 (2-GPU runner)
  - Post-merge only (if: main + needs: gpu-tests), 20 min timeout
  - Verifies >=2 GPUs are visible before proceeding
  - Runs test_multi_gpu.py then a 2-GPU DDP smoke train+inference via
    local_run.sh with GPUS=2 (smoke_run.sh hardcodes GPUS=1)
  - LER check <= 0.35 matches the existing gpu-tests threshold

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(ci): fix YAPF formatting and show multi-gpu-tests on PRs

- Apply yapf formatting (blank lines before top-level functions, assert
  formatting) to test_multi_gpu.py
- Remove `if: github.ref == 'refs/heads/main'` from multi-gpu-tests so
  the job appears in PR CI checks (was invisible on pull-request branches)

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(ci): check training log for LER in multi-gpu-tests

check_ler_from_log.py looks for [LER Validation] lines which are
emitted during training, not inference. Was incorrectly pointing at
the inference log.

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ivanbasov added a commit that referenced this pull request Apr 10, 2026
* fix(ci): disable torch.compile in orientation training to prevent segfault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "fix(ci): disable torch.compile in orientation training to prevent segfault"

This reverts commit 7f0f6c8.

* feat(ci): add multi-GPU tests and CI job for DDP validation

- Add code/tests/test_multi_gpu.py with three test classes (skipped
  unless torch.cuda.device_count() >= 2):
  - TestNCCLCommunication: verifies NCCL all_reduce sum across 2 ranks
  - TestDDPForwardBackward: DDP forward+backward with PreDecoder, checks
    finite gradients
  - TestMultiGPUDataGenerator: QCDataGeneratorTorch places output on the
    correct cuda:{rank} device per rank
  Uses mp.spawn with file:// rendezvous to avoid port conflicts.

- Add multi-gpu-tests job to ci-gpu.yml:
  - Runs on linux-amd64-gpu-rtxpro6000-latest-2 (2-GPU runner)
  - Post-merge only (if: main + needs: gpu-tests), 20 min timeout
  - Verifies >=2 GPUs are visible before proceeding
  - Runs test_multi_gpu.py then a 2-GPU DDP smoke train+inference via
    local_run.sh with GPUS=2 (smoke_run.sh hardcodes GPUS=1)
  - LER check <= 0.35 matches the existing gpu-tests threshold

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(ci): fix YAPF formatting and show multi-gpu-tests on PRs

- Apply yapf formatting (blank lines before top-level functions, assert
  formatting) to test_multi_gpu.py
- Remove `if: github.ref == 'refs/heads/main'` from multi-gpu-tests so
  the job appears in PR CI checks (was invisible on pull-request branches)

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(ci): check training log for LER in multi-gpu-tests

check_ler_from_log.py looks for [LER Validation] lines which are
emitted during training, not inference. Was incorrectly pointing at
the inference log.

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Need to add multi-GPU testing to our regular tests

3 participants