Skip to content

[2.8] Respect visible GPUs in resource manager#4563

Merged
YuanTingHsieh merged 2 commits into
NVIDIA:2.8from
YuanTingHsieh:codex/respect-visible-gpus-poc
May 11, 2026
Merged

[2.8] Respect visible GPUs in resource manager#4563
YuanTingHsieh merged 2 commits into
NVIDIA:2.8from
YuanTingHsieh:codex/respect-visible-gpus-poc

Conversation

@YuanTingHsieh
Copy link
Copy Markdown
Collaborator

@YuanTingHsieh YuanTingHsieh commented May 8, 2026

Summary

  • Restrict GPUResourceManager host validation to integer GPU IDs selected by CUDA_VISIBLE_DEVICES.
  • Preserve CUDA documented semantics for unset, empty, and invalid integer CUDA_VISIBLE_DEVICES entries.
  • Add unit coverage for mixed-memory GPUs, selected GPU IDs, empty visibility, and invalid-index stopping behavior.

Why

POC clients started with nvflare poc start -gpu ... receive CUDA_VISIBLE_DEVICES, but GPUResourceManager was validating memory against every physical GPU returned by nvidia-smi. On hosts with a small display GPU plus large training GPUs, the startup check failed on the unused display GPU.

Validation

  • python3 -m compileall -q nvflare/app_common/resource_managers/gpu_resource_manager.py tests/unit_test/app_common/resource_managers/gpu_resource_manager_test.py
  • git diff --check
  • Manual stubbed GPUResourceManager checks for mixed-memory visible GPU scenarios

@YuanTingHsieh YuanTingHsieh changed the title [codex] Respect visible GPUs in resource manager [2.8] Respect visible GPUs in resource manager May 8, 2026
@YuanTingHsieh YuanTingHsieh marked this pull request as ready for review May 8, 2026 22:14
Copilot AI review requested due to automatic review settings May 8, 2026 22:14
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 8, 2026

Greptile Summary

This PR restricts GPUResourceManager's host validation to only the GPUs visible via CUDA_VISIBLE_DEVICES, fixing a startup failure on hosts with mixed-memory GPUs (e.g., a small display GPU alongside large training GPUs) when nvflare poc start -gpu is used.

  • Adds _get_cuda_visible_device_indices to parse integer GPU IDs from CUDA_VISIBLE_DEVICES, preserving CUDA semantics for the unset, empty, and invalid-integer cases; UUID/MIG values and NoDevFiles intentionally fall back to using all host GPUs (documented in the docstring).
  • Changes resource_gpu_ids to use actual GPU IDs (e.g., [4]) rather than sequential 0-based indices when a visible-GPU subset is active, so downstream reserved_resources keys reflect real nvidia-smi IDs.
  • Adds eight new unit tests covering mixed-memory, selected-ID, empty-visibility, out-of-range-index stopping, and missing-memory-entry error paths.

Confidence Score: 5/5

Safe to merge; the change is narrowly scoped to the host validation path during initialization and introduces no regressions in the resource check/reserve/release lifecycle.

The new _get_cuda_visible_device_indices helper correctly mirrors CUDA documented behaviour for all targeted cases. Resource keys now carry real GPU IDs rather than 0-based indices when a visible subset is active, which is the intentional design change. The existing parametrized tests still pass because the class-level mock returns sequential IDs and CUDA_VISIBLE_DEVICES is cleared between tests. Known limitations are documented in the function docstring and are outside the declared scope of this PR.

No files require special attention.

Important Files Changed

Filename Overview
nvflare/app_common/resource_managers/gpu_resource_manager.py Adds _get_cuda_visible_device_indices and _get_managed_host_gpu_ids helpers; refactors GPUResourceManager.init to scope host validation and resource keys to visible GPUs only. Logic is correct for all declared cases; the non-integer UUID/NoDevFiles fall-through to all host GPUs is documented.
tests/unit_test/app_common/resource_managers/gpu_resource_manager_test.py Adds eight targeted tests for the new CUDA_VISIBLE_DEVICES logic; updates the class-level fixture to also mock get_host_gpu_memory_total, and adds a function-scoped clear_cuda_visible_devices autouse fixture to prevent env leakage between tests.
docs/programming_guide/resource_manager_and_consumer.rst Documentation updated to reflect that the startup check is now restricted to CUDA_VISIBLE_DEVICES-selected GPUs; wording is accurate and consistent with the new implementation.

Reviews (4): Last reviewed commit: "Merge branch '2.8' into codex/respect-vi..." | Re-trigger Greptile

Comment thread nvflare/app_common/resource_managers/gpu_resource_manager.py Outdated
Comment thread nvflare/app_common/resource_managers/gpu_resource_manager.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates GPUResourceManager initialization to respect CUDA_VISIBLE_DEVICES when validating and selecting GPU resources, avoiding host-level validation failures on GPUs that are not visible/selected for the current process (e.g., a small display GPU). It also adds unit tests and updates documentation to reflect the new behavior.

Changes:

  • Parse CUDA_VISIBLE_DEVICES (integer GPU indices) and restrict managed GPU IDs and startup memory/count validation to those visible IDs.
  • Adjust GPU memory validation to check only selected/managed GPUs rather than all physical GPUs.
  • Add unit tests covering mixed-memory GPUs, selected GPU IDs, empty visibility, and invalid-index stopping behavior; update docs accordingly.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
nvflare/app_common/resource_managers/gpu_resource_manager.py Parse CUDA_VISIBLE_DEVICES and restrict managed GPU IDs + init-time host validation to visible/selected GPUs.
tests/unit_test/app_common/resource_managers/gpu_resource_manager_test.py Add fixtures and unit tests validating the new CUDA visibility behavior and edge cases.
docs/programming_guide/resource_manager_and_consumer.rst Document that initialization checks are restricted to GPUs specified by CUDA_VISIBLE_DEVICES.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/programming_guide/resource_manager_and_consumer.rst Outdated
Comment thread nvflare/app_common/resource_managers/gpu_resource_manager.py Outdated
Comment thread nvflare/app_common/resource_managers/gpu_resource_manager.py Outdated
Comment thread nvflare/app_common/resource_managers/gpu_resource_manager.py
@YuanTingHsieh YuanTingHsieh force-pushed the codex/respect-visible-gpus-poc branch from 1455b03 to 6a8add1 Compare May 8, 2026 22:27
@YuanTingHsieh YuanTingHsieh force-pushed the codex/respect-visible-gpus-poc branch from 6a8add1 to 8eab747 Compare May 8, 2026 22:31
@YuanTingHsieh YuanTingHsieh merged commit a54949c into NVIDIA:2.8 May 11, 2026
24 checks passed
@YuanTingHsieh YuanTingHsieh deleted the codex/respect-visible-gpus-poc branch May 11, 2026 20:10
YuanTingHsieh added a commit to YuanTingHsieh/NVFlare that referenced this pull request May 13, 2026
## Summary
- Restrict GPUResourceManager host validation to integer GPU IDs
selected by CUDA_VISIBLE_DEVICES.
- Preserve CUDA documented semantics for unset, empty, and invalid
integer CUDA_VISIBLE_DEVICES entries.
- Add unit coverage for mixed-memory GPUs, selected GPU IDs, empty
visibility, and invalid-index stopping behavior.

## Why
POC clients started with `nvflare poc start -gpu ...` receive
`CUDA_VISIBLE_DEVICES`, but GPUResourceManager was validating memory
against every physical GPU returned by nvidia-smi. On hosts with a small
display GPU plus large training GPUs, the startup check failed on the
unused display GPU.

## Validation
- `python3 -m compileall -q
nvflare/app_common/resource_managers/gpu_resource_manager.py
tests/unit_test/app_common/resource_managers/gpu_resource_manager_test.py`
- `git diff --check`
- Manual stubbed GPUResourceManager checks for mixed-memory visible GPU
scenarios

(cherry picked from commit a54949c)
pcnudde pushed a commit that referenced this pull request May 13, 2026
## Summary

Port the selected 2.8 fixes back to `main` in 2.8 merge order:

- #4528 Add warnings for missing study data mappings
- #4538 Update deploy prepare launcher docs
- #4550 Align `Run.get_result()` with the `clean_up` parameter spelling
- #4561 Clarify `remove_client` token cleanup semantics
- #4563 Respect `CUDA_VISIBLE_DEVICES` in the GPU resource manager
- #4574 Fix Docker SJ workspace tmpfs permissions
- #4576 Narrow client failure reporting for generic launcher execution
errors
- #4583 Fix tracking recipe integration test

---------

Signed-off-by: YuanTingHsieh <yuantingh@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants