[2.8] Respect visible GPUs in resource manager by YuanTingHsieh · Pull Request #4563 · NVIDIA/NVFlare

YuanTingHsieh · 2026-05-08T22:08:01Z

Summary

Restrict GPUResourceManager host validation to integer GPU IDs selected by CUDA_VISIBLE_DEVICES.
Preserve CUDA documented semantics for unset, empty, and invalid integer CUDA_VISIBLE_DEVICES entries.
Add unit coverage for mixed-memory GPUs, selected GPU IDs, empty visibility, and invalid-index stopping behavior.

Why

POC clients started with nvflare poc start -gpu ... receive CUDA_VISIBLE_DEVICES, but GPUResourceManager was validating memory against every physical GPU returned by nvidia-smi. On hosts with a small display GPU plus large training GPUs, the startup check failed on the unused display GPU.

Validation

python3 -m compileall -q nvflare/app_common/resource_managers/gpu_resource_manager.py tests/unit_test/app_common/resource_managers/gpu_resource_manager_test.py
git diff --check
Manual stubbed GPUResourceManager checks for mixed-memory visible GPU scenarios

greptile-apps · 2026-05-08T22:17:46Z

Greptile Summary

This PR restricts GPUResourceManager's host validation to only the GPUs visible via CUDA_VISIBLE_DEVICES, fixing a startup failure on hosts with mixed-memory GPUs (e.g., a small display GPU alongside large training GPUs) when nvflare poc start -gpu is used.

Adds _get_cuda_visible_device_indices to parse integer GPU IDs from CUDA_VISIBLE_DEVICES, preserving CUDA semantics for the unset, empty, and invalid-integer cases; UUID/MIG values and NoDevFiles intentionally fall back to using all host GPUs (documented in the docstring).
Changes resource_gpu_ids to use actual GPU IDs (e.g., [4]) rather than sequential 0-based indices when a visible-GPU subset is active, so downstream reserved_resources keys reflect real nvidia-smi IDs.
Adds eight new unit tests covering mixed-memory, selected-ID, empty-visibility, out-of-range-index stopping, and missing-memory-entry error paths.

Confidence Score: 5/5

Safe to merge; the change is narrowly scoped to the host validation path during initialization and introduces no regressions in the resource check/reserve/release lifecycle.

The new _get_cuda_visible_device_indices helper correctly mirrors CUDA documented behaviour for all targeted cases. Resource keys now carry real GPU IDs rather than 0-based indices when a visible subset is active, which is the intentional design change. The existing parametrized tests still pass because the class-level mock returns sequential IDs and CUDA_VISIBLE_DEVICES is cleared between tests. Known limitations are documented in the function docstring and are outside the declared scope of this PR.

No files require special attention.

Important Files Changed

Filename	Overview
nvflare/app_common/resource_managers/gpu_resource_manager.py	Adds _get_cuda_visible_device_indices and _get_managed_host_gpu_ids helpers; refactors GPUResourceManager.init to scope host validation and resource keys to visible GPUs only. Logic is correct for all declared cases; the non-integer UUID/NoDevFiles fall-through to all host GPUs is documented.
tests/unit_test/app_common/resource_managers/gpu_resource_manager_test.py	Adds eight targeted tests for the new CUDA_VISIBLE_DEVICES logic; updates the class-level fixture to also mock get_host_gpu_memory_total, and adds a function-scoped clear_cuda_visible_devices autouse fixture to prevent env leakage between tests.
docs/programming_guide/resource_manager_and_consumer.rst	Documentation updated to reflect that the startup check is now restricted to CUDA_VISIBLE_DEVICES-selected GPUs; wording is accurate and consistent with the new implementation.

_{Reviews (4): Last reviewed commit: "Merge branch '2.8' into codex/respect-vi..." | Re-trigger Greptile}

Copilot

Pull request overview

This PR updates GPUResourceManager initialization to respect CUDA_VISIBLE_DEVICES when validating and selecting GPU resources, avoiding host-level validation failures on GPUs that are not visible/selected for the current process (e.g., a small display GPU). It also adds unit tests and updates documentation to reflect the new behavior.

Changes:

Parse CUDA_VISIBLE_DEVICES (integer GPU indices) and restrict managed GPU IDs and startup memory/count validation to those visible IDs.
Adjust GPU memory validation to check only selected/managed GPUs rather than all physical GPUs.
Add unit tests covering mixed-memory GPUs, selected GPU IDs, empty visibility, and invalid-index stopping behavior; update docs accordingly.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`nvflare/app_common/resource_managers/gpu_resource_manager.py`	Parse `CUDA_VISIBLE_DEVICES` and restrict managed GPU IDs + init-time host validation to visible/selected GPUs.
`tests/unit_test/app_common/resource_managers/gpu_resource_manager_test.py`	Add fixtures and unit tests validating the new CUDA visibility behavior and edge cases.
`docs/programming_guide/resource_manager_and_consumer.rst`	Document that initialization checks are restricted to GPUs specified by `CUDA_VISIBLE_DEVICES`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

## Summary - Restrict GPUResourceManager host validation to integer GPU IDs selected by CUDA_VISIBLE_DEVICES. - Preserve CUDA documented semantics for unset, empty, and invalid integer CUDA_VISIBLE_DEVICES entries. - Add unit coverage for mixed-memory GPUs, selected GPU IDs, empty visibility, and invalid-index stopping behavior. ## Why POC clients started with `nvflare poc start -gpu ...` receive `CUDA_VISIBLE_DEVICES`, but GPUResourceManager was validating memory against every physical GPU returned by nvidia-smi. On hosts with a small display GPU plus large training GPUs, the startup check failed on the unused display GPU. ## Validation - `python3 -m compileall -q nvflare/app_common/resource_managers/gpu_resource_manager.py tests/unit_test/app_common/resource_managers/gpu_resource_manager_test.py` - `git diff --check` - Manual stubbed GPUResourceManager checks for mixed-memory visible GPU scenarios (cherry picked from commit a54949c)

## Summary Port the selected 2.8 fixes back to `main` in 2.8 merge order: - #4528 Add warnings for missing study data mappings - #4538 Update deploy prepare launcher docs - #4550 Align `Run.get_result()` with the `clean_up` parameter spelling - #4561 Clarify `remove_client` token cleanup semantics - #4563 Respect `CUDA_VISIBLE_DEVICES` in the GPU resource manager - #4574 Fix Docker SJ workspace tmpfs permissions - #4576 Narrow client failure reporting for generic launcher execution errors - #4583 Fix tracking recipe integration test --------- Signed-off-by: YuanTingHsieh <yuantingh@nvidia.com>

YuanTingHsieh changed the title ~~[codex] Respect visible GPUs in resource manager~~ [2.8] Respect visible GPUs in resource manager May 8, 2026

YuanTingHsieh marked this pull request as ready for review May 8, 2026 22:14

Copilot AI review requested due to automatic review settings May 8, 2026 22:14

Copilot started reviewing on behalf of YuanTingHsieh May 8, 2026 22:15 View session

greptile-apps Bot reviewed May 8, 2026

View reviewed changes

Comment thread nvflare/app_common/resource_managers/gpu_resource_manager.py Outdated

Comment thread nvflare/app_common/resource_managers/gpu_resource_manager.py Outdated

Copilot AI reviewed May 8, 2026

View reviewed changes

YuanTingHsieh force-pushed the codex/respect-visible-gpus-poc branch from 1455b03 to 6a8add1 Compare May 8, 2026 22:27

Respect visible GPUs in resource manager

8eab747

YuanTingHsieh force-pushed the codex/respect-visible-gpus-poc branch from 6a8add1 to 8eab747 Compare May 8, 2026 22:31

Merge branch '2.8' into codex/respect-visible-gpus-poc

9d83a44

pcnudde approved these changes May 11, 2026

View reviewed changes

YuanTingHsieh merged commit a54949c into NVIDIA:2.8 May 11, 2026
24 checks passed

YuanTingHsieh deleted the codex/respect-visible-gpus-poc branch May 11, 2026 20:10

YuanTingHsieh mentioned this pull request May 13, 2026

Port selected 2.8 fixes to main #4595

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2.8] Respect visible GPUs in resource manager#4563

[2.8] Respect visible GPUs in resource manager#4563
YuanTingHsieh merged 2 commits into
NVIDIA:2.8from
YuanTingHsieh:codex/respect-visible-gpus-poc

YuanTingHsieh commented May 8, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

YuanTingHsieh commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Validation

Uh oh!

greptile-apps Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YuanTingHsieh commented May 8, 2026 •

edited

Loading

greptile-apps Bot commented May 8, 2026 •

edited

Loading