[2.8] Respect visible GPUs in resource manager#4563
Conversation
Greptile SummaryThis PR restricts
Confidence Score: 5/5Safe to merge; the change is narrowly scoped to the host validation path during initialization and introduces no regressions in the resource check/reserve/release lifecycle. The new _get_cuda_visible_device_indices helper correctly mirrors CUDA documented behaviour for all targeted cases. Resource keys now carry real GPU IDs rather than 0-based indices when a visible subset is active, which is the intentional design change. The existing parametrized tests still pass because the class-level mock returns sequential IDs and CUDA_VISIBLE_DEVICES is cleared between tests. Known limitations are documented in the function docstring and are outside the declared scope of this PR. No files require special attention. Important Files Changed
Reviews (4): Last reviewed commit: "Merge branch '2.8' into codex/respect-vi..." | Re-trigger Greptile |
There was a problem hiding this comment.
Pull request overview
This PR updates GPUResourceManager initialization to respect CUDA_VISIBLE_DEVICES when validating and selecting GPU resources, avoiding host-level validation failures on GPUs that are not visible/selected for the current process (e.g., a small display GPU). It also adds unit tests and updates documentation to reflect the new behavior.
Changes:
- Parse
CUDA_VISIBLE_DEVICES(integer GPU indices) and restrict managed GPU IDs and startup memory/count validation to those visible IDs. - Adjust GPU memory validation to check only selected/managed GPUs rather than all physical GPUs.
- Add unit tests covering mixed-memory GPUs, selected GPU IDs, empty visibility, and invalid-index stopping behavior; update docs accordingly.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
nvflare/app_common/resource_managers/gpu_resource_manager.py |
Parse CUDA_VISIBLE_DEVICES and restrict managed GPU IDs + init-time host validation to visible/selected GPUs. |
tests/unit_test/app_common/resource_managers/gpu_resource_manager_test.py |
Add fixtures and unit tests validating the new CUDA visibility behavior and edge cases. |
docs/programming_guide/resource_manager_and_consumer.rst |
Document that initialization checks are restricted to GPUs specified by CUDA_VISIBLE_DEVICES. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1455b03 to
6a8add1
Compare
6a8add1 to
8eab747
Compare
## Summary - Restrict GPUResourceManager host validation to integer GPU IDs selected by CUDA_VISIBLE_DEVICES. - Preserve CUDA documented semantics for unset, empty, and invalid integer CUDA_VISIBLE_DEVICES entries. - Add unit coverage for mixed-memory GPUs, selected GPU IDs, empty visibility, and invalid-index stopping behavior. ## Why POC clients started with `nvflare poc start -gpu ...` receive `CUDA_VISIBLE_DEVICES`, but GPUResourceManager was validating memory against every physical GPU returned by nvidia-smi. On hosts with a small display GPU plus large training GPUs, the startup check failed on the unused display GPU. ## Validation - `python3 -m compileall -q nvflare/app_common/resource_managers/gpu_resource_manager.py tests/unit_test/app_common/resource_managers/gpu_resource_manager_test.py` - `git diff --check` - Manual stubbed GPUResourceManager checks for mixed-memory visible GPU scenarios (cherry picked from commit a54949c)
## Summary Port the selected 2.8 fixes back to `main` in 2.8 merge order: - #4528 Add warnings for missing study data mappings - #4538 Update deploy prepare launcher docs - #4550 Align `Run.get_result()` with the `clean_up` parameter spelling - #4561 Clarify `remove_client` token cleanup semantics - #4563 Respect `CUDA_VISIBLE_DEVICES` in the GPU resource manager - #4574 Fix Docker SJ workspace tmpfs permissions - #4576 Narrow client failure reporting for generic launcher execution errors - #4583 Fix tracking recipe integration test --------- Signed-off-by: YuanTingHsieh <yuantingh@nvidia.com>
Summary
Why
POC clients started with
nvflare poc start -gpu ...receiveCUDA_VISIBLE_DEVICES, but GPUResourceManager was validating memory against every physical GPU returned by nvidia-smi. On hosts with a small display GPU plus large training GPUs, the startup check failed on the unused display GPU.Validation
python3 -m compileall -q nvflare/app_common/resource_managers/gpu_resource_manager.py tests/unit_test/app_common/resource_managers/gpu_resource_manager_test.pygit diff --check