Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU detection and auto-assign GPUs #983

Merged
merged 112 commits into from
May 1, 2023
Merged

Conversation

shuds13
Copy link
Member

@shuds13 shuds13 commented Mar 13, 2023

*Re-opened after merge from develop (was #980) - now going straight to develop, superseding #928


Extension to gpu_detect branch. Executor options to set CPUs and GPUs to slots using either the method supplied in configuration or determined by the detected environment. This simplifies sim functions and makes them more portable.

There are different ways this could be specified in the arguments to submit. Current approach usesauto_assign_gpus and optionally match_procs_to_gpus, where auto_assign_gpus will use all GPUs assigned to current worker. Method for setting GPUs determined by either MPI runner, specified platform (via env variable LIBE_PLATFORM), or default.

Checklists

Before merging:

  • Check all prints removed
  • Check TODO comments (and any others no longer relevant) removed
  • Docstrings correct for new functions
  • type-hints correct for new functions
  1. Initial single node testing (using CUDA var resources reg. test and forces_gpu):
  • Laptop (inc dry_run mock mpi_runners)
  • Spock (works with detection - uses srun options). Can also work with ROCR_VISIBLE_DEVICES.
  • Crusher (works with detection - issue across GCDs not caused by this PR - known issue)
  • Summit (detection works)
  • Perlmutter (works with srun given - or system env var - also now works with default mpirun/CVD!)
  • Polaris
  • Frontier (works with detection, even if don't provide platform).
  • Swing
  • Sunspot - currently need to set GPU/worker (and MPI rank) affinity via a bash script. Sep. branch for this.
  1. Advanced testing: multiple nodes / mpi4py comms / awkward partitions.
  • Frontier
  • Summit
  • Perlmutter
  • Polaris

Checks for known systems (Just Summit so far)
Then if needed look at environment
Then if needed detect (remote or local)
Default GPU setting for each MPI runner (or CUDA_VISIBLE_DEVICES).
mpi_executor can access platform info (obtained via env variable LIBE_PLATFORM)
Add test test_GPU_variable_resources.py
Add sim_f six_hump_camel_GPU_variable_resources (checks settings)
Store environment var settings in task
Adds gpu setting type GPU_SET_CLI_GPT (gpus per task). Default for jsrun.
Adds check_gpu_setting function in tools
Mocks runs for each MPI runner and checks GPU settings.
@shuds13 shuds13 requested review from jmlarson1 and jlnav April 26, 2023 21:36
docs/platforms/frontier.rst Outdated Show resolved Hide resolved
docs/platforms/frontier.rst Outdated Show resolved Hide resolved
@shuds13 shuds13 merged commit 6f0e87c into develop May 1, 2023
15 checks passed
@shuds13 shuds13 moved this from In progress to In review/testing in libE Kanban May 24, 2023
@shuds13 shuds13 moved this from In review/testing to Done in libE Kanban Jul 12, 2023
@jmlarson1 jmlarson1 deleted the feature/auto_assign_gpus branch July 20, 2023 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants