Add GPU detection and auto-assign GPUs #983

shuds13 · 2023-03-13T15:34:26Z

*Re-opened after merge from develop (was #980) - now going straight to develop, superseding #928

Extension to gpu_detect branch. Executor options to set CPUs and GPUs to slots using either the method supplied in configuration or determined by the detected environment. This simplifies sim functions and makes them more portable.

There are different ways this could be specified in the arguments to submit. Current approach usesauto_assign_gpus and optionally match_procs_to_gpus, where auto_assign_gpus will use all GPUs assigned to current worker. Method for setting GPUs determined by either MPI runner, specified platform (via env variable LIBE_PLATFORM), or default.

Checklists

Before merging:

Check all prints removed
Check TODO comments (and any others no longer relevant) removed
Docstrings correct for new functions
type-hints correct for new functions

Initial single node testing (using CUDA var resources reg. test and forces_gpu):

Laptop (inc dry_run mock mpi_runners)
Spock (works with detection - uses srun options). Can also work with ROCR_VISIBLE_DEVICES.
Crusher (works with detection - issue across GCDs not caused by this PR - known issue)
Summit (detection works)
Perlmutter (works with srun given - or system env var - also now works with default mpirun/CVD!)
Polaris
Frontier (works with detection, even if don't provide platform).
Swing
~~Sunspot - currently need to set GPU/worker (and MPI rank) affinity via a bash script. Sep. branch for this.~~

Advanced testing: multiple nodes / mpi4py comms / awkward partitions.

Frontier
Summit
Perlmutter
Polaris

Checks for known systems (Just Summit so far) Then if needed look at environment Then if needed detect (remote or local)

Default GPU setting for each MPI runner (or CUDA_VISIBLE_DEVICES). mpi_executor can access platform info (obtained via env variable LIBE_PLATFORM)

Add test test_GPU_variable_resources.py Add sim_f six_hump_camel_GPU_variable_resources (checks settings) Store environment var settings in task Adds gpu setting type GPU_SET_CLI_GPT (gpus per task). Default for jsrun. Adds check_gpu_setting function in tools

Mocks runs for each MPI runner and checks GPU settings.

docs/data_structures/libE_specs.rst

docs/platforms/frontier.rst

docs/platforms/perlmutter.rst

…ensemble into feature/auto_assign_gpus

shuds13 added 30 commits December 1, 2022 19:15

Add GPU detection module

4b5ddca

Set gpus_avail_per_node in resources

eeb5e8d

Add testall option to gpu_detect

506f7b3

Add gpus_per_node to rset resources

8a03464

Add gpus_on_node resource_info option

44313bf

Fix env gpu_count return - works if None

5961b86

Fix gpu_count return again

a0d9a19

Return shortnames from PBS nodelist

675b41e

Overhaul node detection

82d5a7c

Checks for known systems (Just Summit so far) Then if needed look at environment Then if needed detect (remote or local)

Remove old variables

03f8e02

Update platforms attributes

a6a0a8f

Update docstrings for list type

4f36228

Add new unit tests for nodes_resources

30a3a16

Add zeinfo gpu detection for intel

1efb1b7

Option to set cpus/gpus to slots in executor

b331d50

Switch to auto_assign_gpus option

f516283

Restructure auto GPU assignment

bc3fd00

Default GPU setting for each MPI runner (or CUDA_VISIBLE_DEVICES). mpi_executor can access platform info (obtained via env variable LIBE_PLATFORM)

Update build_forces.sh

1956e70

Formatting

58af723

Add Polaris to known systems

1392b5c

Fix passing of runner_name

032604d

Minor output changes

7f04b1e

Use LIBE_PLATFORM in node_resources

7671389

Make mpi_runner more robust

7b677cf

Correct forces_gpu GPU compile lines

d1e73ce

Update forces_gpu for auto-assign

1f16806

Abbrev MPI vars in mpi_runner

fcc55b7

Add functionality test for GPU settings.

e0fb304

Mocks runs for each MPI runner and checks GPU settings.

Merge from develop

4207542

Formatting

f19915f

shuds13 requested review from jmlarson1 and jlnav April 26, 2023 21:36