Issue with gpu-related slurm settings on Perlmutter #4834

ndkeen · 2022-03-16T20:38:00Z

Early with Perlmutter, the following was one documented way to submit GPU batch jobs:

#SBATCH  --nodes=1
#SBATCH  --exclusive
#SBATCH  --constraint=gpu
#SBATCH  --gpus-per-task=1
#SBATCH  --gpu-bind=map_gpu:0,1,2,3

But now (not sure when exactly it started), with both stand-alone HOMME and screamv1 cime attempts, I get errors like this:

1: (GTL DEBUG: 1) cuIpcOpenMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 272
3: (GTL DEBUG: 3) cuIpcOpenMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 272
0: (GTL DEBUG: 0) cuIpcOpenMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 272

However, experimenting, I found this works:

#SBATCH  --nodes=1
#SBATCH  --exclusive 
#SBATCH  --constraint=gpu
#SBATCH  --gpus=$np
where $np is the number of MPI's

which doesn't make sense yet, but wanted to start a thread. I can make this change to config_batch.xml but I prefer the original settings as it looks like I would need to use the {{ total_tasks }} variable.

For now, can work-around by passing flags to case.submit:

./case.submit -a="--gpus-per-task=0 --gpu-bind=none --gpus=$np"

Note that we do NOT see this issue with MMF test -- seems to be OK with either of the above slurm settings, however, it's ~2.5x slower using the second (ie faster with what we have currently).

NERSC reports: "something did change in Slurm behavior, it now uses cgroups to enforce binding" and suggests using --gpu-bind=none

Ah, ok, I think this can be fixed by:

#SBATCH  --nodes=1
#SBATCH  --exclusive
#SBATCH  --constraint=gpu
#SBATCH  --gpus-per-task=1
#SBATCH  --gpu-bind=none

ndk/machinefiles/PM-gpu-bind-none

Well, it works, but this is till 2.5x slower for the MMF test I tried. So it's clearly not what we want. NERSC isn't sure how long this state will exist, so might wait a bit.

The text was updated successfully, but these errors were encountered:

ndkeen · 2022-04-01T20:50:09Z

This is still outstanding issue. We currently use #SBATCH --gpu-bind=map_gpu:0,1,2,3, however this causes a runtime error with homme/scream. After trying many things, I have only been able to get it working with #SBATCH --gpu-bind=none. Which is easy enough change, however, if we use this setting for existing MMF cases, it slows down the performance by 2.5x. NERSC said they are still investigating.

This bring up interesting point: We don't currently have a way of using different slurm settings based on the type of e3sm application. Which perhaps could be considered as there may not always be slurm settings on a given machine that work optimally for everything we want to try.

For now, a user can change the xml file locally, or do this at submit time: ./case.submit -a="--gpu-bind=none

rljacob · 2022-04-03T23:56:02Z

So sbatch settings need be set according what is in compset? If the ATM has "MMF" in it, use one setting, if it has "SREAM" use another?

ndkeen · 2022-04-04T14:53:10Z

Yes that could allow for a fix for now if we could do that (ie if could test on MMF present in compset). But it might not be worth the trouble until NERSC has resolved.

ndkeen · 2022-04-27T18:20:50Z

I tried the following:

    <directives COMPSET=".%MMF." compiler="gnugpu">
      <directive> --gpus-per-task=1</directive>
      <directive> --gpu-bind=map_gpu:0,1,2,3</directive>
      <!--directive COMPSET=".%MMF."> -d-gpu-bind=map_gpu:0,1,2,3</directive-->
    </directives>

but I got error:

Batch_system_type is nersc_slurm
ERROR: Command: '/usr/bin/xmllint --xinclude --noout --schema /global/cfs/cdirs/e3sm/ndk/bio-apr11/cime/config/xml_schemas/config_batch.xsd /global/cfs/cdirs/e3sm/ndk/bio-apr11/cime_config/machines/config_batch.xml' failed with error '/global/cfs/cdirs/e3sm/ndk/bio-apr11/cime_config/machines/config_batch.xml:360: element directives: Schemas validity error : Element 'directives', attribute 'COMPSET': The attribute 'COMPSET' is not allowed.
/global/cfs/cdirs/e3sm/ndk/bio-apr11/cime_config/machines/config_batch.xml fails to validate' from dir '/global/cfs/cdirs/e3sm/ndk/bio-apr11/cime/scripts'
/pscratch/sd/n/ndk/e3sm_scratch/perlmutter/bio-apr11/f30.F2000SCREAMv1.ne30_ne30.bio-apr11.gnugpu.12s.n001a4x16.Hremap512.K00.RECe.N576.ts150.s8: No such file or directory.

Jim suggested:

if you modify cime/config/xml_schemas/config_batch.xsd , you should be able to fix this

  <xs:element name="directives">
    <xs:complexType>
      <xs:sequence>
        <xs:element maxOccurs="unbounded" ref="directive"/>
      </xs:sequence>
      <xs:attribute ref="queue"/>
      <xs:attribute name="compiler"/>
      <xs:attribute name="mpilib"/>
      <xs:attribute name="threaded" type="xs:boolean"/>
    </xs:complexType>
  </xs:element>

replace the attribute lines with
<xs:anyAttribute/>

but I haven't tried that yet

Split config_pesall into component-specific config_pes PE-layouts are picked based on active component of a case or based on the prior config_pesall for all-active compsets. This PR also comes with a CIME update: To a3c94512e105ff1f21adf500fd317ac56961635e Changes: 1) Add RUNDIR as an accessible setting in the cmake build system 2) First step in the direction of implementing async IO in CESM 3) Add numeric time-stamp to jenkins archiving 4) Update grid schema 5) Set component-specific config_pes in E3SM 6) Allow any case env to be used as a directives selector in config_batch.xml Fixes #4834 [BFB] * azamat/pes/split-config-pesall: CIME update Add cime update to set component-specific PES_SPEC_FILE Split config_pesall into component-specific config_pes

ndkeen · 2022-05-06T21:26:12Z

I verified that with the change above to xsd file, I can test on the compset string in config_batch.xml and get the slurm directives I wanted for SCREAM/MMF. But need to wait for the CIME change to go in first.

ndkeen · 2022-05-11T17:09:40Z

I don't think we can close this yet as we need the corresponding change that will use it for the aork-around.

ndkeen added Machine Files pm-gpu Perlmutter machine at NERSC (GPU nodes) labels Mar 16, 2022

ndkeen mentioned this issue Mar 16, 2022

Initial runtime errors with SMS_D_Ln2_P1x1.ne4_ne4.F2000SCREAMv1.perlmutter_gnugpu E3SM-Project/scream#1443

Closed

jgfouca self-assigned this May 3, 2022

jgfouca mentioned this issue May 4, 2022

Split config_pesall into component-specific config_pes #4928

Merged

jgfouca closed this as completed in 25ea3db May 9, 2022

ndkeen reopened this May 11, 2022

ndkeen mentioned this issue May 11, 2022

Work-around for slurm settings on perlmutter E3SM-Project/scream#1632

Merged

jgfouca closed this as completed in E3SM-Project/scream#1632 May 11, 2022

atif4461 mentioned this issue Mar 25, 2024

Petsc with CUDA backend atif4461/PR_DNS_base#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with gpu-related slurm settings on Perlmutter #4834

Issue with gpu-related slurm settings on Perlmutter #4834

ndkeen commented Mar 16, 2022 •

edited

Loading

ndkeen commented Apr 1, 2022

rljacob commented Apr 3, 2022

ndkeen commented Apr 4, 2022 •

edited

Loading

ndkeen commented Apr 27, 2022 •

edited

Loading

ndkeen commented May 6, 2022

ndkeen commented May 11, 2022

Issue with gpu-related slurm settings on Perlmutter #4834

Issue with gpu-related slurm settings on Perlmutter #4834

Comments

ndkeen commented Mar 16, 2022 • edited Loading

ndkeen commented Apr 1, 2022

rljacob commented Apr 3, 2022

ndkeen commented Apr 4, 2022 • edited Loading

ndkeen commented Apr 27, 2022 • edited Loading

ndkeen commented May 6, 2022

ndkeen commented May 11, 2022

ndkeen commented Mar 16, 2022 •

edited

Loading

ndkeen commented Apr 4, 2022 •

edited

Loading

ndkeen commented Apr 27, 2022 •

edited

Loading