Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with gpu-related slurm settings on Perlmutter #4834

Closed
ndkeen opened this issue Mar 16, 2022 · 6 comments · Fixed by #4928 or E3SM-Project/scream#1632
Closed

Issue with gpu-related slurm settings on Perlmutter #4834

ndkeen opened this issue Mar 16, 2022 · 6 comments · Fixed by #4928 or E3SM-Project/scream#1632
Assignees
Labels
Machine Files pm-gpu Perlmutter machine at NERSC (GPU nodes)

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Mar 16, 2022

Early with Perlmutter, the following was one documented way to submit GPU batch jobs:

#SBATCH  --nodes=1
#SBATCH  --exclusive
#SBATCH  --constraint=gpu
#SBATCH  --gpus-per-task=1
#SBATCH  --gpu-bind=map_gpu:0,1,2,3

But now (not sure when exactly it started), with both stand-alone HOMME and screamv1 cime attempts, I get errors like this:

1: (GTL DEBUG: 1) cuIpcOpenMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 272
3: (GTL DEBUG: 3) cuIpcOpenMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 272
0: (GTL DEBUG: 0) cuIpcOpenMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 272

However, experimenting, I found this works:

#SBATCH  --nodes=1
#SBATCH  --exclusive 
#SBATCH  --constraint=gpu
#SBATCH  --gpus=$np
where $np is the number of MPI's

which doesn't make sense yet, but wanted to start a thread. I can make this change to config_batch.xml but I prefer the original settings as it looks like I would need to use the {{ total_tasks }} variable.

For now, can work-around by passing flags to case.submit:

./case.submit -a="--gpus-per-task=0 --gpu-bind=none --gpus=$np"

Note that we do NOT see this issue with MMF test -- seems to be OK with either of the above slurm settings, however, it's ~2.5x slower using the second (ie faster with what we have currently).

NERSC reports: "something did change in Slurm behavior, it now uses cgroups to enforce binding" and suggests using --gpu-bind=none

Ah, ok, I think this can be fixed by:

#SBATCH  --nodes=1
#SBATCH  --exclusive
#SBATCH  --constraint=gpu
#SBATCH  --gpus-per-task=1
#SBATCH  --gpu-bind=none

ndk/machinefiles/PM-gpu-bind-none

Well, it works, but this is till 2.5x slower for the MMF test I tried. So it's clearly not what we want. NERSC isn't sure how long this state will exist, so might wait a bit.

@ndkeen
Copy link
Contributor Author

ndkeen commented Apr 1, 2022

This is still outstanding issue. We currently use #SBATCH --gpu-bind=map_gpu:0,1,2,3, however this causes a runtime error with homme/scream. After trying many things, I have only been able to get it working with #SBATCH --gpu-bind=none. Which is easy enough change, however, if we use this setting for existing MMF cases, it slows down the performance by 2.5x. NERSC said they are still investigating.

This bring up interesting point: We don't currently have a way of using different slurm settings based on the type of e3sm application. Which perhaps could be considered as there may not always be slurm settings on a given machine that work optimally for everything we want to try.

For now, a user can change the xml file locally, or do this at submit time: ./case.submit -a="--gpu-bind=none

@rljacob
Copy link
Member

rljacob commented Apr 3, 2022

So sbatch settings need be set according what is in compset? If the ATM has "MMF" in it, use one setting, if it has "SREAM" use another?

@ndkeen
Copy link
Contributor Author

ndkeen commented Apr 4, 2022

Yes that could allow for a fix for now if we could do that (ie if could test on MMF present in compset). But it might not be worth the trouble until NERSC has resolved.

@ndkeen
Copy link
Contributor Author

ndkeen commented Apr 27, 2022

I tried the following:

    <directives COMPSET=".%MMF." compiler="gnugpu">
      <directive> --gpus-per-task=1</directive>
      <directive> --gpu-bind=map_gpu:0,1,2,3</directive>
      <!--directive COMPSET=".%MMF."> -d-gpu-bind=map_gpu:0,1,2,3</directive-->
    </directives>

but I got error:

Batch_system_type is nersc_slurm
ERROR: Command: '/usr/bin/xmllint --xinclude --noout --schema /global/cfs/cdirs/e3sm/ndk/bio-apr11/cime/config/xml_schemas/config_batch.xsd /global/cfs/cdirs/e3sm/ndk/bio-apr11/cime_config/machines/config_batch.xml' failed with error '/global/cfs/cdirs/e3sm/ndk/bio-apr11/cime_config/machines/config_batch.xml:360: element directives: Schemas validity error : Element 'directives', attribute 'COMPSET': The attribute 'COMPSET' is not allowed.
/global/cfs/cdirs/e3sm/ndk/bio-apr11/cime_config/machines/config_batch.xml fails to validate' from dir '/global/cfs/cdirs/e3sm/ndk/bio-apr11/cime/scripts'
/pscratch/sd/n/ndk/e3sm_scratch/perlmutter/bio-apr11/f30.F2000SCREAMv1.ne30_ne30.bio-apr11.gnugpu.12s.n001a4x16.Hremap512.K00.RECe.N576.ts150.s8: No such file or directory.

Jim suggested:

if you modify cime/config/xml_schemas/config_batch.xsd , you should be able to fix this

  <xs:element name="directives">
    <xs:complexType>
      <xs:sequence>
        <xs:element maxOccurs="unbounded" ref="directive"/>
      </xs:sequence>
      <xs:attribute ref="queue"/>
      <xs:attribute name="compiler"/>
      <xs:attribute name="mpilib"/>
      <xs:attribute name="threaded" type="xs:boolean"/>
    </xs:complexType>
  </xs:element>

replace the attribute lines with
<xs:anyAttribute/>

but I haven't tried that yet

@jgfouca jgfouca self-assigned this May 3, 2022
jgfouca added a commit that referenced this issue May 4, 2022
Split config_pesall into component-specific config_pes

PE-layouts are picked based on active component of a case or based on
the prior config_pesall for all-active compsets.

This PR also comes with a CIME update:
To a3c94512e105ff1f21adf500fd317ac56961635e

Changes:
1) Add RUNDIR as an accessible setting in the cmake build system
2) First step in the direction of implementing async IO in CESM
3) Add numeric time-stamp to jenkins archiving
4) Update grid schema
5) Set component-specific config_pes in E3SM
6) Allow any case env to be used as a directives selector in config_batch.xml

Fixes #4834

[BFB]

* azamat/pes/split-config-pesall:
  CIME update
  Add cime update to set component-specific PES_SPEC_FILE
  Split config_pesall into component-specific config_pes
@ndkeen
Copy link
Contributor Author

ndkeen commented May 6, 2022

I verified that with the change above to xsd file, I can test on the compset string in config_batch.xml and get the slurm directives I wanted for SCREAM/MMF. But need to wait for the CIME change to go in first.

@jgfouca jgfouca closed this as completed in 25ea3db May 9, 2022
@ndkeen
Copy link
Contributor Author

ndkeen commented May 11, 2022

I don't think we can close this yet as we need the corresponding change that will use it for the aork-around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Machine Files pm-gpu Perlmutter machine at NERSC (GPU nodes)
Projects
None yet
3 participants