Skip to content

Frequently asked questions

Felix Abecassis edited this page May 1, 2024 · 10 revisions

What is PMIx?

Check the Slurm MPI Users Guide, Slurm is responsible for launching the tasks and mpirun is not needed.

How do I configure pyxis for multi-node workloads through PMIx?

Make sure you configure enroot with the extra PMIx hook, as described in enroot configuration. It it doesn't work, check slurmd configuration

Why is MPI_Init sometimes failing under PMIx?

Under a PMIx allocation, i.e. srun --mpi=pmix, you can only do a single MPI_Init. In other words, you can't have srun execute a script that launches multiple MPI applications in sequence.

Instead, you can save the container state with --container-name and then do multiple invocations of srun, one for each MPI application:

# From the login-node:
$ salloc -N2
$ srun --container-name=tf --container-image=tensorflow bash -c 'apt-get update && apt-get install -y ...'
$ srun --mpi=pmix --container-name=tf mpiapp1 ....
$ srun --mpi=pmix --container-name=tf mpiapp2 ....

Are there any known limitations when using PMIx under Slurm?

Under a PMIx allocation, you can only do a single MPI_Init (see above).
In addition, MPI_Comm_spawn is known to not be available with PMIx and Slurm.

Why am I not seeing the pyxis output when using srun?

This is a known issue in older versions of Slurm when using srun --pty. We recommend using at least Slurm 20.02.5 and pyxis 0.8.1 to solve this problem.

Can I use pyxis arguments with sbatch?

You can do sbatch --container-image with pyxis 0.12. It will run the sbatch script inside the container therefore you will not be able to use srun from within the containerized sbatch script.

Is there an equivalent to -p/--publish from Docker?

Enroot does not create a network namespace for the container, so you don't need to "publish" ports like with Docker, it's no different than running outside the container or --network=host, but as a user you won't be able to listen on privileged ports (ports 1 to 1023 by default)

Why am I getting errors when using the --export argument of Slurm?

For example, with ENROOT_RUNTIME_PATH ${XDG_RUNTIME_DIR}/enroot in enroot.conf:

$ srun --export NVIDIA_VISIBLE_DEVICES=0 --container-image nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi -L
pyxis: importing docker image: nvidia/cuda:12.4.1-base-ubuntu22.04
slurmstepd: error: pyxis: child 1692947 failed with error code: 1
slurmstepd: error: pyxis: failed to import docker image
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     /usr/bin/enroot: line 44: HOME: unbound variable
slurmstepd: error: pyxis:     /usr/bin/enroot: line 44: XDG_RUNTIME_DIR: unbound variable
slurmstepd: error: pyxis:     mkdir: cannot create directory '/run/enroot': Permission denied
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: child 1692966 failed with error code: 1

In this case, the issue is that --export will unset all other environment variables from the user environment, and only set NVIDIA_VISIBLE_DEVICES=0. It is recommended to add the ALL option when using --export:

$ srun --export ALL,NVIDIA_VISIBLE_DEVICES=0 --container-image nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi -L
pyxis: importing docker image: nvidia/cuda:12.4.1-base-ubuntu22.04
pyxis: imported docker image: nvidia/cuda:12.4.1-base-ubuntu22.04
GPU 0: NVIDIA GeForce RTX 3070 (UUID: GPU-acce903c-39ee-787e-3dbc-f1d82df43fe7)

This behavior can be surprising for users familiar with Docker, as the --export argument of Slurm does not behave like the --env argument of Docker Engine.