Skip to content
Felix Abecassis edited this page Aug 14, 2020 · 20 revisions

enroot configuration

Pyxis is using the enroot container utility, and relies on the enroot system configuration for most of its behavior. The system administrator should therefore customize this enroot for their clusters. This is achieved by setting environment variables and enabling or disabling hooks:

The extra hooks are not enabled by default as they are stored in a separate directory. Move them to the main hook directory to enable them:

$ sudo cp /usr/share/enroot/hooks.d/50-slurm-pmi.sh /usr/share/enroot/hooks.d/50-slurm-pytorch.sh /etc/enroot/hooks.d

enroot configuration (example)

Here is an example of an enroot configuration file for a cluster:

$ cat /etc/enroot/enroot.conf
ENROOT_RUNTIME_PATH /run/enroot/user-$(id -u)
ENROOT_CACHE_PATH /raid/enroot-cache/group-$(id -g)
ENROOT_DATA_PATH /tmp/enroot-data/user-$(id -u)
ENROOT_SQUASH_OPTIONS -noI -noD -noF -noX -no-duplicates
ENROOT_MOUNT_HOME n
ENROOT_RESTRICT_DEV y
ENROOT_ROOTFS_WRITABLE y
  • ENROOT_RUNTIME_PATH is the working directory for enroot, it is recommended to use a tmpfs (RAM).
  • ENROOT_CACHE_PATH is where docker layers are stored, it is recommended to use persistent local storage.
  • ENROOT_DATA_PATH is the directory where the filesystems of running containers are stored. If your compute nodes have sufficient memory, it is recommended to use a tmpfs for faster container start. Note that /tmp is not a tmpfs by default on Ubuntu, but that's the case on our cluster.
  • ENROOT_SQUASH_OPTIONS controls the compression parameters for squashfs files.
    In our case we disable compression since squashfs files are used only as an intermediate image format when importing a container image.
  • ENROOT_MOUNT_HOME n disables mounting the home directories of users by default in containers. It can still be mounted with --container-mount-home
  • ENROOT_RESTRICT_DEV y isolates device files inside the container by default. This is useful if you want to allow users to use NVIDIA_VISIBLE_DEVICES to only have a subset of all GPUs accessible inside their containers.
  • ENROOT_ROOTFS_WRITABLE y makes the containers writable by default, so that users can install additional packages if needed.

Slurm configuration

Pyxis and enroot can be used with exclusive (OverSubscribe=EXCLUSIVE) or shared node access in Slurm. It is simpler to start with a setup using exclusive node access.

The default value of PlugStackConfig should work fine if you follow the installation steps, but you might want to set it explicitly if you run other plugins, see bug 9081:

PlugStackConfig=/etc/slurm/plugstack.conf 

Slurm prolog

Depending on your enroot configuration, you might need a custom Slurm prolog to ensure that the configured enroot directories are available. For instance, here is a template for a Slurm prolog script that creates and set permissions for the enroot directories:
https://github.com/NVIDIA/deepops/blob/20.08/roles/slurm/templates/etc/slurm/prolog.d/50-all-enroot-dirs

Slurm epilog

Unnamed containers are removed automatically at the end of a job step, but named containers are not removed automatically. It is recommended to add a Slurm epilog script that cleans up ENROOT_RUNTIME_PATH and ENROOT_DATA_PATH after a job completes, to make sure the storage used by all containers is reclaimed.

Slurm plugstack configuration

Pyxis has currently 3 arguments that can be modified in the Slurm plugstack configuration:

  • runtime_path is similar to ENROOT_RUNTIME_PATH, it is where pyxis stores temporary squashfs images when importing a docker image. It is recommended to use a tmpfs if your systems have sufficient memory.
  • remap_root controls whether a user will see themselves as UID 0 (root) or their usual UID inside the container. See --no-container-remap-root
  • execute_entrypoint controls whether the entrypoint defined in the container image is executed when pyxis starts the container. See --no-container-entrypoint.

If no arguments are specified in the plugstack configuration, the default values are equivalent too:

$ cat /etc/slurm/plugstack.conf.d/pyxis.conf 
required /usr/local/lib/slurm/spank_pyxis.so runtime_path=/run/pyxis remap_root=1 execute_entrypoint=0
Clone this wiki locally