-
Notifications
You must be signed in to change notification settings - Fork 124
Verify CUDA environment variables #213
Copy link
Copy link
Open
Description
Currently enroot trusts CUDA environment variables and calls nvidia-container-cli without checking if drivers are install and whether shared libraries are present, e.g. libnvidia-ml.so.1.
This is problematic on Slurm cluster with a mix of CPU and GPU nodes. Slurm copies environment variables from the head node and there is little control over those. So a user with CUDA environment bars set in their shell on a login node cannot run job on CPU nodes, as enroot will error out as it fails to find libnvidia-ml.so.1.
$ NVIDIA_VISIBLE_DEVICES=all srun --container-image=centos grep PRETTY /etc/os-release
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis: [WARN] Kernel module nvidia_uvm is not loaded. Make sure the NVIDIA device driver is installed and loaded.
slurmstepd: error: pyxis: nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory
slurmstepd: error: pyxis: [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels