Verify CUDA environment variables

Currently `enroot` trusts CUDA environment variables and calls `nvidia-container-cli` without checking if drivers are install and whether shared libraries are present, e.g. `libnvidia-ml.so.1`.
This is problematic on Slurm cluster with a mix of CPU and GPU nodes. Slurm copies environment variables from the head node and there is little control over those. So a user with CUDA environment bars set in their shell on a login node cannot run job on CPU nodes, as `enroot` will error out as it fails to find `libnvidia-ml.so.1`.

```
$ NVIDIA_VISIBLE_DEVICES=all srun --container-image=centos grep PRETTY /etc/os-release
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     [WARN] Kernel module nvidia_uvm is not loaded. Make sure the NVIDIA device driver is installed and loaded.
slurmstepd: error: pyxis:     nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory
slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify CUDA environment variables #213

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Verify CUDA environment variables #213

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions