Skip to content

Verify CUDA environment variables #213

@errordeveloper

Description

@errordeveloper

Currently enroot trusts CUDA environment variables and calls nvidia-container-cli without checking if drivers are install and whether shared libraries are present, e.g. libnvidia-ml.so.1.
This is problematic on Slurm cluster with a mix of CPU and GPU nodes. Slurm copies environment variables from the head node and there is little control over those. So a user with CUDA environment bars set in their shell on a login node cannot run job on CPU nodes, as enroot will error out as it fails to find libnvidia-ml.so.1.

$ NVIDIA_VISIBLE_DEVICES=all srun --container-image=centos grep PRETTY /etc/os-release
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     [WARN] Kernel module nvidia_uvm is not loaded. Make sure the NVIDIA device driver is installed and loaded.
slurmstepd: error: pyxis:     nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory
slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions