Skip to content
This repository was archived by the owner on Jan 22, 2024. It is now read-only.
This repository was archived by the owner on Jan 22, 2024. It is now read-only.

nvidia-docker on Azure broken after restart #349

@adinin

Description

@adinin

We run a few machines on Azure with nvidia-docker on them. Instances get restarted overnight (they are used on demand). After starting instances today, we started to experience problems with docker instances (launched with nvidia-docker) not being able to access nvidia drivers.

Running nvidia-smi on host comes back with proper results,

>nvidia-smi
Thu Mar 23 20:22:57 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | B60C:00:00.0     Off |                    0 |
| N/A   39C    P8    27W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Running nvidia-smi inside nvidia/cuda comes back with the following:

>nvidia-docker run --rm nvidia/cuda nvidia-smi
container_linux.go:247: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH"
docker: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH".

We have previously dealt with the issue of nvidia drivers disappearing after restart, but this issue is different due to nvidia-smi working perfectly on host after restart.

We tried removing nvidia-docker, uninstalling nvidia drivers, restarting and installing everything over. That didn't help. Everything was installed according to wiki/azure instructions on this repo.

Appreciate any help you can provide.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions