We run a few machines on Azure with nvidia-docker on them. Instances get restarted overnight (they are used on demand). After starting instances today, we started to experience problems with docker instances (launched with nvidia-docker) not being able to access nvidia drivers.
Running nvidia-smi on host comes back with proper results,
>nvidia-smi
Thu Mar 23 20:22:57 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39 Driver Version: 375.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | B60C:00:00.0 Off | 0 |
| N/A 39C P8 27W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Running nvidia-smi inside nvidia/cuda comes back with the following:
>nvidia-docker run --rm nvidia/cuda nvidia-smi
container_linux.go:247: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH"
docker: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH".
We have previously dealt with the issue of nvidia drivers disappearing after restart, but this issue is different due to nvidia-smi working perfectly on host after restart.
We tried removing nvidia-docker, uninstalling nvidia drivers, restarting and installing everything over. That didn't help. Everything was installed according to wiki/azure instructions on this repo.
Appreciate any help you can provide.
We run a few machines on Azure with nvidia-docker on them. Instances get restarted overnight (they are used on demand). After starting instances today, we started to experience problems with docker instances (launched with nvidia-docker) not being able to access nvidia drivers.
Running nvidia-smi on host comes back with proper results,
Running nvidia-smi inside nvidia/cuda comes back with the following:
We have previously dealt with the issue of nvidia drivers disappearing after restart, but this issue is different due to nvidia-smi working perfectly on host after restart.
We tried removing nvidia-docker, uninstalling nvidia drivers, restarting and installing everything over. That didn't help. Everything was installed according to wiki/azure instructions on this repo.
Appreciate any help you can provide.