New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when trying run nvidia-docker #358

Closed
NeoZeromus opened this Issue Apr 3, 2017 · 20 comments

Comments

Projects
None yet
10 participants
@NeoZeromus

NeoZeromus commented Apr 3, 2017

root@Computor:/home/alex# nvidia-docker-plugin
nvidia-docker-plugin | 2017/04/03 15:24:24 Loading NVIDIA unified memory
nvidia-docker-plugin | 2017/04/03 15:24:24 Loading NVIDIA management library
nvidia-docker-plugin | 2017/04/03 15:24:24 Discovering GPU devices
nvidia-docker-plugin | 2017/04/03 15:24:24 Provisioning volumes at /var/lib/nvidia-docker/volumes
nvidia-docker-plugin | 2017/04/03 15:24:24 Serving plugin API at /run/docker/plugins
nvidia-docker-plugin | 2017/04/03 15:24:24 Serving remote API at localhost:3476
nvidia-docker-plugin | 2017/04/03 15:24:24 Error: listen tcp 127.0.0.1:3476: bind: address already in use

alex@Computor:$ systemctl status nvidia-docker
● nvidia-docker.service - NVIDIA Docker plugin
Loaded: loaded (/lib/systemd/system/nvidia-docker.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: https://github.com/NVIDIA/nvidia-docker/wiki
alex@Computor:
$ nvidia-docker run --rm -ti crisbal/torch-rnn:cuda7.5 bash

docker: Error response from daemon: create nvidia_driver_375.39: create nvidia_driver_375.39: Error looking up volume plugin nvidia-docker: legacy plugin: plugin not found.
See 'docker run --help'.
alex@Computor:$
alex@Computor:
$ ldconfig -p | grep nvidia-ml
libnvidia-ml.so.1 (libc6,x86-64) => /usr/lib/nvidia-375/libnvidia-ml.so.1
libnvidia-ml.so.1 (libc6) => /usr/lib32/nvidia-375/libnvidia-ml.so.1
libnvidia-ml.so (libc6,x86-64) => /usr/lib/nvidia-375/libnvidia-ml.so
libnvidia-ml.so (libc6) => /usr/lib32/nvidia-375/libnvidia-ml.so

Thank you in advance, i have been giving some readings from the other issues but i can't seem to get this to work i'm trying to use it for torch rnn

@flx42

This comment has been minimized.

Show comment
Hide comment
@flx42

flx42 Apr 5, 2017

Member

You shouldn't start nvidia-docker-plugin manually, it should be started by systemd.
After restarting the machine, what's the output of journalctl -n -u nvidia-docker?

Member

flx42 commented Apr 5, 2017

You shouldn't start nvidia-docker-plugin manually, it should be started by systemd.
After restarting the machine, what's the output of journalctl -n -u nvidia-docker?

@armoreal

This comment has been minimized.

Show comment
Hide comment
@armoreal

armoreal Apr 7, 2017

I have the same issue.
armor@armor2x:~$ service nvidia-docker status ● nvidia-docker.service - NVIDIA Docker plugin Loaded: loaded (/lib/systemd/system/nvidia-docker.service; disabled; vendor preset: enabled) Active: inactive (dead) Docs: https://github.com/NVIDIA/nvidia-docker/wiki armor@armor2x:~$ nvidia-docker run --rm nvidia/cuda nvidia-smi docker: Error response from daemon: create nvidia_driver_375.39: Error looking up volume plugin nvidia-docker: legacy plugin: plugin not found. See 'docker run --help'. armor@armor2x:~$ journalctl -n -u nvidia-docker -- No entries --

Also note that while installing I got this (error code 9):
armor@armor2x:~$ sudo dpkg -i /tmp/nvidia-docker*.deb Configuring user useradd: group nvidia-docker exists - if you want to add this user to that group, use -g. dpkg: ошибка при обработке пакета nvidia-docker (--install): подпроцесс установлен сценарий post-installation возвратил код ошибки 9
seems like problem with user or usergroup?

armoreal commented Apr 7, 2017

I have the same issue.
armor@armor2x:~$ service nvidia-docker status ● nvidia-docker.service - NVIDIA Docker plugin Loaded: loaded (/lib/systemd/system/nvidia-docker.service; disabled; vendor preset: enabled) Active: inactive (dead) Docs: https://github.com/NVIDIA/nvidia-docker/wiki armor@armor2x:~$ nvidia-docker run --rm nvidia/cuda nvidia-smi docker: Error response from daemon: create nvidia_driver_375.39: Error looking up volume plugin nvidia-docker: legacy plugin: plugin not found. See 'docker run --help'. armor@armor2x:~$ journalctl -n -u nvidia-docker -- No entries --

Also note that while installing I got this (error code 9):
armor@armor2x:~$ sudo dpkg -i /tmp/nvidia-docker*.deb Configuring user useradd: group nvidia-docker exists - if you want to add this user to that group, use -g. dpkg: ошибка при обработке пакета nvidia-docker (--install): подпроцесс установлен сценарий post-installation возвратил код ошибки 9
seems like problem with user or usergroup?

@NeoZeromus

This comment has been minimized.

Show comment
Hide comment
@NeoZeromus

NeoZeromus Apr 7, 2017

Same output @flx42, should i do a fresh reinstall?

NeoZeromus commented Apr 7, 2017

Same output @flx42, should i do a fresh reinstall?

@flx42

This comment has been minimized.

Show comment
Hide comment
@flx42

flx42 Apr 7, 2017

Member

@armoreal uninstall nvidia-docker, remove the nvidia-docker, then try a fresh reinstall.

@NeoZeromus are you saying you still have listen tcp 127.0.0.1:3476: bind: address already in use? And you didn't run nvidia-docker-plugin manually this time?

Member

flx42 commented Apr 7, 2017

@armoreal uninstall nvidia-docker, remove the nvidia-docker, then try a fresh reinstall.

@NeoZeromus are you saying you still have listen tcp 127.0.0.1:3476: bind: address already in use? And you didn't run nvidia-docker-plugin manually this time?

@NeoZeromus

This comment has been minimized.

Show comment
Hide comment
@NeoZeromus

NeoZeromus commented Apr 7, 2017

@flx42 exactly

@flx42

This comment has been minimized.

Show comment
Hide comment
@flx42

flx42 Apr 7, 2017

Member

@NeoZeromus Do you have another service running on this port?

$ sudo lsof -i :3476
COMMAND    PID          USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
nvidia-do 8318 nvidia-docker   19u  IPv4  81085      0t0  TCP localhost:3476 (LISTEN)
Member

flx42 commented Apr 7, 2017

@NeoZeromus Do you have another service running on this port?

$ sudo lsof -i :3476
COMMAND    PID          USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
nvidia-do 8318 nvidia-docker   19u  IPv4  81085      0t0  TCP localhost:3476 (LISTEN)
@armoreal

This comment has been minimized.

Show comment
Hide comment
@armoreal

armoreal Apr 8, 2017

@flx42 i tried to reinstall several times, but got the same result.
But i notice that my situation is different from @NeoZeromus. I am able to manually run sudo nvidia-docker-plugin & and than nvidia-docker run properly.

armoreal commented Apr 8, 2017

@flx42 i tried to reinstall several times, but got the same result.
But i notice that my situation is different from @NeoZeromus. I am able to manually run sudo nvidia-docker-plugin & and than nvidia-docker run properly.

@loretoparisi

This comment has been minimized.

Show comment
Hide comment
@loretoparisi

loretoparisi Apr 27, 2017

Having the same error here.

I have

$ modinfo -F version nvidia
367.57

I did

docker volume create --driver=nvidia-docker --name=nvidia_driver_$(modinfo -F version nvidia)

and the volume is there

sudo docker volume ls | grep nvidia
nvidia-docker       nvidia_driver_367.57

I run doing

docker run -it --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --volume-driver nvidia-docker -v nvidia_driver_367.57:/usr/local/nvidia:ro $IMAGE bash

and I have

$ journalctl -n -u nvidia-docker
-- Logs begin at Thu 2017-02-09 23:42:51 UTC, end at Thu 2017-04-27 17:36:00 UTC. --
Apr 12 19:31:39 nvidia-docker-loreto nvidia-docker-plugin[8831]: /usr/bin/nvidia-docker-plugin | 2017/04/12 19:31:39 Received mount request for volume 'nvidia_driver_367.57'
Apr 12 19:36:45 nvidia-docker-loreto nvidia-docker-plugin[8831]: /usr/bin/nvidia-docker-plugin | 2017/04/12 19:36:45 Received unmount request for volume 'nvidia_driver_367.57'

so it should be fine since it is listening

ubuntu@nvidia-docker-loreto:~$ sudo lsof -i :3476
COMMAND    PID          USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
nvidia-do 8831 nvidia-docker   13u  IPv4  63280      0t0  TCP localhost:3476 (LISTEN)

loretoparisi commented Apr 27, 2017

Having the same error here.

I have

$ modinfo -F version nvidia
367.57

I did

docker volume create --driver=nvidia-docker --name=nvidia_driver_$(modinfo -F version nvidia)

and the volume is there

sudo docker volume ls | grep nvidia
nvidia-docker       nvidia_driver_367.57

I run doing

docker run -it --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --volume-driver nvidia-docker -v nvidia_driver_367.57:/usr/local/nvidia:ro $IMAGE bash

and I have

$ journalctl -n -u nvidia-docker
-- Logs begin at Thu 2017-02-09 23:42:51 UTC, end at Thu 2017-04-27 17:36:00 UTC. --
Apr 12 19:31:39 nvidia-docker-loreto nvidia-docker-plugin[8831]: /usr/bin/nvidia-docker-plugin | 2017/04/12 19:31:39 Received mount request for volume 'nvidia_driver_367.57'
Apr 12 19:36:45 nvidia-docker-loreto nvidia-docker-plugin[8831]: /usr/bin/nvidia-docker-plugin | 2017/04/12 19:36:45 Received unmount request for volume 'nvidia_driver_367.57'

so it should be fine since it is listening

ubuntu@nvidia-docker-loreto:~$ sudo lsof -i :3476
COMMAND    PID          USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
nvidia-do 8831 nvidia-docker   13u  IPv4  63280      0t0  TCP localhost:3476 (LISTEN)
@dbkinghorn

This comment has been minimized.

Show comment
Hide comment
@dbkinghorn

dbkinghorn May 1, 2017

I just hit a problem on a fresh install;
nvidia-modprobe was not installed. I installed that and rebooted and everything was fine.
This may or may not help with the specific problems mentioned above but it did take care of things when I got the error message "Error looking up volume plugin nvidia-docker: legacy plugin: plugin not found"

dbkinghorn commented May 1, 2017

I just hit a problem on a fresh install;
nvidia-modprobe was not installed. I installed that and rebooted and everything was fine.
This may or may not help with the specific problems mentioned above but it did take care of things when I got the error message "Error looking up volume plugin nvidia-docker: legacy plugin: plugin not found"

@puyash

This comment has been minimized.

Show comment
Hide comment
@puyash

puyash May 4, 2017

Reboot after install solved it for me too.

puyash commented May 4, 2017

Reboot after install solved it for me too.

@grisaitis

This comment has been minimized.

Show comment
Hide comment
@grisaitis

grisaitis May 12, 2017

@dbkinghorn how did you install the driver? if with the cuda package, it should include nvidia-modprobe.

In my experience, installing drivers and CUDA using nvidia's apt ppa works really well. Just make sure you don't have a preexisting installation from a runfile of both drivers or CUDA. Those should be uninstallable with /usr/bin/nvidia-uninstall (for the driver) and /usr/local/cuda-8.0/bin/uninstall_cuda_8.0.pl (for CUDA 8.0, e.g.)

grisaitis commented May 12, 2017

@dbkinghorn how did you install the driver? if with the cuda package, it should include nvidia-modprobe.

In my experience, installing drivers and CUDA using nvidia's apt ppa works really well. Just make sure you don't have a preexisting installation from a runfile of both drivers or CUDA. Those should be uninstallable with /usr/bin/nvidia-uninstall (for the driver) and /usr/local/cuda-8.0/bin/uninstall_cuda_8.0.pl (for CUDA 8.0, e.g.)

@dbkinghorn

This comment has been minimized.

Show comment
Hide comment
@dbkinghorn

dbkinghorn May 12, 2017

@grisaitis
When I do a docker setup I don't necessarily install CUDA at all on the host ... will use a docker container for that ....
I do Ubuntu installs from server, then add a desktop and the NV drivers from the graphics drivers ppa. If you want to see the setup look for the HPCblog at "Puget Systems" I have a series of docker and nvidia-docker posts that have all the details of getting a nice setup working. There is enough detail that you can script the install. I was showing this stuff at GTC and it was a big hit ... best wishes

dbkinghorn commented May 12, 2017

@grisaitis
When I do a docker setup I don't necessarily install CUDA at all on the host ... will use a docker container for that ....
I do Ubuntu installs from server, then add a desktop and the NV drivers from the graphics drivers ppa. If you want to see the setup look for the HPCblog at "Puget Systems" I have a series of docker and nvidia-docker posts that have all the details of getting a nice setup working. There is enough detail that you can script the install. I was showing this stuff at GTC and it was a big hit ... best wishes

@3XX0

This comment has been minimized.

Show comment
Hide comment
@3XX0

3XX0 Jun 13, 2017

Member

Did you find the problem on your setup?

Member

3XX0 commented Jun 13, 2017

Did you find the problem on your setup?

@liuguiyangnwpu

This comment has been minimized.

Show comment
Hide comment
@liuguiyangnwpu

liuguiyangnwpu Jun 21, 2017

after I use nvidia-run -it id /bin/bash, I got this error

-- Logs begin at 三 2017-06-21 15:12:23 CST, end at 三 2017-06-21 16:27:18 CST. --
6月 21 16:27:04 guiyang sudo[6056]: pam_unix(sudo:session): session closed for user root
6月 21 16:27:18 guiyang sudo[6075]:  guiyang : TTY=pts/17 ; PWD=/home/guiyang ; USER=root ; COMMAND=/usr/bin/nvidia-docker run -it df300f686ea4 /bin/bash
6月 21 16:27:18 guiyang sudo[6075]: pam_unix(sudo:session): session opened for user root by (uid=0)
6月 21 16:27:18 guiyang dockerd[1330]: time="2017-06-21T16:27:18.443791696+08:00" level=error msg="Handler for GET /v1.27/containers/df300f686ea4/json returned error: No such container: df300f686ea4"
6月 21 16:27:18 guiyang dockerd[1330]: time="2017-06-21T16:27:18.456286981+08:00" level=error msg="Handler for GET /v1.27/containers/df300f686ea4/json returned error: No such container: df300f686ea4"
6月 21 16:27:18 guiyang dockerd[1330]: time="2017-06-21T16:27:18.464654980+08:00" level=error msg="Handler for GET /v1.27/volumes/nvidia_driver_375.66 returned error: get nvidia_driver_375.66: no such volume"
6月 21 16:27:18 guiyang nvidia-docker-plugin[5513]: /usr/bin/nvidia-docker-plugin | 2017/06/21 16:27:18 Received create request for volume 'nvidia_driver_375.66'
6月 21 16:27:18 guiyang nvidia-docker-plugin[5513]: /usr/bin/nvidia-docker-plugin | 2017/06/21 16:27:18 Error: mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/375.66: permission denied
6月 21 16:27:18 guiyang dockerd[1330]: time="2017-06-21T16:27:18.624923075+08:00" level=error msg="Handler for POST /v1.27/containers/create returned error: create nvidia_driver_375.66: VolumeDriver.Create: internal error, che
6月 21 16:27:18 guiyang sudo[6075]: pam_unix(sudo:session): session closed for user root

liuguiyangnwpu commented Jun 21, 2017

after I use nvidia-run -it id /bin/bash, I got this error

-- Logs begin at 三 2017-06-21 15:12:23 CST, end at 三 2017-06-21 16:27:18 CST. --
6月 21 16:27:04 guiyang sudo[6056]: pam_unix(sudo:session): session closed for user root
6月 21 16:27:18 guiyang sudo[6075]:  guiyang : TTY=pts/17 ; PWD=/home/guiyang ; USER=root ; COMMAND=/usr/bin/nvidia-docker run -it df300f686ea4 /bin/bash
6月 21 16:27:18 guiyang sudo[6075]: pam_unix(sudo:session): session opened for user root by (uid=0)
6月 21 16:27:18 guiyang dockerd[1330]: time="2017-06-21T16:27:18.443791696+08:00" level=error msg="Handler for GET /v1.27/containers/df300f686ea4/json returned error: No such container: df300f686ea4"
6月 21 16:27:18 guiyang dockerd[1330]: time="2017-06-21T16:27:18.456286981+08:00" level=error msg="Handler for GET /v1.27/containers/df300f686ea4/json returned error: No such container: df300f686ea4"
6月 21 16:27:18 guiyang dockerd[1330]: time="2017-06-21T16:27:18.464654980+08:00" level=error msg="Handler for GET /v1.27/volumes/nvidia_driver_375.66 returned error: get nvidia_driver_375.66: no such volume"
6月 21 16:27:18 guiyang nvidia-docker-plugin[5513]: /usr/bin/nvidia-docker-plugin | 2017/06/21 16:27:18 Received create request for volume 'nvidia_driver_375.66'
6月 21 16:27:18 guiyang nvidia-docker-plugin[5513]: /usr/bin/nvidia-docker-plugin | 2017/06/21 16:27:18 Error: mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/375.66: permission denied
6月 21 16:27:18 guiyang dockerd[1330]: time="2017-06-21T16:27:18.624923075+08:00" level=error msg="Handler for POST /v1.27/containers/create returned error: create nvidia_driver_375.66: VolumeDriver.Create: internal error, che
6月 21 16:27:18 guiyang sudo[6075]: pam_unix(sudo:session): session closed for user root
@grisaitis

This comment has been minimized.

Show comment
Hide comment
@grisaitis

grisaitis Jun 21, 2017

@liuguiyangnwpu hm, a permissions error with mkdir.

What happens when you try creating that directory as root (/var/lib/nvidia-docker/volumes/nvidia_driver/375.66)? Does the parent directory /var/lib/nvidia-docker/volumes/nvidia_driver exist?

grisaitis commented Jun 21, 2017

@liuguiyangnwpu hm, a permissions error with mkdir.

What happens when you try creating that directory as root (/var/lib/nvidia-docker/volumes/nvidia_driver/375.66)? Does the parent directory /var/lib/nvidia-docker/volumes/nvidia_driver exist?

@liuguiyangnwpu

This comment has been minimized.

Show comment
Hide comment
@liuguiyangnwpu

liuguiyangnwpu Jun 21, 2017

hi @grisaitis ,
the first problem is

6月 21 16:27:18 guiyang dockerd[1330]: time="2017-06-21T16:27:18.443791696+08:00" level=error msg="Handler for GET /v1.27/containers/df300f686ea4/json returned error: No such container: df300f686ea4"
6月 21 16:27:18 guiyang dockerd[1330]: time="2017-06-21T16:27:18.456286981+08:00" level=error msg="Handler for GET /v1.27/containers/df300f686ea4/json returned error: No such container: df300f686ea4"

the system log info is No such container: df300f686ea4, but I run the image-id by the docker images
After I use

systemctl restart nvidia-docker

it does have mkdir the /var/lib/nvidia-docker/volumes/nvidia_driver
and there are some info in that

bin lib lib64

But I use

nvidia-docker run -it image-id /bin/bash
there some other errors occurs .

liuguiyangnwpu commented Jun 21, 2017

hi @grisaitis ,
the first problem is

6月 21 16:27:18 guiyang dockerd[1330]: time="2017-06-21T16:27:18.443791696+08:00" level=error msg="Handler for GET /v1.27/containers/df300f686ea4/json returned error: No such container: df300f686ea4"
6月 21 16:27:18 guiyang dockerd[1330]: time="2017-06-21T16:27:18.456286981+08:00" level=error msg="Handler for GET /v1.27/containers/df300f686ea4/json returned error: No such container: df300f686ea4"

the system log info is No such container: df300f686ea4, but I run the image-id by the docker images
After I use

systemctl restart nvidia-docker

it does have mkdir the /var/lib/nvidia-docker/volumes/nvidia_driver
and there are some info in that

bin lib lib64

But I use

nvidia-docker run -it image-id /bin/bash
there some other errors occurs .
@3XX0

This comment has been minimized.

Show comment
Hide comment
@3XX0

3XX0 Jun 28, 2017

Member

Try removing /var/lib/nvidia-docker and restart nvidia-docker.
Then run nvidia-docker run --rm nvidia/cuda nvidia-smi to check your install.

If it fails, paste the logs given by journalctl -u nvidia-docker

Member

3XX0 commented Jun 28, 2017

Try removing /var/lib/nvidia-docker and restart nvidia-docker.
Then run nvidia-docker run --rm nvidia/cuda nvidia-smi to check your install.

If it fails, paste the logs given by journalctl -u nvidia-docker

@liuguiyangnwpu

This comment has been minimized.

Show comment
Hide comment
@liuguiyangnwpu

liuguiyangnwpu commented Jun 30, 2017

@3XX0 thanks !

@3XX0 3XX0 closed this Aug 21, 2017

@pseudotensor

This comment has been minimized.

Show comment
Hide comment
@pseudotensor

pseudotensor Nov 20, 2017

FYI, after installing 2.0 and it not working, I wanted to go back to 1.0 But the install failed even after purge, etc. I had to

sudo rm -rf /usr/bin/nvidia-docker /var/lib/nvidia-docker/
# remove nvidia-docker group entry in /etc/group

Then reinstall 1.0. Otherwise install actually failed and I got a similar error as described here.

pseudotensor commented Nov 20, 2017

FYI, after installing 2.0 and it not working, I wanted to go back to 1.0 But the install failed even after purge, etc. I had to

sudo rm -rf /usr/bin/nvidia-docker /var/lib/nvidia-docker/
# remove nvidia-docker group entry in /etc/group

Then reinstall 1.0. Otherwise install actually failed and I got a similar error as described here.

@flx42

This comment has been minimized.

Show comment
Hide comment
@flx42

flx42 Nov 20, 2017

Member

@pseudotensor please file a new issue for the 2.0 problem you faced. We have some improvements to do in the installation guide.

Member

flx42 commented Nov 20, 2017

@pseudotensor please file a new issue for the 2.0 problem you faced. We have some improvements to do in the installation guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment