Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Remote nvidia-docker occasionally fails #123

Closed
jimfleming opened this issue Jun 27, 2016 · 35 comments
Closed

Remote nvidia-docker occasionally fails #123

jimfleming opened this issue Jun 27, 2016 · 35 comments
Labels

Comments

@jimfleming
Copy link

Here's the error I get:

nvidia-docker | 2016/06/27 19:24:29 Error: Get http://plugin/docker/cli?vol=nvidia_driver&dev=: EOF

And the bash script:

export NV_HOST="ssh://username@$(docker-machine ip $MACHINE_NAME):"
eval $(ssh-agent -s)
ssh-add ~/.docker/machine/machines/$MACHINE_NAME/id_rsa
nvidia-docker run "$DOCKER_IMAGE" "$DOCKER_ARG"

FWIW, this works correctly ~80% of the time. Any tips on troubleshooting this further or possible causes would be super helpful.

@flx42
Copy link
Member

flx42 commented Jun 28, 2016

I tried to repro the problem using the instructions here, but I couldn't.

Which version of nvidia-docker do you have? And when you say 80% of the time, do you mean it works 80% of the time when accessing the same machine?
Or 80% of the machines you provision have this problem?

@jimfleming
Copy link
Author

Thanks for the reply. Here's the info:

Which version of nvidia-docker do you have?

nvidia-docker version: 1.0.0rc3

And when you say 80% of the time, do you mean it works 80% of the time when accessing the same machine?
Or 80% of the machines you provision have this problem?

~80% of the machines I've provisioned.

@flx42
Copy link
Member

flx42 commented Jun 28, 2016

So far no luck reproducing the nvidia-docker issue, but I do get docker-machine hanging on

Installing Docker...

quite reliably :)

@jimfleming
Copy link
Author

Hmm, I haven't run into that myself. In the last week I've started 92 remote instances on EC2, 16 of those failed with the above error. None have failed from hanging.

Is there any debug output from nvidia-docker that would be helpful to capture the next time I encounter the issue?

@flx42
Copy link
Member

flx42 commented Jun 29, 2016

@3XX0 Did you ever see anything like that? Any idea to troubleshoot the issue?

@3XX0
Copy link
Member

3XX0 commented Jun 30, 2016

Can you try with Go 1.6? You need to change the first line here and rebuild from source:
https://github.com/NVIDIA/nvidia-docker/blob/master/tools/Dockerfile.build#L1

@jimfleming
Copy link
Author

Sure, I'll give it a try.

@flx42 flx42 added the bug label Jul 1, 2016
@3XX0
Copy link
Member

3XX0 commented Jul 11, 2016

Did it fix the issue?

@flx42
Copy link
Member

flx42 commented Aug 3, 2016

Ping @jimfleming

@jimfleming
Copy link
Author

Sorry, I did not get around to rebuilding with a newer Go. On the upside, I don't seem to be seeing the issue anymore. (I've rebuilt the AMI for other reasons which may have had an effect.)

@amacneil
Copy link

I am seeing this issue a small percentage of the time on our CI servers. Note that I'm not using nvidia-docker wrapper, only the docker CLI arguments (fetched in a previous step from the http api):

docker run -t --cpuset-cpus=2-3,5,10-11,13 --volume-driver=nvidia-docker --volume=nvidia_driver_352.63:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --device=/dev/nvidia1 --device=/dev/nvidia2 --device=/dev/nvidia3 image foo.sh
Error response from daemon: create nvidia_driver_352.63: Post http://%2Fvar%2Flib%2Fnvidia-docker%2Fnvidia-docker.sock/VolumeDriver.Create: http: ContentLength=44 with Body length 0

Is the recommended fix still to rebuild with Go 1.6? I'm ok with waiting until the next version of nvidia-docker is released if that is likely to fix it too. Let me know if you would like to attempt any debugging steps as this seems to be happening at least once per day for us.

@flx42
Copy link
Member

flx42 commented Aug 24, 2016

@amacneil we don't know actually, we weren't able to reproduce the problem.
If you can test with Go 1.6 and can confirm it removes the bug, we will gladly bump the Go version before the 1.0.0 release.

@amacneil
Copy link

@flx42 any thoughts on using go 1.7? have you guys tested with that at all?

@3XX0
Copy link
Member

3XX0 commented Aug 24, 2016

No we didn't test it but I guess it will work just fine.
We have to stick with Go 1.5 due to a Docker issue (see #83).

@amacneil
Copy link

Ok, I rebuilt with Go 1.6 for now so I'll report back whether we continue to see this issue. If hypothetically Go 1.6 fixes it, it sounds like you guys may still be stuck using 1.5 for the time being... or would that just mean that you need to drop support for docker 1.9?

@flx42
Copy link
Member

flx42 commented Aug 24, 2016

Yes, we'll drop docker 1.9 if that fixes the issue. But I have the feeling it's going to be difficult to know for sure if the problem is fixed by this change.

@amacneil
Copy link

Ok, sounds good. We've been seeing the error above at least a few times per day. I just rolled out a custom deb with Go 1.6 so if we don't see anything for a couple of days it will be a good sign.

Also, if there's any other info I could collect to help debug this let me know.

@amacneil
Copy link

Unfortunately still seeing this issue with the go 1.6 version. I'm honestly not sure if it's the same problem as OP in this thread (different error message), but it's definitely causing occasional failures in our CI environment.

Error response from daemon: create nvidia_driver_352.63: Post http://%2Fvar%2Flib%2Fnvidia-docker%2Fnvidia-docker.sock/VolumeDriver.Create: http: ContentLength=44 with Body length 0

Is there anything I can do to help collect debug logs or try to diagnose this?

@amacneil
Copy link

My issue is actually much more similar to #150

/var/log/upstart/docker.log:

time="2016-08-25T21:17:55.021899212Z" level=warning msg="Unable to connect to plugin: /var/lib/nvidia-docker/nvidia-docker.sock:/VolumeDriver.Create, retrying in 1s"
time="2016-08-25T21:17:56.022472840Z" level=warning msg="Unable to connect to plugin: /var/lib/nvidia-docker/nvidia-docker.sock:/VolumeDriver.Create, retrying in 2s"
time="2016-08-25T21:17:58.023103941Z" level=warning msg="Unable to connect to plugin: /var/lib/nvidia-docker/nvidia-docker.sock:/VolumeDriver.Create, retrying in 4s"
time="2016-08-25T21:18:02.023665967Z" level=warning msg="Unable to connect to plugin: /var/lib/nvidia-docker/nvidia-docker.sock:/VolumeDriver.Create, retrying in 8s"
time="2016-08-25T21:18:18.636113840Z" level=error msg="Handler for POST /v1.21/containers/create returned error: create nvidia_driver_352.63: Post http://%2Fvar%2Flib%2Fnvidia-docker%2Fnvidia-docker.sock/VolumeDriver.Create: http: ContentLength=44 with Body length 0"

/var/log/upstart/nvidia-docker.log:

/usr/bin/nvidia-docker-plugin | 2016/08/25 20:52:31 Loading NVIDIA unified memory
/usr/bin/nvidia-docker-plugin | 2016/08/25 20:52:31 Loading NVIDIA management library
/usr/bin/nvidia-docker-plugin | 2016/08/25 20:52:34 Discovering GPU devices
/usr/bin/nvidia-docker-plugin | 2016/08/25 20:52:35 Provisioning volumes at /var/lib/nvidia-docker/volumes
/usr/bin/nvidia-docker-plugin | 2016/08/25 20:52:35 Serving plugin API at /var/lib/nvidia-docker
/usr/bin/nvidia-docker-plugin | 2016/08/25 20:52:35 Serving remote API at localhost:3476
/usr/bin/nvidia-docker-plugin | 2016/08/25 21:01:37 Received activate request
/usr/bin/nvidia-docker-plugin | 2016/08/25 21:01:37 Plugins activated [VolumeDriver]
/usr/bin/nvidia-docker-plugin | 2016/08/25 21:17:49 Received create request for volume 'nvidia_driver_352.63'

@3XX0
Copy link
Member

3XX0 commented Aug 25, 2016

Yeah pretty sure there is a bug there: https://github.com/NVIDIA/nvidia-docker/blob/master/tools/src/nvidia-docker/remote.go#L53-L63

Change the occurrence of the r variable in these lines to some other name (e.g r2) (you will need to change the = into := too) see if it fixes the issue.

I will commit a fix asap.

@amacneil
Copy link

Rebuilt deb with your suggestion, will test today.

@3XX0
Copy link
Member

3XX0 commented Aug 29, 2016

Any update?

amacneil added a commit to amacneil/nvidia-docker that referenced this issue Aug 29, 2016
@amacneil
Copy link

Saw my issue again today unfortunately (when running patched deb). The source of applied patch is here: amacneil@78a02f0

Same error message:

Error response from daemon: create nvidia_driver_352.63: Post http://%2Fvar%2Flib%2Fnvidia-docker%2Fnvidia-docker.sock/VolumeDriver.Create: http: ContentLength=44 with Body length 0

Per comment above I still think my issue may actually be different than OP in this thread.

Any other ideas to try?

@3XX0
Copy link
Member

3XX0 commented Aug 30, 2016

Yes sorry I got confused. I think my patch fixes the initial bug in this thread.
This one looks like it's coming from the Docker daemon.

I think there is a bug here in case of retry. Do you see the warning Unable to connect to plugin... in your Docker logs?

@amacneil
Copy link

Yep, I see that. See full docker logs in my earlier comment.

I'm currently running docker 1.11.2, I could try with 1.12, although I assumed that the older version would be more stable.

@3XX0
Copy link
Member

3XX0 commented Aug 30, 2016

We should file a bug against Docker then.
It looks like the bug is present in every version so 1.11 vs 1.12 shouldn't matter.

Now, it's weird that Docker is retrying in your case. Maybe we should ensure that the plugin is started before the Docker daemon. We do it in systemd but not in upstart apparently.

Can you try adding and started docker in /etc/init/nvidia-docker.conf within the start on line. This should force the dependency order

@amacneil
Copy link

Just so I'm clear, what do you think is the bug in docker?

I don't think this is a startup dependency issue, because in this case nvidia-docker started 30+ minutes before we saw the bug.

@3XX0
Copy link
Member

3XX0 commented Aug 30, 2016

This is odd. Did you check that /var/lib/nvidia-docker/nvidia-docker.sock is still here when the issue occurs?

Regarding the bug, I think these lines need to go in the for loop because Do resets the request body.

@amacneil
Copy link

I see. Still would be interesting to know why nvidia docker didn't respond correctly the first time.

I'll check for that socket next time I see the bug, don't have a machine available right now.

@amacneil
Copy link

amacneil commented Sep 7, 2016

Following up here. I never really fixed this error, but I came up with an acceptable workaround. I noticed that it only ever happens on the first CI build after booting a machine (even though sometimes there was a 15-30 minute delay between the machine booting and the first build being run).

To solve this I simply added nvidia-docker run --rm nvidia/cuda:7.5-runtime nvidia-smi to my userdata boot script. If this fails then the machine will never join my worker pool and will eventually be terminated as unhealthy by my auto scale group.

So, I didn't get to the bottom of it, but it's no longer an issue since we are just cycling machines with the problem. Hopefully this or a similar solution might help anyone discovering this thread in future.

@cancan101
Copy link

any update on this?

@flx42
Copy link
Member

flx42 commented Nov 8, 2016

@cancan101: are you having exactly the same problem? Do you have simple repro steps?

@cancan101
Copy link

cancan101 commented Nov 8, 2016

I am getting this (slightly different error from OP, same as later post):

Error response from daemon: Post http:///var/lib/nvidia-docker/nvidia-docker.sock/VolumeDriver.Create: http: ContentLength=44 with Body length 0

I see it so far 100% of the time on the first call to nvidia-docker on a given instance. I am calling:

nvidia-docker run --rm nvidia/cuda nvidia-smi

After that call fails, subsequent calls succeed.

$ nvidia-docker version
NVIDIA Docker: 1.0.0~rc.3

Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:12:04 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:12:04 UTC 2015
 OS/Arch:      linux/amd64

@3XX0
Copy link
Member

3XX0 commented Feb 6, 2017

I think this is fixed with 1.0.0

@haoyangz
Copy link

haoyangz commented Sep 26, 2017

Still observing what @cancan101 reported with 1.0.1 . Now I also put nvidia-docker run --rm nvidia/cuda nvidia-smi at the begining of my script but it still fails about 10% of the time.

$ nvidia-docker version
NVIDIA Docker: 1.0.1

Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:06:06 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:06:06 2017
 OS/Arch:      linux/amd64
 Experimental: false

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants