Remote nvidia-docker occasionally fails #123

jimfleming · 2016-06-27T19:36:40Z

Here's the error I get:

nvidia-docker | 2016/06/27 19:24:29 Error: Get http://plugin/docker/cli?vol=nvidia_driver&dev=: EOF

And the bash script:

export NV_HOST="ssh://username@$(docker-machine ip $MACHINE_NAME):"
eval $(ssh-agent -s)
ssh-add ~/.docker/machine/machines/$MACHINE_NAME/id_rsa
nvidia-docker run "$DOCKER_IMAGE" "$DOCKER_ARG"

FWIW, this works correctly ~80% of the time. Any tips on troubleshooting this further or possible causes would be super helpful.

The text was updated successfully, but these errors were encountered:

flx42 · 2016-06-28T18:37:02Z

I tried to repro the problem using the instructions here, but I couldn't.

Which version of nvidia-docker do you have? And when you say 80% of the time, do you mean it works 80% of the time when accessing the same machine?
Or 80% of the machines you provision have this problem?

jimfleming · 2016-06-28T18:41:39Z

Thanks for the reply. Here's the info:

Which version of nvidia-docker do you have?

nvidia-docker version: 1.0.0rc3

And when you say 80% of the time, do you mean it works 80% of the time when accessing the same machine?
Or 80% of the machines you provision have this problem?

~80% of the machines I've provisioned.

flx42 · 2016-06-28T20:31:47Z

So far no luck reproducing the nvidia-docker issue, but I do get docker-machine hanging on

Installing Docker...

quite reliably :)

jimfleming · 2016-06-28T21:32:30Z

Hmm, I haven't run into that myself. In the last week I've started 92 remote instances on EC2, 16 of those failed with the above error. None have failed from hanging.

Is there any debug output from nvidia-docker that would be helpful to capture the next time I encounter the issue?

flx42 · 2016-06-29T20:28:03Z

@3XX0 Did you ever see anything like that? Any idea to troubleshoot the issue?

3XX0 · 2016-06-30T05:38:55Z

Can you try with Go 1.6? You need to change the first line here and rebuild from source:
https://github.com/NVIDIA/nvidia-docker/blob/master/tools/Dockerfile.build#L1

jimfleming · 2016-07-01T00:58:32Z

Sure, I'll give it a try.

3XX0 · 2016-07-11T18:08:18Z

Did it fix the issue?

flx42 · 2016-08-03T17:14:45Z

Ping @jimfleming

jimfleming · 2016-08-04T02:05:52Z

Sorry, I did not get around to rebuilding with a newer Go. On the upside, I don't seem to be seeing the issue anymore. (I've rebuilt the AMI for other reasons which may have had an effect.)

amacneil · 2016-08-24T15:52:50Z

I am seeing this issue a small percentage of the time on our CI servers. Note that I'm not using nvidia-docker wrapper, only the docker CLI arguments (fetched in a previous step from the http api):

docker run -t --cpuset-cpus=2-3,5,10-11,13 --volume-driver=nvidia-docker --volume=nvidia_driver_352.63:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --device=/dev/nvidia1 --device=/dev/nvidia2 --device=/dev/nvidia3 image foo.sh
Error response from daemon: create nvidia_driver_352.63: Post http://%2Fvar%2Flib%2Fnvidia-docker%2Fnvidia-docker.sock/VolumeDriver.Create: http: ContentLength=44 with Body length 0

Is the recommended fix still to rebuild with Go 1.6? I'm ok with waiting until the next version of nvidia-docker is released if that is likely to fix it too. Let me know if you would like to attempt any debugging steps as this seems to be happening at least once per day for us.

flx42 · 2016-08-24T16:58:01Z

@amacneil we don't know actually, we weren't able to reproduce the problem.
If you can test with Go 1.6 and can confirm it removes the bug, we will gladly bump the Go version before the 1.0.0 release.

amacneil · 2016-08-24T18:02:03Z

@flx42 any thoughts on using go 1.7? have you guys tested with that at all?

3XX0 · 2016-08-24T18:05:44Z

No we didn't test it but I guess it will work just fine.
We have to stick with Go 1.5 due to a Docker issue (see #83).

amacneil · 2016-08-24T19:08:39Z

Ok, I rebuilt with Go 1.6 for now so I'll report back whether we continue to see this issue. If hypothetically Go 1.6 fixes it, it sounds like you guys may still be stuck using 1.5 for the time being... or would that just mean that you need to drop support for docker 1.9?

flx42 · 2016-08-24T19:15:25Z

Yes, we'll drop docker 1.9 if that fixes the issue. But I have the feeling it's going to be difficult to know for sure if the problem is fixed by this change.

amacneil · 2016-08-24T19:35:19Z

Ok, sounds good. We've been seeing the error above at least a few times per day. I just rolled out a custom deb with Go 1.6 so if we don't see anything for a couple of days it will be a good sign.

Also, if there's any other info I could collect to help debug this let me know.

amacneil · 2016-08-25T22:46:31Z

Unfortunately still seeing this issue with the go 1.6 version. I'm honestly not sure if it's the same problem as OP in this thread (different error message), but it's definitely causing occasional failures in our CI environment.

Error response from daemon: create nvidia_driver_352.63: Post http://%2Fvar%2Flib%2Fnvidia-docker%2Fnvidia-docker.sock/VolumeDriver.Create: http: ContentLength=44 with Body length 0

Is there anything I can do to help collect debug logs or try to diagnose this?

amacneil · 2016-08-25T22:55:52Z

My issue is actually much more similar to #150

/var/log/upstart/docker.log:

time="2016-08-25T21:17:55.021899212Z" level=warning msg="Unable to connect to plugin: /var/lib/nvidia-docker/nvidia-docker.sock:/VolumeDriver.Create, retrying in 1s"
time="2016-08-25T21:17:56.022472840Z" level=warning msg="Unable to connect to plugin: /var/lib/nvidia-docker/nvidia-docker.sock:/VolumeDriver.Create, retrying in 2s"
time="2016-08-25T21:17:58.023103941Z" level=warning msg="Unable to connect to plugin: /var/lib/nvidia-docker/nvidia-docker.sock:/VolumeDriver.Create, retrying in 4s"
time="2016-08-25T21:18:02.023665967Z" level=warning msg="Unable to connect to plugin: /var/lib/nvidia-docker/nvidia-docker.sock:/VolumeDriver.Create, retrying in 8s"
time="2016-08-25T21:18:18.636113840Z" level=error msg="Handler for POST /v1.21/containers/create returned error: create nvidia_driver_352.63: Post http://%2Fvar%2Flib%2Fnvidia-docker%2Fnvidia-docker.sock/VolumeDriver.Create: http: ContentLength=44 with Body length 0"

/var/log/upstart/nvidia-docker.log:

/usr/bin/nvidia-docker-plugin | 2016/08/25 20:52:31 Loading NVIDIA unified memory
/usr/bin/nvidia-docker-plugin | 2016/08/25 20:52:31 Loading NVIDIA management library
/usr/bin/nvidia-docker-plugin | 2016/08/25 20:52:34 Discovering GPU devices
/usr/bin/nvidia-docker-plugin | 2016/08/25 20:52:35 Provisioning volumes at /var/lib/nvidia-docker/volumes
/usr/bin/nvidia-docker-plugin | 2016/08/25 20:52:35 Serving plugin API at /var/lib/nvidia-docker
/usr/bin/nvidia-docker-plugin | 2016/08/25 20:52:35 Serving remote API at localhost:3476
/usr/bin/nvidia-docker-plugin | 2016/08/25 21:01:37 Received activate request
/usr/bin/nvidia-docker-plugin | 2016/08/25 21:01:37 Plugins activated [VolumeDriver]
/usr/bin/nvidia-docker-plugin | 2016/08/25 21:17:49 Received create request for volume 'nvidia_driver_352.63'

3XX0 · 2016-08-25T23:04:45Z

Yeah pretty sure there is a bug there: https://github.com/NVIDIA/nvidia-docker/blob/master/tools/src/nvidia-docker/remote.go#L53-L63

Change the occurrence of the r variable in these lines to some other name (e.g r2) (you will need to change the = into := too) see if it fixes the issue.

I will commit a fix asap.

amacneil · 2016-08-26T17:23:08Z

Rebuilt deb with your suggestion, will test today.

3XX0 · 2016-08-29T21:01:37Z

Any update?

As suggested by NVIDIA#123 (comment)

amacneil · 2016-08-29T23:35:20Z

Saw my issue again today unfortunately (when running patched deb). The source of applied patch is here: amacneil@78a02f0

Same error message:

Error response from daemon: create nvidia_driver_352.63: Post http://%2Fvar%2Flib%2Fnvidia-docker%2Fnvidia-docker.sock/VolumeDriver.Create: http: ContentLength=44 with Body length 0

Per comment above I still think my issue may actually be different than OP in this thread.

Any other ideas to try?

3XX0 · 2016-08-30T00:59:21Z

Yes sorry I got confused. I think my patch fixes the initial bug in this thread.
This one looks like it's coming from the Docker daemon.

I think there is a bug here in case of retry. Do you see the warning Unable to connect to plugin... in your Docker logs?

amacneil · 2016-08-30T01:15:33Z

Yep, I see that. See full docker logs in my earlier comment.

I'm currently running docker 1.11.2, I could try with 1.12, although I assumed that the older version would be more stable.

3XX0 · 2016-08-30T01:43:04Z

We should file a bug against Docker then.
It looks like the bug is present in every version so 1.11 vs 1.12 shouldn't matter.

Now, it's weird that Docker is retrying in your case. Maybe we should ensure that the plugin is started before the Docker daemon. We do it in systemd but not in upstart apparently.

Can you try adding and started docker in /etc/init/nvidia-docker.conf within the start on line. This should force the dependency order

amacneil · 2016-08-30T01:59:26Z

Just so I'm clear, what do you think is the bug in docker?

I don't think this is a startup dependency issue, because in this case nvidia-docker started 30+ minutes before we saw the bug.

3XX0 · 2016-08-30T02:09:54Z

This is odd. Did you check that /var/lib/nvidia-docker/nvidia-docker.sock is still here when the issue occurs?

Regarding the bug, I think these lines need to go in the for loop because Do resets the request body.

amacneil · 2016-08-30T04:48:23Z

I see. Still would be interesting to know why nvidia docker didn't respond correctly the first time.

I'll check for that socket next time I see the bug, don't have a machine available right now.

amacneil · 2016-09-07T18:21:59Z

Following up here. I never really fixed this error, but I came up with an acceptable workaround. I noticed that it only ever happens on the first CI build after booting a machine (even though sometimes there was a 15-30 minute delay between the machine booting and the first build being run).

To solve this I simply added nvidia-docker run --rm nvidia/cuda:7.5-runtime nvidia-smi to my userdata boot script. If this fails then the machine will never join my worker pool and will eventually be terminated as unhealthy by my auto scale group.

So, I didn't get to the bottom of it, but it's no longer an issue since we are just cycling machines with the problem. Hopefully this or a similar solution might help anyone discovering this thread in future.

cancan101 · 2016-11-08T07:56:09Z

any update on this?

flx42 · 2016-11-08T17:45:54Z

@cancan101: are you having exactly the same problem? Do you have simple repro steps?

cancan101 · 2016-11-08T18:00:21Z

I am getting this (slightly different error from OP, same as later post):

Error response from daemon: Post http:///var/lib/nvidia-docker/nvidia-docker.sock/VolumeDriver.Create: http: ContentLength=44 with Body length 0

I see it so far 100% of the time on the first call to nvidia-docker on a given instance. I am calling:

nvidia-docker run --rm nvidia/cuda nvidia-smi

After that call fails, subsequent calls succeed.

$ nvidia-docker version
NVIDIA Docker: 1.0.0~rc.3

Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:12:04 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:12:04 UTC 2015
 OS/Arch:      linux/amd64

3XX0 · 2017-02-06T19:14:19Z

I think this is fixed with 1.0.0

haoyangz · 2017-09-26T04:06:08Z

Still observing what @cancan101 reported with 1.0.1 . Now I also put nvidia-docker run --rm nvidia/cuda nvidia-smi at the begining of my script but it still fails about 10% of the time.

$ nvidia-docker version
NVIDIA Docker: 1.0.1

Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:06:06 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:06:06 2017
 OS/Arch:      linux/amd64
 Experimental: false

flx42 added the bug label Jul 1, 2016

This was referenced Jul 21, 2016

Unable to connect to plugin #150

Closed

EOF error when run nvidia-docker #151

Closed

amacneil added a commit to amacneil/nvidia-docker that referenced this issue Aug 29, 2016

Restore go 1.5; Update variable name

78a02f0

As suggested by NVIDIA#123 (comment)

cancan101 mentioned this issue Nov 8, 2016

Trying to run nvidia-docker and saw error: VolumeDriver.Create: internal error, check logs for details #142

Closed

3XX0 closed this as completed Feb 6, 2017

flx42 mentioned this issue Mar 23, 2017

Running Nvidia Docker the first time results in Timeout #347

Closed

Remote nvidia-docker occasionally fails #123

Remote nvidia-docker occasionally fails #123

Comments

jimfleming commented Jun 27, 2016

flx42 commented Jun 28, 2016

jimfleming commented Jun 28, 2016

flx42 commented Jun 28, 2016

jimfleming commented Jun 28, 2016

flx42 commented Jun 29, 2016

3XX0 commented Jun 30, 2016

jimfleming commented Jul 1, 2016

3XX0 commented Jul 11, 2016

flx42 commented Aug 3, 2016

jimfleming commented Aug 4, 2016

amacneil commented Aug 24, 2016

flx42 commented Aug 24, 2016

amacneil commented Aug 24, 2016

3XX0 commented Aug 24, 2016

amacneil commented Aug 24, 2016

flx42 commented Aug 24, 2016

amacneil commented Aug 24, 2016

amacneil commented Aug 25, 2016

amacneil commented Aug 25, 2016

3XX0 commented Aug 25, 2016 • edited Loading

amacneil commented Aug 26, 2016

3XX0 commented Aug 29, 2016

amacneil commented Aug 29, 2016

3XX0 commented Aug 30, 2016

amacneil commented Aug 30, 2016

3XX0 commented Aug 30, 2016

amacneil commented Aug 30, 2016

3XX0 commented Aug 30, 2016 • edited Loading

amacneil commented Aug 30, 2016

amacneil commented Sep 7, 2016

cancan101 commented Nov 8, 2016

flx42 commented Nov 8, 2016

cancan101 commented Nov 8, 2016 • edited Loading

3XX0 commented Feb 6, 2017

haoyangz commented Sep 26, 2017 • edited Loading

3XX0 commented Aug 25, 2016 •

edited

Loading

3XX0 commented Aug 30, 2016 •

edited

Loading

cancan101 commented Nov 8, 2016 •

edited

Loading

haoyangz commented Sep 26, 2017 •

edited

Loading