-
Notifications
You must be signed in to change notification settings - Fork 2k
Remote nvidia-docker occasionally fails #123
Comments
I tried to repro the problem using the instructions here, but I couldn't. Which version of |
Thanks for the reply. Here's the info:
nvidia-docker version: 1.0.0rc3
~80% of the machines I've provisioned. |
So far no luck reproducing the nvidia-docker issue, but I do get
quite reliably :) |
Hmm, I haven't run into that myself. In the last week I've started 92 remote instances on EC2, 16 of those failed with the above error. None have failed from hanging. Is there any debug output from |
@3XX0 Did you ever see anything like that? Any idea to troubleshoot the issue? |
Can you try with Go 1.6? You need to change the first line here and rebuild from source: |
Sure, I'll give it a try. |
Did it fix the issue? |
Ping @jimfleming |
Sorry, I did not get around to rebuilding with a newer Go. On the upside, I don't seem to be seeing the issue anymore. (I've rebuilt the AMI for other reasons which may have had an effect.) |
I am seeing this issue a small percentage of the time on our CI servers. Note that I'm not using nvidia-docker wrapper, only the docker CLI arguments (fetched in a previous step from the http api):
Is the recommended fix still to rebuild with Go 1.6? I'm ok with waiting until the next version of nvidia-docker is released if that is likely to fix it too. Let me know if you would like to attempt any debugging steps as this seems to be happening at least once per day for us. |
@amacneil we don't know actually, we weren't able to reproduce the problem. |
@flx42 any thoughts on using go 1.7? have you guys tested with that at all? |
No we didn't test it but I guess it will work just fine. |
Ok, I rebuilt with Go 1.6 for now so I'll report back whether we continue to see this issue. If hypothetically Go 1.6 fixes it, it sounds like you guys may still be stuck using 1.5 for the time being... or would that just mean that you need to drop support for docker 1.9? |
Yes, we'll drop docker 1.9 if that fixes the issue. But I have the feeling it's going to be difficult to know for sure if the problem is fixed by this change. |
Ok, sounds good. We've been seeing the error above at least a few times per day. I just rolled out a custom deb with Go 1.6 so if we don't see anything for a couple of days it will be a good sign. Also, if there's any other info I could collect to help debug this let me know. |
Unfortunately still seeing this issue with the go 1.6 version. I'm honestly not sure if it's the same problem as OP in this thread (different error message), but it's definitely causing occasional failures in our CI environment.
Is there anything I can do to help collect debug logs or try to diagnose this? |
My issue is actually much more similar to #150 /var/log/upstart/docker.log:
/var/log/upstart/nvidia-docker.log:
|
Yeah pretty sure there is a bug there: https://github.com/NVIDIA/nvidia-docker/blob/master/tools/src/nvidia-docker/remote.go#L53-L63 Change the occurrence of the I will commit a fix asap. |
Rebuilt deb with your suggestion, will test today. |
Any update? |
As suggested by NVIDIA#123 (comment)
Saw my issue again today unfortunately (when running patched deb). The source of applied patch is here: amacneil@78a02f0 Same error message:
Per comment above I still think my issue may actually be different than OP in this thread. Any other ideas to try? |
Yes sorry I got confused. I think my patch fixes the initial bug in this thread. I think there is a bug here in case of retry. Do you see the warning |
Yep, I see that. See full docker logs in my earlier comment. I'm currently running docker 1.11.2, I could try with 1.12, although I assumed that the older version would be more stable. |
We should file a bug against Docker then. Now, it's weird that Docker is retrying in your case. Maybe we should ensure that the plugin is started before the Docker daemon. We do it in systemd but not in upstart apparently. Can you try adding |
Just so I'm clear, what do you think is the bug in docker? I don't think this is a startup dependency issue, because in this case nvidia-docker started 30+ minutes before we saw the bug. |
This is odd. Did you check that Regarding the bug, I think these lines need to go in the for loop because |
I see. Still would be interesting to know why nvidia docker didn't respond correctly the first time. I'll check for that socket next time I see the bug, don't have a machine available right now. |
Following up here. I never really fixed this error, but I came up with an acceptable workaround. I noticed that it only ever happens on the first CI build after booting a machine (even though sometimes there was a 15-30 minute delay between the machine booting and the first build being run). To solve this I simply added So, I didn't get to the bottom of it, but it's no longer an issue since we are just cycling machines with the problem. Hopefully this or a similar solution might help anyone discovering this thread in future. |
any update on this? |
@cancan101: are you having exactly the same problem? Do you have simple repro steps? |
I am getting this (slightly different error from OP, same as later post):
I see it so far 100% of the time on the first call to
After that call fails, subsequent calls succeed.
|
I think this is fixed with 1.0.0 |
Still observing what @cancan101 reported with 1.0.1 . Now I also put
|
Here's the error I get:
And the bash script:
FWIW, this works correctly ~80% of the time. Any tips on troubleshooting this further or possible causes would be super helpful.
The text was updated successfully, but these errors were encountered: