-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UDP from Docker/Podman machine on Mac/Windows times out after 90 seconds #654
Comments
As expected, using an Ubuntu machine with regular docker seems to work fine. After trying a few other Docker Desktop replacement systems including OrbStack and Rancher, I ended up with Colima and the containerd backend as provided by nerdctl. Unfortunately that system is still not nearly mature enough (and has its own special UDP issues) so I'm giving up on the Mac development for now till I can work out what is going on. |
Thank you for the incredibly detailed and researched information. I'm a little confused by the "UDP connection timeout" since UDP is connectionless. COSMOS determines that a UDP interface is "connected" simply by whether we allocated a port. Are we saying that the UDP socket stops listening to packets after 90s? So future packet transmissions are not received? |
The language is kind of tricky, I'm using "timeout" for lack of a better word rather than some actual error related to a closed connection or anything (since as you point out UDP itself is connectionless). The issue is sort of the other way around, a UDP socket stops (apparently) sending packets out of Docker Desktop if it does not send any for 90 seconds. Now that you mention it though, I'm not positive I've tested the other way around (whether a UDP socket stops listening if it gets nothing for 90 seconds). I've been focused on the commanding side and not considered the telemetry side since it just worked (in our HERMES use-case, there are separate UDP ports for command and telemetry). From the point of the COSMOS containers, everything is working as intended (it works fine on a real Linux machine running docker directly). I haven't yet raised this issue with Podman Machine or Docker Desktop (or OrbStack or....) since I wasn't totally clear about the language to describe the problem. |
Our primary development platform is Mac OS with Docker Desktop so we'll see if we can reproduce at some point. |
I put together two quick scripts for showing the issue, hopefully the UDP issue replicates on your systems too. All of the TCP messages work as expected, but the final UDP message just never comes through. TCP Tester
UDP Tester
|
The TCP Tests works fine for me. With the UDP one I actually get nothing which is surprising and I'll have to experiment some more. |
👋 OrbStack developer here. Came across this issue while searching GitHub and thought I'd chime in. I can't speak for other Docker providers, but in OrbStack's case, there are two issues here:
ncat only seems to accept up to one "connection", so it stops printing packets after the source port changes. I think it's possible to fix the connection persistence issue, but it'll be challenging. Feel free to open an issue on the OrbStack repo as well. |
Thought about this some more and figured out a relatively simple solution. Both issues should be fixed in the newly-released OrbStack v0.10.2. The reproducer above still won't work because the client's source port There's still another NAT layer that be might cause issues after 2–3 minutes of idle, but I think it's not much different from standard Docker-on-Linux setups. Let me know how it goes with the real COSMOS use case! |
I don't think I fully understand why the source port matching the ncat server port on localhost only works for me for very short sleep times, but that's not actually an issue in production. I do see that the new OrbStack has fixed the UDP issue if I use an alternative source port! That's very exciting! It also works in COSMOS as expected. Thank you @kdrag0n!! I can develop on COSMOS much much faster now. I've not seen a multi-minute idle timeout from the Docker-on-Linux, so hopefully we've dodged that issue here as well (somehow). It seems like I should probably go ahead and report some form of this issue to Docker Desktop as well as to Podman, but first I need to clean up my test scripts to use a different source port than the ncat server port. |
Great to hear, thanks for sharing your results! Also good to know that the other potential source of timeouts isn't actually a problem. |
This is primarily a docker/podman or user issue, but I figured I'd include my writeup here anyways. Something in the network stack from both Docker and Podman seems to timeout a UDP connection after 90 seconds. In our COSMOS use-case, this shows up when we send commands to our platform computer emulator and then take a break. After returning, the connection has to be reset (closing and reopening the interface) in order to send more commands.
Working backwards, I ruled out issues with the emulator hardware, the COSMOS gem, COSMOS' ecosystem and my laptop itself. At this point I'm pretty sure that the issue occurs at the interface between the linux machine that runs docker inside Docker Desktop (or Podman machine) and the host. The issue does not occur on a pure Linux box Docker installation.
I've identified this behavior on Mac 13.3.1 on a M2Max and Windows 11 on an Intel i7, primarily running Docker Desktop though I've also tried Podman and Orbstack on the Mac.
The following is from a conversation with @MTI-twalker
I’ve tested regular ncat out of the host machines without any timeout issues. I then proceeded to try nc out of the COSMOS operator container and got the timeout behavior. Next step is to try a generic docker container without COSMOS. Something as basic using as nc in a pure alpine container like
docker run -p 1234:1234/udp -it alpine ash
ends up timing out silently after 90 seconds. So I do think this is a Docker issue.I did find a few items that sound very similar to what I’m seeing:
https://forums.docker.com/t/udp-stream-timeout/114185
https://stackoverflow.com/questions/58031315/udp-port-forwarding-not-working-with-docker-on-windows-10
Tragically yes, everything that isn’t Linux-based has to use some form of Docker Desktop (or alternative) to create a linux machine that then runs docker inside it. I’ve now learned far more than I wanted to know about the underlying bits of the Docker ecosystem. I did note that the newest Docker Desktop for Mac (as of a few days ago) uses Google’s gVisor instead of their original vpnKit system. I’m doubtful that the switch to gVisor is related (especially since the Windows version is still using vpnKit) but it is at least one lead. https://www.docker.com/blog/docker-desktop-4-19/
I’ve been searching high and low and I can find lots of people reporting somewhat similar timeout issues, though not the specific 90 seconds that I see.
I tried updating vpnKitMaxPortIdleTime to 0 in ~/Library/Group\ Containers/group.com.docker/settings.json per docker/for-mac#2197, but I’m not sure whether it makes any difference. The default is 300, but I’m definitely getting a 90 second timeout rather than a 300 second timeout.
https://github.com/docker/for-win/issues/2639 (Perhaps windows equivalent of Mac issue 2197?)
https://github.com/moby/vpnkit/issues/587 (Maybe related, though based on server load supposedly)
https://github.com/docker/for-win/issues/8861 (Seems like the opposite issue, too many requests causes problems not too few)
https://github.com/moby/moby/issues/8795 (Maybe related, requires flushing the “conntrak” table to fix)
https://forums.docker.com/t/udp-stream-timeout/114185 (Unsolved forum post that specifically mentions UDP)
https://stackoverflow.com/questions/68639603/inactive-tcp-sockets-disconnecting-in-docker-for-windows-wsl-2 (Goes back to the config file)
In terms of my own debugging, I’ve found that I can sometimes actually get a "Ncat: Connection refused” if I try to send messages back from Mac host running ncat listener to docker container running nc after 90 seconds of no messages. Reloading the listener does not make a difference, nc on the container has to be restarted.
In the settings.json, switching back to "networkType": “vpnkit”, does not seem to make a difference.
The text was updated successfully, but these errors were encountered: