Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDP from Docker/Podman machine on Mac/Windows times out after 90 seconds #654

Closed
scandey opened this issue May 7, 2023 · 11 comments
Closed
Labels
triage More information is needed

Comments

@scandey
Copy link

scandey commented May 7, 2023

This is primarily a docker/podman or user issue, but I figured I'd include my writeup here anyways. Something in the network stack from both Docker and Podman seems to timeout a UDP connection after 90 seconds. In our COSMOS use-case, this shows up when we send commands to our platform computer emulator and then take a break. After returning, the connection has to be reset (closing and reopening the interface) in order to send more commands.

Working backwards, I ruled out issues with the emulator hardware, the COSMOS gem, COSMOS' ecosystem and my laptop itself. At this point I'm pretty sure that the issue occurs at the interface between the linux machine that runs docker inside Docker Desktop (or Podman machine) and the host. The issue does not occur on a pure Linux box Docker installation.

I've identified this behavior on Mac 13.3.1 on a M2Max and Windows 11 on an Intel i7, primarily running Docker Desktop though I've also tried Podman and Orbstack on the Mac.

The following is from a conversation with @MTI-twalker

I’ve tested regular ncat out of the host machines without any timeout issues. I then proceeded to try nc out of the COSMOS operator container and got the timeout behavior. Next step is to try a generic docker container without COSMOS. Something as basic using as nc in a pure alpine container like docker run -p 1234:1234/udp -it alpine ash ends up timing out silently after 90 seconds. So I do think this is a Docker issue.

I did find a few items that sound very similar to what I’m seeing:
https://forums.docker.com/t/udp-stream-timeout/114185
https://stackoverflow.com/questions/58031315/udp-port-forwarding-not-working-with-docker-on-windows-10

Tragically yes, everything that isn’t Linux-based has to use some form of Docker Desktop (or alternative) to create a linux machine that then runs docker inside it. I’ve now learned far more than I wanted to know about the underlying bits of the Docker ecosystem. I did note that the newest Docker Desktop for Mac (as of a few days ago) uses Google’s gVisor instead of their original vpnKit system. I’m doubtful that the switch to gVisor is related (especially since the Windows version is still using vpnKit) but it is at least one lead.  https://www.docker.com/blog/docker-desktop-4-19/

I’ve been searching high and low and I can find lots of people reporting somewhat similar timeout issues, though not the specific 90 seconds that I see.

I tried updating vpnKitMaxPortIdleTime to 0 in ~/Library/Group\ Containers/group.com.docker/settings.json per docker/for-mac#2197, but I’m not sure whether it makes any difference. The default is 300, but I’m definitely getting a 90 second timeout rather than a 300 second timeout.

https://github.com/docker/for-win/issues/2639 (Perhaps windows equivalent of Mac issue 2197?)
https://github.com/moby/vpnkit/issues/587 (Maybe related, though based on server load supposedly)
https://github.com/docker/for-win/issues/8861 (Seems like the opposite issue, too many requests causes problems not too few)
https://github.com/moby/moby/issues/8795 (Maybe related, requires flushing the “conntrak” table to fix)
https://forums.docker.com/t/udp-stream-timeout/114185 (Unsolved forum post that specifically mentions UDP)
https://stackoverflow.com/questions/68639603/inactive-tcp-sockets-disconnecting-in-docker-for-windows-wsl-2 (Goes back to the config file)

In terms of my own debugging, I’ve found that I can sometimes actually get a "Ncat: Connection refused” if I try to send messages back from Mac host running ncat listener to docker container running nc after 90 seconds of no messages. Reloading the listener does not make a difference, nc on the container has to be restarted.

In the settings.json, switching back to "networkType": “vpnkit”, does not seem to make a difference.

@scandey
Copy link
Author

scandey commented May 9, 2023

As expected, using an Ubuntu machine with regular docker seems to work fine.

After trying a few other Docker Desktop replacement systems including OrbStack and Rancher, I ended up with Colima and the containerd backend as provided by nerdctl. Unfortunately that system is still not nearly mature enough (and has its own special UDP issues) so I'm giving up on the Mac development for now till I can work out what is going on.

@jmthomas
Copy link
Member

jmthomas commented May 9, 2023

Thank you for the incredibly detailed and researched information. I'm a little confused by the "UDP connection timeout" since UDP is connectionless. COSMOS determines that a UDP interface is "connected" simply by whether we allocated a port. Are we saying that the UDP socket stops listening to packets after 90s? So future packet transmissions are not received?

@jmthomas jmthomas added the triage More information is needed label May 9, 2023
@scandey
Copy link
Author

scandey commented May 9, 2023

The language is kind of tricky, I'm using "timeout" for lack of a better word rather than some actual error related to a closed connection or anything (since as you point out UDP itself is connectionless). The issue is sort of the other way around, a UDP socket stops (apparently) sending packets out of Docker Desktop if it does not send any for 90 seconds. Now that you mention it though, I'm not positive I've tested the other way around (whether a UDP socket stops listening if it gets nothing for 90 seconds). I've been focused on the commanding side and not considered the telemetry side since it just worked (in our HERMES use-case, there are separate UDP ports for command and telemetry).

From the point of the COSMOS containers, everything is working as intended (it works fine on a real Linux machine running docker directly). I haven't yet raised this issue with Podman Machine or Docker Desktop (or OrbStack or....) since I wasn't totally clear about the language to describe the problem.

@jmthomas
Copy link
Member

jmthomas commented May 9, 2023

Our primary development platform is Mac OS with Docker Desktop so we'll see if we can reproduce at some point.

@scandey
Copy link
Author

scandey commented May 9, 2023

I put together two quick scripts for showing the issue, hopefully the UDP issue replicates on your systems too. All of the TCP messages work as expected, but the final UDP message just never comes through.

TCP Tester

#!/bin/sh

echo "TCP test: set up a TDC reciever in other terminal with ncat -lk 1234 (requires ncat from nmap-ncat)"
sleep 10
echo "starting sending TCP packets 10, 30, 60 and 90 seconds apart"
docker run --rm --name sender alpine ash -c "{ echo 'tcp 10 seconds'; date; sleep 10; date; } | timeout 11 nc -p 1234 host.docker.internal 1234"
sleep 1
docker run --rm --name sender alpine ash -c "{ echo 'tcp 30 seconds'; date; sleep 30; date; } | timeout 31 nc -p 1234 host.docker.internal 1234"
sleep 1
docker run --rm --name sender alpine ash -c "{ echo 'tcp 60 seconds'; date; sleep 60; date; } | timeout 61 nc -p 1234 host.docker.internal 1234"
sleep 1
docker run --rm --name sender alpine ash -c "{ echo 'tcp 90 seconds'; date; sleep 90; date; } | timeout 91 nc -p 1234 host.docker.internal 1234"
sleep 1

UDP Tester

#!/bin/sh

echo "UDP Test: Start UDP reciever in other terminal with ncat -lu 1234 (requires ncat from nmap-ncat)"
sleep 10
echo "starting sending UDP packets 10, 30, 60 and 90 seconds apart"
docker run --rm --name sender alpine ash -c "{ echo 'udp 10 seconds'; date; sleep 10; date; } | timeout 11 nc -u -p 1234 host.docker.internal 1234"
sleep 1
docker run --rm --name sender alpine ash -c "{ echo 'udp 30 seconds'; date; sleep 30; date; } | timeout 31 nc -u -p 1234 host.docker.internal 1234"
sleep 1
docker run --rm --name sender alpine ash -c "{ echo 'udp 60 seconds'; date; sleep 60; date; } | timeout 61 nc -u -p 1234 host.docker.internal 1234"
sleep 1
docker run --rm --name sender alpine ash -c "{ echo 'udp 90 seconds'; date; sleep 90; date; } | timeout 91 nc -u -p 1234 host.docker.internal 1234"
sleep 1

@ryanmelt
Copy link
Member

The TCP Tests works fine for me. With the UDP one I actually get nothing which is surprising and I'll have to experiment some more.

@kdrag0n
Copy link

kdrag0n commented May 14, 2023

👋 OrbStack developer here. Came across this issue while searching GitHub and thought I'd chime in.

I can't speak for other Docker providers, but in OrbStack's case, there are two issues here:

  1. A bug causing UDP packets with the same source port + destination IP & port to get dropped after the timeout. I've found the cause and fixed it for the next version — thanks for raising the issue!
  2. The fact that there's an internal NAT between the host and Docker. This means that there has to be some sort of connection timeout to keep resource usage under control. As a result, the client's source port changes after the timeout and the server thinks there's a new "connection":
17:02:02.061255 IP 127.0.0.1.49478 > 127.0.0.1.1234: UDP, length 29
17:02:04.083464 IP 127.0.0.1.59394 > 127.0.0.1.1234: UDP, length 44
17:02:34.087749 IP 127.0.0.1.58358 > 127.0.0.1.1234: UDP, length 29
17:02:36.105397 IP 127.0.0.1.59200 > 127.0.0.1.1234: UDP, length 44

ncat only seems to accept up to one "connection", so it stops printing packets after the source port changes.

I think it's possible to fix the connection persistence issue, but it'll be challenging. Feel free to open an issue on the OrbStack repo as well.

@kdrag0n
Copy link

kdrag0n commented May 16, 2023

Thought about this some more and figured out a relatively simple solution. Both issues should be fixed in the newly-released OrbStack v0.10.2.

The reproducer above still won't work because the client's source port 1234 conflicts with the ncat server port on the macOS host side. It should work if you change it to -p 1235.

There's still another NAT layer that be might cause issues after 2–3 minutes of idle, but I think it's not much different from standard Docker-on-Linux setups. Let me know how it goes with the real COSMOS use case!

@jmthomas
Copy link
Member

@kdrag0n Thanks so much for finding this issue and finding a fix. @scandey I'll leave this open until you get a chance to try out it.

@scandey
Copy link
Author

scandey commented May 19, 2023

I don't think I fully understand why the source port matching the ncat server port on localhost only works for me for very short sleep times, but that's not actually an issue in production.

I do see that the new OrbStack has fixed the UDP issue if I use an alternative source port! That's very exciting! It also works in COSMOS as expected. Thank you @kdrag0n!! I can develop on COSMOS much much faster now. I've not seen a multi-minute idle timeout from the Docker-on-Linux, so hopefully we've dodged that issue here as well (somehow).

It seems like I should probably go ahead and report some form of this issue to Docker Desktop as well as to Podman, but first I need to clean up my test scripts to use a different source port than the ncat server port.

@kdrag0n
Copy link

kdrag0n commented May 19, 2023

Great to hear, thanks for sharing your results! Also good to know that the other potential source of timeouts isn't actually a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage More information is needed
Projects
None yet
Development

No branches or pull requests

4 participants