Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MTU issue for large UDP packets with native routing and DSR #32601

Open
2 of 3 tasks
lgyurci opened this issue May 17, 2024 · 5 comments
Open
2 of 3 tasks

MTU issue for large UDP packets with native routing and DSR #32601

lgyurci opened this issue May 17, 2024 · 5 comments
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/dsr Relates to Cilium's Direct-Server-Return feature for KPR. info-completed The GH issue has received a reply from the author kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/question Frequently asked questions & answers. This issue will be linked from the documentation's FAQ. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@lgyurci
Copy link

lgyurci commented May 17, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

We have an environment with DSR and native routing, and a UDP service (SIP) with 3 endpoints, running on 3 different nodes. The MTU on the network is 1500 bytes. When a large, external 1500 byte UDP packet gets sent to the service, it gets routed to one of the nodes, which DNATs it to one of the pods. If the pod is running on a different node, it gets sent over the network to another node, and because of DSR, an 8 byte header gets added to the packet, making it 1508 bytes long, which is higher than the MTU. And in the end, some entity (either the host, or the switch [we have an L2 network], I didn't really look into that) drops the packet.

For TCP, this is not an issue as far as I know, since this 8 byte header only gets added to packets with the SYN flag, which are almost never large.

Cilium Version

Client: 1.15.1 a368c8f 2024-02-14T22:16:57+00:00 go version go1.21.6 linux/amd64
Daemon: 1.15.1 a368c8f 2024-02-14T22:16:57+00:00 go version go1.21.6 linux/amd64

Kernel Version

5.14.0-362.24.1.el9_3.0.1.x86_64

Kubernetes Version

v1.26.15+rke2r1

Regression

No response

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@lgyurci lgyurci added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 17, 2024
@lmb
Copy link
Contributor

lmb commented May 21, 2024

What do you think is the bug here? That Cilium generates a packet of 1508 bytes? If yes, what else should it do (keep in mind that we can't avoid encapsulation)?

Can you either drop Cilium's MTU to 1492 or increase your underlying network to >= 1508?

@lmb lmb added the need-more-info More information is required to further debug or fix the issue. label May 21, 2024
@lgyurci
Copy link
Author

lgyurci commented May 22, 2024

I'm not sure if it's possible via ebpf, but cilium should fragment the packet, like any other router would in a scenario like this.

But I would also like to suggest what it shouldn't do: Break random network connections without any warning, and put the task of debugging onto the end-user. This issue took a significant amount of time to figure out (SIP connections breaking randomly, good luck), and I would at least expect some form of mention about this in the documentation, that "you should watch out for this if you are doing this", even if the solution is simply to increase/decrease the MTU in certain places. Make it appear in a red alert box on the native routing documentation page.

I could increase the MTU on the hosts, and that's what I did in the end, but I feel like this is more of a workaround than a solution. It turns out the host network interfaces were dropping the packets, because they automatically set the same MRU as the MTU - and also on a hardware level, so we couldn't even see those packets with tcpdump. However, now the hosts technically could send 1508 byte packets to other network entities, which is not ideal (cilium's MTU was left at 1500, which meant we had to take it out from automatic MTU mode), so we implemented another workaround: a bond interface on top of the physical one with 1500 MTU. I hope you see the absurdity of this situation. In my opinion, you shouldn't have to manually set the MTU anywhere, anytime.

Sorry if this comment came out as aggressive, I really like what you do, and I like cilium as a software as well, I'm just through a few hours of debugging, and now a bit emotionally attached to this case.

@github-actions github-actions bot added info-completed The GH issue has received a reply from the author and removed need-more-info More information is required to further debug or fix the issue. labels May 22, 2024
@julianwiedmann
Copy link
Member

👋 maybe have a look at #21825 ?

@julianwiedmann julianwiedmann added kind/question Frequently asked questions & answers. This issue will be linked from the documentation's FAQ. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/dsr Relates to Cilium's Direct-Server-Return feature for KPR. and removed kind/bug This is a bug in the Cilium logic. labels May 22, 2024
@lmb
Copy link
Contributor

lmb commented May 22, 2024

Sorry if this comment came out as aggressive, I really like what you do, and I like cilium as a software as well, I'm just through a few hours of debugging, and now a bit emotionally attached to this case.

Sure, it does come out that way. I empathise with you, but you are using software that is made available to you for free. Debugging is the price you pay. Expressing your frustration this way to a stranger trying to help is not the way to go.

I'm not sure if it's possible via ebpf, but cilium should fragment the packet, like any other router would in a scenario like this.

Fragmentation is a tricky topic. I'm not sure we have plans to do outbound fragmentation. Enabling pmtu discovery should at least help for compliant clients.

However, now the hosts technically could send 1508 byte packets to other network entities, which is not ideal (cilium's MTU was left at 1500, which meant we had to take it out from automatic MTU mode), so we implemented another workaround: a bond interface on top of the physical one with 1500 MTU.

Could you lower the default route MTU to 1500 instead?

Coming back to making this easier to debug: oversized packets should emit DROP_FRAG_NEEDED in cilium-dbg monitor. Did you check that output?

@lgyurci
Copy link
Author

lgyurci commented May 23, 2024

Thanks for the help.

I can lower the default route MTU to 1500, but we have a lot other (dynamically added) routes on the nodes, and setting an MTU for those would be tricky.

There is no such output in cilium-dbg monitor. The packets are dropped by the network card, on the hardware level, and not by cilium.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/dsr Relates to Cilium's Direct-Server-Return feature for KPR. info-completed The GH issue has received a reply from the author kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/question Frequently asked questions & answers. This issue will be linked from the documentation's FAQ. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

No branches or pull requests

3 participants