-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential MTU related miscalculation #26
Comments
Setting |
Related: #23 |
Would you mind sharing what NIC you are using for this test? |
I'm on aliyun (a chinese cloud provider), and the instance is using virtio NIC:
|
Okay, then either set force software GSO or disable GSO (set |
The issue seem to have something to do with the kernel version too, I tried reverting to kernel 5.17.0-1028-oem, |
I am on 5.17 and the issue is always there for me. I don’t understand why you don’t, maybe Homa is running a downgraded mode where always resend packets with MTU size (that is the case for Intel NICs), you can check through |
They were indeed my error, so the only issue now is GSO, I hope we can fix it (homa is yielding terrible performance comparing to TCP now) |
You mean performance with software GSO enforced under virtio? |
I set |
Yes you are correct. I guess John haven’t really optimized performance other than Mellanox NICs yet. GSO can at least work with #23, but the software GSO implementation in homa_offload.c is also slow AFAIK, you can also test by yourself. We need to figure out how TCP TSO works in virtio first in my opinion (i.e. is it just tell the kernel virtio doesn’t support TSO and calls the software segmentation function TCP provides, or part of the virtio driver code performs TSO). And also
My guess is virtio claims to support TSO (from ethtool -k) and therefore the kernel didn’t ask Homa to perform software GSO. But when it comes to the virtio driver actually sending packets, TSO doesn’t work because of protocol number mismatch. And I guess it is same for Intel NICs. |
The problem here comes from differences in support for TSO among NICs. My development of Homa has been with Mellanox NICs, and I discovered that those NICs are willing to perform TSO on packets that use protocols other than TCP. I then modified Homa's packet headers to look as much like TCP headers as possible (and, in particular, not to use fields that will get modified in awkward ways by TSO). With this approach, Homa can piggyback on TSO for segmentation offload, which improves performance considerably. I made this behavior the default in Homa, assuming (hoping?) that other NICs would be as flexible as the Mellanox ones. Unfortunately, it turns out that some (maybe most?) NICs simply discard packets requesting TSO if they don't also specify the TCP protocol; this is almost certainly what's happening with @NickCao. Thus, TSO needs to be disabled for these NICs. One way to do that is to reduce max_gso_size to the network MTU. Another is for Homa to use GSO to do segmentation in software rather than hardware. I added support for this in Homa a while ago, assuming that the software GSO would be invoked automatically if the NIC driver rejected the packet, but that seems not to be the case either. PR #23 adds configuration parameter to ask Homa to use software GSO rather than TSO; I'll get that incorporated shortly. Unfortunately, neither of these solutions is very satsifying. Reducing max_gso_size means that the Linux network must be traversed for every MTU-sized packet; this limits throughput even if the MTU is increased to 9000 bytes. Performing GSO in software means that oversized packets can get sent through the Linux stack, which is nice, but when they are split up at driver level, all of the data has to be copied from the large GSO packet to smaller network packets. Unfortunately this is quite slow, and limits throughput. I've been poking around to see if there is a way that the smaller packets can actually reference data in the original large packet, thereby avoiding copies, but I haven't yet heard out a way to do that (since virtually all NICs support TSO for TCP, there's probably not much incentive for Linux kernel developers to make GSO faster). The best solution to this problem is to fix other NICs so they don't reject non-TCP packets that request TSO. My hope is that this check is done in the driver software, in which case it should be easy to simply remove an "if" statement. If the check is done in the NIC hardware, then this may be hard to fix... |
I actually have other thoughts to share regarding virtio. First It is inevitable to include virtual machines in modern cloud architecture and therefore data center networks. And virtio is nearly the de-facto NIC to use when considering independent kernel virtualization, making it really important if we want push Homa into real deployment. Another one is the behavior of virtio TSO is not compatbile with (current) Homa fundementally even if TSO works. If I am understanding correctly, Homa expects coming packets are always with mtu size, regardless from
This observation indicates virtio doesn't do MTU size segmentation anways on two machines sharing the same node (i.e. no physical NIC involved) but with a more tricky number. I don't know where this 7240 (7292) comes from yet. Then it makes the packet sniffing in force-software-GSO.zip in #23 a lot more sense. So I try same experiment with again with Homa and observe the packets on the bridge (current HomaModule inserted on both machines) and start tcpdump at the host (bridge / swtich)
From the results above, we can see Homa first send the data as normal, virtio from sender side performed non MTU size segmentation and the receiver indeed received them but don't know how to handle them. But it is good receiver at least registered the RPC and asked for RESEND. In conclusion, I would say virtio's behavior for Homa TSO quite similar with TCP TSO although the segmentation size is a bit different. |
I agree that it's important for Homa to work in virtual machine environments, and I would expect/hope it would work with virtio. As far as I know, Homa has no expectation about the size of incoming packets; it should accept whatever size arrives over the network. My interpretation of the tcpdump and Homa timetrace output in @breakertt's message is that a Homa client attempted to transmit a large request packet (8700 bytes?) but this packet never got to the server. Since the client didn't receive a response to the RPC request, it eventually transmitted a RESEND request for the RPC result; this is the first packet logged by homa_gro_recieve, with type 0x12. When the server sees the RESEND, it realizes that it received a request, so it turns around and issues a RESEND for the request. When the client resends, it does so using only MTU-sized packets; these packets are successfully received by the server. I think the problem is that the large initial packet is not being forwarded by someone below Homa. I suspect that virtio discarded the packet because it requested TSO and the IP protocol isn't TCP. Homa should be quite happy if virtio segmentation produces packets with different sizes than traditional TSO, as long as it transmits the data in some form. |
Oh yes, I think you are right. I didn't notice that the first packet noticed is not even a DATA packet. It must be dropped somewhere between the switch (virtual bridge) and the receiver. |
This means it is not a Homa problem and it takes time to figure out what is wrong in this scenario. So maybe in the near future we still need either set |
It's been a long time since I've used tcpdump so I'm pretty rusty on it. Does the tcpdump trace in your message indicate that the large request packet actually made it onto the "wire", for some definition of wire? If so, that would mean that the source virtio isn't dropping the packet. Is it possible that for reason the receiving machine is dropping the large packet for some reason? |
Bridge indeed saw the packet coming out from the sender machine.
That is also my assumption. And for virtio NIC, Homa TSO behaviour on the sender machine is quite similar to TCP TSO as I mentioned above. |
When testing homa with nccl-tests, I noticed that small testcases (whose messages falls under MTU) pass, while large testcases fail. Looking through the logs it seem that the packets never make it to the other end. tcpdump confirms that it's homa sending oversized packages with DF bit set.
(Background: I'm on ubuntu 22.04, with kernel 6.1.0-1007-oem)
The text was updated successfully, but these errors were encountered: