Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential MTU related miscalculation #26

Open
NickCao opened this issue Mar 8, 2023 · 18 comments
Open

Potential MTU related miscalculation #26

NickCao opened this issue Mar 8, 2023 · 18 comments

Comments

@NickCao
Copy link

NickCao commented Mar 8, 2023

When testing homa with nccl-tests, I noticed that small testcases (whose messages falls under MTU) pass, while large testcases fail. Looking through the logs it seem that the packets never make it to the other end. tcpdump confirms that it's homa sending oversized packages with DF bit set.

13:56:10.888085 IP (tos 0xa0, ttl 64, id 0, offset 0, flags [DF], proto unknown (253), length 4237)
    172.25.230.89 > 172.25.230.90:  exptest-253 4217

(Background: I'm on ubuntu 22.04, with kernel 6.1.0-1007-oem)

@NickCao NickCao changed the title Potential MTU related miscalcuation Potential MTU related miscalculation Mar 8, 2023
@NickCao
Copy link
Author

NickCao commented Mar 8, 2023

Setting sysctl net.homa.max_gso_size=1500 does improve the situation. So maybe GSO is to blame.

@NickCao
Copy link
Author

NickCao commented Mar 8, 2023

Related: #23

@breakertt
Copy link
Contributor

Would you mind sharing what NIC you are using for this test?

@NickCao
Copy link
Author

NickCao commented Mar 8, 2023

I'm on aliyun (a chinese cloud provider), and the instance is using virtio NIC:

00:06.0 Ethernet controller: Red Hat, Inc. Virtio network device

@breakertt
Copy link
Contributor

I'm on aliyun (a chinese cloud provider), and the instance is using virtio NIC:

00:06.0 Ethernet controller: Red Hat, Inc. Virtio network device

Okay, then either set force software GSO or disable GSO (set max_gso_size) can make Homa work.

@NickCao
Copy link
Author

NickCao commented Mar 8, 2023

The issue seem to have something to do with the kernel version too, I tried reverting to kernel 5.17.0-1028-oem, and all the strange issues are gone Edit: still having trouble with specific message sizes, but that could be my programming error. So why GSO is broken for virtio-net?

@breakertt
Copy link
Contributor

breakertt commented Mar 8, 2023

I am on 5.17 and the issue is always there for me. I don’t understand why you don’t, maybe Homa is running a downgraded mode where always resend packets with MTU size (that is the case for Intel NICs), you can check through ttprint.py.

@NickCao
Copy link
Author

NickCao commented Mar 8, 2023

Edit: still having trouble with specific message sizes, but that could be my programming error.

They were indeed my error, so the only issue now is GSO, I hope we can fix it (homa is yielding terrible performance comparing to TCP now)

@breakertt
Copy link
Contributor

Edit: still having trouble with specific message sizes, but that could be my programming error.

They were indeed my error, so the only issue now is GSO, I hope we can fix it (homa is yielding terrible performance comparing to TCP now)

You mean performance with software GSO enforced under virtio?

@NickCao
Copy link
Author

NickCao commented Mar 8, 2023

I set max_gso_size to MTU, that means disabling GSO altogether I guess?

@breakertt
Copy link
Contributor

breakertt commented Mar 8, 2023

I set max_gso_size to MTU, that means disabling GSO altogether I guess?

Yes you are correct. I guess John haven’t really optimized performance other than Mellanox NICs yet. GSO can at least work with #23, but the software GSO implementation in homa_offload.c is also slow AFAIK, you can also test by yourself.

We need to figure out how TCP TSO works in virtio first in my opinion (i.e. is it just tell the kernel virtio doesn’t support TSO and calls the software segmentation function TCP provides, or part of the virtio driver code performs TSO). And also

  1. Why skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6; can not let virtio switch to software TSO for Homa if unsupported.
  2. A way to improve software GSO in homa_offload.c anyways regardless whether virtio supports TSO or not, there will always be legacy NICs need that.

My guess is virtio claims to support TSO (from ethtool -k) and therefore the kernel didn’t ask Homa to perform software GSO. But when it comes to the virtio driver actually sending packets, TSO doesn’t work because of protocol number mismatch. And I guess it is same for Intel NICs.

@johnousterhout
Copy link
Member

The problem here comes from differences in support for TSO among NICs. My development of Homa has been with Mellanox NICs, and I discovered that those NICs are willing to perform TSO on packets that use protocols other than TCP. I then modified Homa's packet headers to look as much like TCP headers as possible (and, in particular, not to use fields that will get modified in awkward ways by TSO). With this approach, Homa can piggyback on TSO for segmentation offload, which improves performance considerably. I made this behavior the default in Homa, assuming (hoping?) that other NICs would be as flexible as the Mellanox ones.

Unfortunately, it turns out that some (maybe most?) NICs simply discard packets requesting TSO if they don't also specify the TCP protocol; this is almost certainly what's happening with @NickCao. Thus, TSO needs to be disabled for these NICs. One way to do that is to reduce max_gso_size to the network MTU. Another is for Homa to use GSO to do segmentation in software rather than hardware. I added support for this in Homa a while ago, assuming that the software GSO would be invoked automatically if the NIC driver rejected the packet, but that seems not to be the case either. PR #23 adds configuration parameter to ask Homa to use software GSO rather than TSO; I'll get that incorporated shortly.

Unfortunately, neither of these solutions is very satsifying. Reducing max_gso_size means that the Linux network must be traversed for every MTU-sized packet; this limits throughput even if the MTU is increased to 9000 bytes. Performing GSO in software means that oversized packets can get sent through the Linux stack, which is nice, but when they are split up at driver level, all of the data has to be copied from the large GSO packet to smaller network packets. Unfortunately this is quite slow, and limits throughput. I've been poking around to see if there is a way that the smaller packets can actually reference data in the original large packet, thereby avoiding copies, but I haven't yet heard out a way to do that (since virtually all NICs support TSO for TCP, there's probably not much incentive for Linux kernel developers to make GSO faster).

The best solution to this problem is to fix other NICs so they don't reject non-TCP packets that request TSO. My hope is that this check is done in the driver software, in which case it should be easy to simply remove an "if" statement. If the check is done in the NIC hardware, then this may be hard to fix...

@breakertt
Copy link
Contributor

I actually have other thoughts to share regarding virtio.

First It is inevitable to include virtual machines in modern cloud architecture and therefore data center networks. And virtio is nearly the de-facto NIC to use when considering independent kernel virtualization, making it really important if we want push Homa into real deployment.

Another one is the behavior of virtio TSO is not compatbile with (current) Homa fundementally even if TSO works. If I am understanding correctly, Homa expects coming packets are always with mtu size, regardless from homa_softirq or bypassed by homa_gro_receive. I digged a bit about the TSO bevhaior on TCP which can explain somethings. So I have two virtual machines (with virtio) running TCP, the client exits immediately after send the packet, and the server exits immedaitely after receive the packet. Client nodes sends 10000B payload to server through TCP. Following is what I saw from tcpdump at the receiver node:

...Handshake Packets...
18:27:07.548195 enp1s0 In  IP (tos 0x0, ttl 64, id 43905, offset 0, flags [DF], proto TCP (6), length 7292)
    192.168.122.100.46416 > 192.168.122.101.2000: Flags [P.], cksum 0x9289 (incorrect -> 0xb2ce), seq 1:7241, ack 1, win 502, options [nop,nop,TS val 3990326198 ecr 617160495], length 7240
18:27:07.548196 enp1s0 In  IP (tos 0x0, ttl 64, id 43910, offset 0, flags [DF], proto TCP (6), length 2812)
    192.168.122.100.46416 > 192.168.122.101.2000: Flags [P.], cksum 0x8109 (incorrect -> 0x162c), seq 7241:10001, ack 1, win 502, options [nop,nop,TS val 3990326198 ecr 617160495], length 2760
...Handwave Packets...

This observation indicates virtio doesn't do MTU size segmentation anways on two machines sharing the same node (i.e. no physical NIC involved) but with a more tricky number. I don't know where this 7240 (7292) comes from yet. Then it makes the packet sniffing in force-software-GSO.zip in #23 a lot more sense.

So I try same experiment with again with Homa and observe the packets on the bridge (current HomaModule inserted on both machines) and start tcpdump at the host (bridge / swtich)

18:40:49.958697 vnet0 P   IP (tos 0xc0, ttl 64, id 0, offset 0, flags [DF], proto unknown (253), length 8700)
    192.168.122.100 > 192.168.122.101:  ip-proto-253 8680
18:40:49.958716 vnet0 P   IP (tos 0xc0, ttl 64, id 0, offset 0, flags [DF], proto unknown (253), length 1580)
...Control Packets...
18:47:24.309026 vnet0 P   IP (tos 0xa0, ttl 64, id 0, offset 0, flags [DF], proto unknown (253), length 1500)
    192.168.122.100 > 192.168.122.101:  ip-proto-253 1480
18:47:24.309034 vnet1 Out IP (tos 0xa0, ttl 64, id 0, offset 0, flags [DF], proto unknown (253), length 1500)
    192.168.122.100 > 192.168.122.101:  ip-proto-253 1480
18:47:24.309039 vnet0 P   IP (tos 0xa0, ttl 64, id 0, offset 0, flags [DF], proto unknown (253), length 1500)
    192.168.122.100 > 192.168.122.101:  ip-proto-253 1480
18:47:24.309041 vnet1 Out IP (tos 0xa0, ttl 64, id 0, offset 0, flags [DF], proto unknown (253), length 1500)
    192.168.122.100 > 192.168.122.101:  ip-proto-253 1480
18:47:24.309042 vnet0 P   IP (tos 0xa0, ttl 64, id 0, offset 0, flags [DF], proto unknown (253), length 1500)
    192.168.122.100 > 192.168.122.101:  ip-proto-253 1480
18:47:24.309043 vnet1 Out IP (tos 0xa0, ttl 64, id 0, offset 0, flags [DF], proto unknown (253), length 1500)
    192.168.122.100 > 192.168.122.101:  ip-proto-253 1480
18:47:24.309051 vnet0 P   IP (tos 0xa0, ttl 64, id 0, offset 0, flags [DF], proto unknown (253), length 1500)
    192.168.122.100 > 192.168.122.101:  ip-proto-253 1480
18:47:24.309051 vnet1 Out IP (tos 0xa0, ttl 64, id 0, offset 0, flags [DF], proto unknown (253), length 1500)
    192.168.122.100 > 192.168.122.101:  ip-proto-253 1480
18:47:24.309053 vnet0 P   IP (tos 0xa0, ttl 64, id 0, offset 0, flags [DF], proto unknown (253), length 1500)
    192.168.122.100 > 192.168.122.101:  ip-proto-253 1480
18:47:24.309054 vnet1 Out IP (tos 0xa0, ttl 64, id 0, offset 0, flags [DF], proto unknown (253), length 1500)
    192.168.122.100 > 192.168.122.101:  ip-proto-253 1480
18:47:24.309055 vnet0 P   IP (tos 0xa0, ttl 64, id 0, offset 0, flags [DF], proto unknown (253), length 1500)
    192.168.122.100 > 192.168.122.101:  ip-proto-253 1480
18:47:24.309056 vnet1 Out IP (tos 0xa0, ttl 64, id 0, offset 0, flags [DF], proto unknown (253), length 1500)
    192.168.122.100 > 192.168.122.101:  ip-proto-253 1480
18:47:24.309057 vnet0 P   IP (tos 0xa0, ttl 64, id 0, offset 0, flags [DF], proto unknown (253), length 1500)
    192.168.122.100 > 192.168.122.101:  ip-proto-253 1480
18:47:24.309058 vnet1 Out IP (tos 0xa0, ttl 64, id 0, offset 0, flags [DF], proto unknown (253), length 1500)
$ ./util/ttprint.py                                                      
    0.000 us (+   0.000 us) [C00] First event has timestamp 4666854084992 (cpu_ghz 2.419200000000000)                 
    0.000 us (+   0.000 us) [C02] homa_recvmsg starting, port 2000, pid 1271, flags 1                                 
    1.886 us (+   1.886 us) [C02] reaped 0 skbs, 0 rpcs; 0 skbs remain for port 2000                                  
    1.908 us (+   0.023 us) [C02] Checking nonblocking, flags 1                                                       
   54.442 us (+  52.534 us) [C02] Poll ended unsuccessfully on socket 2000, pid 1271                                  
   54.663 us (+   0.221 us) [C02] homa_wait_for_message sleeping, pid 1271                                            
2571381.620 us (+2571326.957 us) [C03] homa_gro_receive got packet from 0xc0a87a64 id 11, type 0x12, priority 7       
2571383.626 us (+   2.005 us) [C03] homa_softirq: first packet from 0xc0a87a64:32772, id 11, type 18                  
2571386.317 us (+   2.692 us) [C03] resend request for unknown id 11, peer 0xc0a87a64:32772, offset 0; responding with
 UNKNOWN                                                                                                              
2571386.552 us (+   0.234 us) [C03] sending unknown to 0xc0a87a64:32772 for id 11                                     
2571673.010 us (+ 286.458 us) [C03] homa_gro_receive got packet from 0xc0a87a64 id 1, offset 0, priority 6            
2571676.141 us (+   3.131 us) [C03] homa_gro_receive got packet from 0xc0a87a64 id 1, offset 0, priority 6            
2571677.545 us (+   1.404 us) [C03] homa_gro_receive got packet from 0xc0a87a64 id 1, offset 0, priority 6   
2571678.894 us (+   1.349 us) [C03] homa_gro_complete chose core 4 for id 1 offset 0 with IDLE_NEW policy             
2571690.430 us (+  11.536 us) [C03] homa_softirq: first packet from 0xc0a87a64:32772, id 11, type 16                  
2571710.828 us (+  20.398 us) [C03] homa_rpc_handoff handed off id 11 to pid 1271 on core 2                           
2571711.697 us (+   0.868 us) [C03] incoming data packet, id 11, peer 0xc0a87a64, offset 0/10000                      
2571711.797 us (+   0.100 us) [C03] Incoming message for id 11 has 10000 unscheduled bytes                            
2571729.350 us (+  17.553 us) [C03] incoming data packet, id 11, peer 0xc0a87a64, offset 1420/10000                   
2571729.801 us (+   0.451 us) [C03] incoming data packet, id 11, peer 0xc0a87a64, offset 2840/10000                   
2571733.652 us (+   3.851 us) [C03] homa_gro_receive got packet from 0xc0a87a64 id 1, offset 0, priority 6            
2571735.049 us (+   1.397 us) [C03] homa_gro_receive got packet from 0xc0a87a64 id 1, offset 0, priority 6            
2571736.618 us (+   1.569 us) [C03] homa_gro_receive got packet from 0xc0a87a64 id 1, offset 0, priority 6            
2571738.090 us (+   1.472 us) [C03] homa_gro_receive got packet from 0xc0a87a64 id 1, offset 0, priority 6            
2571739.582 us (+   1.492 us) [C03] homa_gro_receive got packet from 0xc0a87a64 id 11, offset 9940, priority 6        
2571739.893 us (+   0.311 us) [C03] homa_softirq: first packet from 0xc0a87a64:32772, id 11, type 16                  
2571740.227 us (+   0.334 us) [C03] incoming data packet, id 11, peer 0xc0a87a64, offset 9940/10000         
2571741.028 us (+   0.801 us) [C03] homa_gro_complete chose core 5 for id 1 offset 0 with IDLE_NEW policy             
2571743.208 us (+   2.180 us) [C03] homa_softirq: first packet from 0xc0a87a64:32772, id 11, type 16
2571743.363 us (+   0.155 us) [C03] incoming data packet, id 11, peer 0xc0a87a64, offset 4260/10000
2571743.759 us (+   0.396 us) [C03] incoming data packet, id 11, peer 0xc0a87a64, offset 5680/10000
2571744.089 us (+   0.330 us) [C03] incoming data packet, id 11, peer 0xc0a87a64, offset 7100/10000
2571744.356 us (+   0.267 us) [C03] incoming data packet, id 11, peer 0xc0a87a64, offset 8520/10000
2571751.489 us (+   7.133 us) [C02] homa_wait_for_message found rpc id 11, pid 1271

From the results above, we can see Homa first send the data as normal, virtio from sender side performed non MTU size segmentation and the receiver indeed received them but don't know how to handle them. But it is good receiver at least registered the RPC and asked for RESEND. In conclusion, I would say virtio's behavior for Homa TSO quite similar with TCP TSO although the segmentation size is a bit different.

@johnousterhout
Copy link
Member

I agree that it's important for Homa to work in virtual machine environments, and I would expect/hope it would work with virtio. As far as I know, Homa has no expectation about the size of incoming packets; it should accept whatever size arrives over the network. My interpretation of the tcpdump and Homa timetrace output in @breakertt's message is that a Homa client attempted to transmit a large request packet (8700 bytes?) but this packet never got to the server. Since the client didn't receive a response to the RPC request, it eventually transmitted a RESEND request for the RPC result; this is the first packet logged by homa_gro_recieve, with type 0x12. When the server sees the RESEND, it realizes that it received a request, so it turns around and issues a RESEND for the request. When the client resends, it does so using only MTU-sized packets; these packets are successfully received by the server.

I think the problem is that the large initial packet is not being forwarded by someone below Homa. I suspect that virtio discarded the packet because it requested TSO and the IP protocol isn't TCP. Homa should be quite happy if virtio segmentation produces packets with different sizes than traditional TSO, as long as it transmits the data in some form.

@breakertt
Copy link
Contributor

Oh yes, I think you are right. I didn't notice that the first packet noticed is not even a DATA packet. It must be dropped somewhere between the switch (virtual bridge) and the receiver.

@breakertt
Copy link
Contributor

Oh yes, I think you are right. I didn't notice that the first packet noticed is not even a DATA packet. It must be dropped somewhere between the switch (virtual bridge) and the receiver.

This means it is not a Homa problem and it takes time to figure out what is wrong in this scenario. So maybe in the near future we still need either set max_gso_size or enable force software GSO unfortunately.

@johnousterhout
Copy link
Member

It's been a long time since I've used tcpdump so I'm pretty rusty on it. Does the tcpdump trace in your message indicate that the large request packet actually made it onto the "wire", for some definition of wire? If so, that would mean that the source virtio isn't dropping the packet. Is it possible that for reason the receiving machine is dropping the large packet for some reason?

@breakertt
Copy link
Contributor

It's been a long time since I've used tcpdump so I'm pretty rusty on it. Does the tcpdump trace in your message indicate that the large request packet actually made it onto the "wire", for some definition of wire?

Bridge indeed saw the packet coming out from the sender machine.

If so, that would mean that the source virtio isn't dropping the packet. Is it possible that for reason the receiving machine is dropping the large packet for some reason?

That is also my assumption. And for virtio NIC, Homa TSO behaviour on the sender machine is quite similar to TCP TSO as I mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants