Implement TUN offloads #141

LeeSmet · 2024-02-28T15:17:54Z

Performance profiles show that the biggest amount of time is currently spent in 3 places:

read from tun
write to tun
write to peers

One option which could improve the situation is #102, since larger packets naturally mean less syscalls (in the case of a tcp stream in the overlay). The problem there, is that larger packets will need to be fragmented if the lower layer link has a smaller MTU (which is the reason why MTU is currently set at 1400). While we currently only use stream based connections, keeping individual packets at MTU 1400 leaves the door for UDP at some point open (plain UDP that is).

The proper way to instead handle this would be to enable TSO (and USO and GRO, while we are at it). Unfortunately not a lot of info is readily available about this. In a first stage, we'll limit this to linux. From what I did manage to find so far:

Reading from the tun can produce a packet bigger than MTU (and similarly a bigger packet can be written)
Checksum offloading is required to be enabled.
We can implement ioctls on the tun created by the library, so we don't really have to write the tun code from scratch.
Ideally the vnet_hdr is enabled ( can be done at startup which requries library changes, or with an ioctl later it seems which is the preferred path for now). This puts a vnet_hdr struct with offload info at the start of the packet.
Since the segmentation boundary is defined in the packet, we can't do a vectored write into a bunch of packetbuffers. Instead we'll need to allocate a large buffer first, from which we then copy the data.
We can allocate this buffer once and reuse it.
Since we already have a single read/write loop setup, we can also reuse this buffer for both GRO and TSO/USO.
In theory, we can send unsegmented packets with the leading header to peers (if I'm given to understand this correctly), but then peers won't be able to handle the packet if they don't have offloading. So packets must be fragmented before sending, which also makes this backward compatible with legacy code

iwanbk · 2024-04-25T07:55:52Z

Performance profiles show that the biggest amount of time is currently spent in 3 places:

@LeeSmet

curious, how you did the profiling?

LeeSmet · 2024-04-25T08:24:23Z

In my global cargo config I have a section which specifies a profiling profile, which just adds debug symbols to the configured release profile of the project:

[profile.profiling]
inherits = "release"
debug = true

Then I build with cargo build --profile profiling. This binary is then run with samply (sudo -E samply record ./target/profiling/mycelium {args}). The resulting profile can then be inspected (uses firefox tracing UI by default) to see where the application spends its time)

iwanbk · 2024-04-25T08:33:39Z

nice 👍

LeeSmet added the type_feature New feature or request label Feb 28, 2024

LeeSmet mentioned this issue Mar 4, 2024

Look at effect of increasing MTU of the TUN interface #102

Closed

LeeSmet mentioned this issue Apr 19, 2024

Create custom tun crate #213

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement TUN offloads #141

Implement TUN offloads #141

LeeSmet commented Feb 28, 2024

iwanbk commented Apr 25, 2024

LeeSmet commented Apr 25, 2024

iwanbk commented Apr 25, 2024

Implement TUN offloads #141

Implement TUN offloads #141

Comments

LeeSmet commented Feb 28, 2024

iwanbk commented Apr 25, 2024

LeeSmet commented Apr 25, 2024

iwanbk commented Apr 25, 2024