Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement TUN offloads #141

Open
LeeSmet opened this issue Feb 28, 2024 · 3 comments
Open

Implement TUN offloads #141

LeeSmet opened this issue Feb 28, 2024 · 3 comments
Labels
type_feature New feature or request

Comments

@LeeSmet
Copy link
Contributor

LeeSmet commented Feb 28, 2024

Performance profiles show that the biggest amount of time is currently spent in 3 places:

  • read from tun
  • write to tun
  • write to peers

One option which could improve the situation is #102, since larger packets naturally mean less syscalls (in the case of a tcp stream in the overlay). The problem there, is that larger packets will need to be fragmented if the lower layer link has a smaller MTU (which is the reason why MTU is currently set at 1400). While we currently only use stream based connections, keeping individual packets at MTU 1400 leaves the door for UDP at some point open (plain UDP that is).

The proper way to instead handle this would be to enable TSO (and USO and GRO, while we are at it). Unfortunately not a lot of info is readily available about this. In a first stage, we'll limit this to linux. From what I did manage to find so far:

  • Reading from the tun can produce a packet bigger than MTU (and similarly a bigger packet can be written)
  • Checksum offloading is required to be enabled.
  • We can implement ioctls on the tun created by the library, so we don't really have to write the tun code from scratch.
  • Ideally the vnet_hdr is enabled ( can be done at startup which requries library changes, or with an ioctl later it seems which is the preferred path for now). This puts a vnet_hdr struct with offload info at the start of the packet.
  • Since the segmentation boundary is defined in the packet, we can't do a vectored write into a bunch of packetbuffers. Instead we'll need to allocate a large buffer first, from which we then copy the data.
  • We can allocate this buffer once and reuse it.
  • Since we already have a single read/write loop setup, we can also reuse this buffer for both GRO and TSO/USO.
  • In theory, we can send unsegmented packets with the leading header to peers (if I'm given to understand this correctly), but then peers won't be able to handle the packet if they don't have offloading. So packets must be fragmented before sending, which also makes this backward compatible with legacy code
@iwanbk
Copy link
Member

iwanbk commented Apr 25, 2024

Performance profiles show that the biggest amount of time is currently spent in 3 places:

@LeeSmet

curious, how you did the profiling?

@LeeSmet
Copy link
Contributor Author

LeeSmet commented Apr 25, 2024

In my global cargo config I have a section which specifies a profiling profile, which just adds debug symbols to the configured release profile of the project:

[profile.profiling]
inherits = "release"
debug = true

Then I build with cargo build --profile profiling. This binary is then run with samply (sudo -E samply record ./target/profiling/mycelium {args}). The resulting profile can then be inspected (uses firefox tracing UI by default) to see where the application spends its time)

@iwanbk
Copy link
Member

iwanbk commented Apr 25, 2024

nice 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type_feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants