-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mainstream some day? #1
Comments
Hi @CodeFetch and thanks for your message. The main reason is definitely performance: with this kernel module we are basically moving the whole data plane (not the control plane!) in kernelspace, similarly to other device drivers or tunnel implementations. On the other hand, it allows us to greatly simplify the linux part of the userspace implementation as it doesn't need to handle user data anymore. io_uring sounds interesting, but at the moment it doesn't go towards the direction we are taking (as we are moving the data plane in kernel directly). Therefore it wouldn't be meaningful to allocate energy on that side right now. Still, if you want to play with it and work on a PoC (that may result in clean patches) for OpenVPN, please do. |
This approach is interesting for platforms which will not or cannot support our DCO kernel module. I would expect DCO still to be faster (as the network packets will still have the fastest path between the physical network interface and the virtual one, without any context switching at all). But a faster user-space implementation with io_uring might be useful on lower-end routers where getting ovpn-dco running being too difficult, or in setups where the user insists on using a not recommended non-GCM based ciphers or compression or other protocol features not available in the ovpn-dco module. Is io_uring supported on other platforms than Linux? I believe I read something about Jens Axboe being involved, who is a Linux kernel developer. |
@dsommers io_uring is only supported by Linux from what I know. It allows you to define kind of ring buffers on which a kernel thread works on. This allows networking in userspace with almost no context switches. Together with MSG_ZEROCOPY this allows performance equal to running natively in the kernel minus the skb_copy operation for sending packets. As most ressource-limited devices which would profit from that are practically CPEs with higher downstream bandwidth usage, this would not hurt much I guess. Tests have showed that raw encryption performance in userspace is near what a kernel thread could achieve e.g. using WireGuard. |
Thanks! So, then the advantage of ovpn-dco will basically be that it can utilize all the CPU cores on the data plane. OpenVPN 2.x is (still!) single-threaded and will therefor hit some limitations on the server side when more clients are connected, and I expect io_uring in an OpenVPN implementation to also be limited in that regards. However, for server sides with only one client connected, the performance difference might not be that big. |
@CodeFetch Would you help to share some performance numbers to prove advantage of io_uring? |
@huangya90 I don't have any statistics on that anymore, but it was a constant 20-30% increase in throughput on an Intel 4820K. Single threaded... I have looked at it with oprofile. The recvmsg sendmsg syscalls are gone which accounted for 20-30% previously. So that matches. |
@huangya90 Keep in mind there is more potential for improvement. As far as I know OpenVPN does not only lack multithreading support, but it doesn't have a buffer pool either and isn't optimized for cache hotness. |
If so, speeding up in the kernel space is much better. Please refer to performance numbers [1]of ovpn-dco tested before. [1] https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg21584.html |
@huangya90 That's the gain of fastd which is better optimized than OpenVPN. The numbers don't look comparable to me. I guess he used a multicore processor and there must have been AES hardware acceleration as ChaCha should be faster. |
Hi,
On Wed, Nov 17, 2021 at 12:29:28PM -0800, Vincent Wiemann wrote:
@huangya90 That's the gain of fastd which is better optimized than OpenVPN. The numbers don't look comparable to me. I guess he used a multicore processor and there must have been AES hardware acceleration as ChaCha should be faster.
The goal of DCO is, of course, to use all CPU threads.
On processors where AES-NI is available, it is a waste of energy to not
go for AES :-) - on ARM, Chacha is more efficient. DCO can do both.
gert
--
"If was one thing all people took for granted, was conviction that if you
feed honest figures into a computer, honest figures come out. Never doubted
it myself till I met a computer with a sense of humor."
Robert A. Heinlein, The Moon is a Harsh Mistress
Gert Doering - Munich, Germany ***@***.***
|
@cron2 Yes, but it's like comparing apples with Microsoft. Using CPU threads for all cores is possible with the userspace implementation, too, but not implemented in OpenVPN. ovpn-dco will definitely perform better than a userspace version, but not that much without hardware encryption on a single thread machine. BTW does ovpn-dco also support layer 2 tunnels? |
@CodeFetch We are using openvpn on level 2 device (tap0) and the process looks like below (RX side) As you can see, it will go into and come out of kernel twice.. I am not sure what fastd does on the level 2 device/tunnel, do you think that io_uring could also benefit on openvpn userspace application (like 20%-30%)? I would like to give it a try and would like hear some advice for io_uring. Thanks! |
@andywangevertz Indeed. You will save the syscall for the freads/fwrites of the TAP device. TAP devices can ordinarily only accept one packet at a time per syscall |
I am closing this, but for further discussions, please reach out to the openvpn-devel mailing list, where a broader audience will be able to join the conversation. |
Hi! I've been fiddling with VPNs for some years now. Did you write this module due to performance reasons? Shall this be a mainstream module some day?
We had performance issues with OpenVPN some years back due to context-switching of the TUN/TAP devices having a syscall for every packet read op. An optimized userspace clone of OpenVPN was therefore created called fastd. It is faster than OpenVPN. But we still had the context switch bottleneck. Therefore I wrote a hacky patch for fastd to utilize io_uring. With that the context switch bottleneck is gone.
Maybe you could adopt this for the OpenVPN userspace version to reach a performance similar to the kernel module:
CodeFetch/fastd@059cdf7
The text was updated successfully, but these errors were encountered: