Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
NAT64 performance evaluation #282
First of all, I want to thank you for the amazing work you did for NAT64.
I work in my thesis with NAT64 solutions for Linux including Jool. I did some performance evaluation. I want to share the results with you and discuss some optimization possibilities. I did some throughput tests on 10 Gbps topology.
NAT64 router: Intel(R) Xeon(R) CPU D-1587 @ 1.70GHz4x 4GB DDR4 2133MHzEthernet Connection X552 10 GbE SFP+Linux 3.10.0-693.17.1.el7.netx.x86_64
I tested Tayga, Jool, Jool in network namespace and pure 6, pure 4, iptables masquerade routing for comparison. Besides it, I captured a CPU load on the NAT64 router during tests. Here is the result.
Jool is performing well and can route around 1Mpps/s on my topology. That is great, but if I compare it to the regular iptables masquerade (3Mpps/s), there is some space for optimizations.
I tried to do some research and find what slows it down. I did some perf captures during tests. I figured out, that lot of time is spent waiting on locks.
From the first capture, I figured out that, the waiting is done in rfc6056_f function.
I suppose the function randomly assign port numbers based on MD5 checksums. The MD5 checksums are calculated using the crypto_shash functions in a critical section. I could not figure out why is this code in the critical section. Are crypto_shash functions thread unsafe? I experimented and I tried to remove the critical section (lock and unlock function remove). I seemed working, but there was no performance increase. I did another perf capture and found that now is time spent in get_random_bytes function, that is used for IPv4 Identification field for every packet.
I tried to remove the random Identification generation and set it to a static value. The performance increase was about 30%.
An interesting fact is that for 30% performance increase both "optimizations" must present. Removing just the get_random_bytes calls has no impact.
I was wondering if it is necessary to generate random Identification for every packet. I captured Taygas translations, and it seems that Tayga sets Identification to zero. Cisco IOS 15.4(1)T same story.
I read RFC 2765, 6145, 7915 each defines IP/ICMP Translation Algorithm and obsoletes the previous one. It seems the newest RFC 7915 tells that the Identification is now mandatory.
"Identification: Set according to a Fragment Identification generator at the translator."
On the other hand, generating random Identification causes reuse of the IP ID field to occur probabilistically RFC 4963.
Question is: is the critical section mandatory in rfc6056_f? Is generation IPv4 Identification necessary? Can be generation IPv4 Identification more optimized? For example, generate random Identification when bib entry is created and then just increment it with every packet? Is there anything I can do for better performance results?
Wow, thank you for your hard work!
Probably not. I don't remember because this was 3 years ago, but my guess is that I immediately defaulted to lock all concurrent usage of the variable
However, according to Linux history, this choice was completely misdirected: "shash is reentrant." So it does looks like the spinlock can be safely discarded.
I was, in fact, considering turning
ID generation is mandatory now. They killed zero ID when they purged atomic fragments. I can't recall the rationale off the top of my head, but presumably, it should be buried somewhere in RFC 8021.
RFCs 8021 and 7915 are somewhat new. Tayga and Cisco are probably following the old rules.
Well, I'm going to be very surprised if we didn't think of just having a global monotonic counter. There should be a reason for this. I think there is some security risk if the ID is predictable, but this tends to slide off my brain over time. Let me check my e-mails.
Even if the ID is meant to be random, what's not mandatory is the usage of the
Yeah, they seem to use
It creates a random base number only the first time, and then increases monotonically. On subsequent calls, it takes less than 1/20th of the time
Guess we should use that as well.
Uploaded optimizations to the issue282 branch. For the moment, the code only compiles in kernels 4.2+.
I found a quirk while tweaking:
I'll try fixing it tomorrow.
Super interesting stuff @JohnyGemityg!
I am assuming that in both situations, the Jool instances is of the Netfilter type? Have you tried the IPTables type too? It would be interesting to see if there is any difference in performance between the two.
If you have the time and opportunity, it would also be very interesting to see how SIIT mode compares to NAT64.
yes, all tests were Netfilter type.
Tried the SIIT and the result is 2Mpps. 2Mpps is the maximum I can get on the current setup. So great result, no problems.
I also tried the Netfilter vs Iptables.
I would say no difference.