Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAT64 performance evaluation #282

Closed
JohnyGemityg opened this issue Apr 16, 2019 · 18 comments

Comments

Projects
None yet
3 participants
@JohnyGemityg
Copy link

commented Apr 16, 2019

Hi,

First of all, I want to thank you for the amazing work you did for NAT64.

I work in my thesis with NAT64 solutions for Linux including Jool. I did some performance evaluation. I want to share the results with you and discuss some optimization possibilities. I did some throughput tests on 10 Gbps topology.

image

NAT64 router: Intel(R) Xeon(R) CPU D-1587 @ 1.70GHz4x 4GB DDR4 2133MHzEthernet Connection X552 10 GbE SFP+Linux 3.10.0-693.17.1.el7.netx.x86_64

I tested Tayga, Jool, Jool in network namespace and pure 6, pure 4, iptables masquerade routing for comparison. Besides it, I captured a CPU load on the NAT64 router during tests. Here is the result.

image

image

Jool is performing well and can route around 1Mpps/s on my topology. That is great, but if I compare it to the regular iptables masquerade (3Mpps/s), there is some space for optimizations.

I tried to do some research and find what slows it down. I did some perf captures during tests. I figured out, that lot of time is spent waiting on locks.

image

From the first capture, I figured out that, the waiting is done in rfc6056_f function.

image

I suppose the function randomly assign port numbers based on MD5 checksums. The MD5 checksums are calculated using the crypto_shash functions in a critical section. I could not figure out why is this code in the critical section. Are crypto_shash functions thread unsafe? I experimented and I tried to remove the critical section (lock and unlock function remove). I seemed working, but there was no performance increase. I did another perf capture and found that now is time spent in get_random_bytes function, that is used for IPv4 Identification field for every packet.

image

I tried to remove the random Identification generation and set it to a static value. The performance increase was about 30%.

An interesting fact is that for 30% performance increase both "optimizations" must present. Removing just the get_random_bytes calls has no impact.

I was wondering if it is necessary to generate random Identification for every packet. I captured Taygas translations, and it seems that Tayga sets Identification to zero. Cisco IOS 15.4(1)T same story.

I read RFC 2765, 6145, 7915 each defines IP/ICMP Translation Algorithm and obsoletes the previous one. It seems the newest RFC 7915 tells that the Identification is now mandatory.

"Identification: Set according to a Fragment Identification generator at the translator."

On the other hand, generating random Identification causes reuse of the IP ID field to occur probabilistically RFC 4963.

Question is: is the critical section mandatory in rfc6056_f? Is generation IPv4 Identification necessary? Can be generation IPv4 Identification more optimized? For example, generate random Identification when bib entry is created and then just increment it with every packet? Is there anything I can do for better performance results?

Thank you.

@ydahhrk

This comment has been minimized.

Copy link
Member

commented Apr 16, 2019

Wow, thank you for your hard work!

I could not figure out why is this code in the critical section. Are crypto_shash functions thread unsafe?

Probably not. I don't remember because this was 3 years ago, but my guess is that I immediately defaulted to lock all concurrent usage of the variable shash simply because it is global and not constant. I tend to do that.

However, according to Linux history, this choice was completely misdirected: "shash is reentrant." So it does looks like the spinlock can be safely discarded.

I was, in fact, considering turning shash into per-cpu variables to prevent the spinning, but the priority of this has always been below a bunch of other stuff. But the spinlock removal actually sounds like it could be done in a snap. (Perhaps in a pull request, even. ;) )

I was wondering if it is necessary to generate random Identification for every packet. I captured Taygas translations, and it seems that Tayga sets Identification to zero. Cisco IOS 15.4(1)T same story.

ID generation is mandatory now. They killed zero ID when they purged atomic fragments. I can't recall the rationale off the top of my head, but presumably, it should be buried somewhere in RFC 8021.

RFCs 8021 and 7915 are somewhat new. Tayga and Cisco are probably following the old rules.

Can be generation IPv4 Identification more optimized? For example, generate random Identification when bib entry is created and then just increment it with every packet?

Well, I'm going to be very surprised if we didn't think of just having a global monotonic counter. There should be a reason for this. I think there is some security risk if the ID is predictable, but this tends to slide off my brain over time. Let me check my e-mails.

Even if the ID is meant to be random, what's not mandatory is the usage of the get_random_bytes() function. Maybe we could offer a slightly less random but faster option. If it's really slowing Jool down that much, I guess the kernel uses some other method that we should probably rip-off.

@ydahhrk

This comment has been minimized.

Copy link
Member

commented Apr 16, 2019

For example, generate random Identification when bib entry is created and then just increment it with every packet?

I guess the kernel uses some other method that we should probably rip-off.

Yeah, they seem to use __ip_select_ident().

It creates a random base number only the first time, and then increases monotonically. On subsequent calls, it takes less than 1/20th of the time get_random_bytes() does.

Guess we should use that as well.

ydahhrk added a commit that referenced this issue Apr 16, 2019

Apply two optimizations:
1. Remove spinlock from the RFC 6056 code.
   The protected variable was reentrant, so the lock was pointless.
2. Remove get_random_bytes() from the algorithm that computes the
   IPv4 Identification field.
   The alternative, __ip_select_ident(), seems to be the kernel's
   intended Identification generator.

Progress on #282.

I still don't know why both optimizations are apparently needed
to see any improvement. Hmmm...
@ydahhrk

This comment has been minimized.

Copy link
Member

commented Apr 16, 2019

Uploaded optimizations to the issue282 branch. For the moment, the code only compiles in kernels 4.2+.

I found a quirk while tweaking: next_ephemeral (from RFC 6056, algorithm 3) is not used. It seems the relevant code was lost during some old refactor. This is probably also slowing some operations down.

I'll try fixing it tomorrow.

@ydahhrk ydahhrk added the Performance label Apr 16, 2019

@JohnyGemityg

This comment has been minimized.

Copy link
Author

commented Apr 17, 2019

God job! I am going to update the kernel on the test router and do the tests.

@JohnyGemityg

This comment has been minimized.

Copy link
Author

commented Apr 17, 2019

Kernel: 4.4.178-1.el7.elrepo.x86_64
Pure IPv6: 2.3 Mpps
Jool-master: 0.72 Mpps
Jool-issue282: 1.14 Mpps

I think this is a great result.

@ydahhrk

This comment has been minimized.

Copy link
Member

commented Apr 17, 2019

Nice!

Question: When you tested Jool in namespace, were the clients located in the same machine?

@JohnyGemityg

This comment has been minimized.

Copy link
Author

commented Apr 17, 2019

In all tests pc-gen generates traffic to pc-col through rtr-netx (nat64). Src: 2001:db8:111::2 dst: 2001:db8:4::192.168.112.2

@ydahhrk

This comment has been minimized.

Copy link
Member

commented Apr 17, 2019

Oh.

So isn't the 100% CPU utilization in the A.8.d graph rather worrying?

@JohnyGemityg

This comment has been minimized.

Copy link
Author

commented Apr 17, 2019

Yes it is. The PPS result in virtual network namespace is low. 444 Kpps in current setup. This means that the 10Gbit line is not saturated and the processor is overloaded. I suppose it's the namespace overhead.

image

@ydahhrk

This comment has been minimized.

Copy link
Member

commented Apr 17, 2019

Hmm. But it still sucks that A.8.c stays at 80% while iptables NAT (A.8.e) seems to stay at 40%. (What's with the holes?)

Are you planning to redo A.8.c and A.8.d with the new optimizations?

Otherwise, how are you generating the traffic?

@JohnyGemityg

This comment has been minimized.

Copy link
Author

commented Apr 17, 2019

The traffic is generated with PF_RING. It reads a PCAP file and play it to the network card.

The holes in IPv4 are due to bad flow distribution between cores. I can do it again with more traffic flows and utilize all cores.

Jool-282
image

Jool-282-namespace
image

Pure IPv6

image

@ydahhrk

This comment has been minimized.

Copy link
Member

commented Apr 17, 2019

Ok. Indeed, the results look reasonable and the performance seems solid to me.

I'll work on finishing the next_ephemeral code and try to release Jool 4.0.1 next week.

If you find more bottlenecks, reports will be welcome.

@ydahhrk ydahhrk added this to the 4.0.1 milestone Apr 17, 2019

@JohnyGemityg

This comment has been minimized.

Copy link
Author

commented Apr 17, 2019

Ok, thanks.

@ydahhrk

This comment has been minimized.

Copy link
Member

commented Apr 17, 2019

BTW: This was a pretty substantial contribution. If you want credits in Jool's README, just state what you'd like included.

@JohnyGemityg

This comment has been minimized.

Copy link
Author

commented Apr 18, 2019

It would be an honor! Jan Pokorny - FIT VUTBR Thank you.

@toreanderson

This comment has been minimized.

Copy link
Contributor

commented Apr 21, 2019

Super interesting stuff @JohnyGemityg! 👏👍

I am assuming that in both situations, the Jool instances is of the Netfilter type? Have you tried the IPTables type too? It would be interesting to see if there is any difference in performance between the two.

If you have the time and opportunity, it would also be very interesting to see how SIIT mode compares to NAT64.

@JohnyGemityg

This comment has been minimized.

Copy link
Author

commented Apr 23, 2019

Hi @toreanderson,

yes, all tests were Netfilter type.

Tried the SIIT and the result is 2Mpps. 2Mpps is the maximum I can get on the current setup. So great result, no problems.

I also tried the Netfilter vs Iptables.

netfilter: 1139.4kpps
iptables: 1154.9kpps

I would say no difference.

@ydahhrk

This comment has been minimized.

Copy link
Member

commented Apr 26, 2019

v4.0.1 released; closing bug.

@ydahhrk ydahhrk closed this Apr 26, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.