-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM on Jool boxes, possible leak? #410
Comments
Some additional data: After 12 hours I'm up to about 3GiB of memory in use. The machine that most recently locked up showed OOM messages. The processes that were killed were all listed as <2MiB RSS, so my guess is that jool is eating up kernel memory:
I rebooted the machine but didn't put it back into service immediately so I could get a baseline on memory. It was receiving sync state information, but not forwarding any traffic. I watched /proc/slabinfo for a bit and while session_nodes and bib_nodes were going up, jool_joold_nodes stayed constant. As soon as I started forwarding traffic through the box, jool_joold_nodes started to increase regularly. I then failed over and looked at the machine that was running out of memory. As soon as it wasn't forwarding traffic, jool_joold_nodes stayed constant even as session_nodes and bib_nodes fluctuated. So it seems that the memory consumption is specifically from packet processing and not states. I restarted the joold processes (I have 2 instances in different netns), and as soon as I did that /proc/slabinfo showed the jool_joold_nodes count drop to < 1000. Free memory on the box shot back up to 3.5GiB. I did not need to unload the kernel module; simply restarting the joold instance was enough to reclaim the memory. Any thoughts on what I should check on next? |
we have a similar issue and are not sure what you need: jool 4.1.10 on a 6.1.42 and we have increases of 1.5mb/minute on |
joold->queue is a listing of joold sessions whose fetch needs to be postponed because of the Netlink channel's limitations. Quite surprisingly, the code was not actually ever fetching them, which is why they were queuing indefinitely. It suspect this has gone unnoticed because, depending on the environment, Joold seemingly needs lots of traffic to queue. I myself had to disable ss-flush-asap to be able to replicate the bug. Likely f1xes #410, but I noticed a couple extra issues that need to be addressed before a new release. It seems joold has gotten dusty again. It makes me wonder if people is using it. In particular, I had to disable --advertise to prevent it from synchronizing uninitialized memory. Will need to look into this in the following days.
I believe I fixed it. Patch uploaded to branch issue410. I noticed two more bugs while reviewing joold, so I will need to work on this further before the release is ready. For the sake of transparency:
|
Thank you so much for looking into this! I'm just trying to understand your bug descriptions and have a couple questions:
To answer your musing in the commit message, I am definitely using Jool! My campus is working toward being IPv6-native internally, which means I need to have NAT64 functioning to provide connectivity to the legacy internet. My other favorite firewall/router (OpenBSD) doesn't have all the NAT64 features that Jool does, so that's why we are using it. I realize it's bugfix/maintenance at this point, but I'm hoping that since all the functionality I need is already in the project we can keep running it to provide this service. Thanks! |
I didn't actually test it, but the code review says "yes." (By which I mean it doesn't work at all.)
Yes;
Sure, but the queuing is probably already fixed.
I'm glad, but I was specifically wondering about joold. It's an optional feature that has often been problematic (to the point I actually regret having implemented it), and the amount of bug reports doesn't seem to match the state of the code. I mean, I casually found three bugs in a single review. That's not healthy. |
We actually had it off (we're doing active/passive but with two instances as a kind of poor-man's load balancing). I've enabled it again for now.
My apologies for the misunderstanding; I thought it was the project as a whole. In terms of joold, a vote of support for that as well. We picked Jool in part due to the state sync feature as we wanted to run things in a cluster. Thanks again! |
For #410. I more or less finished the implementation (albeit not the testing), but the unit tests revealed that one of my assumptions regarding Generic Netlink is incorrect. nla_put() cannot be trusted to respect the allocated packet size, because alloc_skb() can reserve more tail area than it's requested. This means ss-max-payload has to be enforced manually. And this by itself wouldn't be enough to justify another rewrite, but after thinking about it, I realized a more natural implementation would also reduce the range of the spinlock... which is currently not the best. Lots of performance to be gained from switching. So checkpointing. I might have to return to this implementation if the new one turns out unviable.
Ok, all three bugs have been fixed in the issue410 branch. Two notes:
Can you run it from source? It'd be great to get some feedback before I release it. |
Sorry for the delay on this; had some other emergencies that kept getting in the way. I have checked out the issue410 branch from git and built jool from source. I changed our script to use ss-max-sessions-per-packet, and dialed the value down to 8 (we're using HSR which has a lower MTU so I wanted to make sure we weren't going over). Everything is up and running and I've done some light testing. /proc/slabinfo shows bib entries going up, but no runaway joold_nodes, which is great! I'll give it a few days to see if memory use climbs, or if we note any regressions. Thanks for the quick patch! |
Reopening until release. |
And just to provide an update, memory usage is holding steady on the machines, with no mention of joold_nodes. We're still in light testing (students have not returned to campus), but the indicators seem very promising. |
Update from today. Looks like we might still be leaking memory, but not in the same place (and not nearly as quickly, since I just noticed this after 1+ week of uptime). Would you like me to open a separate issue to track? Short version is kernel memory is growing, and this time it's all being hogged by hugepages:
I will see if unloading the kernel module resets it when I have a minute to try. |
I had a minute. ;-) Shutting down the jool instance and unloading the kernel modules did not release the memory. I'm rebooting this box now to recover the RAM and we'll keep an eye on it. So, not certain that it's a jool issue (e.g., could be netfilter or something else), but that's all that this box is doing so it seems connected. |
I personally don't mind either way. I suppose it'd depend on whether gehoern wants to keep getting notifications from this thread.
Do you still get increasing hugepages if you disable joold? If you run joold in active/passive mode, which instance gets the hugepages? Most of Jool's allocations are tracked by this thing. If you enable it, wait for some hugepages then unload the module, maybe Jool will tell us what it's leaking.
|
We're active/active so it's happening on both hosts. I'm trying to turn down journaling to make sure it's not just logging that's chewing up memory. I recompiled with the requested flag on one of the machines. Starting it back up and watching dmesg for a few minutes, I got:
Several traces happened in a short time frame (dozens), but the IP/ports seem to be the same for most (I only count two distinct tuples between all the messages). After that it got quiet agin for a bit with just some more "Too many sessions deferred" messages. I can send the full dmesg output for this brief test if that's helpful. |
Just to be clear: This is not really strange, it just means the two instances did not start at the same time (an
This is indeed strange. Also, none of this is consequence of the |
Sorry, I didn't realize I had to unload the module. When I did, there wasn't anything in dmesg about memory, just the unloading messages:
Is there somewhere else I should check? |
No. Did you uninstall the Debian package? If both the Debian package and a custom Jool are installed, the former takes precedence.
Note, if you uninstall the Debian package, you also lose the systemd unit file. If you're starting Jool via systemd, here are instructions to restore the unit files. Start with the I uploaded a new commit to issue410. It's optional. It reduces the session desync message, and also bumps Jool's version number slightly, so you can more easily tell whether you're running the custom binary or the Debian package'd Jool:
|
Might help monitor joold, as well as debug #410. Print them with jool stat display --all | grep JSTAT_JOOLD
I'm still worried about you getting so many So new commit. Version number bumped to 4.1.10.2. This one prints a few extra stats that might help us monitor kernelside joold's state:
You can get their descriptions by adding
|
OK, recompiled off that branch with the new commits. Been running for about 16 hours. I took a stats dump and then unloaded the module. Still not seeing any leak info in dmesg. We have 3 instances (1 SIIT, 2 NAT64) running on the box; here are the stats from all:
However, I'm not sure if the numbers for any of these counters are alarming. Anything I should try to drill down on? |
Waitwaitwaitwaitwait. WhAt iS HaPpEnInG?!?
The numbers more or less tell me that both NAT64 instances are receiving ALL the joold packets; even the ones they themselves multicasted. This is backwards. And it seems to confirm the theory that they're running on the same machine. They're subscribed to literally the exact same Netlink multicast group... in the same kernel. We're going to have to zoom out for a sec, and scrutinize your setup. Please draw your network, provide your configuration, and explain what you're trying to do. (If you need to obfuscate your IPs, please obfuscate them consistently.) The entire point of Active/Active is load balancing. You don't get anything out of the load balancing if both artifacts are running on the same processor. If all you want is a backup instance in case one of them crashes, PLEASE do Active/Passive, and also, place the instances in different (at least virtual) machines. Jool is a kernel module, which means that if one of them crash, the entire system will go down with it. (And that includes the other Jool, if it's on the same system.)
You don't need to wait 16 hours to find out if you did this correctly. Just compile the module with
If you don't see the " |
I realize I haven't explained the architecture and it sounds like I'm crazy. I'll DM you the details on the network as it's more than I can easily fit here (and I don't want to post a full net diagram). It probably is crazy, but hopefully not in a breaking sort of way. For now, with regards to the module, I think the issue is that I'm installing via dkms. When I do the "make" that must only be userspace (make install doesn't change what's in /lib/modules). How do I pass the MEMLEAK flags to dkms? |
One way is to add
|
Observations by stat:
Oddities by stat:
(I'd expect all the
I'd expect both primaries to be queuing. The secondaries should not be queuing, unless one of the primaries has gone offline. This might just be a consequence of the previous oddity; B is flushing quickly, which means queuing is discouraged. Observations by instance:
This instance received 7311 packets from IPv6, they were all translated successfully, and they were all joold'd queuelessly because they all happened more than It didn't receive any IPv4 packets; not even as responses for the IPv6 packets. It seems the Internet is correctly routing all lower packets to A, while the LAN isn't. (Though stranded lower packets are few.) This is the only instance that isn't printing
The counters suggest this instance is receiving more traffic that belongs to its primary, through both IPv's. It also had to drop some sessions. However, the advertise suggests there was a period in which the primary was inactive, and I suppose that explains both quirks.
Other than But
|
Maybe someone left a cron job that suddenly spikes the traffic at midnight, or something? I suppose I could add stat counters to keep track of the queue length and ACK RTT. You could decide your
Sure, they might be recovering on their own. Also, many sessions joold sends are redundant because they need to be updated very shortly afterwards. |
Sorry, been very busy these last couple weeks. I'm going to dump out the current status of the counters just so we have recent numbers and so we can see if there are any changes. After this I'll double-check all the settings you mentioned (
|
OK, I have reconfigured the instances with the following
It's not possible to restart both instances simultaneously without causing an outage, so I restarted node b followed by node a. Below are the stats for both shortly after both instances were brought back online so we have a basis for comparison:
|
A quick checkin after running this for about a week. Anecdotally the number of "Too many sessions deferred!" messages has diminished; we're now seeing hours between some of the messages whereas before it was several times per hour. We still do have some bursts that are closer to each other, but overall the number is slowing. Assuming this improvement is from tweaking the parameters, should we continue to change them? I'm guessing we would need to continue increasing Below I'll post our current counters just for reference and so we can get an absolute sense of how things are going.
|
Analysis attached. (Tip: Hide columns B, C, G and H.) I don't see anything alarming anymore, though I still don't quite like the numbers. Excluding the sessions it dropped because of But no Netlink packets are getting dropped, so it seems they're getting lost in the network.
Well... ideally yes, though the
Hmm... HMMMM. 1 second is still a lot in computer time. I'm trying to think how we might graph the average RTT with those ACKs, but since there are a lot every second, the data set of a single day would probably be gargantuan.
The more data you can send at a time, the less you have to queue. You can dramatically reduce your needed capacity by sending more greedily. I mean... your current capacity is 2048. Since you can fit 26 sessions per packet... that's a 79-sized packet queue. You have 79 packets waiting for their turn, and you can only send one at a time, each having to wait for the previous's ACK. And while that happens, more sessions are arriving. And sure, since those traffic spikes seem to be rare, a mega queue will probably be able to retain the sessions long enough for the queue to gradually clear naturally. But it's one of those things where increasing The way to know if you're saturating the Netlink channel is by checking if the ACKs are getting dropped. That would be
I'm trying to think why you would not want to do this, but I'm coming out empty-handed. Outside of the unit tests and cleanup, Jool never traverses more than
Right... I think. I just realized the word "flush" is misleading. I don't see anything in your text that suggests you've been misled, but just to be clear: It's not a strict full "flush." It only "flushes" the amout of sessions it can send at a time (in your case, 26). Because it doesn't drop them; it forces a single packet out. Sessions that arrive while the queue is full are dropped. So in times of stress, the queue fills up, and new sessions are dropped. The deadline "flushes" 26 sessions. So 26 new sessions make it in, and the rest are dropped again.
|
(Also, remember that |
Thank you so much for that spreadsheet, it helped make a lot of things clearer. Additionally, your description of I've lowered |
Wait... what? ... ... [Scratches head] Ok, Look, full disclosure. Let's let the code talk. That function returns As you can see, the So yes; it can be used to skip slow ACKs (assuming they're really being slow, IDK), but in essence it's more of a general stagnation-avoidance policy. Sorry for the confusion.
ACK. |
OK, we've been back for a bit so I gathered some more stats and pushed them into a spreadsheet. The loss rates are less than 1%, but don't look a whole lot better than the last time (this is with a 100ms flush deadline). Everything is still being ACKed, so I'll lower the flush deadline again to see if there are any changes. |
Did you also increase |
Yes, simultaneously decreased I've lost track a little bit of what the different numbers mean. Should I continue to lower the deadline and increase buffer, or now that |
Yes
The capacity and deadline only target
Probably; I don't see any other explanation. If you want to analyze it, I just uploaded one more test commit: Stats on the userspace daemons. This one is simpler: fcc5ccc. If you want. At this point I feel like the bug is fixed already. |
Sorry, was on vacation and then busy. Trying out the latest commit so we can put this issue to bed. Unfortunately,
Maybe an off-by-one error in the arg processing?
I'm exiting with a -8 code. My guess is I need to pass a third arg to |
See the commit message of fcc5ccc.
You're right. |
Patch uploaded: b1e5021 |
Ah, sorry. I did see that go by but got distracted and only looked at the man page. I've got that figured out now. There is still an off-by-one in
With that change I'm up and running and I am able to query the stats socket, so I think we're all set. Thank you again for all of your help and patience with this issue! I'm fine to close this out now, and look forward to having it all rolled up in a future release. |
Huhhhhhhhhhhhh? 🤦🤦🤦🤦🤦🤦🤦 How did I miss this during the release? How did I miss it during the TESTING? WHAT |
You know what, I'm going to upgrade to argp. This is clearly not working out. |
For #410. Ugh. Ran out of time, and I still have some issues with it. Also, it's missing documentation. Will keep grinding next weekend.
Before: echo '{ "port": "9999" }' > statsocket.json joold netsocket.json modsocket.json statsocket.json Now: joold netsocket.json modsocket.json 9999 Restores the fcc5ccc interface. It's less consistent, but eliminates the need to re-explain the third argument in #410. I don't mind the inconsistency, because `joold` has been superseded by `jool session proxy` anyway.
This option is just a liability at this point, and its ill-advised default is a trap. Early flushing is no longer on option; SS always queues now. Rather than Active/Active, it's best to set up two Active/Passive couples, per #410.
Hello again. Think I'm finally happy with it.
I decided to deprecate that, and move userspace joold to
Similarly, You don't need to update your scripts because, even though the old commnads are deprecated, they still work, and I'm not really in a rush to delete them. But it'd be great if you could confirm I didn't break something again. The code is in the |
Version 4.1.13 released; closing. |
We have two jool boxes in an active/active load-sharing setup that we're testing for our campus. Things have been fine for months in limited testing. This week we added more test clients to the boxes and have been getting several machine lockups requiring a hard reboot. The message on the screen is typically out of memory.
This is jool 4.1.8.0 on Debian Bullseye.
I'm not a kernel expert, but looking at some of the other memory issue reports people mentioned /proc/slabinfo. I sampled that every 2 seconds and the "jool_joold_nodes" line is increasing constantly (this is on a machine that's been up less than 2 hours and 'jool session display' lists approximately 12,000 sessions):
I sampled active sessions in jool and even when those decreased, the slabs continued to increase. Meanwhile, "available" memory (as reported by top/free) has been steadily decreasing (several MiB per minute). Since we've increased the number of users, the machines have needed a reboot in as few as 20 hours.
I don't know enough about the kernel structures to know what jool_joold_nodes represents, but I'm guessing it shouldn't be monotonically increasing. I'm happy to gather any additional data that may be helpful.
One of my two boxes is locked up at the moment, but once I'm back to fully redundant I can try things like restarting jool or unloading the kernel module to see if we can recover memory without rebooting.
The text was updated successfully, but these errors were encountered: