-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel panic with Jool v3.5.6 (kernel BUG at .../skbuff.h:1826) #247
Comments
Fun times ahead. ^^" Thanks. It's pretty safe to assume that it crashed here. The line is saying "throw a panic if When did you update to 3.5.2? |
I upgraded on the 6th of December 2016 (to fix #232). Both servers had been running stable since then. |
Sorry; I'm still completely empty-handed. I didn't find anything strange by reviewing the code, and the new unit test reports nothing but happiness. I'm wondering if you might be willing to run a custom version of 3.5.2 that would, instead of crashing, dump a bunch of additional information in the log and drop the packet. The reason why I'm hesitant to commit this enhancement to 3.5.4 is because the changes needed to pull this off are kernel-aware, and might therefore introduce new issues in other kernels. (I'm probably being a crybaby. This kernel code hasn't changed at all for many years, and likely won't either for a long time...) Did you update Jool to 3.5.3, and the kernel to the latest generic after the crash? If so, would you rather branch these changes off 3.5.3 instead of 3.5.2? Aside: When you say "within 24 hours of each other", could this by any chance be roughly the same amount of time that separated their activation? ie. Did you start the first BR, and then after 24 hours start the second one? Are you using the atomic configuration feature? |
I'd be happy to try running a version with extra debugging. Just give me a git revision to check out and I'll upgrade in my next available maintenance window. I did not upgrade Jool following the crash (still running 3.5.2), but the kernel was upgraded to 4.4.0-79-generic after the crash. It will be upgraded to 4.4.0-81-generic (or anything newer that might come out in the mean time) on the next boot, unless you'd rather I didn't. On December 6th the two nodes were booted five minutes after one another, they did not have the exact same uptime when crashing. To be specific, node A was booted on 2016-12-06 01:24 while node B was booted 01:29. Then node B was first to crash on 2017-06-10 16:52, followed by node A on 2017-06-11 00:15. For what it's worth there's also a third SIIT-DC BR in my network, running virtualised in KVM. This one was also rebooted on December 6th (01:11) in order to upgrade to 4.4.0-51-generic and Jool 3.5.2. Same versions and same configuration. It has not crashed yet. Considering that I was the one to initially suggest it, I am a little bit embarrassed to say that I still haven't gotten around to updating my automation scripts to use the atomic configuration feature... |
Thanks. Working on the custom build right now. I still need to test it, so I wouldn't recommend you to download it yet.
:/ I can't rule out the possibility that this might be a kernel bug (which at this point feels likely), in which case this update might prevent us from seeing the crash again. But I feel that I can't ask you to downgrade to 4.4.0-51-generic since that could net you unnecessary security issues. Incidentally, is this bug preventing you from updating Jool to 3.5.3? Or do you have some other reason to stick to 3.5.2? Because if you updated the kernel anyway, we could just move over to the other extreme and just try to debug in the latest everything.
Ok, no problem. That simplifies this. |
Ok, thanks. |
I don't recall any security fixes in the kernel since -51-generic that would be applicable to these nodes (no local users or network-exposed services), so I can downgrade if you want (provided I can still find the deb packages for -51-generic somewhere). The reason why I haven't upgraded to 3.5.3 is simply that the changes since 3.5.2 didn't seem relevant to me. Unlike Jool (which is tied to a specific git-rev), the kernel is updated automatically on-disk by the apt packaging system, so when the node crashed and was rebooted it activated the most recent version - it wasn't a conscious decision to update the kernel but not Jool, in case I gave you that impression. In any case, I'll run whatever versions you'd want me to. Just let me know what you prefer. |
Okay, that's perfect. (You sure you need to redownload the kernel? My Ubuntu Server doesn't uninstall old kernels by default.) I just finished testing it. The commit is the same: 21b97b1. Please, go back to 4.4.0-51-generic to maximize the probability of getting the error again, and look out for output that looks somewhat like this:
It should not crash. The expression "Bug #247 happened!" is constant. This might help for grepping. "Dropping packet" is also constant. And guess we'll have to wait another several months. |
All right. Now both the BRs that crashed earlier are running Jool 21b97b1 on |
It seems silly to only keep these validations in an obscure previous branch. If the #247 crash happens to blow up in the latest master, I will have only myself to blame for not getting the output I need. So I'm committing these debugging tweaks for posterity. Will revert once #247 is fixed. I just checked; every supported kernel so far will crash if this condition is met, so this should not introduce any issues. (At least up till kernel 4.13-rc1)
To clarify: To make sure that this bug does not affect more people, I have merged these tweaks (validation, debugging messages and packet drop) on the latest master. Hopefully, this will improve the probability of someone else reporting the output as well. If you want to download Jool 3.5.4, don't let me stop you. Downgrading to non-critical because it shouldn't crash anymore. |
New crash last night with Jool v3.5.6. The events seem very similar as the one originally reported last summer. This time, the line it complains about is My two physical boxes crashed within a minute of each other, while my third virtual one stayed up and saved the day. The three nodes are anycasted, so if there was some sort of incoming stream of «packets of death» from somewhere, it'd get rerouted from node to node as each individual one crashed, causing a domino effect. If that was the case, though, it would appear that the virtual instance is immune. Here are the traces that appeared on the serial console as they crashed: |
NOOOOOOOOOOOOOOOOOOOOO, I MESSED UP THE FAILURE DETECTION CONDITIONAL orz (╯°□°)╯︵ ┻━┻ WHAT WAS I THINKIIIIIIIIIIIIIIIIING All right, it's worth a fresh review anyway. I'm going to be looking into it again next week. Update to the latest commit (68b19ea) so it doesn't crash violently anymore. Hopefully I got it right this time. |
Haven't really found the bug, but I'm questioning some of the code anyway. Uploading tweaks. While computing the nexthdr of a given IPv6 packet, the code was performing the following assignment: Local network header pointer = skb_network_header(skb) + local network header offset (By "local" I mean the scope of the variable.) This operation is reused for outer and inner packets, and strongly requires and assumes that skb_network_offset(skb) is zero: - For outer packets, it doesn't make sense to add anything to skb_network_header(skb) to get a pointer to the network header, and "local network header offset" is skb_network_offset(skb). - For inner packets, "local network header offset" is an offset from skb->data, which means that skb_network_header(skb) needs to equal skb->data. (Hence, skb_network_offset(skb) = 0.) Which is supposed to always the case, which is the reason why this isn't really a bug, unless the kernel itself has another bug to match. I thought about putting a validation in place to ensure that the network header offset is zero for all outer packets, but then I realized that the whole reason why I was making this so convoluted was because I didn't want the code to touch skb->data. (Instead relying on available kernel API.) I noticed that there is at least one function around (offset_to_ptr()) that does it anyway. (And I don't think that there is a way around that unless the packet.c module is completely rewritten.) So the new code is Local network header pointer = skb->data + local network header offset Which actually expresses the intent much more clearly. The following are old assumptions that stand valid still: - The skb_network_offset() offset is relative to skb->data. - The skb_header_pointer() offset is relative to skb->data. The following is no longer an assumption and Jool should continue working fine if the kernel happens to break it: - skb_network_offset(skb) during a packet's entrance to a Netfilter hook is zero. So that's what this commit comes down to; I removed an potential future bug opportunity and made the code clearer as a bonus.
The code was assuming that IPv6 headers were located in the data area of skbs. This is not necessarily true, particularly for ICMP internal packets. I can't think of a way this could have caused the bug, but I'm still looking.
The code wasn't catching NULL as a potential result from skb_pull(). In all fairness, nobody does, really. Standard skb operations usually crash the kernel on failure; I don't really understand why this specific error condition yields NULL instead of BUG(). And no other kernel citizens seem to care about it. Also, this is obviously not the cause of #247; it would have to crash later if this were the case.
Okay, I managed to design a nuke packet that triggers an error similar to the one you got. I think that publishing the specs of the packet would be a bad idea since no one, aside from attackers, would benefit from the information. But I think it's safe to say that it's a very, very forced packet. You can't just put it on the network and expect Jool to crash; the kernel also needs to append some impressive random miracles on its own. I wouldn't say that there is a bug in the kernel since the packet technically does not break any contracts, but damn that's some convoluted black magic there. The bug has been fixed since commit 5c90ead. I cannot but conjecture as to why you're getting the crashes so close to each other, which is why I'd like to leave the data collection and validations in place for at least another year, just in case. With the release of Jool 3.5.7 I'm going to close this issue (if you don't close it earlier, that is), but feel free to reopen if the crash finds you again, as usual. Thanks, Tore. I'm only 99% sure that we fixed the bug, but I'm glad we got that particular crack all patched up. |
Great job! 👍 I'll upgrade to 3.5.7 at first opportunity. I'm quite okay with the details of the nuke packet not being made public, although I certainly am curious... Well, you have my e-mail address, if you're willing to share more details privately. ;-) |
Ok. Just to clarify: Remember that you don't need to wait for the release; it's just a formality so people can see an incrementing number and not install autoconf. The patches are already collapsed into master, and I run the entire test suite before committing to that.
I think that's fine, but I will release 3.5.7 first. |
I got the following in kernel output on one of my boxes today, running Jool 3.5.7 on kernel 4.4.0-121-generic. It's been up since 8th of May, so basically right after we last had communication on this issue. Even though 3.5.7 it is deprecated, I thought you might be interested.
|
It's not, though. I do not intend to work on new features for it, but we need to keep it around until 4.0 feels time-tested. Thanks. I will work on this on the weekend. |
Just... wow.
Here's a theory: The original packet wasn't IPv4, but rather, an IPv6 packet hairpinned into IPv4. Pros:
Cons:
What is Also: The offending packet was supposedly sent from TODOs:
|
So the packet must have originally been sent from That said, hairpinning is enabled: $ sudo jool_siit | grep hairpin
--eam-hairpin-mode: 2 (intrinsic) |
Are you planning to stick to 3.5? I added more debugging messages, but I'm not sure if you want to keep monitoring this. In case you want to continue: Since the bug seems to require some coincidences in unison to show tangible symptoms, I conjecture that it might be affecting more packets than we're noticing. I've placed broader detection conditionals (thanks to the new information) and, hopefully, the next output should appear relatively sooner. Sample output 1:
Sample output 2:
Sample output 3:
Just FYI, the Is it OK to stick to the new commit or would you like me to release 3.5.8? |
I plan on upgrading to v4 eventually, but it probably will take a while before I get around to it. I am happy to run c4629dd until then - you don't have to make a new release just for my sake. |
Ok, the crash hasn't shown up in more than 2 years. I don't have the stomach to remove the detection conditional yet, but I think it's high time to unclutter the issue list. Feel free to reopen if the bug shows up again. |
Within 24 hours of each other two of our SIIT-DC BRs running SIIT Jool v3.5.2 on Ubuntu 14.04.5 with kernel 4.4.0-51-generic panicked in a very similar fashion. They had both been running stable for over half a year, so the incidents are highly suspect. Possibly another «packet of death» issue akin to issue #232?
The only information I have about it is the traces that were output to the serial console as they crashed:
Trace from SIIT-DC BR 1
Trace from SIIT-DC BR 2
Tore
The text was updated successfully, but these errors were encountered: