-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel crash with Jool v3.5.1 #232
Comments
Thank you. It looks like Jool's fault to me. The bug is probably present in the 3.4 series too. Going to allocate time for a review asap... |
Was defrag active? In other words, what is the output of
? (we'll just have to assume your current output is the same it had when it crashed) |
Looks that way:
It is extremely likely that this was the case before the crash, too. |
Thanks |
So I've been scanning the code for roughly a week now and I feel like I should report something, for the sake of collecting my thoughts here if nothing else. We've been trying to find #232. It's one hell of a bug. On one hand, it's easy to tell from the stack trace that the crash happened during the translation of the *inner* payload of an ICMP error from IPv4 to IPv6 caused by a TCP packet. On the other hand, there is no way to reproduce it yet, the review is yielding little more than optimizations and the failure rate (once in the two years the relevant code has existed) suggests that the problem is something otherworldly (ie. undefined behavior, which could have been triggered anywhere in the kernel). There was no hairpinning involved. ICMP errors are never supposed to be fragmented. Even if this particular packet were, the crash happened during the first fragment's copy. The fact that SIIT is the one that crashed is fortunate since it means there's less code to worry about. It crashed during one of the `memcpy()`s of the kernel's `skb_copy_bits()`, sitting at Jool's `copy_payload()`. The crash looks like a typical memory access fault, which would mean that at least one of the following fields had an incorrect value during the copy: state->in.skb skb->len skb->data skb->data_len skb_shinfo(skb)->nr_frags skb_shinfo(skb)->frags skb_shinfo(skb)->frag_list pkt_payload_offset(&state->in) skb->head skb->network_header skb->data pkt->payload pkt_payload(&state->out) pkt->payload pkt_payload_len_frag(&state->out) skb->len skb->data_len pkt->payload skb->head skb->network_header skb_shinfo(skb)->nr_frags skb_shinfo(skb)->frags skb_shinfo(skb)->frag_list Jool rarely needs to edit the incoming packet and when it does it's via the kernel API. The borked field is most likely one of the outgoing ones. ------------------------ Ok so I haven't necessarily fixed the bug but I did find room for improvement. Fragment translation is one area where I feel Jool is too kernel-aware and, though I don't see even potential problems in this code now (given that fragmentation has little to do with the crashed packet to begin with), future kernel refactors regarding fragment representation can come back and shoot me in the foot. The problem is that Jool is copying subsequent packet payload *and even pages* when a simple reference grab can do the job. Subsequent fragments lack headers so they can theoretically be quirklessly shared between incoming and translated packets. Fixing this would have the additional benefit of speeding up translation since only head data (not paged nor fragmented) would need to be copied. I can also see it trumping the offloading problem but I've been there before and I'm not getting my hopes up. IIRC, I implemented it as it is because the kernel's suggested fragment-transparent solution does not necessarily account for the potential header growth (from IPv4 to IPv6) and the kernel can find itself in deep trouble if an skb cannot be `skb_push`ed enough. I did some tests however, and it seems that precisely when fragmentation is involved the kernel tends to reserve plenty of excess headroom for some reason. So I might be on to something. I also found other small errors but I don't see a kernel panic coming out of any of them. In fact, since this is the first time I've seen them I'm somewhat skeptical as to whether I actually fixed something or introduced more problems. Currently testing. If anything, this commit should stick because I added and updated loads of documentation during the review.
hi, @toreanderson . Question: was the offload off?
|
Yes. $ ethtool --show-offload eth0 | grep receive-offload
generic-receive-offload: off
large-receive-offload: off [fixed]
$ ethtool --show-offload eth1 | grep receive-offload
generic-receive-offload: off
large-receive-offload: off [fixed]
$ ethtool --show-offload bond0 | grep receive-offload
generic-receive-offload: off
large-receive-offload: on |
IIIIIIIII FFFFFFFFFOOOOOOOOOOUUUUUUUUUUNNNNNNNDDDDDDDDDDD IIIIIIITTTTTTTT!!! Well, yours crashed in
The bug has been fixed since commit 52deab1. In other words, it has been fixed all along. I'm so angry. Releasing 3.5.2. |
Includes - A fix to the 6791 pool: was always using host addresses, regardless of whether the pool had elements or not. - More graybox improvements. - More comments.
Nice work! 👍 Could you say if the triggering factor is some kind of malformed packet or just some random memory corruption or similar that is very unlikely to happen very often? That is, should I consider this a security issue that could be triggered by a specially crafted packet sent from anywhere on the Internet? |
The trigger is a single, very specific packet that is all an attacker needs to murder the kernel. The packet itself is unlikely to happen naturally. (The "security vulnerability" tag is rather redundant because |
By the way: The bug takes a slightly different shape in Jool 3.4, and it's unclear to me whether it is conductive to a panic or not. Jool 3.4.6 will be released on Monday regardless. |
Sorry for the inconveniences. |
Not at all, thanks for the quick follow up! |
Not sure if the bug yields a panic in 3.4, but at the very least this will prevent some legitimate packets from being dropped.
Hi, will you apply for a CVE number for this bug? I just happened to notice this bug at random and if there had been a CVE for it would have been possible to detect in an automated way. Best regards |
Ok, request sent. I used the "Distributed Weakness Filing Project" (iwantacve.org) option. |
One of our SIIT-DC BRs just crashed. It's an x86_64 server running Ubuntu 14.05.5 and kernel 4.4.0-45-generic. This could be the hardware going faulty for all I know (it's the first time this has happened), but I'm including the oops from the serial console below. It mentions various Jool-related functions, so I'm assuming you'd be interested in taking a look.
The text was updated successfully, but these errors were encountered: