Kernel crash with Jool v3.5.1 #232

Closed
toreanderson opened this Issue Nov 10, 2016 · 14 comments

Projects

None yet

4 participants

@toreanderson
Contributor

One of our SIIT-DC BRs just crashed. It's an x86_64 server running Ubuntu 14.05.5 and kernel 4.4.0-45-generic. This could be the hardware going faulty for all I know (it's the first time this has happened), but I'm including the oops from the serial console below. It mentions various Jool-related functions, so I'm assuming you'd be interested in taking a look.

[1418084.279692] BUG: unable to handle kernel paging request at ffff88007a402000
[1418084.284507] IP: [<ffffffff813e2d12>] __memcpy+0x12/0x20
[1418084.292070] PGD 2205067 PUD 47ffff067 PMD 274a98063 PTE 800000007a402161
[1418084.295390] Oops: 0003 [#1] SMP 
[1418084.312621] Modules linked in: mptctl 8021q garp mrp stp llc jool_siit(OE) ipmi_ssif bonding gpio_ich coretemp kvm_intel kvm ast irqbypass ttm drm_kms_helper input_leds joydev drm fb_sys_fops syscopyarea sysfillrect tpm_infineon 8250_fintek sysimgblt mac_hid ipmi_si ipmi_msghandler i7core_edac edac_core ioatdma lpc_ich shpchp i5500_temp lp parport xt_mark ip6table_mangle nf_log_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables btrfs xor raid6_pq igb mptsas hid_generic i2c_algo_bit mptscsih dca ahci usbhid mptbase ptp libahci hid scsi_transport_sas pps_core fjes
[1418084.416369] CPU: 8 PID: 0 Comm: swapper/8 Tainted: G          IOE   4.4.0-45-generic #66~14.04.1-Ubuntu
[1418084.433389] Hardware name: SUN MICROSYSTEMS SUN FIRE X4170 SERVER          /ASSY,MOTHERBOARD,X4170, BIOS 07060309 07/10/2013
[1418084.453445] task: ffff880276a0f080 ti: ffff880276a18000 task.ti: ffff880276a18000
[1418084.472151] RIP: 0010:[<ffffffff813e2d12>]  [<ffffffff813e2d12>] __memcpy+0x12/0x20
[1418084.491447] RSP: 0018:ffff880277d03750  EFLAGS: 00010202
[1418084.492814] RAX: ffff88006a69f4fc RBX: 00000000fffffffc RCX: 000000001e053a9f
[1418084.511816] RDX: 0000000000000004 RSI: ffff88024d27f712 RDI: ffff88007a401ffc
[1418084.514370] RBP: ffff880277d037a8 R08: ffff88007b344000 R09: ffff880277d03a08
[1418084.532811] R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000028
[1418084.552382] R13: 00000000fffffffc R14: ffff88006a69f4fc R15: 000000000000002c
[1418084.555226] FS:  0000000000000000(0000) GS:ffff880277d00000(0000) knlGS:0000000000000000
[1418084.573104] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[1418084.591776] CR2: ffff88007a402000 CR3: 0000000001e0c000 CR4: 00000000000006e0
[1418084.594332] Stack:
[1418084.611423]  ffffffff816e2d66 0000000000004600 ffff880400000060 12d353951b5a0575
[1418084.615153]  ffffffff816f8450 ffff88007b344000 ffff880277d03a08 ffff88023d51cbfe
[1418084.633019]  ffff88006a69f4e8 ffff880277d03aa8 0000000000000060 ffff880277d037c0
[1418084.651732] Call Trace:
[1418084.652690]  <IRQ> 
[1418084.653349]  [<ffffffff816e2d66>] ? skb_copy_bits+0x66/0x2d0
[1418084.671567]  [<ffffffff816f8450>] ? dev_queue_xmit+0x10/0x20
[1418084.673536]  [<ffffffffc0446420>] copy_payload+0x90/0xe0 [jool_siit]
[1418084.691774]  [<ffffffffc044498d>] ttp46_tcp+0x10d/0x170 [jool_siit]
[1418084.694030]  [<ffffffffc04465be>] ttpcomm_translate_inner_packet+0xee/0x240 [jool_siit]
[1418084.712967]  [<ffffffffc044398f>] post_icmp6error+0xaf/0x1a0 [jool_siit]
[1418084.731816]  [<ffffffffc04444a3>] ttp46_icmp+0x143/0x520 [jool_siit]
[1418084.733552]  [<ffffffffc044685d>] translating_the_packet+0xad/0x2f0 [jool_siit]
[1418084.752745]  [<ffffffffc044ce50>] core_common+0x20/0xf0 [jool_siit]
[1418084.754707]  [<ffffffffc044cfc7>] core_4to6+0xa7/0x130 [jool_siit]
[1418084.773072]  [<ffffffffc0453185>] hook_ipv4+0x15/0x20 [jool_siit]
[1418084.791478]  [<ffffffff8172b2cd>] nf_iterate+0x5d/0x70
[1418084.793705]  [<ffffffff8172b346>] nf_hook_slow+0x66/0xc0
[1418084.811468]  [<ffffffff817326a3>] ip_rcv+0x303/0x3e0
[1418084.813389]  [<ffffffff81731c80>] ? inet_del_offload+0x40/0x40
[1418084.832558]  [<ffffffff816f57ab>] __netif_receive_skb_core+0x36b/0x9c0
[1418084.835651]  [<ffffffffc051a850>] ? bond_resend_igmp_join_requests_delayed+0x80/0x80 [bonding]
[1418084.854153]  [<ffffffff816f5e18>] __netif_receive_skb+0x18/0x60
[1418084.872538]  [<ffffffff816f5e83>] netif_receive_skb_internal+0x23/0x80
[1418084.876184]  [<ffffffff816f6a93>] napi_gro_receive+0xc3/0x110
[1418084.893145]  [<ffffffffc01ab97d>] igb_clean_rx_irq+0x38d/0x6c0 [igb]
[1418084.895906]  [<ffffffffc01ac013>] igb_poll+0x363/0x720 [igb]
[1418084.913155]  [<ffffffff81036929>] ? sched_clock+0x9/0x10
[1418084.915810]  [<ffffffff810ac8d2>] ? sched_clock_cpu+0x72/0xa0
[1418084.939181]  [<ffffffff810a6655>] ? check_preempt_curr+0x75/0x90
[1418084.942383]  [<ffffffff810a6689>] ? ttwu_do_wakeup+0x19/0xe0
[1418084.953452]  [<ffffffff816f6274>] net_rx_action+0x164/0x350
[1418084.956085]  [<ffffffff81081f7d>] __do_softirq+0xdd/0x290
[1418084.973315]  [<ffffffff81082355>] irq_exit+0x95/0xa0
[1418084.975518]  [<ffffffff817fcee6>] do_IRQ+0x56/0xd0
[1418084.993256]  [<ffffffff817fafc2>] common_interrupt+0x82/0x82
[1418084.996847]  <EOI> 
[1418085.011423]  [<ffffffff81695a35>] ? cpuidle_enter_state+0xd5/0x250
[1418085.015568]  [<ffffffff81695a14>] ? cpuidle_enter_state+0xb4/0x250
[1418085.033139]  [<ffffffff81695be7>] cpuidle_enter+0x17/0x20
[1418085.035561]  [<ffffffff810be102>] call_cpuidle+0x32/0x60
[1418085.053152]  [<ffffffff81695bc3>] ? cpuidle_select+0x13/0x20
[1418085.055579]  [<ffffffff810be3b9>] cpu_startup_entry+0x289/0x350
[1418085.073459]  [<ffffffff8104f319>] start_secondary+0x149/0x170
[1418085.075869] Code: 74 0e 48 8b 43 60 48 2b 43 50 88 43 4e 5b 5d c3 e8 b4 fc ff ff eb eb 90 90 66 66 90 66 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 
[1418085.120794] RIP  [<ffffffff813e2d12>] __memcpy+0x12/0x20
[1418085.132582]  RSP <ffff880277d03750>
[1418085.134732] CR2: ffff88007a402000
[1418085.140858] ---[ end trace 7537c99c0e03fc19 ]---
[1418085.140864] BUG: unable to handle kernel paging request at 0000571a90000190
[1418085.140871] IP: [<ffffffff817ac7bc>] rt6_score_route+0x10c/0x1c0
[1418085.140873] PGD 0 
[1418085.140874] Oops: 0000 [#2] SMP 
[1418085.140900] Modules linked in: mptctl 8021q garp mrp stp llc jool_siit(OE) ipmi_ssif bonding gpio_ich coretemp kvm_intel kvm ast irqbypass ttm drm_kms_helper input_leds joydev drm fb_sys_fops syscopyarea sysfillrect tpm_infineon 8250_fintek sysimgblt mac_hid ipmi_si ipmi_msghandler i7core_edac edac_core ioatdma lpc_ich shpchp i5500_temp lp parport xt_mark ip6table_mangle nf_log_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables btrfs xor raid6_pq igb mptsas hid_generic i2c_algo_bit mptscsih dca ahci usbhid mptbase ptp libahci hid scsi_transport_sas pps_core fjes
[1418085.140903] CPU: 15 PID: 0 Comm: swapper/15 Tainted: G      D   IOE   4.4.0-45-generic #66~14.04.1-Ubuntu
[1418085.140904] Hardware name: SUN MICROSYSTEMS SUN FIRE X4170 SERVER          /ASSY,MOTHERBOARD,X4170, BIOS 07060309 07/10/2013
[1418085.140904] task: ffff880476893e80 ti: ffff8804768fc000 task.ti: ffff8804768fc000
[1418085.140907] RIP: 0010:[<ffffffff817ac7bc>]  [<ffffffff817ac7bc>] rt6_score_route+0x10c/0x1c0
[1418085.140908] RSP: 0018:ffff88047fdc36e8  EFLAGS: 00010206
[1418085.140909] RAX: ffff880473df0000 RBX: 000000000000000a RCX: 000000000000001c
[1418085.140910] RDX: ffff88007a4007c0 RSI: 0000000000000008 RDI: 0000000000000000
[1418085.140910] RBP: ffff88047fdc3700 R08: 00000000c000022a R09: 0000000001020100
[1418085.140911] R10: 0000000010000000 R11: 00000000e86a0900 R12: 0000571a90000000
[1418085.140912] R13: 0000000000000002 R14: ffff88047fdc37ac R15: ffff88047fdc37ab
[1418085.140913] FS:  0000000000000000(0000) GS:ffff88047fdc0000(0000) knlGS:0000000000000000
[1418085.140914] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[1418085.140915] CR2: 0000571a90000190 CR3: 0000000001e0c000 CR4: 00000000000006e0
[1418085.140915] Stack:
[1418085.140917]  ffff88027627fd80 0000000000000000 0000000000000002 ffff88047fdc3750
[1418085.140918]  ffffffff817acbd8 0000000000000100 ffff88047fdc3950 0000000000000000
[1418085.140920]  0000000000000002 0000000000000000 0000000000000400 ffff88047fdc3930
[1418085.140920] Call Trace:
[1418085.140924]  <IRQ> 
[1418085.140924]  [<ffffffff817acbd8>] find_match+0x78/0x300
[1418085.140926]  [<ffffffff817acf36>] ip6_pol_route.isra.40+0xd6/0x5e0
[1418085.140928]  [<ffffffff817ad470>] ? ip6_pol_route_input+0x30/0x30
[1418085.140930]  [<ffffffff817ad49a>] ip6_pol_route_output+0x2a/0x30
[1418085.140935]  [<ffffffff817d5187>] fib6_rule_action+0xb7/0x1f0
[1418085.140940]  [<ffffffff81714227>] fib_rules_lookup+0xc7/0x160
[1418085.140949]  [<ffffffffc0453c20>] ? interface_contains+0x40/0xe0 [jool_siit]
[1418085.140951]  [<ffffffff817d5474>] fib6_rule_lookup+0x44/0xa0
[1418085.140953]  [<ffffffff817ad470>] ? ip6_pol_route_input+0x30/0x30
[1418085.140955]  [<ffffffff817ab33d>] ip6_route_output_flags+0xdd/0x120
[1418085.140961]  [<ffffffffc044cacf>] __route6+0x10f/0x200 [jool_siit]
[1418085.140968]  [<ffffffffc044cc25>] route+0x45/0x50 [jool_siit]
[1418085.140974]  [<ffffffffc044cc53>] sendpkt_send+0x23/0x200 [jool_siit]
[1418085.140980]  [<ffffffffc044cec6>] core_common+0x96/0xf0 [jool_siit]
[1418085.140986]  [<ffffffffc044cfc7>] core_4to6+0xa7/0x130 [jool_siit]
[1418085.140993]  [<ffffffffc0453185>] hook_ipv4+0x15/0x20 [jool_siit]
[1418085.140995]  [<ffffffff8172b2cd>] nf_iterate+0x5d/0x70
[1418085.140997]  [<ffffffff8172b346>] nf_hook_slow+0x66/0xc0
[1418085.140998]  [<ffffffff817326a3>] ip_rcv+0x303/0x3e0
[1418085.140999]  [<ffffffff81731c80>] ? inet_del_offload+0x40/0x40
[1418085.141002]  [<ffffffff816f57ab>] __netif_receive_skb_core+0x36b/0x9c0
[1418085.141007]  [<ffffffffc051a850>] ? bond_resend_igmp_join_requests_delayed+0x80/0x80 [bonding]
[1418085.141009]  [<ffffffff816f5e18>] __netif_receive_skb+0x18/0x60
[1418085.141010]  [<ffffffff816f5e83>] netif_receive_skb_internal+0x23/0x80
[1418085.141012]  [<ffffffff816f6a93>] napi_gro_receive+0xc3/0x110
[1418085.141018]  [<ffffffffc01ab97d>] igb_clean_rx_irq+0x38d/0x6c0 [igb]
[1418085.141023]  [<ffffffffc01ac013>] igb_poll+0x363/0x720 [igb]
[1418085.141026]  [<ffffffff810f74c0>] ? tick_sched_do_timer+0x30/0x30
[1418085.141028]  [<ffffffff816f6274>] net_rx_action+0x164/0x350
[1418085.141031]  [<ffffffff81081f7d>] __do_softirq+0xdd/0x290
[1418085.141033]  [<ffffffff81082355>] irq_exit+0x95/0xa0
[1418085.141034]  [<ffffffff817fcee6>] do_IRQ+0x56/0xd0
[1418085.141037]  [<ffffffff817fafc2>] common_interrupt+0x82/0x82
[1418085.141040]  <EOI> 
[1418085.141040]  [<ffffffff81695a35>] ? cpuidle_enter_state+0xd5/0x250
[1418085.141041]  [<ffffffff81695a14>] ? cpuidle_enter_state+0xb4/0x250
[1418085.141043]  [<ffffffff81695be7>] cpuidle_enter+0x17/0x20
[1418085.141045]  [<ffffffff810be102>] call_cpuidle+0x32/0x60
[1418085.141047]  [<ffffffff81695bc3>] ? cpuidle_select+0x13/0x20
[1418085.141048]  [<ffffffff810be3b9>] cpu_startup_entry+0x289/0x350
[1418085.141051]  [<ffffffff8104f319>] start_secondary+0x149/0x170
[1418085.141064] Code: 01 d9 01 ce b9 20 00 00 00 2b 4a 08 48 8b 12 d3 ee 48 8d 14 f2 4c 8b 22 4d 85 e4 75 0e e9 83 00 00 00 4d 8b 24 24 4d 85 e4 74 7a <49> 3b 84 24 90 01 00 00 75 ed 41 8b 94 24 9c 01 00 00 41 8b 8c 
[1418085.141067] RIP  [<ffffffff817ac7bc>] rt6_score_route+0x10c/0x1c0
[1418085.141067]  RSP <ffff88047fdc36e8>
[1418085.141068] CR2: 0000571a90000190
[1418085.141070] ---[ end trace 7537c99c0e03fc1a ]---
[1418085.141071] BUG: unable to handle kernel 
[1418085.141072] Kernel panic - not syncing: Fatal exception in interrupt
[1418085.141074] paging request at 0000571a90000190
[1418085.141078] IP: [<ffffffff817ac7bc>] rt6_score_route+0x10c/0x1c0
[1418085.141079] PGD 0 
[1418085.141081] Oops: 0000 [#3] SMP 
[1418085.141109] Modules linked in: mptctl 8021q garp mrp stp llc jool_siit(OE) ipmi_ssif bonding gpio_ich coretemp kvm_intel kvm ast irqbypass ttm drm_kms_helper input_leds joydev drm fb_sys_fops syscopyarea sysfillrect tpm_infineon 8250_fintek sysimgblt mac_hid ipmi_si ipmi_msghandler i7core_edac edac_core ioatdma lpc_ich shpchp i5500_temp lp parport xt_mark ip6table_mangle nf_log_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_ta
@ydahhrk
Member
ydahhrk commented Nov 10, 2016 edited

Thank you.

It looks like Jool's fault to me. The bug is probably present in the 3.4 series too.

Going to allocate time for a review asap...

@ydahhrk
Member
ydahhrk commented Nov 15, 2016

Was defrag active? In other words, what is the output of

$ lsmod | grep defrag

?

(we'll just have to assume your current output is the same it had when it crashed)

@toreanderson
Contributor

Looks that way:

$ lsmod | grep defrag
nf_defrag_ipv6         36864  1 nf_conntrack_ipv6
nf_defrag_ipv4         16384  1 nf_conntrack_ipv4

It is extremely likely that this was the case before the crash, too.

@ydahhrk
Member
ydahhrk commented Nov 15, 2016

Thanks

@ydahhrk ydahhrk added a commit that referenced this issue Nov 19, 2016
@ydahhrk ydahhrk Improve the translation code slightly
So I've been scanning the code for roughly a week now and I feel
like I should report something, for the sake of collecting my
thoughts here if nothing else.

We've been trying to find #232. It's one hell of a bug. On one
hand, it's easy to tell from the stack trace that the crash
happened during the translation of the *inner* payload of an ICMP
error from IPv4 to IPv6 caused by a TCP packet. On the other hand,
there is no way to reproduce it yet, the review is yielding little
more than optimizations and the failure rate (once in the two years
the relevant code has existed) suggests that the problem is
something otherworldly (ie. undefined behavior, which could have
been triggered anywhere in the kernel).

There was no hairpinning involved. ICMP errors are never supposed
to be fragmented. Even if this particular packet were, the crash
happened during the first fragment's copy. The fact that SIIT is
the one that crashed is fortunate since it means there's less code
to worry about.

It crashed during one of the `memcpy()`s of the kernel's
`skb_copy_bits()`, sitting at Jool's `copy_payload()`. The crash
looks like a typical memory access fault, which would mean that at
least one of the following fields had an incorrect value during the
copy:

	state->in.skb
		skb->len
		skb->data
		skb->data_len
		skb_shinfo(skb)->nr_frags
		skb_shinfo(skb)->frags
		skb_shinfo(skb)->frag_list

	pkt_payload_offset(&state->in)
		skb->head
		skb->network_header
		skb->data
		pkt->payload

	pkt_payload(&state->out)
		pkt->payload

	pkt_payload_len_frag(&state->out)
		skb->len
		skb->data_len
		pkt->payload
		skb->head
		skb->network_header
		skb_shinfo(skb)->nr_frags
		skb_shinfo(skb)->frags
		skb_shinfo(skb)->frag_list

Jool rarely needs to edit the incoming packet and when it does it's
via the kernel API. The borked field is most likely one of the
outgoing ones.

------------------------

Ok so I haven't necessarily fixed the bug but I did find room for
improvement. Fragment translation is one area where I feel Jool is
too kernel-aware and, though I don't see even potential problems in
this code now (given that fragmentation has little to do with the
crashed packet to begin with), future kernel refactors regarding
fragment representation can come back and shoot me in the foot. The
problem is that Jool is copying subsequent packet payload *and even
pages* when a simple reference grab can do the job. Subsequent
fragments lack headers so they can theoretically be quirklessly
shared between incoming and translated packets. Fixing this would
have the additional benefit of speeding up translation since only
head data (not paged nor fragmented) would need to be copied. I can
also see it trumping the offloading problem but I've been there
before and I'm not getting my hopes up.

IIRC, I implemented it as it is because the kernel's suggested
fragment-transparent solution does not necessarily account for the
potential header growth (from IPv4 to IPv6) and the kernel can find
itself in deep trouble if an skb cannot be `skb_push`ed enough. I
did some tests however, and it seems that precisely when
fragmentation is involved the kernel tends to reserve plenty of
excess headroom for some reason. So I might be on to something.

I also found other small errors but I don't see a kernel panic
coming out of any of them. In fact, since this is the first time
I've seen them I'm somewhat skeptical as to whether I actually
fixed something or introduced more problems. Currently testing.

If anything, this commit should stick because I added and updated
loads of documentation during the review.
52deab1
@TheRedTrainer
Contributor
TheRedTrainer commented Nov 22, 2016 edited

hi, @toreanderson . Question: was the offload off?

ethtool --show-offload eth0 | grep receive-offload

@toreanderson
Contributor

Yes.

$ ethtool --show-offload eth0 | grep receive-offload
generic-receive-offload: off
large-receive-offload: off [fixed]
$ ethtool --show-offload eth1 | grep receive-offload
generic-receive-offload: off
large-receive-offload: off [fixed] 
$ ethtool --show-offload bond0 | grep receive-offload
generic-receive-offload: off
large-receive-offload: on
@ydahhrk ydahhrk added a commit that referenced this issue Nov 23, 2016
@ydahhrk ydahhrk Progress on #232 25df84f
@ydahhrk
Member
ydahhrk commented Dec 2, 2016 edited

IIIIIIIII FFFFFFFFFOOOOOOOOOOUUUUUUUUUUNNNNNNNDDDDDDDDDDD IIIIIIITTTTTTTT!!!

Well, yours crashed in __memcpy. Mine crashed in memcpy. Close enough.

BUG: unable to handle kernel paging request at cec3b000
IP: [<c13035f4>] memcpy+0x14/0x30
*pdpt = 0000000001aae001 *pde = 0000000010e88063 *pte = 800000000ec3b161
Oops: 0003 [#1] SMP
Modules linked in: jool_siit(OX) vboxsf(OX) snd_intel8x0 snd_ac97_codec ac97_bus openvswitch gre vxlan snd_pcm ip_tunnel snd_page_alloc libcrc32c snd_seq_midi snd_seq_midi_event snd_raw
midi snd_seq snd_seq_device snd_timer vboxvideo(OX) snd drm joydev vboxguest(OX) rfcomm bnep bluetooth serio_raw soundcore i2c_piix4 video parport_pc mac_hid ppdev lp parport hid_generic usbhid hid ps
mouse ahci libahci e1000 pata_acpi
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           OX 3.13.0-103-generic #150-Ubuntu
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
task: c1934a00 ti: ddc0a000 task.ti: c1928000
EIP: 0060:[<c13035f4>] EFLAGS: 00210212 CPU: 0
EIP is at memcpy+0x14/0x30
EAX: cc70defc EBX: fffffff9 ECX: 3f6b4bbd EDX: d00c3ed2
ESI: d25f0fd6 EDI: cec3b000 EBP: ddc0ba2c ESP: ddc0ba20
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 8005003b CR2: cec3b000 CR3: 1b5d6000 CR4: 000006f0
Stack:
 fffffff9 00000029 00000021 ddc0ba5c c1571153 026433c6 d0c5a840 00000001
 00000000 cc70defc c99d2428 d0718a69 d0c5a840 00000044 fffffff9 ddc0ba7c
 e26a5486 fffffff9 00000014 00000000 cc70deb8 ddc0bc48 cc70dee8 ddc0bab8
Call Trace:
 [<c1571153>] skb_copy_bits+0x53/0x1d0
 [<e26a5486>] copy_payload+0x76/0xd0 [jool_siit]
 [<e26a39f5>] ttp46_tcp+0xf5/0x160 [jool_siit]
 [<e26a5617>] ttpcomm_translate_inner_packet+0xd7/0x240 [jool_siit]
 [<c15a7a5f>] ? nlmsg_notify+0x4f/0xb0
 [<c159c61f>] ? sch_direct_xmit+0x3f/0x190
 [<e26a2947>] post_icmp6error+0xa7/0x1c0 [jool_siit]
 [<e26a3613>] ttp46_icmp+0x253/0x540 [jool_siit]
 [<e26a588d>] translating_the_packet+0x8d/0x2a0 [jool_siit]
 [<e26a948c>] ? key_contains+0x1c/0x30 [jool_siit]
 [<e26a7c78>] ? pkt_init_ipv4+0x2d8/0x4a0 [jool_siit]
 [<e26ab852>] core_common+0x22/0xf0 [jool_siit]
 [<e26abf86>] ? xlator_find+0xa6/0xb0 [jool_siit]
 [<e26ab9b1>] core_4to6+0x91/0x130 [jool_siit]
 [<e0fcdc1f>] ? e1000_xmit_frame+0x89f/0xe40 [e1000]
 [<c1580001>] ? napi_gro_frags+0xc1/0x140
 [<c15a7a5f>] ? nlmsg_notify+0x4f/0xb0
 [<c159c61f>] ? sch_direct_xmit+0x3f/0x190
 [<c1581303>] ? __dev_queue_xmit+0x83/0x490
 [<c11303a6>] ? put_page+0x26/0x30
 [<c1572bc6>] ? skb_free_head+0x46/0x60
 [<c1570b24>] ? kfree_skbmem+0x34/0x90
 [<e26b14a2>] hook_ipv4+0x12/0x20 [jool_siit]
 [<c15a996c>] nf_iterate+0x6c/0x80
 [<c15af330>] ? inet_del_offload+0x30/0x30
 [<c15a99dc>] nf_hook_slow+0x5c/0x100
 [<c15af330>] ? inet_del_offload+0x30/0x30
 [<c15afdba>] ip_rcv+0x33a/0x430
 [<c15af330>] ? inet_del_offload+0x30/0x30
 [<c157f677>] __netif_receive_skb_core+0x577/0x750
 [<c157f866>] __netif_receive_skb+0x16/0x60
 [<c157f8cf>] netif_receive_skb+0x1f/0x80
 [<c15800e7>] napi_gro_receive+0x67/0x90
 [<e0fce4e5>] e1000_clean_rx_irq+0x275/0x4d0 [e1000]
 [<c1572b9c>] ? skb_free_head+0x1c/0x60
 [<c1570b24>] ? kfree_skbmem+0x34/0x90
 [<e0fcf37d>] e1000_clean+0x1cd/0x7f0 [e1000]
 [<c1063b43>] ? internal_add_timer+0x13/0x40
 [<c1065347>] ? mod_timer+0xe7/0x1f0
 [<c13617c9>] ? cursor_timer_handler+0x39/0x40
 [<c157fba0>] net_rx_action+0x110/0x210
 [<c105cf90>] __do_softirq+0xd0/0x250
 [<c105cec0>] ? cpu_callback+0x190/0x190
 <IRQ>
 [<c105d3c5>] ? irq_exit+0x95/0xa0
 [<c166e035>] ? do_IRQ+0x45/0xb0
 [<c107a837>] ? hrtimer_start+0x27/0x30
 [<c166ddf3>] ? common_interrupt+0x33/0x38
 [<c1017b30>] ? mwait_idle+0x50/0x70
 [<c10182b6>] ? arch_cpu_idle+0x26/0x30
 [<c10a7a21>] ? cpu_startup_entry+0x201/0x250
 [<c1655312>] ? rest_init+0x62/0x70
 [<c19c9ae7>] ? start_kernel+0x3a9/0x3af
 [<c19c9575>] ? repair_env_string+0x51/0x51
 [<c19c939c>] ? i386_start_kernel+0x137/0x13a
Code: 54 2b 43 50 88 43 4e 5b 5d c3 90 8d 74 26 00 e8 63 fc ff ff eb e8 90 55 89 e5 57 56 53 66 66 66 66 90 89 cb 89 c7 c1 e9 02 89 d6 <f3> a5 89 d9 83 e1 03 74 02 f3 a4 5b 5e 5f 5d c3
8d b6 00 00 00
EIP: [<c13035f4>] memcpy+0x14/0x30 SS:ESP 0068:ddc0ba20
CR2: 00000000cec3b000
---[ end trace c48c84d63fd53114 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x0 from 0xc1000000 (relocation range: 0xc0000000-0xe07effff)

The bug has been fixed since commit 52deab1. In other words, it has been fixed all along. I'm so angry.

Releasing 3.5.2.

@ydahhrk ydahhrk added this to the 3.5.2 milestone Dec 2, 2016
@ydahhrk ydahhrk added a commit that referenced this issue Dec 2, 2016
@ydahhrk ydahhrk Add more improvements inspired by the #232 review
Includes

- A fix to the 6791 pool: was always using host addresses,
  regardless of whether the pool had elements or not.
- More graybox improvements.
- More comments.
d62b2ff
@ydahhrk ydahhrk closed this Dec 2, 2016
@toreanderson
Contributor

Nice work! 👍

Could you say if the triggering factor is some kind of malformed packet or just some random memory corruption or similar that is very unlikely to happen very often? That is, should I consider this a security issue that could be triggered by a specially crafted packet sent from anywhere on the Internet?

@ydahhrk
Member
ydahhrk commented Dec 3, 2016 edited

The trigger is a single, very specific packet that is all an attacker needs to murder the kernel. The packet itself is unlikely to happen naturally.

(The "security vulnerability" tag is rather redundant because every critical bug so far has been a security vulnerability oops! except for this one, and every security vulnerability should logically be treated as a serious bug.)

@ydahhrk
Member
ydahhrk commented Dec 3, 2016

By the way:

The bug takes a slightly different shape in Jool 3.4, and it's unclear to me whether it is conductive to a panic or not.

Jool 3.4.6 will be released on Monday regardless.

@ydahhrk
Member
ydahhrk commented Dec 3, 2016

Sorry for the inconveniences.

@toreanderson
Contributor

Not at all, thanks for the quick follow up!

@ydahhrk ydahhrk added a commit that referenced this issue Dec 6, 2016
@ydahhrk ydahhrk Mirror the #232 tweaks on the 3.4 code
Not sure if the bug yields a panic in 3.4, but at the very least
this will prevent some legitimate packets from being dropped.
cb2eb07
@alexanderkjall
Contributor

Hi, will you apply for a CVE number for this bug? I just happened to notice this bug at random and if there had been a CVE for it would have been possible to detect in an automated way.

Best regards

@ydahhrk
Member
ydahhrk commented Dec 31, 2016

Hi, will you apply for a CVE number for this bug? I just happened to notice this bug at random and if there had been a CVE for it would have been possible to detect in an automated way.

Ok, request sent. I used the "Distributed Weakness Filing Project" (iwantacve.org) option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment