Skip to content
This repository has been archived by the owner on May 4, 2020. It is now read-only.

Kernel Panic on 2/27 build with USG #97

Open
paulg1981 opened this issue Mar 1, 2019 · 64 comments
Open

Kernel Panic on 2/27 build with USG #97

paulg1981 opened this issue Mar 1, 2019 · 64 comments

Comments

@paulg1981
Copy link

Hello, I have been using these releases with great success for months. I installed the 2/27 build yesterday and upon restart I receive a kernel panic with the updated version. I reset the device to defaults and installed again and received the same issue. I downgraded to the previous release and everything works as expected. Anyone got any pointers to help troubleshoot? Is it just a bad build for the USG3P? Any advice or assistance would be appreciated!

@Dr-Escher
Copy link

Same issue after upgrading to the latest release. The device has been stuck in a reboot loop with occasional ping responses in between.

Package: wireguard-e50-0.0.20190227-1
Device: ER-X-SFP
Firmware: EdgeOS v1.10.9.5166958.190213.1952

@phillipmcmahon
Copy link

Same issue for me on a ER-6P, I upgraded remotely and now the unit it down, no Internet at the site. Once I get serial access I can post more info.

What testing is done on these packages prior to being released?

@NimlothPL
Copy link

Mar  2 14:33:13 USG3P kernel: CPU 1 Unable to handle kernel paging request at virtual address 0000000000000000, epc == ffffffffc012ced8, ra == ffffffffc0b9314c
Mar  2 14:33:13 USG3P kernel: Oops[#1]:
Mar  2 14:33:13 USG3P kernel: CPU: 1 PID: 4103 Comm: ip Tainted: P           O 3.10.107-UBNT #1
Mar  2 14:33:13 USG3P kernel: task: 800000041c20e0e0 ti: 800000000c030000 task.ti: 800000000c030000
Mar  2 14:33:13 USG3P kernel: $ 0   : 0000000000000000 0000000000000004 ffffffffc0660000 ffffffffc050b3e8
Mar  2 14:33:13 USG3P kernel: $ 4   : 0000000000000001 00000000000012d0 ffffffffc0b9314c 800000000c033670
Mar  2 14:33:13 USG3P kernel: $ 8   : ffffffffffffff9d 800000041d296cc0 ffffffffc050b3e8 000000001a5f4728
Mar  2 14:33:13 USG3P kernel: $12   : 0000000000000008 ffffffffc025c878 ffffffffd76c0898 0000000000000000
Mar  2 14:33:13 USG3P kernel: $16   : 800000041d296000 0000000000000000 00000000000012d0 ffffffffc0531980
Mar  2 14:33:13 USG3P kernel: $20   : 800000041db09e10 800000041d296000 0000000000000000 ffffffffc080a380
Mar  2 14:33:13 USG3P kernel: $24   : 0000000005733924 0000000027f2031c
Mar  2 14:33:13 USG3P kernel: $28   : 800000000c030000 800000000c033710 800000000c033780 ffffffffc0b9314c
Mar  2 14:33:13 USG3P kernel: Hi    : 0000000000000000
Mar  2 14:33:13 USG3P kernel: Lo    : 1dcbc89e99000000
Mar  2 14:33:13 USG3P kernel: epc   : ffffffffc012ced8 kmem_cache_alloc+0x30/0x150
Mar  2 14:33:13 USG3P kernel:    Tainted: P           O
Mar  2 14:33:13 USG3P kernel: ra    : ffffffffc0b9314c wg_pubkey_hashtable_alloc+0x1c/0xd8 [wireguard]
Mar  2 14:33:13 USG3P kernel: Status: 10008ce3  KX SX UX KERNEL EXL IE
Mar  2 14:33:13 USG3P kernel: Cause : 00800008
Mar  2 14:33:13 USG3P kernel: BadVA : 0000000000000000
Mar  2 14:33:13 USG3P kernel: PrId  : 000d0601 (Cavium Octeon+)
Mar  2 14:33:13 USG3P kernel: Modules linked in: wireguard(O) ip_tunnel xt_mark xt_nat 8021q garp stp llc ipt_MASQUERADE xt_set nf_conntrack_ipv6 nf_defrag_ipv6 xt_comment xt_conntrack ip_set_bitmap_port xt_TCPMSS xt_tcpudp ip6table_mangle ip6table_filter ip6table_raw ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_mangle xt_CT iptable_raw nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_h323 nf_conntrack_h323 nf_nat_proto_gre nf_nat_tftp nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables ip_set_hash_net ip_set nfnetlink configfs unifigpio(PO) unifihal(PO) cvm_ipsec_kame(O) ipv6 imq cavium_ip_offload(PO) ubnt_nf_app(PO) tdts(PO) octeon_rng rng_core octeon_ethernet mdio_octeon ethernet_mem octeon_common of_mdio ubnt_platform(PO) libphy [last unloaded: nf_conntrack_sip]
Mar  2 14:33:13 USG3P kernel: Process ip (pid: 4103, threadinfo=800000000c030000, task=800000041c20e0e0, tls=0000000077a5b490)
Mar  2 14:33:13 USG3P kernel: Stack : 800000041d296000 800000041d296680 800000000c033780 ffffffffc0b9314c
Mar  2 14:33:13 USG3P kernel:     800000041d296000 ffffffffc0b8d01c 800000041db09e00 800000041db09e00
Mar  2 14:33:13 USG3P kernel:     ffffffffc0531980 800000000c033780 ffffffffc0531980 ffffffffc0346a5c
Mar  2 14:33:13 USG3P kernel:     800000000c033780 ffffffffc0346768 0000000000000000 0000000000000000
Mar  2 14:33:13 USG3P kernel:     0000000000000000 800000041db09e20 0000000000000000 0000000000000000
Mar  2 14:33:13 USG3P kernel:     0000000000000000 0000000000000000 0000000000000000 0000000000000000
Mar  2 14:33:13 USG3P kernel: last message repeated 2 times
Mar  2 14:33:13 USG3P kernel:     800000041db09e28 0000000000000000 0000000000000000 0000000000000000
Mar  2 14:33:13 USG3P kernel:     0000000000000000 0000000000000000 0000000000000000 0000000000000000
Mar  2 14:33:13 USG3P kernel:     ...
Mar  2 14:33:13 USG3P kernel: Call Trace:
Mar  2 14:33:13 USG3P kernel: [<ffffffffc012ced8>] kmem_cache_alloc+0x30/0x150
Mar  2 14:33:13 USG3P kernel: [<ffffffffc0b9314c>] wg_pubkey_hashtable_alloc+0x1c/0xd8 [wireguard]
Mar  2 14:33:13 USG3P kernel: [<ffffffffc0b8d01c>] wg_newlink+0xac/0x3c8 [wireguard]
Mar  2 14:33:13 USG3P kernel: [<ffffffffc0346a5c>] rtnl_newlink+0x434/0x538
Mar  2 14:33:13 USG3P kernel:
Mar  2 14:33:13 USG3P kernel:
Mar  2 14:33:13 USG3P kernel: Code: 0080882d  ffb00000  9f840020 <de220000> 000420f8  0064202d  dc840000  0044382d  dcec0008
Mar  2 14:33:13 USG3P kernel: ---[ end trace 0588e2b9fdef1fd0 ]---

@phillipmcmahon
Copy link

Seems to be quite an issue. Maybe pull this release until more is known why this is happening.

@Lochnair
Copy link
Owner

Lochnair commented Mar 2, 2019

@phillipmcmahon Agreed. I've pulled the 1.10 packages for now. As for testing before release - most of the time, there is none, as I don't really have equipment to test on.

@NimlothPL Thanks for the stacktrace. Seems related to this commit. I'll ask Jason about it.

@evenfowler
Copy link

I was able to fix this on a USG 4 Pro with the help of single user mode.

I connected a serial console cable and then caught the U-Boot console by pressing a key before it continued booting. You should see something like:

U-Boot 2012.04.01 (UBNT Build Version: e221_002_01aa9) (Aug 17 2018 - 01:13:14)

Skipping PCIe port 0 BIST, in EP mode, can't tell if clocked.
Skipping PCIe port 1 BIST, reset not done. (port not configured)
BIST check passed.
UBNT_E220 r1:1, r2:14, serial #: 000000FFFFFF
MPR 13-02102-14
Core clock: 1000 MHz, IO clock: 600 MHz, DDR clock: 533 MHz (1066 Mhz DDR)
Base DRAM address used by u-boot: 0x8f800000, size: 0x800000
DRAM: 2 GiB
Clearing DRAM...... done
Flash: 8 MiB
Net:   octeth0, octeth1, octeth2, octeth3
MMC:   Octeon MMC/SD0: 0
USB:   USB EHCI 1.00
scanning bus for devices... 1 USB Device(s) found
Type the command 'usb start' to scan for USB storage devices.

Hit any key to stop autoboot:  0 
Octeon ubnt_e220# 

Once in the U-Boot console I ran printenv to find the bootcmd value.

Octeon ubnt_e220# printenv
autoload=n
baudrate=115200
boardname=ubnt_e220
bootcmd=fatload mmc 0 $(loadaddr) vmlinux.64;bootoctlinux $(loadaddr) numcores=2 endbootargs mem=0 root=/dev/mmcblk0p2 rootdelay=10 rw rootsqimg=squashfs.img rootsqwdir=w mtdparts=phys_mapped_flash:640k(boot0),640k(boot1),64k(eeprom)
bootdelay=0

I copied the value for bootcmd and appended single which told the kernel to boot to single user mode.

The actual command I ran at the U-Boot console was:

fatload mmc 0 $(loadaddr) vmlinux.64;bootoctlinux $(loadaddr) numcores=2 endbootargs mem=0 root=/dev/mmcblk0p2 rootdelay=10 rw rootsqimg=squashfs.img rootsqwdir=w mtdparts=phys_mapped_flash:640k(boot0),640k(boot1),64k(eeprom) single

Once in single user mode I uninstalled the deb package using dpkg and then rebooted.

dpkg --remove wireguard
shutdown -r now

If you're on a Unifi-enabled board you'll get provisioning errors on when the Unifi controller tries to commit a config that specifies a WireGuard interface (assuming you persisted the WireGuard config using a config.gateway.json file on the controller). Simply ignore that and then install the working version and let the controller re-provision the device now that it'll know what a wireguard interface type is.

@zx2c4
Copy link
Collaborator

zx2c4 commented Mar 3, 2019

Thanks for the report. I'll look into it.

@phillipmcmahon
Copy link

phillipmcmahon commented Mar 3, 2019

I'm happy to test basic install, reboot and simple functionality on the hardware I have. ER-X-SFP and an ERX-6P, these run the 1.10 branch of firmware.

@zx2c4
Copy link
Collaborator

zx2c4 commented Mar 3, 2019

If you've got a working toolchain, would you building with this patch and let me know if that "fixes" it?

diff --git a/src/compat/compat.h b/src/compat/compat.h
index 7a61e4c1..7c2d5125 100644
--- a/src/compat/compat.h
+++ b/src/compat/compat.h
@@ -466,11 +466,13 @@ static inline void *kvmalloc_ours(size_t size, gfp_t flags)
 {
 	gfp_t kmalloc_flags = flags;
 	void *ret;
+#ifndef CONFIG_CAVIUM_OCTEON_IPFWD_OFFLOAD
 	if (size > PAGE_SIZE) {
 		kmalloc_flags |= __GFP_NOWARN;
 		if (!(kmalloc_flags & __GFP_REPEAT) || (size <= PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
 			kmalloc_flags |= __GFP_NORETRY;
 	}
+#endif
 	ret = kmalloc(size, kmalloc_flags);
 	if (ret || size <= PAGE_SIZE)
 		return ret;

@aswild
Copy link
Contributor

aswild commented Mar 3, 2019

Same issue on my ER-4 with FW v2.0.0. I ran make deb-e300 from commit 2877098 of the v2.0 branch. Had to use the reset button and restore a backup.

[**    ] A start job is running for UBNT Routing Daemons (57s / no limit)CPU 2 Unable to handle kernel paging request at virtual address 0000000400000000, epc == ffffffff80956b74, ra == 8
Oops[#1]:
CPU: 2 PID: 3995 Comm: ip Tainted: P           O    4.9.79-UBNT #1
task: 800000004d322700 task.stack: 800000004421c000
$ 0   : 0000000000000000 0000000000000000 ffffffff80f70000 ffffffff80def658
$ 4   : 0000000400000000 0000000000000002 0000000000000000 ffffffffc056bd48
$ 8   : 000000006239a4de ffffffff80def658 da451be76a5f3a20 a7fdf6cb8743060e
$12   : 0000000000000000 ffffffff80ab969c 0000000028bcd81f 800000004d01bda8
$16   : 0000000400000000 ffffffff808c0000 00000000024012c0 0000000000000001
$20   : 800000004d01b780 ffffffffc0570000 ffffffff80e1eb00 ffffffffc0581e90
$24   : 000000001215c592 ffffffffd8a70a1c
$28   : 800000004421c000 800000004421f7a0 800000004421f830 ffffffffc056bd48
Hi    : 0000000000000006
Lo    : ccccccccccccccd7
epc   : ffffffff80956b74 kmem_cache_alloc+0x34/0x160
ra    : ffffffffc056bd48 wg_pubkey_hashtable_alloc+0x28/0xe8 [wireguard]
Status: 10009ce3        KX SX UX KERNEL EXL IE
Cause : 00800008 (ExcCode 02)
BadVA : 0000000400000000
PrId  : 000d9602 (Cavium Octeon III)
Modules linked in: wireguard(O) ip6_udp_tunnel udp_tunnel 8021q garp stp llc ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_NETMAP xt_set nf_log_ipv4 ipt_REJECT nf_reject_ipv4 nf_log_ipv6 nf_l6
Process ip (pid: 3995, threadinfo=800000004421c000, task=800000004d322700, tls=00000000770cb490)
Stack : ffffffff80956b40 ffffffff808c0000 ffffffff808bbb10 ffffffffc056bd48
        800000004d01b000 ffffffffc056572c 0000000000000003 800000004d01b000
        ffffffff80e1eb00 8000000047cf0000 0000000000000000 800000004421f830
        0000000000000000 ffffffff80c1223c 0000000000000000 0000000000000000
        8000000047cf0000 ffffffff80c11d3c 0000000000000000 0000000000000000
        0000000000000000 8000000047cf0020 0000000000000000 0000000000000000
        0000000000000000 0000000000000000 0000000000000000 0000000000000000
        0000000000000000 0000000000000000 0000000000000000 0000000000000000
        0000000000000000 0000000000000000 0000000000000000 0000000000000000
        8000000047cf0028 0000000000000000 0000000000000000 0000000000000000
        ...
Call Trace:
[<ffffffff80956b74>] kmem_cache_alloc+0x34/0x160
[<ffffffffc056bd48>] wg_pubkey_hashtable_alloc+0x28/0xe8 [wireguard]
[<ffffffffc056572c>] wg_newlink+0xdc/0x3e0 [wireguard]
[<ffffffff80c1223c>] rtnl_newlink+0x674/0x750
Code: 00a0902d  0060482d  9f850018 <de020000> 000528f8  7c652a0a  64420008  7c45620a  9f880018

---[ end trace d08fbf877d376bec ]---
Kernel panic - not syncing: Fatal exception
Rebooting in 60 seconds..

@aswild
Copy link
Contributor

aswild commented Mar 3, 2019

@zx2c4 I tried your patch but it didn't help on my ER-4 (v2.0.0, kernel 4.9.79).

I changed your #ifndef to #if !defined(CONFIG_CAVIUM_OCTEON_IPFWD_OFFLOAD) && !defined(CONFIG_CAVIUM_IPFWD_OFFLOAD) since it looks like the config name changed in the new kernel (verified with #error that the block wasn't compiled in), but still the same panic when I create a wireguard device.

@aswild
Copy link
Contributor

aswild commented Mar 3, 2019

@Lochnair wireguard-v2.0-e300-0.0.20190227-1.deb from the 0.0.20190227 github release panics for me, you may want to pull the v2.0 binaries too.

@zx2c4
Copy link
Collaborator

zx2c4 commented Mar 3, 2019

Alright let's take it a step further then and use an entirely different allocator and see if that makes the problem go away. Then at least we'll have some idea of what we're looking at:

diff --git a/src/compat/compat.h b/src/compat/compat.h
index 7a61e4c1..cbf9427a 100644
--- a/src/compat/compat.h
+++ b/src/compat/compat.h
@@ -464,6 +464,7 @@ static inline __be32 our_inet_confirm_addr(struct net *net, struct in_device *in
 #include <linux/slab.h>
 static inline void *kvmalloc_ours(size_t size, gfp_t flags)
 {
+#ifndef CONFIG_CAVIUM_OCTEON_IPFWD_OFFLOAD
 	gfp_t kmalloc_flags = flags;
 	void *ret;
 	if (size > PAGE_SIZE) {
@@ -474,6 +475,7 @@ static inline void *kvmalloc_ours(size_t size, gfp_t flags)
 	ret = kmalloc(size, kmalloc_flags);
 	if (ret || size <= PAGE_SIZE)
 		return ret;
+#endif
 	return __vmalloc(size, flags, PAGE_KERNEL);
 }
 static inline void *kvzalloc_ours(size_t size, gfp_t flags)

@zx2c4
Copy link
Collaborator

zx2c4 commented Mar 3, 2019

Is this the right firmware for that stacktrace, btw? https://dl.ubnt.com/firmwares/edgemax/v2.0.x/ER-e300.v2.0.0.5155284.tar

@aswild
Copy link
Contributor

aswild commented Mar 3, 2019

@zx2c4 Thanks, it looks like this patch works!

For the 4.9 kernel I changed your patch slightly, since the _OCTEON was removed from the config name (and code was moved from arch/mips/cavium-octeon to drivers/net/ethernet/cavium/octeon)

diff --git a/src/compat/compat.h b/src/compat/compat.h
index 7a61e4c..0131d22 100644
--- a/src/compat/compat.h
+++ b/src/compat/compat.h
@@ -464,6 +464,7 @@ static inline __be32 our_inet_confirm_addr(struct net *net, struct in_device *in
 #include <linux/slab.h>
 static inline void *kvmalloc_ours(size_t size, gfp_t flags)
 {
+#if !defined(CONFIG_CAVIUM_OCTEON_IPFWD_OFFLOAD) && !defined(CONFIG_CAVIUM_IPFWD_OFFLOAD)
        gfp_t kmalloc_flags = flags;
        void *ret;
        if (size > PAGE_SIZE) {
@@ -474,6 +475,7 @@ static inline void *kvmalloc_ours(size_t size, gfp_t flags)
        ret = kmalloc(size, kmalloc_flags);
        if (ret || size <= PAGE_SIZE)
                return ret;
+#endif
        return __vmalloc(size, flags, PAGE_KERNEL);
 }
 static inline void *kvzalloc_ours(size_t size, gfp_t flags)

Yes, that's the right firmware for my stacktrace (but @NimlothPL's earlier in the thread is for a different firmware/kernel/hardware).

Ubiquiti still hasn't updated their downloads page for v2.0, nor provided a final GPL archive, so I'm building with kernel source from v2.0.0/master branch of @Lochnair's kernel_e300 repo (based on the ubnt's 2.0.0-beta2 GPL release)

@zx2c4
Copy link
Collaborator

zx2c4 commented Mar 3, 2019

Do you need CONFIG_CAVIUM_IPFWD_OFFLOAD specified in the other part of compat.h where we special case weird offloading logic?

@aswild
Copy link
Contributor

aswild commented Mar 3, 2019

I didn't touch that part of compat.h when building, but it looks like CONFIG_CAVIUM_IPFWD_OFFLOAD should be included there too. (all I've tested so far is simple pings that probably don't touch the offload engine)

In skbuff.h, struct cvm_packet_info cvm_info; is added to sk_buff for #ifdef CONFIG_CAVIUM_NET_PACKET_FWD_OFFLOAD

@zx2c4
Copy link
Collaborator

zx2c4 commented Mar 3, 2019

I didn't touch that part of compat.h when building, but it looks like CONFIG_CAVIUM_IPFWD_OFFLOAD should be included there too. (all I've tested so far is simple pings that probably don't touch the offload engine)

Before I add it, I'd be very grateful if you could do some comparison to show that it's the right thing to do.

Also, with regards to the real bug here, we now know there's something gravely wrong with the slab allocator (kmalloc_caches[15] is an invalid pointer), but we don't know why or how to mitigate that. Think you could send me the output of cat /proc/slabinfo?

@zx2c4
Copy link
Collaborator

zx2c4 commented Mar 3, 2019

For the 4.9 kernel I changed your patch slightly

Woah woah are you saying that this bug is present on their 4.9 kernel too? Not just their 3.10? Or did you not actually try to trigger it on the 4.9 yet?

@aswild
Copy link
Contributor

aswild commented Mar 3, 2019

Before I add it, I'd be very grateful if you could do some comparison to show that it's the right thing to do.

Checking that now and doing some iperf3 benchmarking.

are you saying that this bug is present on their 4.9 kernel too?

Yep, all of my building/testing today has been on the 4.9 kernel, I don't have 3.10 running on anything (and it'd probably be tricky to downgrade)

@zx2c4
Copy link
Collaborator

zx2c4 commented Mar 3, 2019

Gotcha, thanks for clarifying. I've been looking at the wrong kernel sources! Awaiting cat /proc/slabinfo when you have a chance.

@aswild
Copy link
Contributor

aswild commented Mar 3, 2019

Here's /proc/slabinfo. wireguard is loaded and configured with only the allocator change make to compat.h (not skb_scrub_packet)

slabinfo - version: 2.1                                                                                                                                                                                                                                                                                                       
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>                                                                                                                                           
nf_conntrack_expect      0      0    224   18    1 : tunables    0    0    0 : slabdata      0      0      0
nf_conntrack         156    315    384   21    2 : tunables    0    0    0 : slabdata     15     15      0
ip6-frags              0      0    200   20    1 : tunables    0    0    0 : slabdata      0      0      0
tw_sock_TCPv6         16     16    248   16    1 : tunables    0    0    0 : slabdata      1      1      0
request_sock_TCPv6      0      0    304   26    2 : tunables    0    0    0 : slabdata      0      0      0
TCPv6                 64     64   2048   16    8 : tunables    0    0    0 : slabdata      4      4      0
cfq_queue             68     68    240   17    1 : tunables    0    0    0 : slabdata      4      4      0
mqueue_inode_cache     18     18    896   18    4 : tunables    0    0    0 : slabdata      1      1      0
fat_inode_cache        0      0    656   24    4 : tunables    0    0    0 : slabdata      0      0      0
fat_cache              0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
squashfs_inode_cache   2925   2925    640   25    4 : tunables    0    0    0 : slabdata    117    117      0
jbd2_transaction_s     64     64    256   16    1 : tunables    0    0    0 : slabdata      4      4      0
jbd2_journal_handle    340    340     48   85    1 : tunables    0    0    0 : slabdata      4      4      0
jbd2_journal_head    340    340    120   34    1 : tunables    0    0    0 : slabdata     10     10      0
jbd2_revoke_table_s    256    256     16  256    1 : tunables    0    0    0 : slabdata      1      1      0
jbd2_revoke_record_s      0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
ext2_inode_cache       0      0    712   23    4 : tunables    0    0    0 : slabdata      0      0      0
ext4_inode_cache     306    306    936   17    4 : tunables    0    0    0 : slabdata     18     18      0
ext4_allocation_context    128    128    128   32    1 : tunables    0    0    0 : slabdata      4      4      0
ext4_system_zone     102    102     40  102    1 : tunables    0    0    0 : slabdata      1      1      0
ext4_io_end          384    384     64   64    1 : tunables    0    0    0 : slabdata      6      6      0
ext4_extent_status    510    510     40  102    1 : tunables    0    0    0 : slabdata      5      5      0
mbcache                0      0     56   73    1 : tunables    0    0    0 : slabdata      0      0      0
dio                    0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
posix_timers_cache     18     18    216   18    1 : tunables    0    0    0 : slabdata      1      1      0
UNIX                 224    224   1152   28    8 : tunables    0    0    0 : slabdata      8      8      0
ip4-frags             44     44    184   22    1 : tunables    0    0    0 : slabdata      2      2      0
flow_cache           144    144    112   36    1 : tunables    0    0    0 : slabdata      4      4      0
tw_sock_TCP           64     64    248   16    1 : tunables    0    0    0 : slabdata      4      4      0
request_sock_TCP     104    104    304   26    2 : tunables    0    0    0 : slabdata      4      4      0
TCP                   68     68   1920   17    8 : tunables    0    0    0 : slabdata      4      4      0
hugetlbfs_inode_cache     29     29    552   29    4 : tunables    0    0    0 : slabdata      1      1      0
eventpoll_pwq        280    280     72   56    1 : tunables    0    0    0 : slabdata      5      5      0
inotify_inode_mark    184    184     88   46    1 : tunables    0    0    0 : slabdata      4      4      0
request_queue         17     17   1848   17    8 : tunables    0    0    0 : slabdata      1      1      0
blkdev_requests      552    552    344   23    2 : tunables    0    0    0 : slabdata     24     24      0
blkdev_ioc           156    156    104   39    1 : tunables    0    0    0 : slabdata      4      4      0
sock_inode_cache     300    300    640   25    4 : tunables    0    0    0 : slabdata     12     12      0
file_lock_cache       76     76    208   19    1 : tunables    0    0    0 : slabdata      4      4      0
net_namespace          0      0   5632    5    8 : tunables    0    0    0 : slabdata      0      0      0
shmem_inode_cache   2025   2025    640   25    4 : tunables    0    0    0 : slabdata     81     81      0
proc_inode_cache    1695   1728    592   27    4 : tunables    0    0    0 : slabdata     64     64      0
sigqueue             100    100    160   25    1 : tunables    0    0    0 : slabdata      4      4      0
bdev_cache            84     84    768   21    4 : tunables    0    0    0 : slabdata      4      4      0
kernfs_node_cache  10132  10132    120   34    1 : tunables    0    0    0 : slabdata    298    298      0
mnt_cache            210    210    384   21    2 : tunables    0    0    0 : slabdata     10     10      0
inode_cache         4857   5490    536   30    4 : tunables    0    0    0 : slabdata    183    183      0
dentry             23463  24696    192   21    1 : tunables    0    0    0 : slabdata   1176   1176      0
iint_cache             0      0     80   51    1 : tunables    0    0    0 : slabdata      0      0      0
buffer_head        31356  31356    104   39    1 : tunables    0    0    0 : slabdata    804    804      0
nsproxy              292    292     56   73    1 : tunables    0    0    0 : slabdata      4      4      0
files_cache          105    105    768   21    4 : tunables    0    0    0 : slabdata      5      5      0
signal_cache         396    396    896   18    4 : tunables    0    0    0 : slabdata     22     22      0
sighand_cache        153    161   4224    7    8 : tunables    0    0    0 : slabdata     23     23      0
task_struct          232    243   3328    9    8 : tunables    0    0    0 : slabdata     27     27      0
anon_vma            4736   4736     64   64    1 : tunables    0    0    0 : slabdata     74     74      0
shared_policy_node    340    340     48   85    1 : tunables    0    0    0 : slabdata      4      4      0
numa_policy          170    170     24  170    1 : tunables    0    0    0 : slabdata      1      1      0
radix_tree_node     1708   1708    584   28    4 : tunables    0    0    0 : slabdata     61     61      0
idr_layer_cache      255    255   2096   15    8 : tunables    0    0    0 : slabdata     17     17      0
kmalloc-8192          80     80   8192    4    8 : tunables    0    0    0 : slabdata     20     20      0
kmalloc-4096        1354   1808   4096    8    8 : tunables    0    0    0 : slabdata    226    226      0
kmalloc-2048         306    320   2048   16    8 : tunables    0    0    0 : slabdata     20     20      0
kmalloc-1024        1605   1664   1024   16    4 : tunables    0    0    0 : slabdata    104    104      0
kmalloc-512         3051   3552    512   16    2 : tunables    0    0    0 : slabdata    222    222      0
kmalloc-256         1738   1984    256   16    1 : tunables    0    0    0 : slabdata    124    124      0
kmalloc-192         5985   5985    192   21    1 : tunables    0    0    0 : slabdata    285    285      0
kmalloc-128        15360  15552    128   32    1 : tunables    0    0    0 : slabdata    486    486      0
kmalloc-96          7350   7350     96   42    1 : tunables    0    0    0 : slabdata    175    175      0
kmalloc-64         18221  20032     64   64    1 : tunables    0    0    0 : slabdata    313    313      0
kmalloc-32          1664   1664     32  128    1 : tunables    0    0    0 : slabdata     13     13      0
kmalloc-16          2304   2304     16  256    1 : tunables    0    0    0 : slabdata      9      9      0
kmalloc-8           6144   6144      8  512    1 : tunables    0    0    0 : slabdata     12     12      0
kmem_cache_node      128    128     64   64    1 : tunables    0    0    0 : slabdata      2      2      0
kmem_cache            80     80    256   16    1 : tunables    0    0    0 : slabdata      5      5      0

@aswild
Copy link
Contributor

aswild commented Mar 4, 2019

Rebuilt wireguard with skb_scrub_packet patched for CONFIG_CAVIUM_IPFWD_OFFLOAD and it works too.

iperf3 might be slightly faster when terminating wireguard in the ER4 and then forwarding to a LAN host with the skb_scrub_packet patch, but it was pretty close.

@zx2c4
Copy link
Collaborator

zx2c4 commented Mar 4, 2019

This is a bit of a frustrating situation as I don't have things setup to keep trying stuff, so it's quite hard to debug, and the octeon kernel won't build for qemu. If you've got a lot of patience, there are a million things I'm curious about in trying to track this bug down. For example:

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 622f6b6ae..29861409a 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -980,6 +980,7 @@ static void __init new_kmalloc_cache(int idx, unsigned long flags)
 {
 	kmalloc_caches[idx] = create_kmalloc_cache(kmalloc_info[idx].name,
 					kmalloc_info[idx].size, flags);
+	pr_err("SARU making cache %d is 0x%llx called %s size %lu flags 0x%x\n", idx, kmalloc_caches[idx], kmalloc_info[idx].name, kmalloc_info[idx].size, flags);
 }
 
 /*
@@ -992,6 +993,7 @@ void __init create_kmalloc_caches(unsigned long flags)
 	int i;
 
 	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
+		pr_err("SARU iteration %d, pre-state: 0x%llx\n", i, kmalloc_caches[i]);
 		if (!kmalloc_caches[i])
 			new_kmalloc_cache(i, flags);
 

Got IRC or something? Might be easier to work through it there, if you're up for that.

@aswild
Copy link
Contributor

aswild commented Mar 4, 2019

I can dig up an IRC client, but I'm not super comfortable testing out kernel patches. When I soft-bricked at first, I wasn't able to break into a bootloader shell and don't know what would happen if I got stuck with an unbootable kernel.

Happy to test out wireguard patches as long as my roommate's not using the internet.

P.S. I sympathize with the struggle of debugging without hardware, and really appreciate your help on this issue!

@zx2c4
Copy link
Collaborator

zx2c4 commented Mar 4, 2019

Okay what if you patch wireguard with the below and see at which point it crashes (i.e. send me the whole dmesg output):

diff --git a/src/main.c b/src/main.c
index 4b5b58e8..cda15a94 100644
--- a/src/main.c
+++ b/src/main.c
@@ -20,8 +20,20 @@
 
 static int __init mod_init(void)
 {
+	unsigned long i;
+	void *ohnose;
 	int ret;
 
+	for (i = 0; i < ilog2(0x100000000); ++i) {
+		pr_err("About to allocate size %lu, index %d", 1UL << i, kmalloc_index(1UL << i));
+		ohnose = kmalloc(1UL << i, GFP_KERNEL);
+		if (!ohnose) {
+			pr_err("Allocation failed at size %lu\n", 1UL << i);
+			break;
+		}
+		kfree(ohnose);
+	}
+
 	if ((ret = chacha20_mod_init()) || (ret = poly1305_mod_init()) ||
 	    (ret = chacha20poly1305_mod_init()) || (ret = blake2s_mod_init()) ||
 	    (ret = curve25519_mod_init()))

@aswild
Copy link
Contributor

aswild commented Mar 5, 2019

Sure, I can try that out (as soon as I can find a reasonable maintenance window). One issue is that systemd seems to capture most of the kernel output once it starts so the prints before the panic might get dropped. I'll play around with printk levels to see if I can make them hit the console unconditionally.

@zx2c4
Copy link
Collaborator

zx2c4 commented Mar 5, 2019

Those are pr_err prints, so they should be somewhat unconditional.

I wasn't aware edgemax had moved to systemd.

@aswild
Copy link
Contributor

aswild commented Mar 5, 2019

Yeah, EdgeOS v2.0 switched to Debian Stretch with systemd. Here's the output after insmod with the kmalloc patch. Interestingly it didn't panic in this context. I did rmmod wireguard then insmod /tmp/wireguard.ko.

Here's the dmesg output starting after the insmod. Did you want the full log starting at boot?

[94275.974092] wireguard: About to allocate size 1, index 5
[94275.977934] wireguard: About to allocate size 2, index 5
[94275.981942] wireguard: About to allocate size 4, index 5
[94275.985803] wireguard: About to allocate size 8, index 5
[94275.989814] wireguard: About to allocate size 16, index 5
[94275.993733] wireguard: About to allocate size 32, index 5
[94275.997839] wireguard: About to allocate size 64, index 6
[94276.001759] wireguard: About to allocate size 128, index 7
[94276.005948] wireguard: About to allocate size 256, index 8
[94276.009955] wireguard: About to allocate size 512, index 9
[94276.014144] wireguard: About to allocate size 1024, index 10
[94276.018324] wireguard: About to allocate size 2048, index 11
[94276.022679] wireguard: About to allocate size 4096, index 12
[94276.026867] wireguard: About to allocate size 8192, index 13
[94276.031223] wireguard: About to allocate size 16384, index 14
[94276.035506] wireguard: About to allocate size 32768, index 15
[94276.039951] wireguard: About to allocate size 65536, index 16
[94276.044235] wireguard: About to allocate size 131072, index 17
[94276.048768] wireguard: About to allocate size 262144, index 18
[94276.053128] wireguard: About to allocate size 524288, index 19
[94276.057679] wireguard: About to allocate size 1048576, index 20
[94276.062147] wireguard: About to allocate size 2097152, index 21
[94276.066814] wireguard: About to allocate size 4194304, index 22
[94276.071356] wireguard: About to allocate size 8388608, index 23
[94276.076194] wireguard: About to allocate size 16777216, index 24
[94276.081217] wireguard: About to allocate size 33554432, index 25
[94276.087004] wireguard: About to allocate size 67108864, index 26
[94276.091534] ------------[ cut here ]------------
[94276.094880] WARNING: CPU: 0 PID: 19738 at mm/page_alloc.c:3544 __alloc_pages_nodemask+0x2f8/0xca8
[94276.102452] Modules linked in: wireguard(O+) sch_fq_codel sch_htb xt_nat xt_multiport ip6_udp_tunnel udp_tunnel 8021q garp stp llc ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_NETMAP xt_set nf_log_ipv4 ipt_REJECT nf_reject_ipv4 nf_log_ipv6 nf_log_common nf_conntrack_ipv6 nf_defrag_ipv6 xt_LOG xt_tcpudp xt_comment xt_conntrack ip_set_bitmap_port ip6table_mangle ip6table_filter ip6table_raw ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_mangle xt_CT iptable_raw nf_nat_h323 nf_conntrack_h323 nf_nat_sip nf_conntrack_sip nf_nat_tftp nf_nat_ftp nf_conntrack_tftp nf_conntrack_ftp ip_set_hash_net ip_set nfnetlink iptable_filter cvm_ipsec_kame(O) imq cavium_ip_offload(O) ubnt_nf_app(O) tdts(PO) octeon_rng rng_core nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre                                                                                                                                             
[94276.172413]  nf_nat nf_conntrack ubnt_platform(PO) ip_tables x_tables ipv6 [last unloaded: wireguard]
[94276.180422] CPU: 0 PID: 19738 Comm: insmod Tainted: P           O    4.9.79-UBNT #1
[94276.186772] Stack : 0000000000000000 0000000000000004 0000000000000006 0000000000000000
[94276.193528]         ffffffff80e00000 ffffffff80f65eb0 ffffffff80f60000 ffffffff80e00000
[94276.200283]         0000000000000000 0000000000000000 0000000000000047 0000000000000000
[94276.207037]         ffffffff80f60000 ffffffff808c07c8 0000000000000004 ffffffff808c18c8
[94276.213791]         0000000000000000 0000000000000000 0000000000000000 ffffffff80f60000
[94276.220545]         ffffffff80d7a468 ffffffff80df3f07 8000000046418d00 ffffffff80f5c300
[94276.227300]         0000000000004d1a 0000000000000000 0000000000100001 ffffffff808fae64
[94276.234054]         ffffffff808e7b20 8000000047cbb860 8000000047cbb978 ffffffff80aa9234
[94276.240809]         0000000000000000 ffffffff808c2000 000000000000000a ffffffff80d7a468
[94276.247563]         0000000000000000 ffffffff808601c8 0000000000000000 0000000000000000
[94276.254318]         ...
[94276.255482] Call Trace:
[94276.256631] [<ffffffff808601c8>] show_stack+0x90/0xb0
[94276.260383] [<ffffffff80aa9234>] dump_stack+0x84/0xc0
[94276.264134] [<ffffffff8087eb08>] __warn+0x100/0x118
[94276.267712] [<ffffffff809066e8>] __alloc_pages_nodemask+0x2f8/0xca8
[94276.272681] [<ffffffff80922e54>] kmalloc_order+0x14/0x80
[94276.276728] [<ffffffffc05c7250>] mod_init+0x250/0x3b4 [wireguard]
[94276.281535] [<ffffffff80800610>] do_one_initcall+0x40/0x140
[94276.285809] [<ffffffff808fb2ac>] do_init_module+0x64/0x1b4
[94276.289995] [<ffffffff808eaa4c>] load_module+0x1dcc/0x2090
[94276.294177] [<ffffffff808eafc4>] SyS_finit_module+0xcc/0xf0
[94276.298449] [<ffffffff8086deec>] syscall_common+0x18/0x3c
[94276.302616] ---[ end trace 3be245c725359407 ]---
[94276.305945] wireguard: Allocation failed at size 67108864
[94276.310101] wireguard: WireGuard 0.0.20190227 loaded. See www.wireguard.com for information.
[94276.317258] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.

@phillipmcmahon
Copy link

has there been any progress on this? I am happy to test packages (assuming no risk of bricking my ER-6P, it has a serial port on it but not sure how far I can screw things up) and if someone can point me in the right direction to setting up a compile toolchain I will gladly assist in this too.

@zx2c4
Copy link
Collaborator

zx2c4 commented Apr 11, 2019

Quiet, yes, but not forgotten. Lots of unexpected travel precluding my access to the hardware right now. I'd suggest @Lochnair apply the workaround I posted above to his builds until I'm back home and can figure out what UBNT is doing to their kernels.

@phillipmcmahon
Copy link

Appreciate the response, and also to know at some point things will pick up again. There has been another release of WireGuard in the meantime, v0.0.20190406.

@zx2c4
Copy link
Collaborator

zx2c4 commented Apr 11, 2019

Indeed. I'm the one who made that release :)

I don't expect it will fix the kmalloc problem, though.

@phillipmcmahon
Copy link

haha, my bad. I should know whom I am talking with next time :)

@dampfklon
Copy link

I can confirm 0406 still crashes without the patch

@coreyhines
Copy link

I am willing to test on ER-4 EDGEOS FW 2.0.1 if deb packages go back up again.

@Lochnair
Copy link
Owner

Packages with the patch applied are available from the build server now:

If they work for you, I'll tag a new release with them.

@phillipmcmahon
Copy link

phillipmcmahon commented Apr 15, 2019

Fingers crossed, installing now on my 6P...

Update: Installed, rebooted and it all came back up and within these first few minutes it looks ok. My WireGuard client connected without issue and traffic is-a-flowing. I will keep hammering it this evening and see if something "bad" happens.

Early to call it, but thanks a lot.

@phillipmcmahon
Copy link

Several GB have passed through the multiple WG interfaces I have installed on my 6P. All looks pretty solid. No issues noted as of yet.

@aswild
Copy link
Contributor

aswild commented Apr 15, 2019

Thanks for the build! The 2.0 package seems sane on my ER4 v2.0.1

@coreyhines
Copy link

coreyhines commented Apr 16, 2019 via email

@dc361
Copy link

dc361 commented Apr 16, 2019

Corey -- try your configuration for the peer without the ipv6 default network. I've had a problem with this the last few versions and have had to use a script to add it after the link is up using the wg command directly. For some reason on the ER's if the ::/0 (or 0::/0) is present in the saved config it doesn't work.

@coreyhines
Copy link

coreyhines commented Apr 16, 2019 via email

@phillipmcmahon
Copy link

Corey, in addition to removing IPv6, also set
route-allowed-ips to false.

You might want to try the Ubiquity forum for further assistance.

@coreyhines
Copy link

coreyhines commented Apr 16, 2019 via email

@dampfklon
Copy link

thanks for the update
installed on E50 v1.10 works without problems

@coreyhines
Copy link

coreyhines commented Apr 16, 2019 via email

@acejacek
Copy link
Contributor

Report: version 406 installed on EdgeRouter Lite-3 (e100) few days ago and operates OK since.

@jmturner
Copy link

I've been running it for two days now on my ERL-3 and all looks good.

@phillipmcmahon
Copy link

Looks like this might be good to formally push to a release.

Thanks again for the work done to get the fix in and the packages out.

@coreyhines
Copy link

coreyhines commented Apr 19, 2019 via email

@benklop
Copy link

benklop commented Apr 24, 2019

I have an ER-X and an ER-Lite that are currently just sitting in a box. would this hardware be helpful for testing so this sort of thing doesn't occur again? If so, I'm more than happy to either donate them make them available in some other way.

@coreyhines
Copy link

coreyhines commented Jun 8, 2019 via email

aswild added a commit to aswild/vyatta-wireguard-build that referenced this issue Jun 9, 2019
Full support for building wireguard for the UBNT e300 (ER-4/ER-6P/ER-12)

Git submodules:
  * WireGuard 0.0.20190601
  * libmnl 1.0.4
  * musl 1.1.22
  * vyatta-wireguard package and scripts v2.0 branch

Git LFS objects:
  * Cavium Octeon gcc 4.7 toolchain (OCTEON-SDK-5.1-tools.tar.xz) from
    OCTEON-SDK-5.1.tbz in https://github.com/Cavium-Open-Source-Distributions/OCTEON-SDK/
    I repackaged only the toolchain into a tar.xz to save space and avoid
    slow bzip2 decompression.
  * Kernel source (e300_kernel_5174690-gbd11043d0ccc.tgz) from the
    EdgeMAX v2.0.1 GPL release https://dl.ubnt.com/firmwares/edgemax/v2.0.x/GPL.ER-e300.v2.0.1.5174690.tar.bz2
    I repackaged just the kernel source tarball to save space and avoid
    slow bzip2 decompression.

Other:
  * only-use-__vmalloc-for-now.patch from https://gist.github.com/Lochnair/805bf9ab96742d0fe1c25e4130268307
    See Lochnair/vyatta-wireguard#97 for
    context and history
aswild added a commit to aswild/vyatta-wireguard-build that referenced this issue Jun 9, 2019
Full support for building wireguard for the UBNT e300 (ER-4/ER-6P/ER-12)

Git submodules:
  * WireGuard 0.0.20190601
  * libmnl 1.0.4
  * musl 1.1.22
  * vyatta-wireguard package and scripts v2.0 branch

Git LFS objects:
  * Cavium Octeon gcc 4.7 toolchain (OCTEON-SDK-5.1-tools.tar.xz) from
    OCTEON-SDK-5.1.tbz in https://github.com/Cavium-Open-Source-Distributions/OCTEON-SDK/
    I repackaged only the toolchain into a tar.xz to save space and avoid
    slow bzip2 decompression.
  * Kernel source (e300_kernel_5174690-gbd11043d0ccc.tgz) from the
    EdgeMAX v2.0.1 GPL release https://dl.ubnt.com/firmwares/edgemax/v2.0.x/GPL.ER-e300.v2.0.1.5174690.tar.bz2
    I repackaged just the kernel source tarball to save space and avoid
    slow bzip2 decompression.

Other:
  * only-use-__vmalloc-for-now.patch from https://gist.github.com/Lochnair/805bf9ab96742d0fe1c25e4130268307
    See Lochnair/vyatta-wireguard#97 for
    context and history
@aswild
Copy link
Contributor

aswild commented Jun 9, 2019

Hi Corey,
It's probably best not to hijack this (already very long) issue thread with unrelated questions about releases and builds, but we're here now so I'll help anyway.

UBNT hasn't released the GPL archive for v2.0.3, but the kernel hasn't changed enough since v2.0.1 to matter; the same WireGuard binaries/packages will work on v2.0.1 and v2.0.3, at least for my e300 ER-4.

The "unknown symbol" error is due to the wireguard module's dependencies on udp_tunnel and ip6_udp_tunnel. The best solution is to use modprobe wireguard instead of insmod /path/to/wireguard.ko (since modprobe handles module dependencies). Alternatively, modprobe udp_tunnel and ip6_udp_tunnel before insmod-ing wireguard.

For anyone wanting to build their own packages, I finally published and documented my build scripts: https://github.com/aswild/vyatta-wireguard-build. Only e300 v2.0.x is supported right now since that's what I use, but it should be straightforward to add other platforms.

@coreyhines
Copy link

coreyhines commented Jun 9, 2019 via email

@zx2c4
Copy link
Collaborator

zx2c4 commented Jun 9, 2019

Please don't hijack this. I'm going to look into this eventually, but there's already way too much noise to keep straight in documenting what's going on.

Repository owner locked as spam and limited conversation to collaborators Jun 9, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests