Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel Panic or Lockup on DS3615xs image #21

Open
WiteWulf opened this issue Sep 17, 2021 · 74 comments
Open

Kernel Panic or Lockup on DS3615xs image #21

WiteWulf opened this issue Sep 17, 2021 · 74 comments
Labels
awaiting-feedback bug Something isn't working

Comments

@WiteWulf
Copy link

WiteWulf commented Sep 17, 2021

A number of users on the forum are reporting kernel panics on baremetal installs, or lockups of guests on virtualisation platforms, when running the DS3615xs image specifically. This is typically precipitated by running docker in general, or certain docker images, but may also have been caused by high IO load in some circumstances. A common feature is use of databases (notably influxdb, mariadb, mysql and elasticsearch) but also nginx and jdownloader2.

This has been observed on baremetal HP Gen7 and Gen8 servers, proxmox and ESXi with a variety of Xeon CPUs (E3-1265L V2, E3-1270 V2, E3-1241 V3, E3-1220L V2 and E3-1265L V4), Celeron and AMD.

Most users are on DSM7.0.1-RC1, but I also observed this behaviour when on DSM6.2.4

(edit: also confirmed to affect 7.0 beta and 7.0.1, ie. not the release candidate)

Conversely, a number of users with DS918+ images have reported no issues with running docker or known problematic images (in my case influxdb causes a 100% reproducible crash).

On my baremetal HP Gen8 running 6.2.4 I get the following console output before a reboot:

[  191.452302] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3

[  191.487637] CPU: 3 PID: 19775 Comm: containerd-shim Tainted: PF          O 3.10.105 #25556

[  191.528112] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019

[  191.562597]  ffffffff814c904d ffffffff814c8121 0000000000000010 ffff880109ac8d58

[  191.599118]  ffff880109ac8cf0 0000000000000000 0000000000000003 000000000000002c

[  191.634943]  0000000000000003 ffffffff80000001 0000000000000010 ffff880103817c00

[  191.670604] Call Trace:

[  191.682506]  <NMI>  [<ffffffff814c904d>] ? dump_stack+0xc/0x15

[  191.710494]  [<ffffffff814c8121>] ? panic+0xbb/0x1ce

[  191.735108]  [<ffffffff810a0922>] ? watchdog_overflow_callback+0xb2/0xc0

[  191.768203]  [<ffffffff810b152b>] ? __perf_event_overflow+0x8b/0x240

[  191.799789]  [<ffffffff810b02d4>] ? perf_event_update_userpage+0x14/0xf0

[  191.834349]  [<ffffffff81015411>] ? intel_pmu_handle_irq+0x1d1/0x360

[  191.865505]  [<ffffffff81010026>] ? perf_event_nmi_handler+0x26/0x40

[  191.897683]  [<ffffffff81005fa8>] ? do_nmi+0xf8/0x3e0

[  191.922372]  [<ffffffff814cfa53>] ? end_repeat_nmi+0x1e/0x7e

[  191.950899]  <<EOE>> 

[  191.961095] Rebooting in 3 seconds..

This is seen by others on baremetal when using docker. Virtualisation platform users see 100% CPU usage on their xpenology guest and it becomes unresponsive, requiring a restart of the guest. The majority of kernel panics cite containerd-shim as being at fault, but sometimes (rarely) it will list a process being run inside a docker container (notably influxdb in my case).

This is notably similar to an issue logged with RHEL a number of years ago that they note was fixed in a subsequent kernel release:
https://access.redhat.com/solutions/1354963

@WiteWulf WiteWulf changed the title Docker Causing Kernel Panic or Lockup on DS3615XS image Docker Causing Kernel Panic or Lockup on DS3615xs image Sep 17, 2021
@OrpheeGT
Copy link

As @WiteWulf I have the same issue on ESXi VM.
HP Gen8

When I run only simple nginx container... randomly...

[ 923.411516] device dockerdbd5db9 entered promiscuous mode
[ 923.413414] IPv6: ADDRCONF(NETDEV_UP): dockerdbd5db9: link is not ready
[ 923.533359] <redpill/smart_shim.c:794> Handling ioctl(0x31f) for /dev/sda
[ 923.534667] <redpill/smart_shim.c:624> Got SMART command - looking for feature=0xd0
[ 923.536000] <redpill/smart_shim.c:376> Generating fake SMART values
[ 924.059100] IPv6: ADDRCONF(NETDEV_CHANGE): dockerdbd5db9: link becomes ready
[ 924.060327] docker0: port 1(dockerdbd5db9) entered forwarding state
[ 924.061379] docker0: port 1(dockerdbd5db9) entered forwarding state
[ 924.062448] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready
[ 939.050818] docker0: port 1(dockerdbd5db9) entered forwarding state
[ 975.910411] <redpill/smart_shim.c:794> Handling ioctl(0x31f) for /dev/sda
[ 975.911848] <redpill/smart_shim.c:624> Got SMART command - looking for feature=0xd0
[ 975.913185] <redpill/smart_shim.c:376> Generating fake SMART values
[ 1035.865304] <redpill/smart_shim.c:794> Handling ioctl(0x31f) for /dev/sda
[ 1035.866688] <redpill/smart_shim.c:624> Got SMART command - looking for feature=0xd0
[ 1035.867966] <redpill/smart_shim.c:376> Generating fake SMART values
[ 1095.820121] <redpill/smart_shim.c:794> Handling ioctl(0x31f) for /dev/sda
[ 1095.821465] <redpill/smart_shim.c:624> Got SMART command - looking for feature=0xd0
[ 1095.822764] <redpill/smart_shim.c:376> Generating fake SMART values
[ 1167.089552] BUG: soft lockup - CPU#1 stuck for 41s! [runc:19580]
[ 1167.090581] Modules linked in: nfnetlink xfrm_user xfrm_algo fuse bridge stp aufs macvlan veth xt_conntrack xt_addrtype nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_REDIRECT xt_nat iptable_nat nf_nat_ipv4
nf_nat xt_recent xt_iprange xt_limit xt_state xt_tcpudp xt_multiport xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables 8021q vhost_scsi(O) vhost(O) tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) cdc_acm ftdi_sio ch341(OF) cp210x(OF) usbserial udf isofs loop synoacl_vfs(PO) btrfs zstd_decompress ecryptfs zstd_compress xxhash xor raid6_pq aesni_intel glue_helper lrw gf128mul ablk_helper zram(C) bromolow_synobios(PO) hid_generic usbhid hid usblp bnx2x(O) mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) qede(O) qed(O) atlantic_v2(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) i2c_algo_bit igb(O) dca e1000e(O) sg dm_snapshot crc_itu_t crc_ccitt psnap p8022 llc zlib_deflate libcrc32c hfsplus md4 hmac sit tunnel4 ipv6 flashcache_syno(O) flashcache(O) syno_flashcache_control(O) dm_mod crc32c_intel cryptd arc4 sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_powersave cpufreq_performance mperf processor thermal_sys cpufreq_stats freq_table vxlan ip_tunnel vmxnet3(F) etxhci_hcd mpt2sas(O) usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios]
[ 1167.115855] CPU: 1 PID: 19580 Comm: runc Tainted: PF C O 3.10.108 #42214
[ 1167.117041] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[ 1167.118750] task: ffff8800b656c820 ti: ffff880099568000 task.ti: ffff880099568000
[ 1167.119968] RIP: 0010:[] [] generic_exec_single+0x76/0xe0
[ 1167.121380] RSP: 0000:ffff88009956bc20 EFLAGS: 00000202
[ 1167.122253] RAX: 00000000000008fb RBX: 00000037ffffffc8 RCX: 000000000000008a
[ 1167.123402] RDX: 0000000000000008 RSI: 00000000000000fb RDI: ffffffff81606628
[ 1167.124552] RBP: ffff88009956bc60 R08: ffff880134888758 R09: 0000000000000020
[ 1167.125708] R10: 0000000000004041 R11: 0000000000000000 R12: 0000004000000001
[ 1167.126858] R13: 0000000000000000 R14: 0000005000000041 R15: ffffffff81894a98
[ 1167.128009] FS: 00007f17f0044740(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000
[ 1167.129322] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1167.130253] CR2: 000000c00024ef68 CR3: 00000000aad04000 CR4: 00000000001607e0
[ 1167.131420] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1167.132574] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1167.133726] Stack:
[ 1167.134067] 0000000000000001 ffff88009956bcb0 0000000000000000 ffffffff818a90d0
[ 1167.135366] ffffffff8102fcc0 ffffffff8109007e 0000000000000000 ffffffff818a90d0
[ 1167.136656] ffff88013dc13980 ffff88013dc13980 ffffffff8102fcc0 ffff88009956bcc0
[ 1167.137952] Call Trace:
[ 1167.138364] [] ? do_flush_tlb_all+0x160/0x160
[ 1167.139339] [] ? smp_call_function_single+0x12e/0x150
[ 1167.140431] [] ? do_flush_tlb_all+0x160/0x160
[ 1167.141413] [] ? flush_tlb_page+0x72/0x130
[ 1167.142352] [] ? ptep_clear_flush+0x22/0x30
[ 1167.143304] [] ? do_wp_page+0x2ad/0x8c0
[ 1167.144194] [] ? handle_pte_fault+0x38d/0x9e0
[ 1167.145166] [] ? handle_mm_fault+0x135/0x2e0
[ 1167.146123] [] ? __do_page_fault+0x14a/0x500
[ 1167.147094] [] ? vfs_read+0x140/0x170
[ 1167.147956] [] ? SyS_read+0x84/0xb0
[ 1167.148793] [] ? page_fault+0x22/0x30
[ 1167.149660] Code: 89 55 08 48 89 2a e8 8a 78 41 00 4c 39 f3 75 0f 44 89 e7 48 8b 05 fb f1 78 00 e8 96 4d 20 00 f6 45 20 01 74 08 f3 90 f6 45 20 01 <75> f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00 00 00 00 4c 8d 6b
[ 1215.050796] BUG: soft lockup - CPU#1 stuck for 41s! [runc:19580]
[ 1215.051822] Modules linked in: nfnetlink xfrm_user xfrm_algo fuse bridge stp aufs macvlan veth xt_conntrack xt_addrtype nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_REDIRECT xt_nat iptable_nat nf_nat_ipv4
nf_nat xt_recent xt_iprange xt_limit xt_state xt_tcpudp xt_multiport xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables 8021q vhost_scsi(O) vhost(O) tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) cdc_acm ftdi_sio ch341(OF) cp210x(OF) usbserial udf isofs loop synoacl_vfs(PO) btrfs zstd_decompress ecryptfs zstd_compress xxhash xor raid6_pq aesni_intel glue_helper lrw gf128mul ablk_helper zram(C) bromolow_synobios(PO) hid_generic usbhid hid usblp bnx2x(O) mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) qede(O) qed(O) atlantic_v2(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) i2c_algo_bit igb(O) dca e1000e(O) sg dm_snapshot crc_itu_t crc_ccitt psnap p8022 llc zlib_deflate libcrc32c hfsplus md4 hmac sit tunnel4 ipv6 flashcache_syno(O) flashcache(O) syno_flashcache_control(O) dm_mod crc32c_intel cryptd arc4 sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_powersave cpufreq_performance mperf processor thermal_sys cpufreq_stats freq_table vxlan ip_tunnel vmxnet3(F) etxhci_hcd mpt2sas(O) usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios]
[ 1215.076956] CPU: 1 PID: 19580 Comm: runc Tainted: PF C O 3.10.108 #42214
[ 1215.078142] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[ 1215.079854] task: ffff8800b656c820 ti: ffff880099568000 task.ti: ffff880099568000
[ 1215.081064] RIP: 0010:[] [] generic_exec_single+0x76/0xe0
[ 1215.082475] RSP: 0000:ffff88009956bc20 EFLAGS: 00000202
[ 1215.083338] RAX: 00000000000008fb RBX: 00000037ffffffc8 RCX: 000000000000008a
[ 1215.084487] RDX: 0000000000000008 RSI: 00000000000000fb RDI: ffffffff81606628
[ 1215.085646] RBP: ffff88009956bc60 R08: ffff880134888758 R09: 0000000000000020
[ 1215.086797] R10: 0000000000004041 R11: 0000000000000000 R12: 0000004000000001
[ 1215.087947] R13: 0000000000000000 R14: 0000005000000041 R15: ffffffff81894a98
[ 1215.089099] FS: 00007f17f0044740(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000
[ 1215.090400] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1215.091331] CR2: 000000c00024ef68 CR3: 00000000aad04000 CR4: 00000000001607e0
[ 1215.092492] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1215.093658] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1215.094824] Stack:
[ 1215.095165] 0000000000000001 ffff88009956bcb0 0000000000000000 ffffffff818a90d0
[ 1215.096461] ffffffff8102fcc0 ffffffff8109007e 0000000000000000 ffffffff818a90d0
[ 1215.097748] ffff88013dc13980 ffff88013dc13980 ffffffff8102fcc0 ffff88009956bcc0
[ 1215.099035] Call Trace:
[ 1215.099449] [] ? do_flush_tlb_all+0x160/0x160
[ 1215.100423] [] ? smp_call_function_single+0x12e/0x150
[ 1215.101505] [] ? do_flush_tlb_all+0x160/0x160
[ 1215.102478] [] ? flush_tlb_page+0x72/0x130
[ 1215.103412] [] ? ptep_clear_flush+0x22/0x30
[ 1215.104354] [] ? do_wp_page+0x2ad/0x8c0
[ 1215.105244] [] ? handle_pte_fault+0x38d/0x9e0
[ 1215.106221] [] ? handle_mm_fault+0x135/0x2e0
[ 1215.107180] [] ? __do_page_fault+0x14a/0x500
[ 1215.108140] [] ? vfs_read+0x140/0x170
[ 1215.109001] [] ? SyS_read+0x84/0xb0
[ 1215.109836] [] ? page_fault+0x22/0x30
[ 1215.110699] Code: 89 55 08 48 89 2a e8 8a 78 41 00 4c 39 f3 75 0f 44 89 e7 48 8b 05 fb f1 78 00 e8 96 4d 20 00 f6 45 20 01 74 08 f3 90 f6 45 20 01 <75> f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00 00 00 00 4c 8d 6b
[ 1263.012041] BUG: soft lockup - CPU#1 stuck for 41s! [runc:19580]
[ 1263.013068] Modules linked in: nfnetlink xfrm_user xfrm_algo fuse bridge stp aufs macvlan veth xt_conntrack xt_addrtype nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_REDIRECT xt_nat iptable_nat nf_nat_ipv4
nf_nat xt_recent xt_iprange xt_limit xt_state xt_tcpudp xt_multiport xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables 8021q vhost_scsi(O) vhost(O) tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) cdc_acm ftdi_sio ch341(OF) cp210x(OF) usbserial udf isofs loop synoacl_vfs(PO) btrfs zstd_decompress ecryptfs zstd_compress xxhash xor raid6_pq aesni_intel glue_helper lrw gf128mul ablk_helper zram(C) bromolow_synobios(PO) hid_generic usbhid hid usblp bnx2x(O) mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) qede(O) qed(O) atlantic_v2(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) i2c_algo_bit igb(O) dca e1000e(O) sg dm_snapshot crc_itu_t crc_ccitt psnap p8022 llc zlib_deflate libcrc32c hfsplus md4 hmac sit tunnel4 ipv6 flashcache_syno(O) flashcache(O) syno_flashcache_control(O) dm_mod crc32c_intel cryptd arc4 sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_powersave cpufreq_performance mperf processor thermal_sys cpufreq_stats freq_table vxlan ip_tunnel vmxnet3(F) etxhci_hcd mpt2sas(O) usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios]
[ 1263.038226] CPU: 1 PID: 19580 Comm: runc Tainted: PF C O 3.10.108 #42214
[ 1263.039404] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[ 1263.041115] task: ffff8800b656c820 ti: ffff880099568000 task.ti: ffff880099568000
[ 1263.042323] RIP: 0010:[] [] generic_exec_single+0x72/0xe0
[ 1263.043737] RSP: 0000:ffff88009956bc20 EFLAGS: 00000202
[ 1263.044601] RAX: 00000000000008fb RBX: 00000037ffffffc8 RCX: 000000000000008a
[ 1263.045752] RDX: 0000000000000008 RSI: 00000000000000fb RDI: ffffffff81606628
[ 1263.046902] RBP: ffff88009956bc60 R08: ffff880134888758 R09: 0000000000000020
[ 1263.048050] R10: 0000000000004041 R11: 0000000000000000 R12: 0000004000000001
[ 1263.049200] R13: 0000000000000000 R14: 0000005000000041 R15: ffffffff81894a98
[ 1263.050353] FS: 00007f17f0044740(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000
[ 1263.051657] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1263.052590] CR2: 000000c00024ef68 CR3: 00000000aad04000 CR4: 00000000001607e0
[ 1263.053755] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1263.054916] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1263.056066] Stack:
[ 1263.056408] 0000000000000001 ffff88009956bcb0 0000000000000000 ffffffff818a90d0
[ 1263.057707] ffffffff8102fcc0 ffffffff8109007e 0000000000000000 ffffffff818a90d0
[ 1263.058996] ffff88013dc13980 ffff88013dc13980 ffffffff8102fcc0 ffff88009956bcc0
[ 1263.060286] Call Trace:
[ 1263.060700] [] ? do_flush_tlb_all+0x160/0x160
[ 1263.061676] [] ? smp_call_function_single+0x12e/0x150
[ 1263.062757] [] ? do_flush_tlb_all+0x160/0x160
[ 1263.063729] [] ? flush_tlb_page+0x72/0x130
[ 1263.064661] [] ? ptep_clear_flush+0x22/0x30
[ 1263.065607] [] ? do_wp_page+0x2ad/0x8c0
[ 1263.066499] [] ? handle_pte_fault+0x38d/0x9e0
[ 1263.067473] [] ? handle_mm_fault+0x135/0x2e0
[ 1263.068432] [] ? __do_page_fault+0x14a/0x500
[ 1263.069392] [] ? vfs_read+0x140/0x170
[ 1263.070257] [] ? SyS_read+0x84/0xb0
[ 1263.071093] [] ? page_fault+0x22/0x30
[ 1263.071959] Code: 89 5d 00 48 89 55 08 48 89 2a e8 8a 78 41 00 4c 39 f3 75 0f 44 89 e7 48 8b 05 fb f1 78 00 e8 96 4d 20 00 f6 45 20 01 74 08 f3 90 45 20 01 75 f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00 00 00
[ 1310.973284] BUG: soft lockup - CPU#1 stuck for 41s! [runc:19580]
[ 1310.974511] Modules linked in: nfnetlink xfrm_user xfrm_algo fuse bridge stp aufs macvlan veth xt_conntrack xt_addrtype nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_REDIRECT xt_nat iptable_nat nf_nat_ipv4
nf_nat xt_recent xt_iprange xt_limit xt_state xt_tcpudp xt_multiport xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables 8021q vhost_scsi(O) vhost(O) tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) cdc_acm ftdi_sio ch341(OF) cp210x(OF) usbserial udf isofs loop synoacl_vfs(PO) btrfs zstd_decompress ecryptfs zstd_compress xxhash xor raid6_pq aesni_intel glue_helper lrw gf128mul ablk_helper zram(C) bromolow_synobios(PO) hid_generic usbhid hid usblp bnx2x(O) mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) qede(O) qed(O) atlantic_v2(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) i2c_algo_bit igb(O) dca e1000e(O) sg dm_snapshot crc_itu_t crc_ccitt psnap p8022 llc zlib_deflate libcrc32c hfsplus md4 hmac sit tunnel4 ipv6 flashcache_syno(O) flashcache(O) syno_flashcache_control(O) dm_mod crc32c_intel cryptd arc4 sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_powersave cpufreq_performance mperf processor thermal_sys cpufreq_stats freq_table vxlan ip_tunnel vmxnet3(F) etxhci_hcd mpt2sas(O) usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios]
[ 1310.999665] CPU: 1 PID: 19580 Comm: runc Tainted: PF C O 3.10.108 #42214
[ 1311.000847] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[ 1311.002566] task: ffff8800b656c820 ti: ffff880099568000 task.ti: ffff880099568000
[ 1311.003774] RIP: 0010:[] [] generic_exec_single+0x70/0xe0
[ 1311.005185] RSP: 0000:ffff88009956bc20 EFLAGS: 00000202
[ 1311.006051] RAX: 00000000000008fb RBX: 00000037ffffffc8 RCX: 000000000000008a
[ 1311.007199] RDX: 0000000000000008 RSI: 00000000000000fb RDI: ffffffff81606628
[ 1311.008351] RBP: ffff88009956bc60 R08: ffff880134888758 R09: 0000000000000020
[ 1311.009500] R10: 0000000000004041 R11: 0000000000000000 R12: 0000004000000001
[ 1311.010651] R13: 0000000000000000 R14: 0000005000000041 R15: ffffffff81894a98
[ 1311.011800] FS: 00007f17f0044740(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000
[ 1311.013101] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1311.014034] CR2: 000000c00024ef68 CR3: 00000000aad04000 CR4: 00000000001607e0
[ 1311.015186] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1311.016339] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1311.017487] Stack:
[ 1311.017827] 0000000000000001 ffff88009956bcb0 0000000000000000 ffffffff818a90d0
[ 1311.019131] ffffffff8102fcc0 ffffffff8109007e 0000000000000000 ffffffff818a90d0
[ 1311.020420] ffff88013dc13980 ffff88013dc13980 ffffffff8102fcc0 ffff88009956bcc0
[ 1311.021711] Call Trace:
[ 1311.022125] [] ? do_flush_tlb_all+0x160/0x160
[ 1311.023100] [] ? smp_call_function_single+0x12e/0x150
[ 1311.024179] [] ? do_flush_tlb_all+0x160/0x160
[ 1311.025149] [] ? flush_tlb_page+0x72/0x130
[ 1311.026081] [] ? ptep_clear_flush+0x22/0x30
[ 1311.027023] [] ? do_wp_page+0x2ad/0x8c0
[ 1311.027912] [] ? handle_pte_fault+0x38d/0x9e0
[ 1311.028885] [] ? handle_mm_fault+0x135/0x2e0
[ 1311.029842] [] ? __do_page_fault+0x14a/0x500
[ 1311.030800] [] ? vfs_read+0x140/0x170
[ 1311.031665] [] ? SyS_read+0x84/0xb0
[ 1311.032502] [] ? page_fault+0x22/0x30
[ 1311.033367] Code: ef 48 89 5d 00 48 89 55 08 48 89 2a e8 8a 78 41 00 4c 39 f3 75 0f 44 89 e7 48 8b 05 fb f1 78 00 e8 96 4d 20 00 f6 45 20 01 74 08 90 f6 45 20 01 75 f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00
[ 1358.934527] BUG: soft lockup - CPU#1 stuck for 41s! [runc:19580]
[ 1358.935547] Modules linked in: nfnetlink xfrm_user xfrm_algo fuse bridge stp aufs macvlan veth xt_conntrack xt_addrtype nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_REDIRECT xt_nat iptable_nat nf_nat_ipv4
nf_nat xt_recent xt_iprange xt_limit xt_state xt_tcpudp xt_multiport xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables 8021q vhost_scsi(O) vhost(O) tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) cdc_acm ftdi_sio ch341(OF) cp210x(OF) usbserial udf isofs loop synoacl_vfs(PO) btrfs zstd_decompress ecryptfs zstd_compress xxhash xor raid6_pq aesni_intel glue_helper lrw gf128mul ablk_helper zram(C) bromolow_synobios(PO) hid_generic usbhid hid usblp bnx2x(O) mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) qede(O) qed(O) atlantic_v2(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) i2c_algo_bit igb(O) dca e1000e(O) sg dm_snapshot crc_itu_t crc_ccitt psnap p8022 llc zlib_deflate libcrc32c hfsplus md4 hmac sit tunnel4 ipv6 flashcache_syno(O) flashcache(O) syno_flashcache_control(O) dm_mod crc32c_intel cryptd arc4 sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_powersave cpufreq_performance mperf processor thermal_sys cpufreq_stats freq_table vxlan ip_tunnel vmxnet3(F) etxhci_hcd mpt2sas(O) usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios]
[ 1358.960892] CPU: 1 PID: 19580 Comm: runc Tainted: PF C O 3.10.108 #42214
[ 1358.962069] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[ 1358.963782] task: ffff8800b656c820 ti: ffff880099568000 task.ti: ffff880099568000
[ 1358.964993] RIP: 0010:[] [] generic_exec_single+0x76/0xe0
[ 1358.966409] RSP: 0000:ffff88009956bc20 EFLAGS: 00000202
[ 1358.967273] RAX: 00000000000008fb RBX: 00000037ffffffc8 RCX: 000000000000008a
[ 1358.968425] RDX: 0000000000000008 RSI: 00000000000000fb RDI: ffffffff81606628
[ 1358.969578] RBP: ffff88009956bc60 R08: ffff880134888758 R09: 0000000000000020
[ 1358.970728] R10: 0000000000004041 R11: 0000000000000000 R12: 0000004000000001
[ 1358.971880] R13: 0000000000000000 R14: 0000005000000041 R15: ffffffff81894a98
[ 1358.973036] FS: 00007f17f0044740(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000
[ 1358.974341] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1358.975275] CR2: 000000c00024ef68 CR3: 00000000aad04000 CR4: 00000000001607e0
[ 1358.976431] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1358.977581] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1358.978736] Stack:
[ 1358.979078] 0000000000000001 ffff88009956bcb0 0000000000000000 ffffffff818a90d0
[ 1358.980383] ffffffff8102fcc0 ffffffff8109007e 0000000000000000 ffffffff818a90d0
[ 1358.981674] ffff88013dc13980 ffff88013dc13980 ffffffff8102fcc0 ffff88009956bcc0
[ 1358.982966] Call Trace:
[ 1358.983382] [] ? do_flush_tlb_all+0x160/0x160
[ 1358.984360] [] ? smp_call_function_single+0x12e/0x150
[ 1358.985444] [] ? do_flush_tlb_all+0x160/0x160
[ 1358.986419] [] ? flush_tlb_page+0x72/0x130
[ 1358.987352] [] ? ptep_clear_flush+0x22/0x30
[ 1358.988299] [] ? do_wp_page+0x2ad/0x8c0
[ 1358.989190] [] ? handle_pte_fault+0x38d/0x9e0
[ 1358.990180] [] ? handle_mm_fault+0x135/0x2e0
[ 1358.991141] [] ? __do_page_fault+0x14a/0x500
[ 1358.992101] [] ? vfs_read+0x140/0x170
[ 1358.992966] [] ? SyS_read+0x84/0xb0
[ 1358.993803] [] ? page_fault+0x22/0x30
[ 1358.994669] Code: 89 55 08 48 89 2a e8 8a 78 41 00 4c 39 f3 75 0f 44 89 e7 48 8b 05 fb f1 78 00 e8 96 4d 20 00 f6 45 20 01 74 08 f3 90 f6 45 20 01 <75> f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00 00 00 00 4c 8d 6b
[ 1406.895770] BUG: soft lockup - CPU#1 stuck for 41s! [runc:19580]
[ 1406.896809] Modules linked in: nfnetlink xfrm_user xfrm_algo fuse bridge stp aufs macvlan veth xt_conntrack xt_addrtype nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_REDIRECT xt_nat iptable_nat nf_nat_ipv4
nf_nat xt_recent xt_iprange xt_limit xt_state xt_tcpudp xt_multiport xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables 8021q vhost_scsi(O) vhost(O) tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) cdc_acm ftdi_sio ch341(OF) cp210x(OF) usbserial udf isofs loop synoacl_vfs(PO) btrfs zstd_decompress ecryptfs zstd_compress xxhash xor raid6_pq aesni_intel glue_helper lrw gf128mul ablk_helper zram(C) bromolow_synobios(PO) hid_generic usbhid hid usblp bnx2x(O) mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) qede(O) qed(O) atlantic_v2(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) i2c_algo_bit igb(O) dca e1000e(O) sg dm_snapshot crc_itu_t crc_ccitt psnap p8022 llc zlib_deflate libcrc32c hfsplus md4 hmac sit tunnel4 ipv6 flashcache_syno(O) flashcache(O) syno_flashcache_control(O) dm_mod crc32c_intel cryptd arc4 sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_powersave cpufreq_performance mperf processor thermal_sys cpufreq_stats freq_table vxlan ip_tunnel vmxnet3(F) etxhci_hcd mpt2sas(O) usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios]
[ 1406.922153] CPU: 1 PID: 19580 Comm: runc Tainted: PF C O 3.10.108 #42214
[ 1406.923331] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[ 1406.925042] task: ffff8800b656c820 ti: ffff880099568000 task.ti: ffff880099568000
[ 1406.926248] RIP: 0010:[] [] generic_exec_single+0x76/0xe0
[ 1406.927665] RSP: 0000:ffff88009956bc20 EFLAGS: 00000202
[ 1406.928529] RAX: 00000000000008fb RBX: 00000037ffffffc8 RCX: 000000000000008a
[ 1406.929681] RDX: 0000000000000008 RSI: 00000000000000fb RDI: ffffffff81606628
[ 1406.930836] RBP: ffff88009956bc60 R08: ffff880134888758 R09: 0000000000000020
[ 1406.931992] R10: 0000000000004041 R11: 0000000000000000 R12: 0000004000000001
[ 1406.933151] R13: 0000000000000000 R14: 0000005000000041 R15: ffffffff81894a98
[ 1406.934309] FS: 00007f17f0044740(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000
[ 1406.935625] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1406.936560] CR2: 000000c00024ef68 CR3: 00000000aad04000 CR4: 00000000001607e0
[ 1406.937715] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1406.938873] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1406.940031] Stack:
[ 1406.940373] 0000000000000001 ffff88009956bcb0 0000000000000000 ffffffff818a90d0
[ 1406.941681] ffffffff8102fcc0 ffffffff8109007e 0000000000000000 ffffffff818a90d0
[ 1406.942975] ffff88013dc13980 ffff88013dc13980 ffffffff8102fcc0 ffff88009956bcc0
[ 1406.944267] Call Trace:
[ 1406.944682] [] ? do_flush_tlb_all+0x160/0x160
[ 1406.945658] [] ? smp_call_function_single+0x12e/0x150
[ 1406.946744] [] ? do_flush_tlb_all+0x160/0x160
[ 1406.947719] [] ? flush_tlb_page+0x72/0x130
[ 1406.948654] [] ? ptep_clear_flush+0x22/0x30
[ 1406.949603] [] ? do_wp_page+0x2ad/0x8c0
[ 1406.950494] [] ? handle_pte_fault+0x38d/0x9e0
[ 1406.951471] [] ? handle_mm_fault+0x135/0x2e0
[ 1406.952433] [] ? __do_page_fault+0x14a/0x500
[ 1406.953394] [] ? vfs_read+0x140/0x170
[ 1406.954260] [] ? SyS_read+0x84/0xb0
[ 1406.955098] [] ? page_fault+0x22/0x30
[ 1406.955966] Code: 89 55 08 48 89 2a e8 8a 78 41 00 4c 39 f3 75 0f 44 89 e7 48 8b 05 fb f1 78 00 e8 96 4d 20 00 f6 45 20 01 74 08 f3 90 f6 45 20 01 <75> f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00 00 00 00 4c 8d 6b

CPU jump to 100%
DSM is unresponsive and I must reset the VM.

image

@WiteWulf
Copy link
Author

There have been a couple of further reports from users with Celeron CPUs, J1900, so not just limited to Xeon architecture. All running DS3615xs image.

@OrpheeGT
Copy link

OrpheeGT commented Sep 21, 2021

Hello,

I tried with DSM 6.2.4, DSM hangs once docker start... must reset the VM.

BUG: soft lockup - CPU#1 stuck for 41s! [fileindexd:12641]
Modules linked in: hid_generic cifs udf isofs loop tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) usbhid hid usblp bromolow_synobios(PO) exfat(O) btrfs synoacl_vfs(PO) zlib_deflate hfsplus md4 hmac bnx2x(O) libcrc32c mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) compat(O) qede(O) qed(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) igb(O) i2c_algo_bit e1000e(O) dca vxlan fuse vfat fat crc32c_intel aesni_intel glue_helper lrw gf128mul ablk_helper arc4 cryptd ecryptfs sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_conservative cpufreq_powersave cpufreq_performance cpufreq_ondemand mperf processor thermal_sys cpufreq_stats freq_table dm_snapshot crc_itu_t crc_ccitt quota_v2 quota_tree psnap p8022 llc sit tunnel4 ip_tunnel ipv6 zram(C) sg etxhci_hcd usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios]
CPU: 1 PID: 12641 Comm: fileindexd Tainted: PF C O 3.10.105 #25556
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
task: ffff8801284fc800 ti: ffff880117a70000 task.ti: ffff880117a70000
RIP: 0010:[] [] generic_exec_single+0x68/0xe0
RSP: 0018:ffff880117a73cc0 EFLAGS: 00000202
RAX: 00000000000008fb RBX: ffff8801377dbec0 RCX: 0000000000000002
RDX: ffffffff816057c8 RSI: 00000000000000fb RDI: ffffffff816057c8
RBP: ffff88013dc12a80 R08: ffff88012788d358 R09: 0000000000000000
R10: 00007fc8c029a0a0 R11: ffff880115878ac0 R12: 8000000111e93067
R13: ffffea0003cc1348 R14: ffff8801284fc800 R15: ffffffff810d6cf5
FS: 00007fc8c2508700(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc8c2407000 CR3: 00000001156e4000 CR4: 00000000001607e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
0000000000000000 ffff880117a73d50 0000000000000001 ffffffff8186ec10
ffffffff8102f9e0 ffffffff81087e55 0000000000000001 0000000000000000
ffff88013dc12a80 ffff88013dc12a80 ffffffff8102f9e0 ffff880117a73d70
Call Trace:
[] ? do_flush_tlb_all+0x170/0x170
[] ? smp_call_function_single+0xd5/0x160
[] ? do_flush_tlb_all+0x170/0x170
[] ? flush_tlb_mm_range+0x22c/0x300
[] ? tlb_flush_mmu.part.66+0x29/0x80
[] ? tlb_finish_mmu+0x3d/0x40
[] ? unmap_region+0xbe/0x100
[] ? show_vfsmnt+0x104/0x140
[] ? vma_rb_erase+0x121/0x260
[] ? do_munmap+0x2ed/0x690
[] ? vm_munmap+0x36/0x50
[] ? SyS_munmap+0x5/0x10
[] ? system_call_fastpath+0x22/0x27
Code: 48 89 c6 48 89 5d 08 4c 89 ef 48 89 2b 48 89 53 08 48 89 1a e8 aa 68 44 00 4c 39 f5 74 6b f6 43 20 01 74 0f 0f 1f 80 00 00 00 00 90 f6 43 20 01 75 f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00

Edit :

Actually it is not only docker.

I installed Moments on 6.2.4, and tried to import 20 files in a raw...

System froze :

[ 1230.717458] BUG: soft lockup - CPU#1 stuck for 41s! [fileindexd:22537]
[ 1230.718657] Modules linked in: cifs udf isofs loop hid_generic tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) usbhid hid usblp bromolow_synobios(PO) exfat(O) btrfs synoacl_vfs(PO) zlib_deflate hfsplus md4 hmac bnx2x(O) libcrc32c mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) compat(O) qede(O) qed(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) igb(O) i2c_algo_bit e1000e(O) dca vxlan fuse vfat fat crc32c_intel aesni_intel glue_helper lrw gf128mul ablk_helper arc4 cryptd ecryptfs sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_conservative cpufreq_powersave cpufreq_performance cpufreq_ondemand mperf processor thermal_sys cpufreq_stats freq_table dm_snapshot crc_itu_t crc_ccitt quota_v2 quota_tree psnap p8022 llc sit tunnel4 ip_tunnel ipv6 zram(C) sg etxhci_hcd usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios]
[ 1230.736700] CPU: 1 PID: 22537 Comm: fileindexd Tainted: PF C O 3.10.105 #25556
[ 1230.738014] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[ 1230.739795] task: ffff8800a83bf040 ti: ffff8800a6114000 task.ti: ffff8800a6114000
[ 1230.741052] RIP: 0010:[] [] generic_exec_single+0x6e/0xe0
[ 1230.742562] RSP: 0018:ffff8800a6117cc0 EFLAGS: 00000202
[ 1230.743461] RAX: 00000000000008fb RBX: 0000000000000001 RCX: 00000000000000d8
[ 1230.744658] RDX: ffffffff816057c8 RSI: 00000000000000fb RDI: ffffffff816057c8
[ 1230.745856] RBP: ffff88013dc12a80 R08: ffff88008cff6358 R09: 0000000000000000
[ 1230.747056] R10: 0000000000000022 R11: ffff88013390bf80 R12: ffff88008d58f7d0
[ 1230.748254] R13: ffff880108538788 R14: ffffffff81108d4d R15: ffff8800a6117da0
[ 1230.749452] FS: 00007fa28e90b700(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000
[ 1230.750805] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1230.751775] CR2: 00007fa2a361e000 CR3: 000000008d470000 CR4: 00000000001607e0
[ 1230.752994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1230.754159] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1230.755356] Stack:
[ 1230.755708] 0000000000000000 ffff8800a6117d50 0000000000000001 ffffffff8186ec10
[ 1230.757041] ffffffff8102f9e0 ffffffff81087e55 0000000000000001 0000000000000000
[ 1230.758370] ffff88013dc12a80 ffff88013dc12a80 ffffffff8102f9e0 ffff8800a6117d70
[ 1230.759697] Call Trace:
[ 1230.760135] [] ? do_flush_tlb_all+0x170/0x170
[ 1230.761141] [] ? smp_call_function_single+0xd5/0x160
[ 1230.762246] [] ? do_flush_tlb_all+0x170/0x170
[ 1230.763252] [] ? flush_tlb_mm_range+0x22c/0x300
[ 1230.764307] [] ? tlb_flush_mmu.part.66+0x29/0x80
[ 1230.765354] [] ? tlb_finish_mmu+0x3d/0x40
[ 1230.766300] [] ? unmap_region+0xbe/0x100
[ 1230.767236] [] ? vma_rb_erase+0x121/0x260
[ 1230.768185] [] ? do_munmap+0x2ed/0x690
[ 1230.769095] [] ? vm_munmap+0x36/0x50
[ 1230.769972] [] ? SyS_munmap+0x5/0x10
[ 1230.770859] [] ? system_call_fastpath+0x22/0x27
[ 1230.771886] Code: 08 4c 89 ef 48 89 2b 48 89 53 08 48 89 1a e8 aa 68 44 00 4c 39 f5 74 6b f6 43 20 01 74 0f 0f 1f 80 00 00 00 00 f3 90 f6 43 20 01 <75> f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00 00 00 00 4c 8d 6d

CPU raise at 100%
image

@WiteWulf could you try ?

Moments seems to be continuing to import images, but very slowly, and DSM is unresponsive.

Edit 2 : now once moments app start, it hangs DSM... I can't block it to start.

@labrouss
Copy link

labrouss commented Sep 22, 2021

After uploading a batch of photos with face detection enabled, the same happens on baremetal running DS3615xs 7.0.1 with AMD CPU . I get consistent hangs/freezes.

Unfortunatelly i have no serial console to get the soft lockup output.

I will try to increase the watchdog timer to 60 and see if this helps :

echo 60 > /proc/sys/kernel/watchdog_thresh
echo "kernel.watchdog_thresh = 60" >> /etc/sysctl.conf
sysctl -p /etc/sysctl.conf

sysctl -a |grep -i watch
kernel.watchdog_thresh = 60

@WiteWulf
Copy link
Author

After uploading a batch of photos, the same happens on baremetal. I get consistent hangs/freezes.

Can you confirm:

  • hardware in use (including CPU model)
  • DSM version and architecture (bromolow or apollolake)

...and can you post the serial console output when you get a hang/freeze?

Hang/freezes have only been observed so far on virtualisation platforms (such as ESXi and Proxmox), with kernel panics on baremetal.

@WiteWulf
Copy link
Author

@WiteWulf could you try ?

Moments seems to be continuing to import images, but very slowly, and DSM is unresponsive.

Edit 2 : now once moments app start, it hangs DSM... I can't block it to start.

I'm on 7.0.1-RC1, so don't have Moments, but I installed Photos (I wasn't using it previously) and uploaded ~250 images to it. CPU usage barely got above 20% (looks like it's single-threaded) but disk writes were much higher than when I've seen kernel panics when running docker stuff that crashed (this makes me think this isn't related to high disk throughput). The server did not kernel panic or lock up.

@OrpheeGT
Copy link

@WiteWulf I confirm Synology Photos does not suffer the same issue... My DSM 7.0.1 RC does not have issue too.
It happened on DSM 6.2.4 with Synology Moments.

@labrouss
Copy link

labrouss commented Sep 22, 2021

@OrpheeGT You get a soft lockup on fileindexd. Which will be called in various times for various application indexing.

[ 1230.717458] BUG: soft lockup - CPU#1 stuck for 41s! [fileindexd:22537]

Increase the watchdog threshold to 60 (Seconds maximum)

@WiteWulf
Copy link
Author

I noticed it was actually fileindexd that was named in the kernel panic output you posted, so I pointed Universal Search at a ~2TB music folder on my NAS to see if that crashes.

And...BOOM! synoelasticd (another database process) kernel panics:

[77556.729069] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 4
[77556.764160] CPU: 4 PID: 14678 Comm: synoelasticd Tainted: PF          O 3.10.108 #42214
[77556.803544] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[77556.838036]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409b08d60
[77556.874402]  ffff880409b08cf8 0000000000000000 0000000000000004 0000000000000001
[77556.910661]  0000000000000004 ffffffff80000001 0000000000000030 ffff8803f4cd4c00
[77556.947092] Call Trace:
[77556.958946]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[77556.986979]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[77557.011271]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[77557.043922]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[77557.075077]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[77557.107901]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[77557.138530]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[77557.169348]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[77557.194810]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[77557.222568]  <<EOE>> 
[77557.233041] Rebooting in 3 seconds..

@labrouss
Copy link

@WiteWulf Yes last thing i saw while running htop was synoelacticd as well

@WiteWulf
Copy link
Author

WiteWulf commented Sep 22, 2021

So far I've seen kernel panics mostly related to:

  • influxdb
  • mysql
  • mariadb
  • elasticsearch

Elasticsearch seems to be the back-end used by Synology Universal Indexer), and the others have been installed by users in docker containers.

@OrpheeGT
Copy link

So title can be renamed. it is not only docker... @WiteWulf you made it crash on DSM 7 ?

@WiteWulf WiteWulf changed the title Docker Causing Kernel Panic or Lockup on DS3615xs image Kernel Panic or Lockup on DS3615xs image Sep 22, 2021
@WiteWulf
Copy link
Author

Yes, Universal Search crashed on DSM7 while indexing a large folder (many large files).

Changed title to reflect not just docker causing issues.

@OrpheeGT
Copy link

Synology Photos must work diffently because I added to it like 20 images in a row like I did with Moments, but no issue with it...

@OrpheeGT
Copy link

OrpheeGT commented Sep 23, 2021

I may have failed by I remove the register_pmu_shim line where @ttg-public requested for @WiteWulf

then I tried to import a 20 files in a row with Synology Moments app :

[ 20.206998] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 20.210142] <redpill/smart_shim.c:388> Generating fake SMART values
[ 20.215051] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there
[ 20.377729] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there
[ 20.754611] BTRFS: device label 2021.09.22-14:16:50 v25556 devid 1 transid 383 /dev/md2
[ 20.758828] BTRFS info (device md2): enabling auto syno reclaim space
[ 20.761395] BTRFS info (device md2): use ssd allocation scheme
[ 20.763715] BTRFS info (device md2): using free space tree
[ 20.766100] BTRFS: has skinny extents
[ 21.534082] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 21.535867] <redpill/smart_shim.c:388> Generating fake SMART values
[ 21.891207] usbcore: registered new interface driver usblp
[ 22.162279] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 22.163660] <redpill/smart_shim.c:388> Generating fake SMART values
[ 22.560571] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 22.562195] Synotify use 16384 event queue size
[ 22.562213] Synotify use 16384 event queue size
[ 22.564978] <redpill/smart_shim.c:388> Generating fake SMART values
[ 22.567083] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 22.569133] <redpill/smart_shim.c:388> Generating fake SMART values
[ 22.571174] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd5
[ 22.573226] <redpill/smart_shim.c:514> Generating fake WIN_SMART log=0 entries
[ 22.653817] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there
[ 22.751817] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 22.753898] <redpill/smart_shim.c:388> Generating fake SMART values
[ 22.756290] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd1
[ 22.758341] <redpill/smart_shim.c:455> Generating fake SMART thresholds
[ 22.764050] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there
[ 22.776168] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 22.778231] <redpill/smart_shim.c:388> Generating fake SMART values
[ 22.971983] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there
[ 23.136111] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 23.137513] <redpill/smart_shim.c:388> Generating fake SMART values
[ 23.145039] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd1
[ 23.146412] <redpill/smart_shim.c:455> Generating fake SMART thresholds
[ 23.152186] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there
[ 23.165414] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 23.166767] <redpill/smart_shim.c:388> Generating fake SMART values
[ 23.237426] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there
[ 23.239306] iSCSI:target_core_rodsp_server.c:1027:rodsp_server_init RODSP server started, login_key(001132417efd).
[ 23.316261] iSCSI:extent_pool.c:766:ep_init syno_extent_pool successfully initialized
[ 23.355192] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 23.356547] <redpill/smart_shim.c:388> Generating fake SMART values
[ 23.366572] iSCSI:target_core_device.c:617:se_dev_align_max_sectors Rounding down aligned max_sectors from 4294967295 to 4294967288
[ 23.368681] iSCSI:target_core_lunbackup.c:361:init_io_buffer_head 512 buffers allocated, total 2097152 bytes successfully
[ 23.373280] Synotify use 16384 event queue size
[ 23.381171] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd1
[ 23.382530] <redpill/smart_shim.c:455> Generating fake SMART thresholds
[ 23.398635] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there
[ 23.412314] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 23.413674] <redpill/smart_shim.c:388> Generating fake SMART values
[ 23.429148] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there
[ 23.582194] iSCSI:target_core_file.c:146:fd_attach_hba RODSP plugin for fileio is enabled.
[ 23.583646] iSCSI:target_core_file.c:153:fd_attach_hba ODX Token Manager is enabled.
[ 23.585700] iSCSI:target_core_multi_file.c:91:fd_attach_hba RODSP plugin for multifile is enabled.
[ 23.587375] iSCSI:target_core_ep.c:786:ep_attach_hba RODSP plugin for epio is enabled.
[ 23.588717] iSCSI:target_core_ep.c:793:ep_attach_hba ODX Token Manager is enabled.
[ 23.835435] usbcore: registered new interface driver usbhid
[ 23.836477] usbhid: USB HID core driver
[ 23.912684] input: VMware VMware Virtual USB Mouse as /devices/pci0000:00/0000:00:11.0/0000:02:00.0/usb2/2-1/2-1:1.0/input/input0
[ 23.914751] hid-generic 0003:0E0F:0003.0001: input: USB HID v1.10 Mouse [VMware VMware Virtual USB Mouse] on usb-0000:02:00.0-1/input0
[ 24.459275] loop: module loaded
[ 24.469102] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there
[ 24.988731] warning: `nginx' uses 32-bit capabilities (legacy support in use)
[ 25.587723] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there
[ 25.590445] ata5.00: configured for UDMA/100
[ 25.591176] ata5: EH complete
[ 25.592038] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd8
[ 25.593386] <redpill/smart_shim.c:654> Attempted ATA_SMART_ENABLE modification!
[ 25.595248] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there

Xpen_624 login: [ 25.623023] ata5.00: configured for UDMA/100
[ 25.623774] ata5: EH complete
[ 25.624479] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there
[ 29.023868] <redpill/intercept_execve.c:87> Blocked /usr/syno/bin/syno_pstore_collect from running
[ 29.232232] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 29.233668] <redpill/smart_shim.c:388> Generating fake SMART values
[ 29.338667] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 29.340481] <redpill/smart_shim.c:388> Generating fake SMART values
[ 29.343761] <redpill/memory_helper.c:18> Disabling memory protection for page(s) at ffffffffa09f4c50+12/1 (<<ffffffffa09f4000)
[ 29.346380] <redpill/override_symbol.c:244> Obtaining lock for
[ 29.348127] <redpill/override_symbol.c:244> Writing original code to
[ 29.349960] <redpill/override_symbol.c:244> Released lock for
[ 29.351690] <redpill/override_symbol.c:219> Obtaining lock for
[ 29.353406] <redpill/override_symbol.c:219> Writing trampoline code to
[ 29.355276] <redpill/override_symbol.c:219> Released lock for
[ 29.357033] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=3)->support => real=1 [org_fout=0, ovs_fout=0]
[ 29.449363] <redpill/smart_shim.c:359> ATA_CMD_ID_ATA confirmed no SMART support - pretending it's there
[ 29.458673] synobios write K to /dev/ttyS1 failed
[ 29.474660] <redpill/bios_shims_collection.c:43> mfgBIOS: nullify zero-int for VTK_SET_HDD_ACT_LED
[ 29.477850] <redpill/override_symbol.c:244> Obtaining lock for
[ 29.479807] <redpill/override_symbol.c:244> Writing original code to
[ 29.481891] <redpill/override_symbol.c:244> Released lock for
[ 29.483826] <redpill/override_symbol.c:219> Obtaining lock for
[ 29.485779] <redpill/override_symbol.c:219> Writing trampoline code to
[ 29.487895] <redpill/override_symbol.c:219> Released lock for
[ 29.489816] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=2)->support => real=1 [org_fout=0, ovs_fout=0]
[ 29.501901] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 29.509993] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 29.513433] <redpill/bios_shims_collection.c:42> mfgBIOS: nullify zero-int for VTK_SET_PHY_LED
[ 29.515742] <redpill/bios_shims_collection.c:36> mfgBIOS: nullify zero-int for VTK_SET_PWR_LED
[ 29.520330] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 29.526123] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 29.537159] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 29.544131] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 29.550320] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 29.556763] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 29.563306] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 29.569841] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 29.579919] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 29.587099] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 29.602319] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 29.627963] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 29.636158] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 30.369451] init: syno-check-disk-compatibility main process (12582) terminated with status 255
[ 30.428262] ip_tables: (C) 2000-2006 Netfilter Core Team
[ 30.435840] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[ 30.507642] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 30.555223] aufs 3.10.x-20141110
[ 30.565205] Bridge firewalling registered
[ 32.920553] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
[ 37.436982] Synotify use 16384 event queue size
[ 37.438642] Synotify use 16384 event queue size
[ 38.686780] <redpill/bios_shims_collection.c:44> mfgBIOS: nullify zero-int for VTK_GET_MICROP_ID
[ 39.145534] <redpill/override_symbol.c:244> Obtaining lock for
[ 39.148066] <redpill/override_symbol.c:244> Writing original code to
[ 39.150551] <redpill/override_symbol.c:244> Released lock for
[ 39.152899] <redpill/override_symbol.c:219> Obtaining lock for
[ 39.155222] <redpill/override_symbol.c:219> Writing trampoline code to
[ 39.157757] <redpill/override_symbol.c:219> Released lock for
[ 39.159684] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=3)->support => real=1 [org_fout=0, ovs_fout=0]
[ 40.327575] Synotify use 16384 event queue size
[ 40.329555] Synotify use 16384 event queue size
[ 41.392997] init: synocontentextractd main process ended, respawning
[ 43.327169] <redpill/override_symbol.c:244> Obtaining lock for
[ 43.329183] <redpill/override_symbol.c:244> Writing original code to
[ 43.331258] <redpill/override_symbol.c:244> Released lock for
[ 43.333175] <redpill/override_symbol.c:219> Obtaining lock for
[ 43.335111] <redpill/override_symbol.c:219> Writing trampoline code to
[ 43.337222] <redpill/override_symbol.c:219> Released lock for
[ 43.339135] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=3)->support => real=1 [org_fout=0, ovs_fout=0]
[ 48.883979] Synotify use 16384 event queue size
[ 51.178339] <redpill/bios_shims_collection.c:44> mfgBIOS: nullify zero-int for VTK_GET_MICROP_ID
[ 51.536832] <redpill/override_symbol.c:244> Obtaining lock for
[ 51.539283] <redpill/override_symbol.c:244> Writing original code to
[ 51.541795] <redpill/override_symbol.c:244> Released lock for
[ 51.544112] <redpill/override_symbol.c:219> Obtaining lock for
[ 51.546456] <redpill/override_symbol.c:219> Writing trampoline code to
[ 51.549007] <redpill/override_symbol.c:219> Released lock for
[ 51.551336] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=3)->support => real=1 [org_fout=0, ovs_fout=0]
[ 82.539367] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 82.543593] <redpill/smart_shim.c:388> Generating fake SMART values
[ 139.927989] BUG: soft lockup - CPU#1 stuck for 41s! [fileindexd:12879]
[ 139.929144] Modules linked in: bridge stp aufs macvlan veth xt_conntrack xt_addrtype nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_REDIRECT xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_recent xt_iprange xt_limit xt_state xt_tcpudp xt_multiport xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables cifs udf isofs loop hid_generic tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) usbhid hid usblp bromolow_synobios(PO) exfat(O) btrfs synoacl_vfs(PO) zlib_deflate hfsplus md4 hmac bnx2x(O) libcrc32c mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) compat(O) qede(O) qed(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) igb(O) i2c_algo_bit e1000e(O) dca vxlan fuse vfat fat crc32c_intel aesni_intel glue_helper lrw gf128mul ablk_helper arc4 cryptd ecryptfs sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_conservative cpufreq_powersave cpufreq_performance cpufreq_ondemand mperf processor thermal_sys cpufreq_stats freq_table dm_snapshot crc_itu_t crc_ccitt quota_v2 quota_tree psnap p8022 llc sit tunnel4 ip_tunnel ipv6 zram(C) sg etxhci_hcd usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios]
[ 139.952060] CPU: 1 PID: 12879 Comm: fileindexd Tainted: PF C O 3.10.105 #25556
[ 139.953333] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[ 139.955060] task: ffff8801327bd800 ti: ffff88011e8b4000 task.ti: ffff88011e8b4000
[ 139.956274] RIP: 0010:[] [] generic_exec_single+0x6e/0xe0
[ 139.957725] RSP: 0018:ffff88011e8b7cc0 EFLAGS: 00000202
[ 139.958592] RAX: 00000000000008fb RBX: 0000000000000001 RCX: 0000000000000014
[ 139.959747] RDX: ffffffff816057c8 RSI: 00000000000000fb RDI: ffffffff816057c8
[ 139.960903] RBP: ffff88013dc12a80 R08: ffff88011448fe58 R09: 0000000000000000
[ 139.962058] R10: 0000000000000022 R11: ffff8800b49decc0 R12: ffff8800b0596890
[ 139.963214] R13: ffff88011a1c4170 R14: ffffffff81108d4d R15: ffff88011e8b7da0
[ 139.964374] FS: 00007fba3cf4c700(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000
[ 139.965669] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 139.966602] CR2: 00007fba3ce4b000 CR3: 000000011b332000 CR4: 00000000001607e0
[ 139.967762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 139.968916] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 139.970047] Stack:
[ 139.970381] 0000000000000000 ffff88011e8b7d50 0000000000000001 ffffffff8186ec10
[ 139.971675] ffffffff8102f9e0 ffffffff81087e55 0000000000000001 0000000000000000
[ 139.972974] ffff88013dc12a80 ffff88013dc12a80 ffffffff8102f9e0 ffff88011e8b7d70
[ 139.974283] Call Trace:
[ 139.974715] [] ? do_flush_tlb_all+0x170/0x170
[ 139.975699] [] ? smp_call_function_single+0xd5/0x160
[ 139.976778] [] ? do_flush_tlb_all+0x170/0x170
[ 139.977758] [] ? flush_tlb_mm_range+0x22c/0x300
[ 139.978772] [] ? tlb_flush_mmu.part.66+0x29/0x80
[ 139.979793] [] ? tlb_finish_mmu+0x3d/0x40
[ 139.980718] [] ? unmap_region+0xbe/0x100
[ 139.981630] [] ? vma_rb_erase+0x121/0x260
[ 139.982555] [] ? do_munmap+0x2ed/0x690
[ 139.983437] [] ? vm_munmap+0x36/0x50
[ 139.984293] [] ? SyS_munmap+0x5/0x10
[ 139.985165] [] ? system_call_fastpath+0x22/0x27
[ 139.986173] Code: 08 4c 89 ef 48 89 2b 48 89 53 08 48 89 1a e8 aa 68 44 00 4c 39 f5 74 6b f6 43 20 01 74 0f 0f 1f 80 00 00 00 00 f3 90 f6 43 20 01 <75> f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00 00 00 00 4c 8d 6d
[ 191.885952] BUG: soft lockup - CPU#1 stuck for 44s! [fileindexd:12879]
[ 191.887102] Modules linked in: bridge stp aufs macvlan veth xt_conntrack xt_addrtype nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_REDIRECT xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_recent xt_iprange xt_limit xt_state xt_tcpudp xt_multiport xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables cifs udf isofs loop hid_generic tcm_loop(O) iscsi_target_mod(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) usbhid hid usblp bromolow_synobios(PO) exfat(O) btrfs synoacl_vfs(PO) zlib_deflate hfsplus md4 hmac bnx2x(O) libcrc32c mdio mlx5_core(O) mlx4_en(O) mlx4_core(O) mlx_compat(O) compat(O) qede(O) qed(O) atlantic(O) tn40xx(O) i40e(O) ixgbe(O) be2net(O) igb(O) i2c_algo_bit e1000e(O) dca vxlan fuse vfat fat crc32c_intel aesni_intel glue_helper lrw gf128mul ablk_helper arc4 cryptd ecryptfs sha256_generic sha1_generic ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc cpufreq_conservative cpufreq_powersave cpufreq_performance cpufreq_ondemand mperf processor thermal_sys cpufreq_stats freq_table dm_snapshot crc_itu_t crc_ccitt quota_v2 quota_tree psnap p8022 llc sit tunnel4 ip_tunnel ipv6 zram(C) sg etxhci_hcd usb_storage xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common redpill(OF) [last unloaded: bromolow_synobios]
[ 191.910526] CPU: 1 PID: 12879 Comm: fileindexd Tainted: PF C O 3.10.105 #25556
[ 191.911840] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[ 191.913618] task: ffff8801327bd800 ti: ffff88011e8b4000 task.ti: ffff88011e8b4000
[ 191.914869] RIP: 0010:[] [] generic_exec_single+0x6a/0xe0
[ 191.916330] RSP: 0018:ffff88011e8b7cc0 EFLAGS: 00000202
[ 191.917225] RAX: 00000000000008fb RBX: 0000000000000001 RCX: 0000000000000014
[ 191.918416] RDX: ffffffff816057c8 RSI: 00000000000000fb RDI: ffffffff816057c8
[ 191.919609] RBP: ffff88013dc12a80 R08: ffff88011448fe58 R09: 0000000000000000
[ 191.920803] R10: 0000000000000022 R11: ffff8800b49decc0 R12: ffff8800b0596890
[ 191.921996] R13: ffff88011a1c4170 R14: ffffffff81108d4d R15: ffff88011e8b7da0
[ 191.923191] FS: 00007fba3cf4c700(0000) GS:ffff88013dd00000(0000) knlGS:0000000000000000
[ 191.924546] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 191.925512] CR2: 00007fba3ce4b000 CR3: 000000011b332000 CR4: 00000000001607e0
[ 191.926717] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 191.927917] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 191.929108] Stack:
[ 191.929461] 0000000000000000 ffff88011e8b7d50 0000000000000001 ffffffff8186ec10
[ 191.930797] ffffffff8102f9e0 ffffffff81087e55 0000000000000001 0000000000000000
[ 191.932131] ffff88013dc12a80 ffff88013dc12a80 ffffffff8102f9e0 ffff88011e8b7d70
[ 191.933483] Call Trace:
[ 191.933912] [] ? do_flush_tlb_all+0x170/0x170
[ 191.934921] [] ? smp_call_function_single+0xd5/0x160
[ 191.936030] [] ? do_flush_tlb_all+0x170/0x170
[ 191.937038] [] ? flush_tlb_mm_range+0x22c/0x300
[ 191.938076] [] ? tlb_flush_mmu.part.66+0x29/0x80
[ 191.939129] [] ? tlb_finish_mmu+0x3d/0x40
[ 191.940083] [] ? unmap_region+0xbe/0x100
[ 191.941023] [] ? vma_rb_erase+0x121/0x260
[ 191.941975] [] ? do_munmap+0x2ed/0x690
[ 191.942886] [] ? vm_munmap+0x36/0x50
[ 191.943767] [] ? SyS_munmap+0x5/0x10
[ 191.944648] [] ? system_call_fastpath+0x22/0x27
[ 191.945684] Code: c6 48 89 5d 08 4c 89 ef 48 89 2b 48 89 53 08 48 89 1a e8 aa 68 44 00 4c 39 f5 74 6b f6 43 20 01 74 0f 0f 1f 80 00 00 00 00 f3 90 43 20 01 75 f8 5b 5d 41 5c 41 5d 41 5e c3 0f 1f 80 00 00 00

So it still does the same behavior for me...

@labrouss
Copy link

labrouss commented Sep 23, 2021

I confirm that the universal search on the DS3615 image freezes the system on both physical and virtual even with the latest code. Looks like the redpill module is crashing as i get ATA disk and btrfs errors.

On DS918, It does not have the same effect.

@WiteWulf
Copy link
Author

@labrouss can you confirm that your baremetal/physical system is freezing and not kernel panicking? All other reports I've had indicate kernel panic followed by reboot on baremetal, and freeze/lockup on virtual. It's important to be clear what you're experiencing here.

@WiteWulf
Copy link
Author

As per @ttg-public's suggestion on the forum, "Can you try deleting the line with "register_pmu_shim" from redpill_main.c (in init_(void) function) and rebuilding the kernel module?" I rebuilt the loader without the PMU shim and am still seeing the kernel panics when loading my influxdb docker container.

[  338.055690] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 6
[  338.091670] CPU: 6 PID: 21097 Comm: containerd-shim Tainted: PF          O 3.10.108 #42214
[  338.132114] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  338.168045]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409b88d60
[  338.205031]  ffff880409b88cf8 0000000000000000 0000000000000006 0000000000000001
[  338.241507]  0000000000000006 ffffffff80000001 0000000000000030 ffff8803f4d4dc00
[  338.278173] Call Trace:
[  338.290006]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[  338.318839]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[  338.342727]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[  338.375043]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[  338.405804]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[  338.438356]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[  338.469218]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[  338.500130]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[  338.525060]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[  338.552408]  <<EOE>>
[  338.562333] Rebooting in 3 seconds..

@ilovepancakes95
Copy link

I confirm that the universal search on the DS3615 image freezes the system on both physical and virtual even with the latest code. Looks like the redpill module is crashing as i get ATA disk and btrfs errors.

On DS918, It does not have the same effect.

For the record, I am not experiencing this issue. Using universal search/having it index the entire home folder does not result in any issues. I don't have a lot of data in home folder, as this is a test VM so not sure if your problem only occurs with higher I/O load and lots of data to index? I am running DSM 7 41222 RedPill 3615xs on ESXi v7.

@OrpheeGT
Copy link

I confirm that the universal search on the DS3615 image freezes the system on both physical and virtual even with the latest code. Looks like the redpill module is crashing as i get ATA disk and btrfs errors.
On DS918, It does not have the same effect.

For the record, I am not experiencing this issue. Using universal search/having it index the entire home folder does not result in any issues. I don't have a lot of data in home folder, as this is a test VM so not sure if your problem only occurs with higher I/O load and lots of data to index? I am running DSM 7 41222 RedPill 3615xs on ESXi v7.

Could you just try to install docker and influxdb container on it ?

@WiteWulf
Copy link
Author

For the record, I am not experiencing this issue. Using universal search/having it index the entire home folder does not result in any issues. I don't have a lot of data in home folder, as this is a test VM so not sure if your problem only occurs with higher I/O load and lots of data to index? I am running DSM 7 41222 RedPill 3615xs on ESXi v7.

I found that simply indexing a few photos didn't cause problems, but pointing it at a very large music folder (2TB, 160k files) caused a kernel panic after a few minutes.

@labrouss
Copy link

labrouss commented Sep 23, 2021

Both baremetal and physical is not panicking and not rebooting but are having soft lockups. The issue with the universal search is consistent and has been verified numerous times.

I have a VM running an old image of the DS3615 with RedPill v0.5-git-23578eb. Universal search does not have the same effect. It works as expected.

You can find your version of redpill module by running : dmesg |grep RedPill

You can also do a "git checkout 23578eb" , recompile, inject the module into your rd.gz and check

@WiteWulf
Copy link
Author

You can also do a "git checkout 23578eb" , recompile, inject the module into your rd.gz and check

Interesting that that commit is related to the PMU shim, notably bug fixes and enabling it by default.

So, to review:

  • on latest redpill code your baremetal server with DS3615xs image locks up when Universal Search is indexing
  • on older '2357eb' release (where PMU fix was introduced and enabled by default) you do not see the same lockups with Universal Search

Thanks for the info!

@WiteWulf
Copy link
Author

WiteWulf commented Sep 23, 2021

The latest redpill code (3474d9b) actually seems more prone to kernel panicking than previous versions. My influxdb container was guaranteed to crash the system every time I started it, but the others I had were typically stable and non-problematic. I'm now seeing immediate kernel panics when starting a mysql container that previously didn't cause any problems.

I've gone back to a slightly older build (021ed51), and mysql (operating as a database backend for librenms) starts and runs without problems.

(FYI, I'm not using 021ed51 for any reason other than that I already had a USB stick with it on. I don'r perceive it to be more stable than any other commit).

@WiteWulf
Copy link
Author

As an experiment, I took the HDDs out of my Gen8 and put an old spare drive in, booted it from the stick with redpill 021ed51 on it and did a fresh install. I installed a load of docker containers (mysql, mariadb, influxdb and nginx) and pretty soon mysqld kernel panic'd:

Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 4
[ 1421.492280] CPU: 4 PID: 28085 Comm: mysqld Tainted: PF        C O 3.10.108 #42214
[ 1421.529738] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[ 1421.564106]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409b08d60
[ 1421.599651]  ffff880409b08cf8 0000000000000000 0000000000000004 0000000000000001
[ 1421.635065]  0000000000000004 ffffffff80000001 0000000000000030 ffff8803f4cd4c00
[ 1421.671546] Call Trace:
[ 1421.683853]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[ 1421.712185]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[ 1421.737554]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[ 1421.771289]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[ 1421.802524]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[ 1421.838746]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[ 1421.870141]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[ 1421.901388]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[ 1421.925556]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[ 1421.953482]  <<EOE>> 
[ 1422.998713] Shutting down cpus with NMI
[ 1423.017980] Rebooting in 3 seconds..

I basically wanted to establish whether or not the crashes were related to data or configuration carried over from my previous 6.2.3 install, and it doesn't look like it is.

@WiteWulf
Copy link
Author

WiteWulf commented Sep 24, 2021

Quick follow up on the above: I had to push the system to crash. Influxdb didn't do it straight away as it does on my "live" server, nor Universal Search indexing a few GBs of data. Once I booted it back into my live system it crashed again starting up docker and came back up scrubbing the disks. While disks were scrubbing any attempt to start up docker would cause another kernel panic, but leaving it to finish scrubbing over night I was able to start everything but the influxdb container without trouble.

This suggests to me that the problem is related to load somehow. It's nothing as obvious as CPU load, or disk throughput, as I can quite happily stream 4k video and transcode 1080p content with Plex on this system. It's something harder to pin down.

@labrouss
Copy link

labrouss commented Sep 24, 2021

I see you are running 7.0.1 42214, can you try with a 7.0 loader instead ? As i tested that though its not possible to recover to an earlier version, you will have to reinstall.

@ceozero
Copy link

ceozero commented Sep 30, 2021

In dell r740xd esxi 7.0.2
Using 918-DSM7.0.1 everything is normal.
You cannot install docker with 3615-DSM7.0.1, otherwise it will crash if you reboot it.
Hope to fix it sooner.
image

@txb2d
Copy link

txb2d commented Oct 1, 2021

gen8 8 x Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz
pve系统
run docer and synologyphoto 人脸识别,system 会崩溃!!!

@WiteWulf
Copy link
Author

WiteWulf commented Oct 1, 2021

Via Google translate, for the benefit of the non-Chinese speaking:

Run docer and synologyphoto face recognition, the system will crash! ! !

@ceozero
Copy link

ceozero commented Oct 1, 2021

@WiteWulf
Did you feed back this bug to the forum?
The team doesn't seem to reply to other posts.
https://xpenology.com/forum/topic/45795-redpill-the-new-loader-for-624-discussion/

@WiteWulf
Copy link
Author

WiteWulf commented Oct 1, 2021

@WiteWulf
Did you feed back this bug to the forum?
The team doesn't seem to reply to other posts.
https://xpenology.com/forum/topic/45795-redpill-the-new-loader-for-624-discussion/

It's been covered in the depth on the forum both in the main Redpill thread and it's own dedicated thread. Both easy to find with the search function.

ThorGroup tend to reply to the forum threads at most once a week, there was a big update from them overnight where they addressed this issue many times.

@ceozero
Copy link

ceozero commented Oct 1, 2021

well. If there is any solution, please @ me.. Thank you.

@WiteWulf
Copy link
Author

WiteWulf commented Oct 1, 2021

Sorry, no, that’s not how this works 😀

If you want to keep up to date either “follow” the repo or issue here on GitHub or the thread on the forums. Use the tools available to you…

@ttg-public ttg-public added the bug Something isn't working label Oct 2, 2021
@ttg-public
Copy link
Member

We don't want to say that but... we THINK we nailed it down ;)

It looks like we were accidentally touching some of the registers which weren't preserved by the caller. This is an exception with a few syscalls in Linux (the full stack frame is not preserved for performance reasons). After a numerous articles and a plenty of research we concluded it's not worth fighting for literally 1-3 CPU cycles per binary and go ASM-crazy (as in comparison, access to MEMORY is one-few hundred times slower, so it only matters on a datacenter-scale with millions of executions) and added an unconditional jump to the execve() override. With any modern CPU (we tested as far as one of the first Core2Duos) the impact is immeasurable due to branch prediction ;)

The fact it worked on 918+ seems to be a sheer luck of the GCC used to compile the kernel on 918+ didn't touch these registers (or didn't touch them enough to cause a problem). These "random" soft lockups were caused by the interaction of the swapper process and context switching. That's why docker + a database was a PERFECT triggering scenario as it stresses the memory, uses swap extensively, executes many processes and switches contexts like crazy (which is, as a side note, why database in a container in any performance-oriented applications is really hard to pull off correctly).

There are two tests we were running which consistantly crashed the 3515xs instance on Proxmox:

# Test 1 - installing and removing mysql container; usually crashed 512MB instance within the first try/seconds
while true; do
    sudo docker rm -f test-mysql ;
    sudo docker image prune --all -f ;
    sudo docker run --name test-mysql -e MYSQL_ROOT_PASSWORD=test -d mysql:latest &&
    echo "OK"
done


# Test 2 - on instances with more RAM stress-testing the server usually triggered the crash in under a minute
docker run --name test-mysql -e MYSQL_ROOT_PASSWORD=test -d mysql:latest
docker exec -it test-mysql bash
# execute mysql -u root -p => then execute any SQL like SHOW DATABSES;
# for some reasons running mysqlslap right away ends up in denied connections?
mysqlslap --user=root -p --auto-generate-sql --number-char-cols=100 --number-int-cols=100 --detach=10 --concurrency=200 --iterations=5000

That being said once we debugged the issue the fix wasn't that difficult - the proper stack protection and indirect jump seems to fix the problem. This is really the commit which introduced the mechanism: 122470d (it's slightly noisy as we moved one file, the important change is with comment to the override_syscall() and the macros above that function). The fix for execve() interception was a matter of adjusting syntax: cdc0e2d

@WiteWulf: your reports were very useful. Can you poke around and see if the new version fixes the problem? We tested it in multiple scenario and weren't able to crash it but....

p.s. If anyone wants to get nerdy about performance impacts CloudFlare has a nice blog post about a similar problem: https://blog.cloudflare.com/branch-predictor/

@pocopico
Copy link

pocopico commented Oct 2, 2021

@ttg-public @WiteWulf

Yes ! At least now universal search (synoelasticd) doesn't crash the system !

This was the most consistent crash and i was able to replicate in both physical and virtual machines, which in my case varied from AMD to Intel CPUs, old and newer CPUs etc.

I will continue testing

@OrpheeGT
Copy link

OrpheeGT commented Oct 2, 2021

To be disregarded, did not git pull... my bad.

@WiteWulf
Copy link
Author

WiteWulf commented Oct 2, 2021

Well, it's definitely an improvement, but not fixed. This is just a "feeling", but I can get more containers running than I was on 3474d9b before it crashes.

I had all my docker containers running, and was importing a load of data to the influxdb when it crashed:

[  393.963862] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
[  393.998533] CPU: 1 PID: 23812 Comm: python3 Tainted: PF          O 3.10.108 #42218
[  394.035043] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  394.069815]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409a48d60
[  394.106286]  ffff880409a48cf8 0000000000000000 0000000000000001 0000000000000001
[  394.143049]  0000000000000001 ffffffff80000001 0000000000000030 ffff8803f52ddc00
[  394.179034] Call Trace:
[  394.190684]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[  394.219155]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[  394.243880]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[  394.276362]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[  394.307194]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[  394.339880]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[  394.371296]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[  394.402719]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[  394.426839]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[  394.454698]  <<EOE>> 
[  394.465349] Rebooting in 3 seconds..

Notably it wasn't the influxdb or containerd-shim processes that it flagged this time, though. It was the python script I was running to gather the data and import to the influxdb.

Plex Media Server next...

Nope:

[  453.200262] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
[  453.235373] CPU: 3 PID: 23981 Comm: Plex Media Serv Tainted: PF          O 3.10.108 #42218
[  453.275990] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  453.310620]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409ac8d60
[  453.348126]  ffff880409ac8cf8 0000000000000000 0000000000000003 0000000000000001
[  453.384619]  0000000000000003 ffffffff80000001 0000000000000030 ffff8803f5372c00
[  453.420535] Call Trace:
[  453.432447]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[  453.461377]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[  453.486080]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[  453.519279]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[  453.549937]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[  453.582923]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[  453.614385]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[  453.646830]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[  453.671481]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[  453.699295]  <<EOE>> 
[  453.709435] Rebooting in 3 seconds..

After the reboot I just left Plex running in the background and it went again without me even pushing it with any updates/scans.

@pocopico
Copy link

pocopico commented Oct 2, 2021

@WiteWulf @ttg-public

I left the Test#1 running and had no crash until i reached the pull limit ! Thats definetely an improvement. I need though to test it on a slower CPU also.

@OrpheeGT
Copy link

OrpheeGT commented Oct 2, 2021

So I rebuilt my image, I did wrong first time.
with a git pull done
Synology Moments seems to be OK now when adding like 40 pictures in a row...

Edit : inserted 185 in a row :
image

It imported them correctly, but after that it froze a bit... CPU 100% :
image

But no soft lockup

Serial console logs :

[ 28.496812] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=3)->support => real=1 [org_fout=0, ovs_fout=0]
[ 28.904537] <redpill/bios_shims_collection.c:43> mfgBIOS: nullify zero-int for VTK_SET_HDD_ACT_LED
[ 28.908402] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 28.912443] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 28.916223] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 28.919735] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 28.923155] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 28.926507] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 28.929672] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 28.932944] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 28.937190] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 28.941205] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 28.945578] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 28.949115] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 28.952771] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 28.958468] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 28.962119] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED
[ 29.007892] <redpill/pmu_shim.c:310> Got 2 bytes from PMU: reason=1 hex={2d 38} ascii="-8"
[ 29.009316] <redpill/pmu_shim.c:239> Executing cmd OUT_STATUS_LED_ON_GREEN handler cmd_shim_noop+0x0/0x30 [redpill]
[ 29.011055] <redpill/pmu_shim.c:45> vPMU received OUT_STATUS_LED_ON_GREEN using 1 bytes - NOOP
[ 31.316069] Synotify use 16384 event queue size
[ 31.316901] Synotify use 16384 event queue size
[ 31.800159] <redpill/bios_shims_collection.c:44> mfgBIOS: nullify zero-int for VTK_GET_MICROP_ID
[ 33.260153] Synotify use 16384 event queue size
[ 33.260999] Synotify use 16384 event queue size
[ 35.389368] init: synocontentextractd main process ended, respawning
[ 40.617909] Synotify use 16384 event queue size
[ 77.034590] type=1400 audit(1633172764.561:2): apparmor="DENIED" operation="capable" parent=1 profile="/volume*/@appstore/SynologyMoments/usr/sbin/synophoto-thumbnail-provider" pid=13817 comm="synophoto-thumb" pid=13817 comm="synophoto-thumb" capability=36 capname="block_suspend"
[ 77.182689] type=1400 audit(1633172764.709:3): apparmor="DENIED" operation="capable" parent=1 profile="/volume*/@appstore/SynologyMoments/usr/sbin/synophoto-thumbnail-provider" pid=13817 comm="synophoto-thumb" pid=13817 comm="synophoto-thumb" capability=36 capname="block_suspend"
[ 83.321087] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 83.322463] <redpill/smart_shim.c:388> Generating fake SMART values
[ 143.276691] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 143.278081] <redpill/smart_shim.c:388> Generating fake SMART values
[ 203.228901] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 203.230224] <redpill/smart_shim.c:388> Generating fake SMART values
[ 263.183224] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 263.184594] <redpill/smart_shim.c:388> Generating fake SMART values
[ 278.841375] <redpill/bios_shims_collection.c:44> mfgBIOS: nullify zero-int for VTK_GET_MICROP_ID
[ 323.138300] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 323.139673] <redpill/smart_shim.c:388> Generating fake SMART values
[ 327.134839] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 327.136215] <redpill/smart_shim.c:388> Generating fake SMART values
[ 383.092309] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 383.093719] <redpill/smart_shim.c:388> Generating fake SMART values
[ 443.050291] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 443.051679] <redpill/smart_shim.c:388> Generating fake SMART values
[ 463.522386] <redpill/bios_shims_collection.c:44> mfgBIOS: nullify zero-int for VTK_GET_MICROP_ID
[ 463.790425] <redpill/override_symbol.c:242> Obtaining lock for
[ 463.791748] <redpill/override_symbol.c:242> Writing original code to
[ 463.793135] <redpill/override_symbol.c:242> Released lock for
[ 463.794407] <redpill/override_symbol.c:217> Obtaining lock for
[ 463.795694] <redpill/override_symbol.c:217> Writing trampoline code to
[ 463.797086] <redpill/override_symbol.c:217> Released lock for

[ 463.798337] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=3)->support => real=1 [org_fout=0, ovs_fout=0]
[ 502.999601] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0
[ 503.000962] <redpill/smart_shim.c:388> Generating fake SMART values

@OrpheeGT
Copy link

OrpheeGT commented Oct 2, 2021

While uploading 800 pictures :
image

9773.624456] <redpill/bios_shims_collection.c:44> mfgBIOS: nullify zero-int for VTK_GET_MICROP_ID
[ 9774.007617] <redpill/override_symbol.c:242> Obtaining lock for
[ 9774.008964] <redpill/override_symbol.c:242> Writing original code to
[ 9774.010353] <redpill/override_symbol.c:242> Released lock for
[ 9774.011649] <redpill/override_symbol.c:217> Obtaining lock for
[ 9774.012948] <redpill/override_symbol.c:217> Writing trampoline code to
[ 9774.014346] <redpill/override_symbol.c:217> Released lock for
[ 9774.015665] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=3)->support => real=1 [org_fout=0, ovs_fout=0]
image

@WiteWulf
Copy link
Author

WiteWulf commented Oct 2, 2021

@OrpheeGT feeling brave enough to upgrade to 7.0.1? :)

@ceozero
Copy link

ceozero commented Oct 2, 2021

I just recompiled it.. After several hours of testing, it didn't crash.
Forced restart or power failure can accelerate the occurrence of crash..
Previously, it would crash as long as it was forced to restart 2-3 times.. I've tested it dozens of times and it doesn't seem to crash.
@WiteWulf

@ttg-public
Copy link
Member

ttg-public commented Oct 2, 2021

I had all my docker containers running, and was importing a load of data to the influxdb when it crashed:

[  393.963862] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
[  393.998533] CPU: 1 PID: 23812 Comm: python3 Tainted: PF          O 3.10.108 #42218
[  394.035043] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  394.069815]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409a48d60
[  394.106286]  ffff880409a48cf8 0000000000000000 0000000000000001 0000000000000001
[  394.143049]  0000000000000001 ffffffff80000001 0000000000000030 ffff8803f52ddc00
[  394.179034] Call Trace:
[  394.190684]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[  394.219155]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[  394.243880]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[  394.276362]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[  394.307194]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[  394.339880]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[  394.371296]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[  394.402719]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[  394.426839]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[  394.454698]  <<EOE>> 
[  394.465349] Rebooting in 3 seconds..

Notably it wasn't the influxdb or containerd-shim processes that it flagged this time, though. It was the python script I was running to gather the data and import to the influxdb.

@WiteWulf:
Hard lockups are way different than soft ones. They happen when the interrupts itself weren't processed for some reason for ~10s. We only touch IRQs in a very limited set of scenarios. We tried to isolate test case here as plex will have a huge variability so testing with it may be very hard. Since you reported influx having issues we turned into synthetic influx stress test ran like so:

docker run --name influx-test -d -p 8086:8086 -v $PWD:/var/lib/influxdb influxdb:1.8
docker exec -it influx-test sh

# inside of the container:
wget https://golang.org/dl/go1.17.1.linux-amd64.tar.gz &&
    tar -C /usr/local -xzf go1.17.1.linux-amd64.tar.gz &&
    rm go1* &&
    export PATH=$PATH:/usr/local/go/bin &&
    go get -v 'github.com/influxdata/influx-stress/cmd/...'
/root/go/bin/influx-stress insert -f --stats

Indeed this causes a lot of CPU usage and plenty of swapping, along with sparse hrtimer: interrupt took X ns in dmesg (which by itself usually means that system is overloaded, but stable). However, we weren't able to trigger hard lockup crash. The issue with hard lockups is that the watchdog may actually be wrong and crash the system even if it's not locked up in an infinite loop/lock but just too slow to unlock. It is unusual to happen and shouldn't really.

  1. Can you try increasing the threshold (echo 30 > /proc/sys/kernel/watchdog_threshold) and see if it's still crashing?

  2. If that doesn't help can you try to boot with nmi_watchdog=0 kernel option set? This will disable lockup detection (which is not what you ever want to do normally). Then try to run your workload.

    • In such state can you check if the system is really locking up?
    • It may be that if you finish importing data and processing Plex cache the situation normalizes (i.e. the CPU usage drops to idle and the system is working like it should)
    • If the system locks in such state (i.e. it's not accessible) it may be a real lockup.
  3. Can you try crashing your system with the stress test above? We weren't able but maybe you will? We didn't try this on any of our micros gen8 but only on VM.

We found one more place which can technically cause an infinite loop under very specific circumstances but that shouldn't cause hard lockup crash even if it occurs (see 0065c89)

@OrpheeGT: your results look normal. The CPU usage peaks due to moments processing. The reported locks/unlocks are actually our debug messages to diagnose if something crashes within the lock - if you see lock => unlock sequence it's good, it would be a problem if it was a lock with no corresponding unlock.
Also, we see you removed the addresses (ffff.....) while posting a log snippet here - you don't have to. They have no identifying information in them (these are addresses in memory which are semi-random every reboot).

@WiteWulf
Copy link
Author

WiteWulf commented Oct 2, 2021

Okay, everyone else has gone to bed so I've got time to tinker :)

I ran part 1 of the test, running the influx-test container with the stress test data, and it crashed within seconds:

[41152.354507] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
[41152.389469] CPU: 1 PID: 5121 Comm: influxd Tainted: PF          O 3.10.108 #42218
[41152.425738] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[41152.460622]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409a48d60
[41152.498165]  ffff880409a48cf8 0000000000000000 0000000000000001 0000000000000001
[41152.535348]  0000000000000001 ffffffff80000001 0000000000000030 ffff8803f52ddc00
[41152.571634] Call Trace:
[41152.583761]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[41152.612922]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[41152.637273]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[41152.670285]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[41152.701594]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[41152.734834]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[41152.766158]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[41152.797370]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[41152.822551]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[41152.849821]  <<EOE>> 
[41152.859964] Rebooting in 3 seconds..

NB. due to this crash all subsequent tests were carried out with data-scrubbing running in the background and a consequent CPU and storage load imposed on the system.

Moving on, I tried setting echo 30 > /proc/sys/kernel/watchdog_threshold but /proc/sys/kernel/watchdog_threshold doesn't exist on my system, however /proc/sys/kernel/watchdog_thresh does. I don't know if this was a typo or a genuine difference between our systems.

I ran echo 30 > /proc/sys/kernel/watchdog_thresh instead and tried the stress test again...

Watching the output from top on the host while I ran it I saw that influxd was running at around 500% CPU until it panic'd again after running for less than a minute:

[  712.721022] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
[  712.756529] CPU: 0 PID: 18136 Comm: influxd Tainted: PF          O 3.10.108 #42218
[  712.793911] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  712.828355]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409a08d60
[  712.864657]  ffff880409a08cf8 0000000000000000 0000000000000000 0000000000000001
[  712.900318]  0000000000000000 ffffffff80000001 0000000000000030 ffff8803f521d000
[  712.936297] Call Trace:
[  712.948230]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[  712.976803]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[  713.001207]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[  713.034067]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[  713.065591]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[  713.098068]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[  713.129805]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[  713.161300]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[  713.185952]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[  713.213759]  <<EOE>> 
[  713.224341] Rebooting in 3 seconds..

Part 2, booting with nmi_watchdog=0. I ran the stress test and experienced no crashes. I left it running for about ten minutes with high CPU load and no crashes:
image

I then fired up my influxdb container that's usually guaranteed to crash. It showed high cpu for a few seconds after launch but didn't crash. I then ran the big import task that crashed it last time and it completed successfully with no crashes.

I then started up my other containers (domoticz, librenms, mysql, Grafana, ubooquity) with no crashes.

Finally I launched Plex Media Server, with no crash, and initiated a number of library updates and media analyses:
image

It did not crash during any of these tests with the NMI watchdog disabled, and the server remained responsive and usable throughout. I was watching the output of htop in an SSH session (which updated regularly) and navigating around the web UI.

Despite the obvious risk of running a system without the NMI watchdog running I'm going to leave it like this overnight and see how it goes.

I think we might be getting somewhere...

@OrpheeGT
Copy link

OrpheeGT commented Oct 3, 2021

@OrpheeGT: your results look normal. The CPU usage peaks due to moments processing. The reported locks/unlocks are actually our debug messages to diagnose if something crashes within the lock - if you see lock => unlock sequence it's good, it would be a problem if it was a lock with no corresponding unlock. Also, we see you removed the addresses (ffff.....) while posting a log snippet here - you don't have to. They have no identifying information in them (these are addresses in memory which are semi-random every reboot).

I did not hide anything actually.
Maybe there are not shown from telnet serial console ?

Most if the time I watch live telnet serial console instead of /var/log/messages

@WiteWulf
Copy link
Author

WiteWulf commented Oct 4, 2021

@ttg-public on the forums you posted A different hard lockup seems to exist still but this shouldn't affect most of the people and judging by the last post in the issue it may not even be a bug but an over-reactive kernel protection mechanism.

It's significant that I never had this on 6.2.3 with Jun's bootloader, but have seen it on 6.2.4, 7.0 and 7.0.1 since moving to redpill. I appreciate that it may be an issue introduced by Synology between 6.2.3 and 6.2.4, but imho it's more likely an interaction between redpill, the kernel and nmi watchdog.

One way to test this would be to have redpill load 6.2.3 and see if the problem persists. I don't know how much work would be involved in creating a 6.2.3 build target but it would potentially isolate the problem to either redpill or a change introduced by Synology in 6.2.4.

FWIW, I've been running this system with nmi_watchdog=0 since Saturday night with nothing obviously going wrong.

@OrpheeGT
Copy link

OrpheeGT commented Oct 5, 2021

May I ask do you know why it does not happen with Jun's loader (soft and hard lockups) ? were you able to catch it ?

@WiteWulf
Copy link
Author

WiteWulf commented Oct 6, 2021

This looks interesting: 19f745a

@WiteWulf
Copy link
Author

WiteWulf commented Oct 8, 2021

I just built a new DS3615xs 7.0.1-42218 image using the latest code commit (0df1ec8) and booted with nmi_watchdog running. The system survived the docker containers starting, but as soon as Plex started updating stuff it kernel panic'ed.

I've rebooted with 'nmi_watchdog=0' set and it's stable again.

@WiteWulf
Copy link
Author

Hey folks, big development over on the forums:
https://xpenology.com/forum/topic/45795-redpill-the-new-loader-for-624-discussion/?do=findComment&comment=220150

tl;dr disabling my NC360T PCIe NIC and instead adding the tg3 extension and using the onboard NIC massively improves system stability.

I never thought to ask others suffering with kernel panics if they were also using a PCIe NIC. I'll gather info and report back.

@WiteWulf
Copy link
Author

Hmmm, not looking so good in the cold light of day :) Multiple reboots overnight when Plex started doing library updates and media analysis. I've asked Kouill to share their boot image with me so I can test on my hardware. I'll report back

@WiteWulf
Copy link
Author

@ttg-public

I've found a strange behaviour with the NMI watchdog while rebooting and testing.

I currently have nmi_watchdog=0 in my grub.conf to keep the system stable. As demonstrated previously the system does not kernel panic under load with this.

I wanted to test a new bootloader and accidentally left nmi_watchdog=0 in the json file, so when the system booted it wouldn't crash. I spotted this and did 'echo 1 > /proc/sys/kernel/nmi_watchdog'. The kernel confirmed this on the console with 'NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.' but subsequently running the influx-test it does not crash.

This is reproducible: booting with NMI watchdog enabled results in the previously demonstrated kernel panics, but booting with NMI watchdog disabled then enabling it once the system has finished booting leaves it stable.

@Lasx
Copy link

Lasx commented Feb 12, 2022

@ttg-public

I've found a strange behaviour with the NMI watchdog while rebooting and testing.

I currently have nmi_watchdog=0 in my grub.conf to keep the system stable. As demonstrated previously the system does not kernel panic under load with this.

I wanted to test a new bootloader and accidentally left nmi_watchdog=0 in the json file, so when the system booted it wouldn't crash. I spotted this and did 'echo 1 > /proc/sys/kernel/nmi_watchdog'. The kernel confirmed this on the console with 'NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.' but subsequently running the influx-test it does not crash.

This is reproducible: booting with NMI watchdog enabled results in the previously demonstrated kernel panics, but booting with NMI watchdog disabled then enabling it once the system has finished booting leaves it stable.

Has the problem been solved? Can I upgrade?

@WiteWulf
Copy link
Author

WiteWulf commented Mar 3, 2022

I would say this was a workaround rather than a solution. As with all things related to redpill, it comes with a risk of dataloss: don't use it with data you don't have backed up elsewhere or don't care about losing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-feedback bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants