Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux 6.9-rc2 breaks suspend on newer hardware - system doesn't wake up #9096

Closed
marmarek opened this issue Apr 7, 2024 · 6 comments
Closed
Assignees
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: kernel C: power management diagnosed Technical diagnosis has been performed (see issue comments). P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@marmarek
Copy link
Member

marmarek commented Apr 7, 2024

How to file a helpful issue

Qubes OS release

R4.2

Brief summary

openQA test run of 6.9-rc2 failed suspend test on a bunch of systems. The console log ends at the suspend of dom0, but doesn't have anything about resume.

Steps to reproduce

  1. Install 6.9-rc2 in dom0 (https://gitlab.com/QubesOS/qubes-linux-kernel/-/pipelines/1242555227)
  2. Suspend the system
  3. Try to wake it up

Expected behavior

System wake up normally. It works on 6.8.4.

Actual behavior

https://openqa.qubes-os.org/tests/overview?distri=qubesos&version=4.2&build=202404061643-4.2&groupid=12

Resume (or still suspend?) fails on a bunch of systems. Specifically all with ADL (including all certified systems with this CPU - both Dasharo FidelisGuard/Nitrokey Pro and Novacustom NV41), and also on AMD 4500U. And fails on Nitrokey Pro 2 too (14th gen CPU). But, it works on qemu, all thinpad-based systems (much older CPU), and interestingly on StarLabs one (which has 13th gen CPU).

Another common factor to most failed cases is Dasharo firmware (but the AMD one has stock firmware, so it isn't 100% match).

@marmarek marmarek added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. C: kernel P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. labels Apr 7, 2024
@marmarek marmarek self-assigned this Apr 7, 2024
@andrewdavidwong andrewdavidwong added C: power management affects-4.2 This issue affects Qubes OS 4.2. needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Apr 8, 2024
@marmarek
Copy link
Member Author

marmarek commented Apr 8, 2024

This looks to be an issue with sys-net suspending. After a timeout, sys-usb and dom0 finally properly suspend, and after waking up dom0 and sys-usb seems to be functional, but network doesn't work anymore.

@marmarek
Copy link
Member Author

marmarek commented Apr 8, 2024

At least on one system it is a deadlock on device unbind from igc driver:

Details

[   84.553112] Call Trace:
[   84.553118]  <TASK>
[   84.553123]  __schedule+0x23b/0x5c0
[   84.553134]  schedule+0x27/0xa0
[   84.553142]  schedule_preempt_disabled+0x15/0x30
[   84.553152]  __mutex_lock.constprop.0+0x34c/0x6a0
[   84.553165]  unregister_netdevice_notifier+0x25/0xc0
[   84.553178]  netdev_trig_deactivate+0x1e/0x60 [ledtrig_netdev]
[   84.553195]  led_trigger_set+0x105/0x340
[   84.553206]  led_classdev_unregister+0x4a/0x110
[   84.553219]  release_nodes+0x3d/0xb0
[   84.553229]  devres_release_all+0x8c/0xc0
[   84.553238]  device_del+0x27a/0x3f0
[   84.553248]  unregister_netdevice_many_notify+0x46a/0x6a0
[   84.553260]  unregister_netdevice_queue+0xf0/0x130
[   84.553271]  unregister_netdev+0x1c/0x30
[   84.553280]  igc_remove+0xe3/0x1d0 [igc]
[   84.553298]  pci_device_remove+0x3f/0xb0
[   84.553308]  device_release_driver_internal+0x19f/0x200
[   84.553320]  unbind_store+0xa1/0xb0
[   84.553329]  kernfs_fop_write_iter+0x11f/0x200
[   84.553341]  vfs_write+0x293/0x460 
[   84.553351]  ksys_write+0x6f/0xf0  
[   84.553360]  do_syscall_64+0x87/0x170
[   84.553368]  ? syscall_exit_work+0xf3/0x120
[   84.553378]  ? syscall_exit_to_user_mode+0x69/0x220
[   84.553389]  ? do_syscall_64+0x96/0x170
[   84.553397]  ? do_syscall_64+0x96/0x170
[   84.553404]  ? do_syscall_64+0x96/0x170
[   84.553412]  ? do_syscall_64+0x96/0x170
[   84.553420]  ? __irq_exit_rcu+0x4b/0xb0
[   84.553429]  entry_SYSCALL_64_after_hwframe+0x71/0x79
[   84.553439] RIP: 0033:0x7b46ae7c5ee4
[   84.553446] RSP: 002b:00007ffe580c2dd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 
[   84.553460] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007b46ae7c5ee4
[   84.553474] RDX: 000000000000000d RSI: 00006458ac50b4b0 RDI: 0000000000000001 
[   84.553487] RBP: 00007ffe580c2e00 R08: 0000000000000073 R09: 0000000000000001
[   84.553500] R10: 0000000000000000 R11: 0000000000000202 R12: 000000000000000d 
[   84.553514] R13: 00006458ac50b4b0 R14: 00007b46ae8965c0 R15: 00007b46ae893f20
[   84.553528]  </TASK>   

Time to enable lockdep and see what happens.

@marmarek
Copy link
Member Author

marmarek commented Apr 8, 2024

Lockdep says:

Details

[   18.587056] igc 0000:00:07.0 ens7: PHC removed
[   18.589316] 
[   18.589322] ======================================================
[   18.589329] WARNING: possible circular locking dependency detected
[   18.589335] 6.9.0-rc2-1.qubes.fc32.x86_64 #378 Not tainted
[   18.589340] ------------------------------------------------------
[   18.589347] prepare-suspend/1145 is trying to acquire lock:
[   18.589352] ffff897494bc37b8 (&led_cdev->trigger_lock){+.+.}-{3:3}, at: led_classdev_unregister+0x32/0x110
[   18.589367] 
[   18.589367] but task is already holding lock: 
[   18.589373] ffffffffb034dfa8 (rtnl_mutex){+.+.}-{3:3}, at: unregister_netdev+0xe/0x20
[   18.589384]
[   18.589384] which lock already depends on the new lock.
[   18.589384] 
[   18.589391] 
[   18.589391] the existing dependency chain (in reverse order) is:
[   18.589399] 
[   18.589399] -> #1 (rtnl_mutex){+.+.}-{3:3}:
[   18.589407]        __mutex_lock+0xb2/0xbd0
[   18.589413]        set_device_name+0x2d/0x140 [ledtrig_netdev]
[   18.589423]        netdev_trig_activate+0x1a6/0x220 [ledtrig_netdev]
[   18.589432]        led_trigger_set+0x20f/0x340
[   18.589438]        led_trigger_register+0x16d/0x1a0
[   18.589443]        do_one_initcall+0x6f/0x3d0
[   18.589451]        do_init_module+0x60/0x240
[   18.589459]        init_module_from_file+0x86/0xc0
[   18.589465]        idempotent_init_module+0x126/0x2c0
[   18.589471]        __x64_sys_finit_module+0x5a/0xb0
[   18.589477]        do_syscall_64+0x96/0x190
[   18.589482]        entry_SYSCALL_64_after_hwframe+0x71/0x79
[   18.589490] 
[   18.589490] -> #0 (&led_cdev->trigger_lock){+.+.}-{3:3}:
[   18.589498]        __lock_acquire+0x13e7/0x2180
[   18.589505]        lock_acquire+0xd5/0x2f0
[   18.589510]        down_write+0x2a/0xc0
[   18.589515]        led_classdev_unregister+0x32/0x110
[   18.589522]        devres_release_all+0xb5/0x110
[   18.589530]        device_del+0x275/0x3f0
[   18.589535]        unregister_netdevice_many_notify+0x5ba/0x870
[   18.589543]        unregister_netdevice_queue+0xf3/0x130
[   18.589549]        unregister_netdev+0x18/0x20
[   18.589555]        igc_remove+0xe1/0x1c0 [igc]
[   18.589566]        pci_device_remove+0x3b/0xb0
[   18.589574]        device_release_driver_internal+0x1a5/0x210
[   18.589581]        unbind_store+0x9d/0xb0
[   18.589587]        kernfs_fop_write_iter+0x15b/0x210
[   18.589595]        vfs_write+0x2bd/0x560
[   18.589601]        ksys_write+0x71/0xf0
[   18.589608]        do_syscall_64+0x96/0x190
[   18.589614]        entry_SYSCALL_64_after_hwframe+0x71/0x79
[   18.589620] 
[   18.589620] other info that might help us debug this:
[   18.589620] 
[   18.589628]  Possible unsafe locking scenario:
[   18.589628] 
[   18.589635]        CPU0                    CPU1
[   18.589640]        ----                    ----
[   18.589645]   lock(rtnl_mutex);
[   18.589650]                                lock(&led_cdev->trigger_lock);
[   18.589657]                                lock(rtnl_mutex);
[   18.589664]   lock(&led_cdev->trigger_lock);
[   18.589670] 
[   18.589670]  *** DEADLOCK ***
[   18.589670] 
[   18.589676] 4 locks held by prepare-suspend/1145:
[   18.589682]  #0: ffff8974873a7420 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x71/0xf0
[   18.589693]  #1: ffff897495886288 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x114/0x210[   18.589704]  #2: ffff8974820991b0 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x39/0x210
[   18.589715]  #3: ffffffffb034dfa8 (rtnl_mutex){+.+.}-{3:3}, at: unregister_netdev+0xe/0x20
[   18.589726] 
[   18.589726] stack backtrace:
[   18.589731] CPU: 1 PID: 1145 Comm: prepare-suspend Not tainted 6.9.0-rc2-1.qubes.fc32.x86_64 #378
[   18.589741] Hardware name: Xen HVM domU, BIOS 4.17.3 03/12/2024
[   18.589748] Call Trace:
[   18.589752]  <TASK>
[   18.589755]  dump_stack_lvl+0x73/0xb0
[   18.589761]  check_noncircular+0x148/0x160
[   18.589766]  ? stack_trace_save+0x4a/0x70
[   18.589773]  __lock_acquire+0x13e7/0x2180
[   18.589780]  lock_acquire+0xd5/0x2f0
[   18.589786]  ? led_classdev_unregister+0x32/0x110
[   18.589793]  down_write+0x2a/0xc0
[   18.589798]  ? led_classdev_unregister+0x32/0x110
[   18.589804]  led_classdev_unregister+0x32/0x110
[   18.589811]  devres_release_all+0xb5/0x110
[   18.589816]  device_del+0x275/0x3f0
[   18.589821]  unregister_netdevice_many_notify+0x5ba/0x870
[   18.589829]  unregister_netdevice_queue+0xf3/0x130
[   18.589835]  unregister_netdev+0x18/0x20
[   18.589840]  igc_remove+0xe1/0x1c0 [igc]
[   18.589850]  pci_device_remove+0x3b/0xb0
[   18.589855]  device_release_driver_internal+0x1a5/0x210
[   18.589861]  unbind_store+0x9d/0xb0
[   18.589867]  kernfs_fop_write_iter+0x15b/0x210
[   18.589874]  vfs_write+0x2bd/0x560
[   18.589880]  ksys_write+0x71/0xf0
[   18.589886]  do_syscall_64+0x96/0x190
[   18.589891]  ? find_held_lock+0x2b/0x80
[   18.589896]  ? lock_release+0x143/0x2c0
[   18.589902]  ? do_user_addr_fault+0x354/0x8a0
[   18.589909]  ? exc_page_fault+0x126/0x260
[   18.589916]  entry_SYSCALL_64_after_hwframe+0x71/0x79
[   18.589922] RIP: 0033:0x76426194fee4
[   18.589927] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d 85 74 0d 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
[   18.589946] RSP: 002b:00007ffe69a0ca98 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[   18.589955] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 000076426194fee4
[   18.589963] RDX: 000000000000000d RSI: 000058ae60024480 RDI: 0000000000000001
[   18.589971] RBP: 00007ffe69a0cac0 R08: 0000000000000000 R09: 0000000000000001
[   18.589979] R10: 0000000000000004 R11: 0000000000000202 R12: 000000000000000d
[   18.589987] R13: 000058ae60024480 R14: 0000764261a205c0 R15: 0000764261a1df20
[   18.589997]  </TASK>

@marmarek
Copy link
Member Author

marmarek commented Apr 8, 2024

Reported here: https://lore.kernel.org/netdev/ZhRD3cOtz5i-61PB@mail-itl/T/#u

@marmarek marmarek added the waiting for upstream This issue is waiting for something from an upstream project to arrive in Qubes. Remove when closed. label Apr 8, 2024
@andrewdavidwong andrewdavidwong added diagnosed Technical diagnosis has been performed (see issue comments). and removed needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Apr 9, 2024
@marmarek
Copy link
Member Author

Fix is developed already: https://lore.kernel.org/netdev/20240411-igc_led_deadlock-v2-1-b758c0c88b2b@linutronix.de/T/#u, hopefully will make it into next -rc.

@marmarek
Copy link
Member Author

marmarek commented May 2, 2024

The fix is merged, Linux 6.9-rc6 doesn't have this problem anymore.

@marmarek marmarek closed this as completed May 2, 2024
@andrewdavidwong andrewdavidwong removed the waiting for upstream This issue is waiting for something from an upstream project to arrive in Qubes. Remove when closed. label May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: kernel C: power management diagnosed Technical diagnosis has been performed (see issue comments). P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
Status: Done
Development

No branches or pull requests

2 participants