Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combination of dom0 busy cpu and suspension triggers soft lock sometimes #7795

Closed
logoerthiner1 opened this issue Sep 27, 2022 · 2 comments
Closed
Labels
affects-4.1 This issue affects Qubes OS 4.1. C: kernel P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@logoerthiner1
Copy link

How to file a helpful issue

Qubes OS release

R4.1, kernel 5.18.16 / 5.15.64 (I believe that former exhibits the behavior mostly as the kernel is SMP PREEMPT_DYNAMIC)

Brief summary

I am still suffering from #7340 and testing in which scenario the suspension fails. I was testing running the same command systemctl suspend over and over under different circumstances, with various number of VM open and various xenpm options (#7794), and comparing with .
Here I report a behavior related to some soft lock similar to #7696. This can be somewhat reliably reproduced.

When I start zero vms and suspend with only dom0, the suspension success rate is high (but not 100%, the other cases falls in #7340 mostly). When I run a command python3 -c "while 1:pass" in another terminal, the suspension goes less smoothly - it does not fail with a #7340 but with a #7696 - after several times of suspension, the later suspensions have long delay or even unable to continue; at one time when this happens (sustemctl suspend run but the system does not go to sleep), I tried to Ctrl-C the testing python3 process and the moment it dies the system sleeps. It seems that dom0 cpu usage is special and when dom0 cpu is in constant usage, many operations get unstable.

python3 -c "while 1:pass" is a testing command that does nothing but keep the CPU busy in dom0, however the behavior matches many existing issues and can be seem as a hint on many stability problems with various setups. I wonder if anyone can reproduce similar issues.

Steps to reproduce

  1. boot command for xen: cpufreq=xen:powersave (may be related; however without this my suspension experiment does not keep long though)
  2. with no VM started, try repeatly systemctl suspend to make sure that your suspension works mostly
  3. open a dom0 terminal with python3 -c "while 1: pass"
  4. repeatly `systemctl suspend

Expected behavior

Suspension works equally and no soft lock happens

Actual behavior

Soft lock occurs more and more frequently: usually after 3 suspensions the system gets unstable with softlock problems

@logoerthiner1 logoerthiner1 added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Sep 27, 2022
@andrewdavidwong andrewdavidwong added C: kernel needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Sep 27, 2022
@andrewdavidwong andrewdavidwong added this to the Release 4.1 updates milestone Sep 27, 2022
@logoerthiner1
Copy link
Author

@marmarek @DemiMarie

When I am doing suspend-resuming experiment with one core python3 -c "while 1:pass", on one suspend the computer become non-responsive. This is probably a more general dom0 linux problem.

xen 4.14.5-10 kernel 6.0.2-2

[  249.271098] ata1.00: exception Emask 0x0 SAct 0x3c SErr 0x0 action 0x6 frozen
[  249.271107] ata1.00: failed command: WRITE FPDMA QUEUED
[  249.271109] ata1.00: cmd 61/48:10:50:1d:62/00:00:d9:00:00/40 tag 2 ncq dma 36864 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  249.271117] ata1.00: status: { DRDY }
[  249.271120] ata1.00: failed command: WRITE FPDMA QUEUED
[  249.271122] ata1.00: cmd 61/08:18:80:46:76/00:00:d9:00:00/40 tag 3 ncq dma 4096 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  249.271128] ata1.00: status: { DRDY }
[  249.271130] ata1.00: failed command: WRITE FPDMA QUEUED
[  249.271131] ata1.00: cmd 61/08:20:f8:86:b2/00:00:d9:00:00/40 tag 4 ncq dma 4096 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  249.271136] ata1.00: status: { DRDY }
[  249.271138] ata1.00: failed command: WRITE FPDMA QUEUED
[  249.271139] ata1.00: cmd 61/08:28:08:87:b2/00:00:d9:00:00/40 tag 5 ncq dma 4096 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  249.271144] ata1.00: status: { DRDY }
[  249.271149] ata1: hard resetting link
[  249.579615] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  254.645146] ata1.00: qc timeout (cmd 0xec)
[  254.646263] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  254.646270] ata1.00: revalidation failed (errno=-5)
[  254.646274] ata1: hard resetting link
[  254.955442] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  265.397155] ata1.00: qc timeout (cmd 0xec)
[  265.397165] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  265.397168] ata1.00: revalidation failed (errno=-5)
[  265.397183] ata1: limiting SATA link speed to 3.0 Gbps
[  265.397186] ata1: hard resetting link
[  265.707377] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[  296.117137] ata1.00: qc timeout (cmd 0xec)
[  296.118235] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  296.118242] ata1.00: revalidation failed (errno=-5)
[  296.118245] ata1.00: disable device
[  296.427795] EXT4-fs error (de[  296.427509] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)vice dm-4) in ex
t4_writepages:28[  296.427546] ata1: EH complete40: IO failure

[  296.427599] sd 0:0:0:0: [sda] tag#28 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=77s
[  296.427617] sd 0:0:0:0: [sda] tag#28 CDB: Write(10) 2a 00 d9 b2 87 08 00 00 08 00
[  296.427619] I/O error, dev sda, sector 3652355848 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  296.427632] EXT4-fs warning (device dm-4): ext4_end_bio:343: I/O error 10 writing to inode 1182507 starting block 2920449)
[  296.427638] Buffer I/O error on device dm-4, logical block 2920449
[  296.427657] sd 0:0:0:0: [sda] tag#29 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=77s
[  296.427660] sd 0:0:0:0: [sda] tag#29 CDB: Write(10) 2a 00 d9 b2 86 f8 00 00 08 00
[  296.427662] I/O error, dev sda, sector 3652355832 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  296.427667] EXT4-fs warning (device dm-4): ext4_end_bio:343: I/O error 10 writing to inode 1182507 starting block 2920447)
[  296.427670] Buffer I/O error on device dm-4, logical block 2920447
[  296.427675] sd 0:0:0:0: [sda] tag#31 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=77s
[  296.427678] sd 0:0:0:0: [sda] tag#31 CDB: Write(10) 2a 00 d9 76 46 80 00 00 08 00
[  296.427679] I/O error, dev sda, sector 3648407168 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  296.427684] EXT4-fs warning (device dm-4): ext4_end_bio:343: I/O error 10 writing to inode 1182507 starting block 3114496)
[  296.427688] Buffer I/O error on device dm-4, logical block 3114496
[  296.427693] sd 0:0:0:0: [sda] tag#6 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=97s
[  296.427695] sd 0:0:0:0: [sda] tag#6 CDB: Write(10) 2a 00 d9 62 1d 50 00 00 48 00
[  296.427984] E[  296.427697] I/O error, dev sda, sector 3647085904 op 0x1:(WRITE) flags 0x800 phys_seg 9 prio class 2XT4-fs error (de
vice dm-4): ext4[  296.427770] sd 0:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s_journal_check_s
tart:83: comm sy[  296.427771] Aborting journal on device dm-4-8.stemd-journal: D
etected aborted [  296.427779] sd 0:0:0:0: [sda] tag#7 CDB: Write(10) 2a 00 d9 b2 87 28 00 00 10 00journal

[  296.427782] I/O error, dev sda, sector 3652355880 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 2
[  296.427795] EXT4-fs error (device dm-4) in ext4_writepages:2840: IO failure
[  296.427798] EXT4-fs warning (device dm-4): ext4_end_bio:343: I/O error 10 writing to inode 1182507 starting block 2920453)
[  296.427984] EXT4-fs error (device dm-4): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
[  296.429153] Buffer I/O error on device dm-4, logical block 2920453
[  296.429159] Buffer I/O error on device dm-4, logical block 2920454
[  296.444478] sd 0:0:0:0: [sda] tag#8 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[  296.444479] sd 0:0:0:0: [sda] tag#8 CDB: Write(10) 2a 00 d9 b2 87 50 00 00 08 00
[  296.444480] I/O error, dev sda, sector 3652355920 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  296.444482] EXT4-fs warning (device dm-4): ext4_end_bio:343: I/O error 10 writing to inode 1182507 starting block 2920458)[  296.450157] E
XT4-fs error (de[  296.444484] Buffer I/O error on device dm-4, logical block 2920458vice dm-4): ext4
_journal_check_s[  296.444486] sd 0:0:0:0: [sda] tag#11 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0start:83: comm kw
orker/u16:26: De[  296.444487] sd 0:0:0:0: [sda] tag#11 CDB: Write(10) 2a 00 d9 b2 87 60 00 00 08 00tected aborted j
ournal
[  296.444488] I/O error, dev sda, sector 3652355936 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  296.444490] EXT4-fs warning (device dm-4): ext4_end_bio:343: I/O error 10 writing to inode 1182507 starting block 2920460)
[  296.444491] Buffer I/O error on device dm-4, logical block 2920460
[  296.444514] sd 0:0:0:0: [sda] tag#12 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[  296.444515] sd 0:0:0:0: [sda] tag#12 CDB: Write(10) 2a 00 d9 b2 87 78 00 00 10 00
[  296.444516] I/O error, dev sda, sector 3652355960 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 2
[  296.444519] EXT4-fs warning (device dm-4): ext4_end_bio:343: I/O error 10 writing to inode 1182507 starting block 2920464)
[  296.444520] Buffer I/O error on device dm-4, logical block 2920463
[  296.444521] Buffer I/O error on device dm-4, logical block 2920464
[  296.444525] sd 0:0:0:0: [sda] tag#13 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[  296.453631] E[  296.444526] sd 0:0:0:0: [sda] tag#13 CDB: Write(10) 2a 00 d9 b2 87 d8 00 00 08 00XT4-fs (dm-4): R
emounting filesy[  296.444527] I/O error, dev sda, sector 3652356056 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2stem read-only

[  296.453632] E[  296.444529] EXT4-fs warning (device dm-4): ext4_end_bio:343: I/O error 10 writing to inode 1182507 starting block 2920475)XT4-fs (dm-4): R
emounting filesy[  296.444530] Buffer I/O error on device dm-4, logical block 2920475stem read-only

[  296.462102] E[  296.444533] sd 0:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0sXT4-fs (dm-4): f
ailed to convert[  296.444534] sd 0:0:0:0: [sda] tag#14 CDB: Write(10) 2a 00 d9 b2 88 08 00 00 08 00 unwritten exten
ts to written ex[  296.444534] I/O error, dev sda, sector 3652356104 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2tents -- potenti
al data loss!  ([  296.444537] EXT4-fs warning (device dm-4): ext4_end_bio:343: I/O error 10 writing to inode 1182507 starting block 2920481)inode 1182507, e
rror -30)
[  296.444542] EXT4-fs warning (device dm-4): ext4_end_bio:343: I/O error 10 writing to inode 1182507 starting block 2920483)[  296.465600] E
XT4-fs (dm-4): f[  296.450132] Buffer I/O error on dev dm-4, logical block 2655236, lost sync page writeailed to convert
 unwritten exten[  296.450157] EXT4-fs error (device dm-4): ext4_journal_check_start:83: comm kworker/u16:26: Detected aborted journalts to written ex
tents -- potenti[  296.450162] Buffer I/O error on dev dm-4, logical block 0, lost sync page writeal data loss!  (
inode 1182507, e[  296.450641] JBD2: Error -5 detected when updating journal superblock for dm-4-8.rror -30)

[  296.453000] EXT4-fs (dm-4): previous I/O error to superblock detected[  296.469435] E
XT4-fs (dm-4): f[  296.453048] Buffer I/O error on dev dm-4, logical block 0, lost sync page writeailed to convert
 unwritten exten[  296.453063] EXT4-fs (dm-4): previous I/O error to superblock detectedts to written ex
tents -- potenti[  296.453619] Buffer I/O error on dev dm-4, logical block 0, lost sync page writeal data loss!  (
inode 1182507, e[  296.453629] EXT4-fs (dm-4): I/O error while writing superblockrror -30)

[  296.453631] EXT4-fs (dm-4): Remounting filesystem read-only[  296.472658] E
XT4-fs (dm-4): f[  296.453632] EXT4-fs (dm-4): Remounting filesystem read-onlyailed to convert
 unwritten exten[  296.462102] EXT4-fs (dm-4): failed to convert unwritten extents to written extents -- potential data loss!  (inode 1182507, ts to written exerror -30)tents -- potenti
al data loss!  ([  296.465600] EXT4-fs (dm-4): failed to convert unwritten extents to written extents -- potential data loss!  (inode 1182507, inode 1182755, eerror -30)rror -30)

[  296.469435] EXT4-fs (dm-4): failed to convert unwritten extents to written extents -- potential data loss!  (inode 1182507, error -30)
[  296.472658] EXT4-fs (dm-4): failed to convert unwritten extents to written extents -- potential data loss!  (inode 1182755, error -30)

@logoerthiner1
Copy link
Author

Close with #7340

@andrewdavidwong andrewdavidwong removed the needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. label Nov 24, 2022
@andrewdavidwong andrewdavidwong added the affects-4.1 This issue affects Qubes OS 4.1. label Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.1 This issue affects Qubes OS 4.1. C: kernel P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

2 participants