New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

page fault when booting Centos and RHEL HVMs #1943

Closed
lorenzog opened this Issue May 3, 2016 · 23 comments

Comments

Projects
None yet
7 participants
@lorenzog

lorenzog commented May 3, 2016

Qubes OS version (e.g., R3.1):

3.1

Affected TemplateVMs (e.g., fedora-23, if applicable):

None - affects HVMs


Expected behavior:

HVM boots regularly

Actual behavior:

HVM does not boot 9 times out of 10 - logs show a page fault error

Steps to reproduce the behavior:

  • Install Centos7 or RHEL from CD/DVD
  • Do a default install on xvda - no customisation, minimal system
  • Reboot

General notes:

Installation from CDROM/DVD works fine - system boots without a problem. However once installed, the first reboot never gets anywhere but instead shows a blank screen with a non-blinking cursor. VM shutdown via Qubes VM manager doesn't stop it; I have to kill it instead.

This happens with the following VMs:

  • A Centos7 VM migrated from Virtualbox using qemu-img
  • A freshly installed Centos7 HVM, using an ISO from Centos' website
  • A freshly installed Centos7 HVM (kernel: 3.10.0-327.el7.x86_64)
  • A freshly installed RHEL 7.2 (same kernel as Centos)

This does NOT happen on a Ubuntu server 16.04, kernel 4.4.0-21.

The -dm.log file says (error at the bottom):


Xen Minimal OS!
  start_info: 0x56e000(VA)
    nr_pages: 0x2c00
  shared_inf: 0x282da000(MA)
     pt_base: 0x571000(VA)
nr_pt_frames: 0x7
    mfn_list: 0x558000(VA)
   mod_start: 0x0(VA)
     mod_len: 0
       flags: 0x0
    cmd_line:  -d 40
       stack: 0x516cc0-0x536cc0
MM: Init
      _text: 0x0(VA)
     _etext: 0x109012(VA)
   _erodata: 0x158000(VA)
     _edata: 0x15e4c8(VA)
stack start: 0x516cc0(VA)
       _end: 0x5575c8(VA)
  start_pfn: 57b
    max_pfn: 2c00
Mapping memory range 0x800000 - 0x2c00000
setting 0x0-0x158000 readonly
skipped 1000
MM: Initialise page allocator for 58d000(58d000)-2c00000(2c00000)
MM: done
Demand map pfns at 2c01000-0x2002c01000.
Heap resides at 2002c02000-4002c02000.
Initialising timer interface
Initialising console ... done.
gnttab_table mapped at 0x2c01000.
Initialising scheduler
Thread "Idle": pointer: 0x0x2002c02050, stack: 0x0x5b0000
Thread "xenstore": pointer: 0x0x2002c02800, stack: 0x0x5c0000
xenbus initialised on irq 1 mfn 0x2fa98
Thread "shutdown": pointer: 0x0x2002c02fb0, stack: 0x0x5d0000
main.c: dummy main: start_info=0x536dc0
Thread "main": pointer: 0x0x2002c03760, stack: 0x0x5e0000
Thread "pcifront": pointer: 0x0x2002c03f50, stack: 0x0x5f0000
pcifront_watches: waiting for backend path to appear device/pci/0/backend
dom vm is at /vm/8a25f436-c040-43a2-8ea6-f2a670af5321
"main" "-d" "40" "-d" "40" "-domain-name" "rhel7.2server" "-vnc" "none" "-videoram" "16" "-std-vga" "-boot" "dc" "-usb" "-usbdevice" "tablet" "-acpi" "-vcpus" "8" "-vcpu_avail" "0xff" "-net" "nic,vlan=0,macaddr=00:16:3e:5e:6c:19,model=rtl8139" "-net" "tap,vlan=0,ifname=vif40.0-emu,bridge=(null),script=no,downscript=no" "-net" "lwip,client_ip=10.137.2.27,server_ip=10.137.2.254,dns=10.137.2.1,gw=10.137.2.1,netmask=255.255.255.0" 
domid: 40
domid: 40
************************ NETFRONT for device/vif/0 **********


net TX ring size 256
net RX ring size 256
backend at /local/domain/2/backend/vif/41/0
mac is 00:16:3e:5e:6c:19
**************************
tap_open((null)) -> 3
Waiting for network.
IP a8902fe netmask ffffff00 gateway a890201.
TCP/IP bringup begins.
Thread "tcpip_thread": pointer: 0x0x2002c07d60, stack: 0x0x710000
TCP/IP bringup ends.
registering DHCP server
Network is ready.
xs_daemon_open -> 5, 0x15ce48
Using xvda for guest's hda
******************* BLKFRONT for /local/domain/41/device/vbd/51712 **********


backend at /local/domain/0/backend/vbd/41/51712
41943040 sectors of 512 bytes
**************************
blk_open(/local/domain/41/device/vbd/51712) -> 6
Using xvdb for guest's hdb
******************* BLKFRONT for /local/domain/41/device/vbd/51728 **********


backend at /local/domain/0/backend/vbd/41/51728
4194304 sectors of 512 bytes
**************************
blk_open(/local/domain/41/device/vbd/51728) -> 7
xs_directory(/local/domain/41/device/vkbd): ENOENT
xs_directory(/local/domain/41/device/vfb): ENOENT
xs_watch(device-model/40/logdirty/cmd, logdirty)
Watching device-model/40/logdirty/cmd
xs_watch(device-model/40/command, dm-command)
Watching device-model/40/command
xs_watch(/local/domain/40/cpu, vcpu-set)
Watching /local/domain/40/cpu
xs_read(/local/domain/0/backend/pci/40/0/msitranslate): EACCES
xs_read(/local/domain/0/backend/pci/40/0/power_mgmt): EACCES
qemu_map_cache_init nr_buckets = 10000 size 4194304
shared page at pfn feffd
buffered io page at pfn feffb
Guest uuid = 8a25f436-c040-43a2-8ea6-f2a670af5321
xs_watch(/local/domain/0/backend/console/40, be:0x144fb4:40:0x1583e0)
xs_directory(/local/domain/0/backend/console/40): EACCES
xs_watch(/local/domain/0/backend/vkbd/40, be:0x14045c:40:0x158380)
xs_directory(/local/domain/0/backend/vkbd/40): EACCES
evtchn_open() -> 8
xc_evtchn_bind_interdomain(40, 3) = 0
xc_evtchn_bind_interdomain(40, 5) = 0
xc_evtchn_bind_interdomain(40, 6) = 0
xc_evtchn_bind_interdomain(40, 7) = 0
xc_evtchn_bind_interdomain(40, 8) = 0
xc_evtchn_bind_interdomain(40, 9) = 0
xc_evtchn_bind_interdomain(40, 10) = 0
xc_evtchn_bind_interdomain(40, 11) = 0
xc_evtchn_bind_interdomain(40, 4) = 0
populating video RAM at ff000000
mapping video RAM from ff000000
xs_read(device-model/40/disable_pf): ENOENT
Register xen platform.
Done register platform.
xs_watch(/local/domain/40/log-throttling, /local/domain/40/log-throttling)
platform_fixed_ioport: changed ro/rw state of ROM memory area. now is rw state.
qubes_gui/init: 657
qubes_gui/init: 666
qubes_gui/init: 669
qubes_gui/init: 678
evtchn_open() -> 10
xc_evtchn_bind_unbound_port(0) = 0
xs_daemon_open -> 11, 0x15cf08
qubes_gui/init[708]: version sent, waiting for xorg conf
qubes gui initialized
resize to 640x480@32, 2560 required
can't store dev vc:80Cx24C name for domid 40 in /serial/0 from a stub domain
xs_read_watch() -> /local/domain/40/log-throttling /local/domain/40/log-throttling
xs_read(/local/domain/40/log-throttling): ENOENT
xs_read(/local/domain/40/log-throttling): read error
qemu: ignoring not-understood drive `/local/domain/40/log-throttling'
medium change watch on `/local/domain/40/log-throttling' - unknown device, ignored
resize to 720x400@32, 2880 required
xs_read_watch() -> /local/domain/40/cpu vcpu-set
vcpu-set: watch node error.
[xenstore_process_vcpu_set_event]: /local/domain/40/cpu has no CPU!
I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
xs_read_watch() -> device-model/40/command dm-command
xs_read(device-model/40/command): ENOENT
I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
xs_read_watch() -> device-model/40/logdirty/cmd logdirty
xs_read(device-model/40/logdirty/cmd): ENOENT
Log-dirty: no command yet.
I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
vga s->lfb_addr = f0000000 s->lfb_end = f1000000 
platform_fixed_ioport: changed ro/rw state of ROM memory area. now is rw state.
platform_fixed_ioport: changed ro/rw state of ROM memory area. now is ro state.
xs_daemon_open -> 12, 0x15cf28
qubes_gui/init[719]: got xorg conf, creating window
qubes_gui/init: 726
dumping mfns: n=282, w=720, h=400, bpp=32
configure msg, x/y 600 365 (was 0 0), w/h 720 400
Unknown PV product 3 loaded in guest
PV driver build 1
close(6)
close blk: backend=/local/domain/0/backend/vbd/41/51712 node=/local/domain/41/device/vbd/51712
close(7)
close blk: backend=/local/domain/0/backend/vbd/41/51728 node=/local/domain/41/device/vbd/51728
region type 1 at [c100,c200).
region type 0 at [f2000000,f2000100).
squash iomem [f2000000, f2000100).
close(3)
close network: backend at /local/domain/2/backend/vif/41/0
vga s->lfb_addr = f0000000 s->lfb_end = f1000000 
vga s->lfb_addr = f0000000 s->lfb_end = f1000000 
vga s->lfb_addr = f0000000 s->lfb_end = f1000000 
vga s->lfb_addr = f0000000 s->lfb_end = f1000000 
vga s->lfb_addr = f0000000 s->lfb_end = f1000000 
vga s->lfb_addr = f0000000 s->lfb_end = f1000000 
vga s->lfb_addr = f0000000 s->lfb_end = f1000000 
Page fault at linear address 427aa28, rip 20759, regs 0x5ef5b8, sp 5ef660, our_sp 0x5ef580, code 2
Thread: main
RIP: e030:[<0000000000020759>] 
RSP: e02b:00000000005ef660  EFLAGS: 00010202
RAX: 000000000427aa28 RBX: 00000000e4a73a28 RCX: 0000000000000001
RDX: 0000000000000008 RSI: 0000002002c752fa RDI: 000000000427aa28
RBP: 00000000005ef660 R08: 0000000000000004 R09: 00000000f2000000
R10: 0000000000000000 R11: 000000000000000c R12: 0000000000000008
R13: 0000000000000008 R14: 0000002002c752fa R15: 0000000000001000
base is 0x5ef660 caller is 0x20f28
base is 0x5ef6b0 caller is 0x50c22
base is 0x5ef950 caller is 0x50e9d
base is 0x5ef980 caller is 0x35b5
base is 0x5ef9a0 caller is 0x6a86
base is 0x5efa10 caller is 0x21f27
base is 0x5efa60 caller is 0x950d
base is 0x5efdf0 caller is 0xd7f27
base is 0x5effe0 caller is 0x3423

5ef650: 60 f6 5e 00 00 00 00 00 2b e0 00 00 00 00 00 00
5ef660: b0 f6 5e 00 00 00 00 00 28 0f 02 00 00 00 00 00
5ef670: 08 00 00 00 00 00 00 00 50 89 d8 02 01 00 00 00
5ef680: b0 f6 5e 00 00 00 00 00 b0 52 c7 02 20 00 00 00

5ef650: 60 f6 5e 00 00 00 00 00 2b e0 00 00 00 00 00 00
5ef660: b0 f6 5e 00 00 00 00 00 28 0f 02 00 00 00 00 00
5ef670: 08 00 00 00 00 00 00 00 50 89 d8 02 01 00 00 00
5ef680: b0 f6 5e 00 00 00 00 00 b0 52 c7 02 20 00 00 00

20740: c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
20750: 55 48 89 e5 89 d1 c1 e9 03 f3 48 a5 f7 c2 04 00
20760: 00 00 74 01 a5 f7 c2 02 00 00 00 74 02 66 a5 f7
20770: c2 01 00 00 00 74 01 a4 5d c3 55 48 89 e5 5d c3
Pagetable walk from virt 427aa28, base 571000:
 L4 = 0000000033932067 (0x572000)  [offset = 0]
  L3 = 000000003ef5a067 (0x573000)  [offset = 0]
   L2 = 000000003156c067 (0x700000)  [offset = 21]
    L1 = 0000000000000000 [offset = 7a]


Related issues:

Relevant labels:

@marmarek marmarek added this to the Release 3.1 updates milestone May 3, 2016

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek May 3, 2016

Member

This looks to be somewhere in USB emulation code.

PS I've cut the log to only last VM run.

Member

marmarek commented May 3, 2016

This looks to be somewhere in USB emulation code.

PS I've cut the log to only last VM run.

@lorenzog

This comment has been minimized.

Show comment
Hide comment
@lorenzog

lorenzog May 3, 2016

5 minutes after writing this I realised the problem was the kernel. Upgrading to a 4.x kernel solved the issue for RHEL. As always, one feels better after they've called the doctor....

Now investigating Centos7 before closing this

lorenzog commented May 3, 2016

5 minutes after writing this I realised the problem was the kernel. Upgrading to a 4.x kernel solved the issue for RHEL. As always, one feels better after they've called the doctor....

Now investigating Centos7 before closing this

@lorenzog

This comment has been minimized.

Show comment
Hide comment
@lorenzog

lorenzog May 3, 2016

Update - RHEL booted successfully once, after that never booted again. Error seems to be identical (see report below).

The VM does not have any USB nor PCI devices attached to it. Could it be a video output problem?

[...] 
vga s->lfb_addr = f0000000 s->lfb_end = f1000000 
vga s->lfb_addr = f0000000 s->lfb_end = f1000000 
vga s->lfb_addr = f0000000 s->lfb_end = f1000000 
Page fault at linear address 4307008, rip 20759, regs 0x5ef5b8, sp 5ef660, our_sp 0x5ef580, code 2
Thread: main
RIP: e030:[<0000000000020759>] 
RSP: e02b:00000000005ef660  EFLAGS: 00010202
RAX: 0000000004307008 RBX: 00000000ec000008 RCX: 0000000000000001
RDX: 0000000000000008 RSI: 0000002002c752fa RDI: 0000000004307008
RBP: 00000000005ef660 R08: 0000000000000004 R09: 00000000f2000000
R10: 0000000000000000 R11: 000000000000000c R12: 0000000000000008
R13: 0000000000000008 R14: 0000002002c752fa R15: 0000000000001000
base is 0x5ef660 caller is 0x20f28
base is 0x5ef6b0 caller is 0x50c22
base is 0x5ef950 caller is 0x50e9d
base is 0x5ef980 caller is 0x35b5
base is 0x5ef9a0 caller is 0x6a86
base is 0x5efa10 caller is 0x21f27
base is 0x5efa60 caller is 0x950d
base is 0x5efdf0 caller is 0xd7f27
base is 0x5effe0 caller is 0x3423

5ef650: 60 f6 5e 00 00 00 00 00 2b e0 00 00 00 00 00 00
5ef660: b0 f6 5e 00 00 00 00 00 28 0f 02 00 00 00 00 00
5ef670: 08 00 00 00 00 00 00 00 50 89 d8 02 01 00 00 00
5ef680: b0 f6 5e 00 00 00 00 00 b0 52 c7 02 20 00 00 00

5ef650: 60 f6 5e 00 00 00 00 00 2b e0 00 00 00 00 00 00
5ef660: b0 f6 5e 00 00 00 00 00 28 0f 02 00 00 00 00 00
5ef670: 08 00 00 00 00 00 00 00 50 89 d8 02 01 00 00 00
5ef680: b0 f6 5e 00 00 00 00 00 b0 52 c7 02 20 00 00 00

20740: c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
20750: 55 48 89 e5 89 d1 c1 e9 03 f3 48 a5 f7 c2 04 00
20760: 00 00 74 01 a5 f7 c2 02 00 00 00 74 02 66 a5 f7
20770: c2 01 00 00 00 74 01 a4 5d c3 55 48 89 e5 5d c3
Pagetable walk from virt 4307008, base 571000:
 L4 = 0000000030bb6067 (0x572000)  [offset = 0]
  L3 = 0000000030bb5067 (0x573000)  [offset = 0]
   L2 = 0000000022421067 (0x5af000)  [offset = 21]
    L1 = 0000000000000000 [offset = 107]

lorenzog commented May 3, 2016

Update - RHEL booted successfully once, after that never booted again. Error seems to be identical (see report below).

The VM does not have any USB nor PCI devices attached to it. Could it be a video output problem?

[...] 
vga s->lfb_addr = f0000000 s->lfb_end = f1000000 
vga s->lfb_addr = f0000000 s->lfb_end = f1000000 
vga s->lfb_addr = f0000000 s->lfb_end = f1000000 
Page fault at linear address 4307008, rip 20759, regs 0x5ef5b8, sp 5ef660, our_sp 0x5ef580, code 2
Thread: main
RIP: e030:[<0000000000020759>] 
RSP: e02b:00000000005ef660  EFLAGS: 00010202
RAX: 0000000004307008 RBX: 00000000ec000008 RCX: 0000000000000001
RDX: 0000000000000008 RSI: 0000002002c752fa RDI: 0000000004307008
RBP: 00000000005ef660 R08: 0000000000000004 R09: 00000000f2000000
R10: 0000000000000000 R11: 000000000000000c R12: 0000000000000008
R13: 0000000000000008 R14: 0000002002c752fa R15: 0000000000001000
base is 0x5ef660 caller is 0x20f28
base is 0x5ef6b0 caller is 0x50c22
base is 0x5ef950 caller is 0x50e9d
base is 0x5ef980 caller is 0x35b5
base is 0x5ef9a0 caller is 0x6a86
base is 0x5efa10 caller is 0x21f27
base is 0x5efa60 caller is 0x950d
base is 0x5efdf0 caller is 0xd7f27
base is 0x5effe0 caller is 0x3423

5ef650: 60 f6 5e 00 00 00 00 00 2b e0 00 00 00 00 00 00
5ef660: b0 f6 5e 00 00 00 00 00 28 0f 02 00 00 00 00 00
5ef670: 08 00 00 00 00 00 00 00 50 89 d8 02 01 00 00 00
5ef680: b0 f6 5e 00 00 00 00 00 b0 52 c7 02 20 00 00 00

5ef650: 60 f6 5e 00 00 00 00 00 2b e0 00 00 00 00 00 00
5ef660: b0 f6 5e 00 00 00 00 00 28 0f 02 00 00 00 00 00
5ef670: 08 00 00 00 00 00 00 00 50 89 d8 02 01 00 00 00
5ef680: b0 f6 5e 00 00 00 00 00 b0 52 c7 02 20 00 00 00

20740: c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
20750: 55 48 89 e5 89 d1 c1 e9 03 f3 48 a5 f7 c2 04 00
20760: 00 00 74 01 a5 f7 c2 02 00 00 00 74 02 66 a5 f7
20770: c2 01 00 00 00 74 01 a4 5d c3 55 48 89 e5 5d c3
Pagetable walk from virt 4307008, base 571000:
 L4 = 0000000030bb6067 (0x572000)  [offset = 0]
  L3 = 0000000030bb5067 (0x573000)  [offset = 0]
   L2 = 0000000022421067 (0x5af000)  [offset = 21]
    L1 = 0000000000000000 [offset = 107]

@lorenzog lorenzog changed the title from page fault when booting HVM with kernel 3.10.0 from Centos, RHEL to page fault when booting Centos and RHEL HVMs May 3, 2016

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek May 3, 2016

Member

There is emulated USB tablet device (see qemu cmdline in that log)

Member

marmarek commented May 3, 2016

There is emulated USB tablet device (see qemu cmdline in that log)

@lorenzog

This comment has been minimized.

Show comment
Hide comment
@lorenzog

lorenzog May 3, 2016

The HVM config file contains this line:

<input type='tablet' bus='usb'/>

I've tried editing with virsh edit but changing 'tablet' to 'mouse' or 'keyboard' results in this error message:

error: XML document failed to validate against schema: Unable to validate doc against /usr/share/libvirt/schemas/domain.rng Extra element os in interleave Element domain failed to validate content

To be honest any change in that file results in this error. I'm a bit at a loss - how do I remove the emulated USB tablet?

lorenzog commented May 3, 2016

The HVM config file contains this line:

<input type='tablet' bus='usb'/>

I've tried editing with virsh edit but changing 'tablet' to 'mouse' or 'keyboard' results in this error message:

error: XML document failed to validate against schema: Unable to validate doc against /usr/share/libvirt/schemas/domain.rng Extra element os in interleave Element domain failed to validate content

To be honest any change in that file results in this error. I'm a bit at a loss - how do I remove the emulated USB tablet?

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek May 3, 2016

Member

virsh will not help - domain config is regenerated at each domain startup. If you want try with manual change - dump that config to some file, edit, then pass to qvm-start --custom-config=....

Member

marmarek commented May 3, 2016

virsh will not help - domain config is regenerated at each domain startup. If you want try with manual change - dump that config to some file, edit, then pass to qvm-start --custom-config=....

@lorenzog

This comment has been minimized.

Show comment
Hide comment
@lorenzog

lorenzog May 4, 2016

Brilliant, I was able to achieve reliable boot. Thank you very much! Good hint :)

Solution:

Changing 'tablet' and 'usb' into 'mouse' and 'ps2' allowed reliable boot (albeit with the message BUG: soft lockup - CPU#0 stuck for 23s [...])

Question:

Is there some documentation on how to make those changes permanent to the .conf file? I suppose domain config is located somewhere else, but I can't find out exactly where.

lorenzog commented May 4, 2016

Brilliant, I was able to achieve reliable boot. Thank you very much! Good hint :)

Solution:

Changing 'tablet' and 'usb' into 'mouse' and 'ps2' allowed reliable boot (albeit with the message BUG: soft lockup - CPU#0 stuck for 23s [...])

Question:

Is there some documentation on how to make those changes permanent to the .conf file? I suppose domain config is located somewhere else, but I can't find out exactly where.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek May 4, 2016

Member

No, domain config is generated at each startup (unless --custom-config is used). If you really want, you can edit its template: /usr/share/qubes/vm-template-hvm.xml. But it will affect all the HVMs, and that file will be overwritten on update.
The right way to go is to fix USB controller emulator to not crash - even in case of buggy driver.

Member

marmarek commented May 4, 2016

No, domain config is generated at each startup (unless --custom-config is used). If you really want, you can edit its template: /usr/share/qubes/vm-template-hvm.xml. But it will affect all the HVMs, and that file will be overwritten on update.
The right way to go is to fix USB controller emulator to not crash - even in case of buggy driver.

@lorenzog

This comment has been minimized.

Show comment
Hide comment
@lorenzog

lorenzog May 4, 2016

Fair enough. Thanks.

lorenzog commented May 4, 2016

Fair enough. Thanks.

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong May 4, 2016

Member

Can we close this as solved?

Member

andrewdavidwong commented May 4, 2016

Can we close this as solved?

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek May 4, 2016

Member

I'd leave it open for that qemu bug.

Member

marmarek commented May 4, 2016

I'd leave it open for that qemu bug.

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong May 4, 2016

Member

Ok, sounds good.

Member

andrewdavidwong commented May 4, 2016

Ok, sounds good.

@dlmetcalf

This comment has been minimized.

Show comment
Hide comment
@dlmetcalf

dlmetcalf May 17, 2016

Thanks for ticketing priority of this as Major.

Given most users need to run RHEL or Ubuntu for their day job work, support for these mainstream OS's could have a significant impact on uptake. Increasing the number of people who can/do use Qubes in their day job, might increase odds of raising donations (or code contributions). I know several places I could 'pitch' Qubes to, but latest RHEL & Ubuntu LTS support as templates & HVMs (of which this ticket is a start) is definitely a prerequisite.

p.s. The volume of work such a small Qubes team is achieving is super impressive BTW!

Thanks for ticketing priority of this as Major.

Given most users need to run RHEL or Ubuntu for their day job work, support for these mainstream OS's could have a significant impact on uptake. Increasing the number of people who can/do use Qubes in their day job, might increase odds of raising donations (or code contributions). I know several places I could 'pitch' Qubes to, but latest RHEL & Ubuntu LTS support as templates & HVMs (of which this ticket is a start) is definitely a prerequisite.

p.s. The volume of work such a small Qubes team is achieving is super impressive BTW!

@JoeThielen

This comment has been minimized.

Show comment
Hide comment
@JoeThielen

JoeThielen Jul 28, 2016

I am having this same issue with CentOS 7 (1511) Minimal ISO downloaded from CentOS.org (sha256sum: f90e4d28fa377669b2db16cbcb451fcb9a89d2460e3645993e30e137ac37d284).

As noted above, when the VM does boot, it will show the "BUG: soft lockup - CPU#0 stuck for" notice too. At one point in trying to diagnose this (before finding this thread), I tried varying the number of CPUs allotted to the VM, and I seemed to have more success with just one. If I tried more than one I seemed to have more issues. However, that was awhile ago, before I found this thread, and I may be confusing the issue.

I will try the 'tablet'/'usb' and 'mouse'/'ps2 changes in /usr/share/qubes/vm-template-hvm.xml to see if that helps long-term. I just did a quick test with three different CentOS HVMs, booting them several times each, and now it seems to boot 100% every time, after a minute or two wait for the CPU lockup thing...

Thanks to all who are reporting on this as well as working on it! I thought I was doing something noobish/wrong.

I am having this same issue with CentOS 7 (1511) Minimal ISO downloaded from CentOS.org (sha256sum: f90e4d28fa377669b2db16cbcb451fcb9a89d2460e3645993e30e137ac37d284).

As noted above, when the VM does boot, it will show the "BUG: soft lockup - CPU#0 stuck for" notice too. At one point in trying to diagnose this (before finding this thread), I tried varying the number of CPUs allotted to the VM, and I seemed to have more success with just one. If I tried more than one I seemed to have more issues. However, that was awhile ago, before I found this thread, and I may be confusing the issue.

I will try the 'tablet'/'usb' and 'mouse'/'ps2 changes in /usr/share/qubes/vm-template-hvm.xml to see if that helps long-term. I just did a quick test with three different CentOS HVMs, booting them several times each, and now it seems to boot 100% every time, after a minute or two wait for the CPU lockup thing...

Thanks to all who are reporting on this as well as working on it! I thought I was doing something noobish/wrong.

@pedro7

This comment has been minimized.

Show comment
Hide comment
@pedro7

pedro7 Aug 4, 2016

Does intel_idle.max_cstate=7 help?

pedro7 commented Aug 4, 2016

Does intel_idle.max_cstate=7 help?

@JoeThielen

This comment has been minimized.

Show comment
Hide comment
@JoeThielen

JoeThielen Aug 4, 2016

Pedro, that doesn't seem to do anything for me.

Here is what I did:

  • Started CentOS 7 HVM
  • I added that to my /etc/default/grub file in the GRUB_CMDLINE_LINUX parameter.
  • Ran grub2-mkconfig --output=/boot/grub2/grub.cfg
  • Restarted the HVM.

It still sat there for over a minute or more, finally giving me the "BUG: soft lockup - CPU#0 stuck for" message and then booted. I restarted again and at the grub prompt I hit "e" to make sure it showed up on the command line and it was indeed there.

Let me know if I should have done something different. And thanks for the suggestion.

Pedro, that doesn't seem to do anything for me.

Here is what I did:

  • Started CentOS 7 HVM
  • I added that to my /etc/default/grub file in the GRUB_CMDLINE_LINUX parameter.
  • Ran grub2-mkconfig --output=/boot/grub2/grub.cfg
  • Restarted the HVM.

It still sat there for over a minute or more, finally giving me the "BUG: soft lockup - CPU#0 stuck for" message and then booted. I restarted again and at the grub prompt I hit "e" to make sure it showed up on the command line and it was indeed there.

Let me know if I should have done something different. And thanks for the suggestion.

@JoeThielen

This comment has been minimized.

Show comment
Hide comment
@JoeThielen

JoeThielen Sep 1, 2016

I can confirm this is still an issue on R3.2-rc2. Not the reliable booting issue, but the "BUG: soft lockup - CPU#0 stuck for" message/issue. I tried the intel_idle.max_cstate=7 thing as well as editing /usr/share/qubes/vm-template-hvm.xml. I still get the delay and message every time.

I can confirm this is still an issue on R3.2-rc2. Not the reliable booting issue, but the "BUG: soft lockup - CPU#0 stuck for" message/issue. I tried the intel_idle.max_cstate=7 thing as well as editing /usr/share/qubes/vm-template-hvm.xml. I still get the delay and message every time.

@JoeThielen

This comment has been minimized.

Show comment
Hide comment
@JoeThielen

JoeThielen Sep 2, 2016

I can also confirm this is still an issue on R3.2-rc3. I did witness a CentOS 7 Minimal HVM fail to boot before I edited /usr/share/qubes/vm-template-hvm.xml, so I can also confirm that is still an issue too. The /usr/share/qubes/vm-template-hvm.xml workaround works, but I still get the "BUG: soft lockup - CPU#0 stuck for" message/issue.

I can also confirm this is still an issue on R3.2-rc3. I did witness a CentOS 7 Minimal HVM fail to boot before I edited /usr/share/qubes/vm-template-hvm.xml, so I can also confirm that is still an issue too. The /usr/share/qubes/vm-template-hvm.xml workaround works, but I still get the "BUG: soft lockup - CPU#0 stuck for" message/issue.

@JoeThielen

This comment has been minimized.

Show comment
Hide comment
@JoeThielen

JoeThielen Sep 23, 2016

I think I've figured out the "BUG: soft lockup - CPU#0 stuck for" issue that causes delays when the HVM is booting. Looks like, at least in my situation, it's related to the bochs_drm Linux module. Looks like it has something to do with being a frame buffer driver. If I disable that module on the Linux kernel command line, then no more error!

So basically:

  • Start CentOS 7 HVM
  • I added "modprobe.disable=bochs_drm" to my /etc/default/grub file in the GRUB_CMDLINE_LINUX parameter.
  • I REMOVED "rhgb" from my /etc/default/grub file in the GRUB_CMDLINE_LINUX parameter.
  • Ran grub2-mkconfig --output=/boot/grub2/grub.cfg
  • Restarted the HVM.

It looks like this has nothing to do with the original issue in this thread. I latched onto it because the "BUG: soft lockup - CPU#0" issue was mentioned and I had assumed they were linked. But I tried a lot of different things in my attempts to fix this issue. I tried various edits to /usr/share/qubes/vm-template-hvm.xml but I went back and changed it back to stock and noticed that this issue still occured even when the original <input type='tablet' bus='usb'/> was in there. Then I did some googling and found a bunch of posts relating to clocksource=jiffies so I tried that with Xen and the HVMs and that didn't fix anything either. I forget what finally turned me onto the bochs_drm module.

Anyway, the final result is that, at least in my case, there were two separate issues:

  • /usr/share/qubes/vm-template-hvm.xml needed to have the <input type='tablet' bus='usb'/> changed to <input type='mouse' bus='ps2'/>
    • This made HVM booting reliable
  • The "BUG: soft lockup - CPU#0" issue documented above, which now seems good for me by disabling the bochs_drm module.

I think I've figured out the "BUG: soft lockup - CPU#0 stuck for" issue that causes delays when the HVM is booting. Looks like, at least in my situation, it's related to the bochs_drm Linux module. Looks like it has something to do with being a frame buffer driver. If I disable that module on the Linux kernel command line, then no more error!

So basically:

  • Start CentOS 7 HVM
  • I added "modprobe.disable=bochs_drm" to my /etc/default/grub file in the GRUB_CMDLINE_LINUX parameter.
  • I REMOVED "rhgb" from my /etc/default/grub file in the GRUB_CMDLINE_LINUX parameter.
  • Ran grub2-mkconfig --output=/boot/grub2/grub.cfg
  • Restarted the HVM.

It looks like this has nothing to do with the original issue in this thread. I latched onto it because the "BUG: soft lockup - CPU#0" issue was mentioned and I had assumed they were linked. But I tried a lot of different things in my attempts to fix this issue. I tried various edits to /usr/share/qubes/vm-template-hvm.xml but I went back and changed it back to stock and noticed that this issue still occured even when the original <input type='tablet' bus='usb'/> was in there. Then I did some googling and found a bunch of posts relating to clocksource=jiffies so I tried that with Xen and the HVMs and that didn't fix anything either. I forget what finally turned me onto the bochs_drm module.

Anyway, the final result is that, at least in my case, there were two separate issues:

  • /usr/share/qubes/vm-template-hvm.xml needed to have the <input type='tablet' bus='usb'/> changed to <input type='mouse' bus='ps2'/>
    • This made HVM booting reliable
  • The "BUG: soft lockup - CPU#0" issue documented above, which now seems good for me by disabling the bochs_drm module.
@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Sep 23, 2016

Member

Thanks for the info!

Is mouse position reliable with this change? AFAIR it easily
desynchronize. Do you have any Windows installation (without Qubes
Windows Tools installed) to check it there too?

The later change can be probably added here:
https://www.qubes-os.org/doc/linux-hvm-tips/

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Member

marmarek commented Sep 23, 2016

Thanks for the info!

Is mouse position reliable with this change? AFAIR it easily
desynchronize. Do you have any Windows installation (without Qubes
Windows Tools installed) to check it there too?

The later change can be probably added here:
https://www.qubes-os.org/doc/linux-hvm-tips/

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

@JoeThielen

This comment has been minimized.

Show comment
Hide comment
@JoeThielen

JoeThielen Sep 24, 2016

I've only been using CentOS 7 HVM for services, not anything GUI. However, during CentOS GUI install I can say that, yes, I did have problems with desynchronized mouse position.

I've only been using CentOS 7 HVM for services, not anything GUI. However, during CentOS GUI install I can say that, yes, I did have problems with desynchronized mouse position.

@JoeThielen

This comment has been minimized.

Show comment
Hide comment
@JoeThielen

JoeThielen Sep 24, 2016

@marmarek I've created a pull request in that doc page you mentioned with the steps to resolve the issue.

@marmarek I've created a pull request in that doc page you mentioned with the steps to resolve the issue.

@unman unman referenced this issue in QubesOS/qubes-doc Apr 19, 2017

Merged

Explain how to boot with HVM kernel errors. #366

@unman

This comment has been minimized.

Show comment
Hide comment
@unman

unman Apr 19, 2017

Member

@andrewdavidwong I've explained how to boot a HVM with this error in QubesOS/qubes-doc#366 and JoeThielen's original PR resolved the issue.
This may be closed

Member

unman commented Apr 19, 2017

@andrewdavidwong I've explained how to boot a HVM with this error in QubesOS/qubes-doc#366 and JoeThielen's original PR resolved the issue.
This may be closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment