New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal bug on R 3.2 and intel integrated graphics prevents boot with kernel 1000:4.9.45-21. Kinda really bad. #3165

Open
tonsimple opened this Issue Oct 10, 2017 · 15 comments

Comments

Projects
None yet
5 participants
@tonsimple

tonsimple commented Oct 10, 2017

So I got around to updating my dom0 (finally!) and ran into a bug that completely breaks everything (fortunately there's a workaround, but it's a lame and unhealthy one, so it's still bad)

Qubes OS version (e.g., R3.2):

R 3.2

Affected TemplateVMs (e.g., fedora-23, if applicable):

dom0


Steps to reproduce the behavior:

Have an Asus ROG Hero VIII motherboard and a Skylake Intel processor (other Intels and motherboards might be affected, but tested on this config)

Use integrated graphics (because Nvidia 1070 GPU support still completely broken on R 3.2)

update all dom0 stuff, particularly kernel
kernel version should be 1000:4.9.45-21

Now try a reboot

Expected behavior:

System booting normally, or at least getting to LUKS password entry splash screen

Actual behavior:

Almost complete boot failure.

GRUB menu displays, but after it tries to actually start up Qubes, it displays black screen.
No error, no nothing.
Attempt to enter LUKS password blindly also gives nothing.

General notes:

There is a "workaround" (that allowed me to boot and work normally, including posting this)

The workaround is to go into advanced options menu item in GRUB menu and choose to boot Qubes with a different kernel
(all other kernels work fine, currently posting this using 1000:4.4.67-13 kernel)

Can't provide logs because with kernel 1000:4.9.45-21 I can't even enter disk encryption password, and with kernel 1000:4.4.67-13 it boots without any issues at all (it seems)

Using an old kernel in dom0 really rubs me the wrong way, so any pointers as to how to get it boot with 1000:4.9.45-21 are greatly appreciated.


Related issues:

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 17, 2017

Member

@rtiangha any idea? Some options to try?

I'd start with modeset=0 kernel option.

Member

marmarek commented Oct 17, 2017

@rtiangha any idea? Some options to try?

I'd start with modeset=0 kernel option.

@tonsimple

This comment has been minimized.

Show comment
Hide comment
@tonsimple

tonsimple Oct 19, 2017

I've managed to catch an error message via cellphone cam (by just waiting for a few hours), I'll try to upload the photo a bit later on.

I've managed to catch an error message via cellphone cam (by just waiting for a few hours), I'll try to upload the photo a bit later on.

@tonsimple

This comment has been minimized.

Show comment
Hide comment
@tonsimple

tonsimple Oct 19, 2017

Okay, here's the "log"

It appears on screen after several hours (though the messages speak about tens of seconds of "starving")

stuff

Okay, here's the "log"

It appears on screen after several hours (though the messages speak about tens of seconds of "starving")

stuff

@tonsimple

This comment has been minimized.

Show comment
Hide comment
@tonsimple

tonsimple Dec 7, 2017

I hate to be naggy, but the issue persists on kernel 4.9.56-21 and is really getting me worried

P.S.:
Oh, and modeset=0 does nothing

tonsimple commented Dec 7, 2017

I hate to be naggy, but the issue persists on kernel 4.9.56-21 and is really getting me worried

P.S.:
Oh, and modeset=0 does nothing

@0spinboson

This comment has been minimized.

Show comment
Hide comment
@0spinboson

0spinboson Dec 8, 2017

did you actually blacklist nouveau (or the PCIe slot containing the 1070)? It may still be trying to initialize the GPU, and getting stuck, because I'm not sure 4.9 supports the pascal series at all.

did you actually blacklist nouveau (or the PCIe slot containing the 1070)? It may still be trying to initialize the GPU, and getting stuck, because I'm not sure 4.9 supports the pascal series at all.

@tonsimple

This comment has been minimized.

Show comment
Hide comment
@tonsimple

tonsimple Dec 10, 2017

@0spinboson
How do I blacklist a PCIe slot at "grub level"?

It hangs before the disk encryption password can be entered so stuff I put in "/etc/modprobe.d/" can't be relevant here, no?

@0spinboson
How do I blacklist a PCIe slot at "grub level"?

It hangs before the disk encryption password can be entered so stuff I put in "/etc/modprobe.d/" can't be relevant here, no?

@0spinboson

This comment has been minimized.

Show comment
Hide comment
@0spinboson

0spinboson Dec 10, 2017

what should work is adding 'rd.qubes.hide_pci=0d:00.0' (or whatever the PCI ID is that's shown when you run lspci) to /etc/default/grub (or by editing the grub boot parameters during boot). It's odd that it freezes so early in the boot process, though.

what should work is adding 'rd.qubes.hide_pci=0d:00.0' (or whatever the PCI ID is that's shown when you run lspci) to /etc/default/grub (or by editing the grub boot parameters during boot). It's odd that it freezes so early in the boot process, though.

@tonsimple

This comment has been minimized.

Show comment
Hide comment
@tonsimple

tonsimple Dec 13, 2017

@0spinboson

Tried that, no cigar but the error messages about stalling no longer appear at all

@0spinboson

Tried that, no cigar but the error messages about stalling no longer appear at all

@tonsimple

This comment has been minimized.

Show comment
Hide comment
@tonsimple

tonsimple Jan 2, 2018

@0spinboson okay, if I blacklist BOTH NVIDIA VGA device and NVIDIA audio device it starts spamming the NMI error messages right away with no wait at all (but still never reaches the password entry screen)

Not sure if that helps fixing it :( but hope it does...

@0spinboson okay, if I blacklist BOTH NVIDIA VGA device and NVIDIA audio device it starts spamming the NMI error messages right away with no wait at all (but still never reaches the password entry screen)

Not sure if that helps fixing it :( but hope it does...

@tonsimple

This comment has been minimized.

Show comment
Hide comment
@tonsimple

tonsimple Feb 24, 2018

@marmarek @0spinboson
A little update:

with most recent updates (dunno what changed...) nomodeset allows the system to boot (password can be entered and boot completes) but lightdm dies right away and I am limited to command line interface (fortunately, that's enough to revert things back to normal without too much pain)

Also, with most recent updates the "starved for jiffies" message never ever shows up (when booting most recent kernel without nomodeset option)
Just black screen, forever.

@marmarek @0spinboson
A little update:

with most recent updates (dunno what changed...) nomodeset allows the system to boot (password can be entered and boot completes) but lightdm dies right away and I am limited to command line interface (fortunately, that's enough to revert things back to normal without too much pain)

Also, with most recent updates the "starved for jiffies" message never ever shows up (when booting most recent kernel without nomodeset option)
Just black screen, forever.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Feb 24, 2018

Member

What version do you have now? Testing repository have 4.14.18 now, you may want to check if that fixed anything. But pay attention during update to not remove the working kernel - you may want to raise installonly_limit option in /etc/dnf/dnf.conf, or remove one of broken kernels.

Member

marmarek commented Feb 24, 2018

What version do you have now? Testing repository have 4.14.18 now, you may want to check if that fixed anything. But pay attention during update to not remove the working kernel - you may want to raise installonly_limit option in /etc/dnf/dnf.conf, or remove one of broken kernels.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Feb 24, 2018

Member

Try also iommu=no-igfx option to xen (xen.gz line in grub).

Member

marmarek commented Feb 24, 2018

Try also iommu=no-igfx option to xen (xen.gz line in grub).

@loadcorp

This comment has been minimized.

Show comment
Hide comment
@loadcorp

loadcorp Feb 25, 2018

@tonsimple did you resolve your problem? Could you tell me which updates you got? Which version you installed?
What exactly you did with nomodeset and where?

@tonsimple did you resolve your problem? Could you tell me which updates you got? Which version you installed?
What exactly you did with nomodeset and where?

@tonsimple

This comment has been minimized.

Show comment
Hide comment
@tonsimple

tonsimple Feb 25, 2018

@marmarek
iommu=no-igfx causes immediate crash (the screen flashes a few times, then system shuts down)

I'll try out the new testing kernel and report results.

@loadcorp

My current solution is to increase installonly_limit, then install kernel 1000:4.4.67-13, then boot using it.
But it's not really a "solution" and sooner or later a properly working kernel will become needed.

nomodeset allows boot, but GUI (lightdm) environment won't start/dies right away, so it's a "no go"

tonsimple commented Feb 25, 2018

@marmarek
iommu=no-igfx causes immediate crash (the screen flashes a few times, then system shuts down)

I'll try out the new testing kernel and report results.

@loadcorp

My current solution is to increase installonly_limit, then install kernel 1000:4.4.67-13, then boot using it.
But it's not really a "solution" and sooner or later a properly working kernel will become needed.

nomodeset allows boot, but GUI (lightdm) environment won't start/dies right away, so it's a "no go"

@tonsimple

This comment has been minimized.

Show comment
Hide comment
@tonsimple

tonsimple Feb 27, 2018

@marmarek @0spinboson

Well, I tried it with the 4.14.18 from testing and there is good news and bad news

Good news:
It actually gets to the password entry screen, password gets accepted, logs are written, and it reaches desktop environment user password entry screen.

HOWEVER
Bad news:
after user's password is entered the desktop fails to load.
The cursor is there, but neither the shortcuts on desktop nor anything else (like the VM manager) ever show up.

But, since logs are being written, I managed to retrieve logs from this "almost but not quite" successful boot.

Here it is
badboot2018.txt

The "oldie but still goodie" kernel 1000:4.4.67-13 still works flawlessly.

Just in case it helps determining "what went wrong this time" I attach a log of "completely healthy boot" using the kernel 1000:4.4.67-13
goodboot2018.txt

P.S.:
Should I start a separate issue for this behavior, post it to dev forum, or anything else? It is, after all, distinct from "wontboot" behavior observed on non-testing kernels...

tonsimple commented Feb 27, 2018

@marmarek @0spinboson

Well, I tried it with the 4.14.18 from testing and there is good news and bad news

Good news:
It actually gets to the password entry screen, password gets accepted, logs are written, and it reaches desktop environment user password entry screen.

HOWEVER
Bad news:
after user's password is entered the desktop fails to load.
The cursor is there, but neither the shortcuts on desktop nor anything else (like the VM manager) ever show up.

But, since logs are being written, I managed to retrieve logs from this "almost but not quite" successful boot.

Here it is
badboot2018.txt

The "oldie but still goodie" kernel 1000:4.4.67-13 still works flawlessly.

Just in case it helps determining "what went wrong this time" I attach a log of "completely healthy boot" using the kernel 1000:4.4.67-13
goodboot2018.txt

P.S.:
Should I start a separate issue for this behavior, post it to dev forum, or anything else? It is, after all, distinct from "wontboot" behavior observed on non-testing kernels...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment