New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try some newer kernel for Dom0 and VM #560

Closed
marmarek opened this Issue Mar 8, 2015 · 28 comments

Comments

Projects
None yet
2 participants
@marmarek
Member

marmarek commented Mar 8, 2015

Reported by wikimaster on 5 May 2012 10:51 UTC
Problems with 3.2.7:

  • Doesn't support suspend on some systems (e.g. Vaio)
  • Problems with some USB modems in netvms (e.g. on my T420s)
  • udev lockup when booting on T420s that I started experience recently... (#551)

Ideally we could have some newer kernel and get rid complete of the old 2.6.38, with all its hundreds of god-knows-written-by-whom patches, and not-supported xenlinux architecture...

Migrated-From: https://wiki.qubes-os.org/ticket/560

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by marmarek on 9 May 2012 14:25 UTC
3.3.5 in my kernel repo, devel-3.3 branch. There was many changes since 3.2, especially in ACPI S3 part, but fortunately Konrad provide branch in his repo with ACPI S3 patches for 3.3. It looks like 3.4 will fully support ACPI S3 OOTB :)

Tested on Samsung laptop and looks ok.

New nvidia driver (not commited to dom0-updates repo yet) now compiles cleanly, but still doesn't work - kernel messages looks like some infinite loop in interrupt handler... The same kernel on baremetal doesn't have this problem. Have not looked at this deeper yet.

Member

marmarek commented Mar 8, 2015

Comment by marmarek on 9 May 2012 14:25 UTC
3.3.5 in my kernel repo, devel-3.3 branch. There was many changes since 3.2, especially in ACPI S3 part, but fortunately Konrad provide branch in his repo with ACPI S3 patches for 3.3. It looks like 3.4 will fully support ACPI S3 OOTB :)

Tested on Samsung laptop and looks ok.

New nvidia driver (not commited to dom0-updates repo yet) now compiles cleanly, but still doesn't work - kernel messages looks like some infinite loop in interrupt handler... The same kernel on baremetal doesn't have this problem. Have not looked at this deeper yet.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by joanna on 12 May 2012 12:30 UTC
Just tried the 3.3.5-1 with xen 4.1.2-14, so without this commit:

http://10.141.1.101/?p=marmarek/xen.git;a=commitdiff;h=4bbba736120d67b7d7b7551d881f1f7e50ccbd55

My T420s hanged upon first S3 resume :/

Member

marmarek commented Mar 8, 2015

Comment by joanna on 12 May 2012 12:30 UTC
Just tried the 3.3.5-1 with xen 4.1.2-14, so without this commit:

http://10.141.1.101/?p=marmarek/xen.git;a=commitdiff;h=4bbba736120d67b7d7b7551d881f1f7e50ccbd55

My T420s hanged upon first S3 resume :/

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by marmarek on 12 May 2012 20:02 UTC
This patch can (note that also kernel patch required) can help here - C-states can be the cause of freeze at suspend/resume. You can also try to disable cpuidle: don't remember exact parameters, but something like (AFAIR both required):
xen: no-cpuidle
dom0 kernel: processor.max_cstate=0 intel_idle.max_cstate=0

Member

marmarek commented Mar 8, 2015

Comment by marmarek on 12 May 2012 20:02 UTC
This patch can (note that also kernel patch required) can help here - C-states can be the cause of freeze at suspend/resume. You can also try to disable cpuidle: don't remember exact parameters, but something like (AFAIR both required):
xen: no-cpuidle
dom0 kernel: processor.max_cstate=0 intel_idle.max_cstate=0

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by marmarek on 12 May 2012 22:54 UTC
Perhaps this hang was introduced by msi-after-sleep.patch? Have you tried 3.3.5 kernel without this xen patch?

Member

marmarek commented Mar 8, 2015

Comment by marmarek on 12 May 2012 22:54 UTC
Perhaps this hang was introduced by msi-after-sleep.patch? Have you tried 3.3.5 kernel without this xen patch?

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by joanna on 16 May 2012 14:37 UTC
Interestingly there is one more thing that breaks on this 3.3.5 kernel, compared to other dom0 kernel -- namely the gui daemon seems to have a horrible performance! This is easily visible when one opens up a browser with some long page, and tries to scroll up and down. The graphics lags significantly. When I switch back to the previous dom0 kernel, this problem vanishes... maybe it's connected with how Xen/Dom0 manages cpu, i.e. it slows it down too much?

Member

marmarek commented Mar 8, 2015

Comment by joanna on 16 May 2012 14:37 UTC
Interestingly there is one more thing that breaks on this 3.3.5 kernel, compared to other dom0 kernel -- namely the gui daemon seems to have a horrible performance! This is easily visible when one opens up a browser with some long page, and tries to scroll up and down. The graphics lags significantly. When I switch back to the previous dom0 kernel, this problem vanishes... maybe it's connected with how Xen/Dom0 manages cpu, i.e. it slows it down too much?

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by marmarek on 20 May 2012 00:05 UTC
Interesting, I don't see such symptoms on my system (with nvidia GPU on nouveau driver).
You can check power management parameters and statistics using xenpm tool.

Member

marmarek commented Mar 8, 2015

Comment by marmarek on 20 May 2012 00:05 UTC
Interesting, I don't see such symptoms on my system (with nvidia GPU on nouveau driver).
You can check power management parameters and statistics using xenpm tool.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by joanna on 21 May 2012 08:31 UTC
Later I could also see that with this 3.3.5 kernel in Dom0 some other graphics, specifically the plymouth splash blue screens, is also very slowly drawn. This suggests that there is some problem with supporting Intel Sandy Bridge integrated GPUs (strangely, OpenGL, as used by Dom0 KDE, seems to work fine).

Anyway, a natural question arises: how can we offer a user a choice of several different kernels for Dom0? If we just put them into our current yum repo, then you will automatically always pick the newest one, without offering an easy option to install any of the previous ones.

We could modify installer so that several select kernels are installed and grub offers a choice at boot (also for the installer). This would be good for hardware compatibility. Nevertheless, how can we prevent further kernel updates from removing the oldest ones?

Member

marmarek commented Mar 8, 2015

Comment by joanna on 21 May 2012 08:31 UTC
Later I could also see that with this 3.3.5 kernel in Dom0 some other graphics, specifically the plymouth splash blue screens, is also very slowly drawn. This suggests that there is some problem with supporting Intel Sandy Bridge integrated GPUs (strangely, OpenGL, as used by Dom0 KDE, seems to work fine).

Anyway, a natural question arises: how can we offer a user a choice of several different kernels for Dom0? If we just put them into our current yum repo, then you will automatically always pick the newest one, without offering an easy option to install any of the previous ones.

We could modify installer so that several select kernels are installed and grub offers a choice at boot (also for the installer). This would be good for hardware compatibility. Nevertheless, how can we prevent further kernel updates from removing the oldest ones?

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by marmarek on 21 May 2012 12:29 UTC
As of multiple kernel: yum already have support for multiple kernels installed. Number of kernels installed simultaneously can be set via ''installonly_limit'' in yum.conf (3 by default).

What did you get from xenpm tool?

Member

marmarek commented Mar 8, 2015

Comment by marmarek on 21 May 2012 12:29 UTC
As of multiple kernel: yum already have support for multiple kernels installed. Number of kernels installed simultaneously can be set via ''installonly_limit'' in yum.conf (3 by default).

What did you get from xenpm tool?

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by joanna on 26 May 2012 09:38 UTC
Shall we increase the installonly limit to something bigger, say 16? I don't really see a reason to prevent user from trying a few kernels... Currently, with this being 3, we risk that a user removes a working kernel, while trying 3 more newer kernels...

Member

marmarek commented Mar 8, 2015

Comment by joanna on 26 May 2012 09:38 UTC
Shall we increase the installonly limit to something bigger, say 16? I don't really see a reason to prevent user from trying a few kernels... Currently, with this being 3, we risk that a user removes a working kernel, while trying 3 more newer kernels...

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by marmarek on 27 May 2012 23:36 UTC
The reason to keep this limit low is disk space. We have 500MB in /boot (one kernel takes about 20MB), so should be no problem here. But modules take about 130MB, which can be significant with number like 10 (especially when using not-so-big SSD drive).
IMHO we can increase by default it to somehow like 5, but not more. If the user wants to experiment with different kernel versions, can increase this limit manually.

BTW I've just pushed devel-3.4 branch with brand new 3.4 kernel (+some patches - ACPI S3 under Xen still isn't in upstream). I'd some problems in VM on this kernel ("null paging request" while loading iptables modules), but cannot reproduce... Beside that looks good (especially in dom0). Maybe it fixes some problems mentioned in original ticket description and/or Intel GPU performance issue?

Member

marmarek commented Mar 8, 2015

Comment by marmarek on 27 May 2012 23:36 UTC
The reason to keep this limit low is disk space. We have 500MB in /boot (one kernel takes about 20MB), so should be no problem here. But modules take about 130MB, which can be significant with number like 10 (especially when using not-so-big SSD drive).
IMHO we can increase by default it to somehow like 5, but not more. If the user wants to experiment with different kernel versions, can increase this limit manually.

BTW I've just pushed devel-3.4 branch with brand new 3.4 kernel (+some patches - ACPI S3 under Xen still isn't in upstream). I'd some problems in VM on this kernel ("null paging request" while loading iptables modules), but cannot reproduce... Beside that looks good (especially in dom0). Maybe it fixes some problems mentioned in original ticket description and/or Intel GPU performance issue?

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by joanna on 28 May 2012 21:38 UTC
I just tried the 3.4 kernel (+ Xen with "xen: allow dom0 to update C-/P-/T- state management info" patch). The good news is that suspend seems to work fine now on my T420s. The bad news is that the graphics performance is just as poor as on the 3.3. kernel, as reported above. I played with xenpm and made sure to set the following:

  1. xenpm set-scaling-governor performance
  2. xenpm set-max-cstate 0

I have verified then that my processor is: 1) keeps staying in P0 state (so, max frequenccy), and 2) keeps staying in C0 state.

Those setting didn't change anything regarding the graphics performance, which is really visibly poor (e.g. when scrolling large websites in a browser).

Is there any other setting in xenpm I could try?

Member

marmarek commented Mar 8, 2015

Comment by joanna on 28 May 2012 21:38 UTC
I just tried the 3.4 kernel (+ Xen with "xen: allow dom0 to update C-/P-/T- state management info" patch). The good news is that suspend seems to work fine now on my T420s. The bad news is that the graphics performance is just as poor as on the 3.3. kernel, as reported above. I played with xenpm and made sure to set the following:

  1. xenpm set-scaling-governor performance
  2. xenpm set-max-cstate 0

I have verified then that my processor is: 1) keeps staying in P0 state (so, max frequenccy), and 2) keeps staying in C0 state.

Those setting didn't change anything regarding the graphics performance, which is really visibly poor (e.g. when scrolling large websites in a browser).

Is there any other setting in xenpm I could try?

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by joanna on 28 May 2012 21:39 UTC
Ah, one additional side effect that seems to be related to running 3.4 as dom0 kernel (or perhaps xen with this C/P/T patch?) is that I keep getting the following crash in various VMs (which run 3.2.7-4 kernel):

[   21.018929] BUG: unable to handle kernel paging request at 00000000988c9400
[   21.018943] IP: [____nf_conntrack_find+0x2/0x180 [nf_conntrack](<ffffffffa00b3e82>])
[   21.018958] PGD 11a79067 PUD 0 
[   21.018964] Oops: 0002 [SMP 
[   21.018970](#1]) CPU 3 
[   21.018973] Modules linked in: bnep bluetooth rfkill ipt_REJECT xt_state xt_tcpudp iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables xen_netfront pcspkr u2mfn(O) xen_blkback xen_evtchn autofs4 ext4 jbd2 crc16 dm_snapshot xen_blkfront [unloaded: scsi_wait_scan](last)
[   21.019017] 
[   21.019021] Pid: 1006, comm: firefox Tainted: G           O 3.2.7-4.pvops.qubes.x86_64 #1  
[   21.019029] RIP: e030:[ [<ffffffffa00b3e82>](<ffffffffa00b3e82>]) ____nf_conntrack_find+0x2/0x180 [  21.019041](nf_conntrack]
[) RSP: e02b:ffff8800114878c0  EFLAGS: 00010282
[   21.019045] RAX: 00000000988c9400 RBX: ffff880011487938 RCX: 00000000988c94c1
[   21.019051] RDX: ffff880011487938 RSI: 0000000000000000 RDI: ffffffff819e6ec0
[   21.019057] RBP: ffff8800114878f8 R08: 0000000020a9a821 R09: 000000007fad0e16
[   21.019063] R10: ffff880011487950 R11: 0000000000000000 R12: 0000000000000000
[   21.019069] R13: 00000000988c94c1 R14: ffffffff819e6ec0 R15: 0000000000000000
[   21.019079] FS:  00007fe0784f6700(0000) GS:ffff880018fad000(0000) knlGS:0000000000000000
[   21.019086] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[   21.019092] CR2: 00000000988c9400 CR3: 0000000010a55000 CR4: 0000000000002660
[   21.019098] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   21.019104] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   21.019111] Process firefox (pid: 1006, threadinfo ffff880011486000, task ffff8800114be440)
[   21.019117] Stack:
[   21.019121]  ffffffffa00b403d ffff8800114878f8 ffff8800118382c0 0000000000000000
[   21.019132]  ffffffff819e6ec0 0000000000000000 0000000000000002 ffff8800114879a8
[   21.019142]  ffffffffa00b44a2 ffffffffa00d1420 ffffffffa00be640 ffffffffa00be640
[   21.019153] Call Trace:
[   21.019162]  [? __nf_conntrack_find_get+0x3d/0x180 [nf_conntrack](<ffffffffa00b403d>])
[   21.019173]  [nf_conntrack_in+0x2d2/0x6a0 [nf_conntrack](<ffffffffa00b44a2>])
[   21.019183]  [? ip_generic_getfrag+0x88/0xa0
[   21.019191](<ffffffff8139d6e8>])  [ipv4_conntrack_local+0x49/0x50 [nf_conntrack_ipv4](<ffffffffa00d0689>])
[   21.019200]  [nf_iterate+0x85/0xb0
[   21.019207](<ffffffff813910d5>])  [? ip_options_build+0x210/0x210
[   21.019214](<ffffffff8139be70>])  [nf_hook_slow+0x6d/0x140
[   21.019221](<ffffffff8139125d>])  [? ip_options_build+0x210/0x210
[   21.019228](<ffffffff8139be70>])  [__ip_local_out+0x9e/0xa0
[   21.019234](<ffffffff8139e0fe>])  [ip_local_out+0x11/0x30
[   21.019240](<ffffffff8139e431>])  [ip_send_skb+0x16/0x50
[   21.019248](<ffffffff8139e466>])  [udp_send_skb+0x108/0x390
[   21.019255](<ffffffff813bfe98>])  [? ipv4_dst_check+0x39/0x40
[   21.019262](<ffffffff81396129>])  [? ip_append_page+0x530/0x530
[   21.019270](<ffffffff8139d660>])  [udp_sendmsg+0x2ed/0x8a0
[   21.019278](<ffffffff813c107d>])  [? xen_restore_fl_direct_reloc+0x4/0x4
[   21.019288](<ffffffff8100a05f>])  [? kmem_cache_alloc+0x77/0x110
[   21.019296](<ffffffff81128997>])  [inet_sendmsg+0x43/0xb0
[   21.019303](<ffffffff813ca0b3>])  [sock_sendmsg+0xe4/0x110
[   21.019309](<ffffffff81352884>])  [? ip_route_output_slow+0x1ae/0x520
[   21.019321](<ffffffff81396d8e>])  [? local_bh_enable_ip+0x22/0xa0
[   21.019331](<ffffffff81065462>])  [? _raw_spin_unlock_bh+0x10/0x20
[   21.019338](<ffffffff81447a80>])  [? release_sock+0xdb/0x110
[   21.019345](<ffffffff813561fb>])  [sys_sendto+0x104/0x140
[   21.019356](<ffffffff81352a04>])  [? pvclock_clocksource_read+0x58/0xd0
[   21.019362](<ffffffff8103a2a8>])  [? xen_clocksource_read+0x20/0x30
[   21.019369](<ffffffff81009e60>])  [? xen_clocksource_get_cycles+0x9/0x10
[   21.019378](<ffffffff81009ff9>])  [? getnstimeofday+0x52/0xe0
[   21.019386](<ffffffff81088a52>])  [system_call_fastpath+0x16/0x1b
[   21.019391](<ffffffff8144f752>]) Code: 00 01 00 00 09 bc 0b a0 00 00 00 50 e8 0b a0 00 00 a8 43 61 18 e1 00 00 00 01 00 00 00 02 00 00 00 40 06 00 00 00 00 00 00 b0 00 <00> 00 00 00 00 00 01 00 00 00 02 00 00 00 30 05 00 00 00 00 00 
[   21.019473] RIP  [____nf_conntrack_find+0x2/0x180 [nf_conntrack](<ffffffffa00b3e82>])
[   21.019487]  RSP <ffff8800114878c0>
[   21.019491] CR2: 00000000988c9400
[   21.019496] ---[ end trace 24141b02a27a02d2 ]---
Member

marmarek commented Mar 8, 2015

Comment by joanna on 28 May 2012 21:39 UTC
Ah, one additional side effect that seems to be related to running 3.4 as dom0 kernel (or perhaps xen with this C/P/T patch?) is that I keep getting the following crash in various VMs (which run 3.2.7-4 kernel):

[   21.018929] BUG: unable to handle kernel paging request at 00000000988c9400
[   21.018943] IP: [____nf_conntrack_find+0x2/0x180 [nf_conntrack](<ffffffffa00b3e82>])
[   21.018958] PGD 11a79067 PUD 0 
[   21.018964] Oops: 0002 [SMP 
[   21.018970](#1]) CPU 3 
[   21.018973] Modules linked in: bnep bluetooth rfkill ipt_REJECT xt_state xt_tcpudp iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables xen_netfront pcspkr u2mfn(O) xen_blkback xen_evtchn autofs4 ext4 jbd2 crc16 dm_snapshot xen_blkfront [unloaded: scsi_wait_scan](last)
[   21.019017] 
[   21.019021] Pid: 1006, comm: firefox Tainted: G           O 3.2.7-4.pvops.qubes.x86_64 #1  
[   21.019029] RIP: e030:[ [<ffffffffa00b3e82>](<ffffffffa00b3e82>]) ____nf_conntrack_find+0x2/0x180 [  21.019041](nf_conntrack]
[) RSP: e02b:ffff8800114878c0  EFLAGS: 00010282
[   21.019045] RAX: 00000000988c9400 RBX: ffff880011487938 RCX: 00000000988c94c1
[   21.019051] RDX: ffff880011487938 RSI: 0000000000000000 RDI: ffffffff819e6ec0
[   21.019057] RBP: ffff8800114878f8 R08: 0000000020a9a821 R09: 000000007fad0e16
[   21.019063] R10: ffff880011487950 R11: 0000000000000000 R12: 0000000000000000
[   21.019069] R13: 00000000988c94c1 R14: ffffffff819e6ec0 R15: 0000000000000000
[   21.019079] FS:  00007fe0784f6700(0000) GS:ffff880018fad000(0000) knlGS:0000000000000000
[   21.019086] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[   21.019092] CR2: 00000000988c9400 CR3: 0000000010a55000 CR4: 0000000000002660
[   21.019098] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   21.019104] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   21.019111] Process firefox (pid: 1006, threadinfo ffff880011486000, task ffff8800114be440)
[   21.019117] Stack:
[   21.019121]  ffffffffa00b403d ffff8800114878f8 ffff8800118382c0 0000000000000000
[   21.019132]  ffffffff819e6ec0 0000000000000000 0000000000000002 ffff8800114879a8
[   21.019142]  ffffffffa00b44a2 ffffffffa00d1420 ffffffffa00be640 ffffffffa00be640
[   21.019153] Call Trace:
[   21.019162]  [? __nf_conntrack_find_get+0x3d/0x180 [nf_conntrack](<ffffffffa00b403d>])
[   21.019173]  [nf_conntrack_in+0x2d2/0x6a0 [nf_conntrack](<ffffffffa00b44a2>])
[   21.019183]  [? ip_generic_getfrag+0x88/0xa0
[   21.019191](<ffffffff8139d6e8>])  [ipv4_conntrack_local+0x49/0x50 [nf_conntrack_ipv4](<ffffffffa00d0689>])
[   21.019200]  [nf_iterate+0x85/0xb0
[   21.019207](<ffffffff813910d5>])  [? ip_options_build+0x210/0x210
[   21.019214](<ffffffff8139be70>])  [nf_hook_slow+0x6d/0x140
[   21.019221](<ffffffff8139125d>])  [? ip_options_build+0x210/0x210
[   21.019228](<ffffffff8139be70>])  [__ip_local_out+0x9e/0xa0
[   21.019234](<ffffffff8139e0fe>])  [ip_local_out+0x11/0x30
[   21.019240](<ffffffff8139e431>])  [ip_send_skb+0x16/0x50
[   21.019248](<ffffffff8139e466>])  [udp_send_skb+0x108/0x390
[   21.019255](<ffffffff813bfe98>])  [? ipv4_dst_check+0x39/0x40
[   21.019262](<ffffffff81396129>])  [? ip_append_page+0x530/0x530
[   21.019270](<ffffffff8139d660>])  [udp_sendmsg+0x2ed/0x8a0
[   21.019278](<ffffffff813c107d>])  [? xen_restore_fl_direct_reloc+0x4/0x4
[   21.019288](<ffffffff8100a05f>])  [? kmem_cache_alloc+0x77/0x110
[   21.019296](<ffffffff81128997>])  [inet_sendmsg+0x43/0xb0
[   21.019303](<ffffffff813ca0b3>])  [sock_sendmsg+0xe4/0x110
[   21.019309](<ffffffff81352884>])  [? ip_route_output_slow+0x1ae/0x520
[   21.019321](<ffffffff81396d8e>])  [? local_bh_enable_ip+0x22/0xa0
[   21.019331](<ffffffff81065462>])  [? _raw_spin_unlock_bh+0x10/0x20
[   21.019338](<ffffffff81447a80>])  [? release_sock+0xdb/0x110
[   21.019345](<ffffffff813561fb>])  [sys_sendto+0x104/0x140
[   21.019356](<ffffffff81352a04>])  [? pvclock_clocksource_read+0x58/0xd0
[   21.019362](<ffffffff8103a2a8>])  [? xen_clocksource_read+0x20/0x30
[   21.019369](<ffffffff81009e60>])  [? xen_clocksource_get_cycles+0x9/0x10
[   21.019378](<ffffffff81009ff9>])  [? getnstimeofday+0x52/0xe0
[   21.019386](<ffffffff81088a52>])  [system_call_fastpath+0x16/0x1b
[   21.019391](<ffffffff8144f752>]) Code: 00 01 00 00 09 bc 0b a0 00 00 00 50 e8 0b a0 00 00 a8 43 61 18 e1 00 00 00 01 00 00 00 02 00 00 00 40 06 00 00 00 00 00 00 b0 00 <00> 00 00 00 00 00 01 00 00 00 02 00 00 00 30 05 00 00 00 00 00 
[   21.019473] RIP  [____nf_conntrack_find+0x2/0x180 [nf_conntrack](<ffffffffa00b3e82>])
[   21.019487]  RSP <ffff8800114878c0>
[   21.019491] CR2: 00000000988c9400
[   21.019496] ---[ end trace 24141b02a27a02d2 ]---
@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by marmarek on 28 May 2012 22:00 UTC
The crash can be caused by latest SKB slots patch... Don't have any other idea how change in dom0 kernerl can cause VM kernel crashes, especially in network subsystem, which are totally isolated.

Member

marmarek commented Mar 8, 2015

Comment by marmarek on 28 May 2012 22:00 UTC
The crash can be caused by latest SKB slots patch... Don't have any other idea how change in dom0 kernerl can cause VM kernel crashes, especially in network subsystem, which are totally isolated.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by marmarek on 30 May 2012 09:21 UTC
Updated SKB patch (from xen-devel):
http://git.qubes-os.org/gitweb/?p=marmarek/kernel.git;a=commit;h=0ebad7661fe512bdc0d3c6169208f5e7eef8fb39

Member

marmarek commented Mar 8, 2015

Comment by marmarek on 30 May 2012 09:21 UTC
Updated SKB patch (from xen-devel):
http://git.qubes-os.org/gitweb/?p=marmarek/kernel.git;a=commit;h=0ebad7661fe512bdc0d3c6169208f5e7eef8fb39

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by joanna on 5 Jun 2012 21:23 UTC
Finally I managed to reproduce the hang on my 2.3.7-3 Dom0 kernel:

Key slot 0 unlocked.
                Welcome to Qubes
                Press 'I' to enter interactive startup.
Starting udev: udevd-work['/usr/bin/vmmouse_detect' unexpected exit with status 0x000b

udevd-work[810](816]:): '/usr/bin/vmmouse_detect' unexpected exit with status 0x000b

udevd[worker [850](802]:) unexpectedly returned with status 0x0100

udevd[worker [850](802]:) failed while handling '/devices/virtual/block/loop114'

udevd[worker [1421](802]:) unexpectedly returned with status 0x0100

udevd[worker [1421](802]:) failed while handling '/devices/virtual/block/loop192'

udevd[worker [809](802]:) unexpectedly returned with status 0x0100

udevd[worker [809](802]:) failed while handling '/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda1'

udevd[worker [838](802]:) unexpectedly returned with status 0x0100

udevd[worker [838](802]:) failed while handling '/devices/virtual/block/loop201'

udevd[worker [872](802]:) unexpectedly returned with status 0x0100

udevd[worker [872](802]:) failed while handling '/devices/virtual/block/loop205'

udevd[worker [875](802]:) unexpectedly returned with status 0x0100

udevd[worker [875](802]:) failed while handling '/devices/virtual/block/loop206'

udevd[worker [876](802]:) unexpectedly returned with status 0x0100

udevd[worker [876](802]:) failed while handling '/devices/virtual/block/loop207'

udevd[worker [885](802]:) unexpectedly returned with status 0x0100

udevd[worker [885](802]:) failed while handling '/devices/virtual/block/loop209'

udevd[worker [1058](802]:) unexpectedly returned with status 0x0100

udevd[worker [1058](802]:) failed while handling '/devices/virtual/block/loop212'

udevd[worker [1158](802]:) unexpectedly returned with status 0x0100

udevd[worker [1158](802]:) failed while handling '/devices/virtual/block/loop217'

Member

marmarek commented Mar 8, 2015

Comment by joanna on 5 Jun 2012 21:23 UTC
Finally I managed to reproduce the hang on my 2.3.7-3 Dom0 kernel:

Key slot 0 unlocked.
                Welcome to Qubes
                Press 'I' to enter interactive startup.
Starting udev: udevd-work['/usr/bin/vmmouse_detect' unexpected exit with status 0x000b

udevd-work[810](816]:): '/usr/bin/vmmouse_detect' unexpected exit with status 0x000b

udevd[worker [850](802]:) unexpectedly returned with status 0x0100

udevd[worker [850](802]:) failed while handling '/devices/virtual/block/loop114'

udevd[worker [1421](802]:) unexpectedly returned with status 0x0100

udevd[worker [1421](802]:) failed while handling '/devices/virtual/block/loop192'

udevd[worker [809](802]:) unexpectedly returned with status 0x0100

udevd[worker [809](802]:) failed while handling '/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda1'

udevd[worker [838](802]:) unexpectedly returned with status 0x0100

udevd[worker [838](802]:) failed while handling '/devices/virtual/block/loop201'

udevd[worker [872](802]:) unexpectedly returned with status 0x0100

udevd[worker [872](802]:) failed while handling '/devices/virtual/block/loop205'

udevd[worker [875](802]:) unexpectedly returned with status 0x0100

udevd[worker [875](802]:) failed while handling '/devices/virtual/block/loop206'

udevd[worker [876](802]:) unexpectedly returned with status 0x0100

udevd[worker [876](802]:) failed while handling '/devices/virtual/block/loop207'

udevd[worker [885](802]:) unexpectedly returned with status 0x0100

udevd[worker [885](802]:) failed while handling '/devices/virtual/block/loop209'

udevd[worker [1058](802]:) unexpectedly returned with status 0x0100

udevd[worker [1058](802]:) failed while handling '/devices/virtual/block/loop212'

udevd[worker [1158](802]:) unexpectedly returned with status 0x0100

udevd[worker [1158](802]:) failed while handling '/devices/virtual/block/loop217'

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by marmarek on 6 Jun 2012 00:24 UTC
Perhpaps the reason is slowness of xenstore. Our udev handler expose block devices to xenstore, but also remove its when decided to not export some device - regardless if it was exported earlier or not. In dom0 we have 256 loop devices, which are simultaneously handled by udev, which produce at least that amount of simultaneous xenstore-rm calls (empty loop devices are hidden from qvm-block). Maybe even this is some deadlock in xenstore...

I've just modified udev handler to save information if device is exported to xenstore and call xenstore-rm only when necessary (will push it in the near future).

Member

marmarek commented Mar 8, 2015

Comment by marmarek on 6 Jun 2012 00:24 UTC
Perhpaps the reason is slowness of xenstore. Our udev handler expose block devices to xenstore, but also remove its when decided to not export some device - regardless if it was exported earlier or not. In dom0 we have 256 loop devices, which are simultaneously handled by udev, which produce at least that amount of simultaneous xenstore-rm calls (empty loop devices are hidden from qvm-block). Maybe even this is some deadlock in xenstore...

I've just modified udev handler to save information if device is exported to xenstore and call xenstore-rm only when necessary (will push it in the near future).

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by joanna on 11 Jun 2012 09:07 UTC
Have you pushed?

Member

marmarek commented Mar 8, 2015

Comment by joanna on 11 Jun 2012 09:07 UTC
Have you pushed?

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by marmarek on 11 Jun 2012 09:12 UTC
Into gitpro - yes. Not to public git yet - will do it later today.

Member

marmarek commented Mar 8, 2015

Comment by marmarek on 11 Jun 2012 09:12 UTC
Into gitpro - yes. Not to public git yet - will do it later today.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by marmarek on 15 Jun 2012 13:24 UTC
Above mentioned commit: http://git.qubes-os.org/?p=marmarek/core.git;a=commit;h=3a8427cee57cab2a0f10c00586a8ccd967462aa5

Also pushed 3.4.2 to devel-3.4 branch. 3.4 up to 3.4.1 had rather strange problem which manifest itself in messing up pages read from loop device. On my test system it causes crashes during netvm boot because of loading broken modules (reads from /lib/modules returned messed up files content!). Details: http://lists.xen.org/archives/html/xen-devel/2012-06/msg00537.html

But still, I suspect nothing new here in terms of intel gpu performance.

Member

marmarek commented Mar 8, 2015

Comment by marmarek on 15 Jun 2012 13:24 UTC
Above mentioned commit: http://git.qubes-os.org/?p=marmarek/core.git;a=commit;h=3a8427cee57cab2a0f10c00586a8ccd967462aa5

Also pushed 3.4.2 to devel-3.4 branch. 3.4 up to 3.4.1 had rather strange problem which manifest itself in messing up pages read from loop device. On my test system it causes crashes during netvm boot because of loading broken modules (reads from /lib/modules returned messed up files content!). Details: http://lists.xen.org/archives/html/xen-devel/2012-06/msg00537.html

But still, I suspect nothing new here in terms of intel gpu performance.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by joanna on 17 Jun 2012 20:47 UTC
Sadly, this commit didn't help with udev on my kernel. It still tends to hang on udev during boot for a few minutes...

Member

marmarek commented Mar 8, 2015

Comment by joanna on 17 Jun 2012 20:47 UTC
Sadly, this commit didn't help with udev on my kernel. It still tends to hang on udev during boot for a few minutes...

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Modified by joanna on 20 Jun 2012 12:47 UTC

Member

marmarek commented Mar 8, 2015

Modified by joanna on 20 Jun 2012 12:47 UTC

@marmarek marmarek added P: critical and removed P: major labels Mar 8, 2015

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by marmarek on 20 Jun 2012 22:47 UTC
Can you catch which udev action causes this (and if it really is udev action)? Perhaps some sysrq will help, or some additional detail in logfile. Above udev messages didn't contain info which action failed...
Did you tried 3.4.2?
Regarding GPU performance, I can revert commit pointed by Radoslaw Szkodzinski, but can't predict side effects... Maybe it worth a try?

Member

marmarek commented Mar 8, 2015

Comment by marmarek on 20 Jun 2012 22:47 UTC
Can you catch which udev action causes this (and if it really is udev action)? Perhaps some sysrq will help, or some additional detail in logfile. Above udev messages didn't contain info which action failed...
Did you tried 3.4.2?
Regarding GPU performance, I can revert commit pointed by Radoslaw Szkodzinski, but can't predict side effects... Maybe it worth a try?

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by marmarek on 24 Jun 2012 10:14 UTC
I've pushed 3.4.4 (which contains some minor fixes) and also added fix for GPU performance (reverted commit pointed by R.Szkodzinski and applied correct fix from Konrad's git repo).
In dom0 running this kernel I see constantly load>=1.0, but no process consume CPU time... Besides that everything seems to be working.

Member

marmarek commented Mar 8, 2015

Comment by marmarek on 24 Jun 2012 10:14 UTC
I've pushed 3.4.4 (which contains some minor fixes) and also added fix for GPU performance (reverted commit pointed by R.Szkodzinski and applied correct fix from Konrad's git repo).
In dom0 running this kernel I see constantly load>=1.0, but no process consume CPU time... Besides that everything seems to be working.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by joanna on 25 Jun 2012 21:45 UTC
The GPU performance seems ok on the 3.4.4 kernel :)

Member

marmarek commented Mar 8, 2015

Comment by joanna on 25 Jun 2012 21:45 UTC
The GPU performance seems ok on the 3.4.4 kernel :)

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by joanna on 28 Jun 2012 15:01 UTC
Regarding the 3.4.4 kernel running in a VM -- I constantly observe my WiFi card gets stuck and effectively defunt in the netvm:

[iwlwifi 0000:00:02.0: Queue 12 stuck for 10000 ms.
[10658.740099](10658.740087]) iwlwifi 0000:00:02.0: Current SW read_ptr 58 write_ptr 154
[iwlwifi 0000:00:02.0: Current HW read_ptr 58 write_ptr 154
[10658.740185](10658.740175]) iwlwifi 0000:00:02.0: On demand firmware reload
[ieee80211 phy0: Hardware restart was requested
[10658.740958](10658.740771]) iwlwifi 0000:00:02.0: L1 Enabled; Disabling L0S
[10658.741157] iwlwifi 0000:00:02.0: Radio type=0x1-0x2-0x0

This is seen by the user as networking not working anymore. The solution is to click disconnect and then connect again in the NM applet.

Member

marmarek commented Mar 8, 2015

Comment by joanna on 28 Jun 2012 15:01 UTC
Regarding the 3.4.4 kernel running in a VM -- I constantly observe my WiFi card gets stuck and effectively defunt in the netvm:

[iwlwifi 0000:00:02.0: Queue 12 stuck for 10000 ms.
[10658.740099](10658.740087]) iwlwifi 0000:00:02.0: Current SW read_ptr 58 write_ptr 154
[iwlwifi 0000:00:02.0: Current HW read_ptr 58 write_ptr 154
[10658.740185](10658.740175]) iwlwifi 0000:00:02.0: On demand firmware reload
[ieee80211 phy0: Hardware restart was requested
[10658.740958](10658.740771]) iwlwifi 0000:00:02.0: L1 Enabled; Disabling L0S
[10658.741157] iwlwifi 0000:00:02.0: Radio type=0x1-0x2-0x0

This is seen by the user as networking not working anymore. The solution is to click disconnect and then connect again in the NM applet.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by joanna on 30 Jun 2012 12:41 UTC
It seems like 3.4.4 works fine as a Dom0 kernel on my laptop...

Member

marmarek commented Mar 8, 2015

Comment by joanna on 30 Jun 2012 12:41 UTC
It seems like 3.4.4 works fine as a Dom0 kernel on my laptop...

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by joanna on 6 Jul 2012 08:12 UTC
... actually, the 3.4.4 has one problem when I run it as Dom0 on my laptop -- namely, quite often, the kscreenlocker or kwin becomes somehow defunct after S3 resume, and I need to manually go to a text console, log in, and then kill the kwin, switch to X, and restart the kwin in order to be able to continue the work under X. It might be something GPU-related, perhaps.

In any case, I think we should choose 3.2.7-7 as both the Dom0 and the default VM kernel. The 3.2.7 branch got most testing with Qubes so far...

Member

marmarek commented Mar 8, 2015

Comment by joanna on 6 Jul 2012 08:12 UTC
... actually, the 3.4.4 has one problem when I run it as Dom0 on my laptop -- namely, quite often, the kscreenlocker or kwin becomes somehow defunct after S3 resume, and I need to manually go to a text console, log in, and then kill the kwin, switch to X, and restart the kwin in order to be able to continue the work under X. It might be something GPU-related, perhaps.

In any case, I think we should choose 3.2.7-7 as both the Dom0 and the default VM kernel. The 3.2.7 branch got most testing with Qubes so far...

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 8, 2015

Member

Comment by joanna on 9 Jul 2012 08:42 UTC
If we all agree, I will close this ticket.

Member

marmarek commented Mar 8, 2015

Comment by joanna on 9 Jul 2012 08:42 UTC
If we all agree, I will close this ticket.

@marmarek marmarek added the worksforme label Mar 8, 2015

@marmarek marmarek closed this Mar 8, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment