Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upgoogle-chrome-stable 43.0.2357 crashes with SIGBUS under Xen PV #1003
Comments
rootkovska
added
bug
C: kernel
C: xen
P: critical
labels
May 21, 2015
rootkovska
added this to the Release 3.0 milestone
May 21, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rootkovska
May 23, 2015
Member
I've done some more investigations...
First, I booted Xen passing xsave=0, which forced Xen not to expose AVX to guests. I could confirm that glibc now used __memmove_ssse3() instead of __memmove_avx_unaligned() as expected. Sadly, Chrome continued to be crashing. Usually I got the crash here (this is for google-chrome-stable-43.0.2357.65-1.x86_64 on fc21 template with glibc-2.20-8.fc21.x86_64 and randomize_va_space=0):
Program received signal SIGBUS, Bus error.
[Switching to Thread 0x7fffdbc19700 (LWP 880)]
0x00007ffff18aa2dd in __memmove_ssse3 () from /lib64/libc.so.6
(gdb) x/3i 0x7ffff18aa2dd
=> 0x7ffff18aa2dd <__memmove_ssse3+797>: movdqa %xmm1,-0x10(%rdi)
0x7ffff18aa2e2 <__memmove_ssse3+802>: sub $0x10,%rdi
0x7ffff18aa2e6 <__memmove_ssse3+806>: cmp $0x80,%rdx
(gdb) i r rdi
rdi 0x7fffbf351ab0 140736401316528
(gdb) x *0x7fffbf351ab0-0x10
So, it looks like rdi was messed up. Given that the same Chrome on the same FC21 on the same CPU but running on baremetal (i.e. without Xen) does not crash, I started suspecting this is caused by Xen improperly restoring register states under some specific circumstances (that Chrome is just unlucky to be triggering)....
I thus tried one more test -- I assigned the VM where Chrome was running only one vCPU and pinned it to one physical CPU, and also made sure that all the other VMs in the system never use this same physical CPU, so it was exclusively for the Chrome VM. Now, admittedly, it took significantly longer to crash Chrome. More specifically: it crashed as hell, but only to "Ah, Snap" stage, and I was able to reload the crashed Tab without restarting the whole Chrome. After only some dozens of Tab crashes, I finally seen a SIGBUS crash in the __memmove_ssse3()... This doesn't mean, of course, that Xen still doesn't mess up the vCPU context in some way, which is currently the only explanation I can think of.
|
I've done some more investigations...
So, it looks like rdi was messed up. Given that the same Chrome on the same FC21 on the same CPU but running on baremetal (i.e. without Xen) does not crash, I started suspecting this is caused by Xen improperly restoring register states under some specific circumstances (that Chrome is just unlucky to be triggering).... I thus tried one more test -- I assigned the VM where Chrome was running only one vCPU and pinned it to one physical CPU, and also made sure that all the other VMs in the system never use this same physical CPU, so it was exclusively for the Chrome VM. Now, admittedly, it took significantly longer to crash Chrome. More specifically: it crashed as hell, but only to "Ah, Snap" stage, and I was able to reload the crashed Tab without restarting the whole Chrome. After only some dozens of Tab crashes, I finally seen a SIGBUS crash in the |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rootkovska
May 23, 2015
Member
One line didn't paste correctly in the comment above:
(gdb) i r rdi
rdi 0x7fffbf351ab0 140736401316528
(gdb) x *0x7fffbf351ab0-0x10
Cannot access memory at address 0x7fffbf351ab0
So, it shows that rdi, or some other register earlier used to load rdi (or some memory location) was messed up somehow.
|
One line didn't paste correctly in the comment above:
So, it shows that rdi, or some other register earlier used to load rdi (or some memory location) was messed up somehow. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
May 23, 2015
Member
Why do you assume its Xen fault? There is dozen simpler, more probable
explanations. It looks like some generic software bug like passing
messed up arguments to memmove. If assigning only one CPU to the VM
helps it somehow, most likely it is some race between threads there. If
not bug in the chrome itself, maybe some library it uses. And as you
see, available CPU features results in different code path, so
likely the set available on your machine, under Xen PV, happen to trigger
buggy code path.
If Xen would mess such essential thing like saving rdi, you would get
crashes all the time - most likely you would not manage to even boot the
OS.
Check yum log for latest updates and try to downgrade libraries one by
one (or perform some binary search).
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
|
Why do you assume its Xen fault? There is dozen simpler, more probable If Xen would mess such essential thing like saving rdi, you would get Check yum log for latest updates and try to downgrade libraries one by Best Regards, |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rootkovska
May 23, 2015
Member
The same chrome/fedora packages combination works flawlessly on baremetal as mentioned already. If that was indeed a bug in Chrome others would already be experiencing it, I think.
|
The same chrome/fedora packages combination works flawlessly on baremetal as mentioned already. If that was indeed a bug in Chrome others would already be experiencing it, I think. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
May 23, 2015
Member
I guess you are not the only one (or thats only yours crashes?):
https://retrace.fedoraproject.org/faf/problems/?component_names=google-chrome-stable&function_names=__memmove_avx_unaligned
There is also a lot of crash reports with "anonymous function" -
unfortunately retrace service does not allow filtering on signal ("error
name").
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
|
I guess you are not the only one (or thats only yours crashes?): Best Regards, |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Apparently these were only mine crash reports... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rootkovska
May 27, 2015
Member
Just got the same bug on Qubes R2 running on another machine, and on fc20 template. This was shortly after I upgraded to google-chrome-stable-43.0.2357.81-1.x86_64...
|
Just got the same bug on Qubes R2 running on another machine, and on fc20 template. This was shortly after I upgraded to google-chrome-stable-43.0.2357.81-1.x86_64... |
marmarek
added
the
wontfix
label
May 27, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Ok, so this is clearly chrome bug. |
marmarek
closed this
May 27, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rootkovska
May 28, 2015
Member
Not sure if clearly Chrome bug, because it does seem to manifest itself only under Xen (and probably only on PVs)...
|
Not sure if clearly Chrome bug, because it does seem to manifest itself only under Xen (and probably only on PVs)... |
rootkovska
reopened this
Jun 1, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
cjschroeder
Jul 30, 2015
I see the same thing running in a Docker container:
#0 0x00007fb1df8428f6 __memmove_avx_unaligned (libc.so.6)
as well as
#0 0x00007fd320f9dd66 in __memcpy_avx_unaligned () at /lib64/libc.so.6
Have you found a resolution by chance?
I have seen this on Fedora 22 also.
Seems in the case of Docker, the shared memory filesystem is capped at 64M;
The resolution:
umount -f /dev/shm
mount -t tmpfs -o rw,nodev,noexec,nosuid,relatime,size=512M tmpfs
cjschroeder
commented
Jul 30, 2015
|
I see the same thing running in a Docker container: #0 0x00007fb1df8428f6 __memmove_avx_unaligned (libc.so.6) as well as #0 0x00007fd320f9dd66 in __memcpy_avx_unaligned () at /lib64/libc.so.6 Have you found a resolution by chance? I have seen this on Fedora 22 also. Seems in the case of Docker, the shared memory filesystem is capped at 64M; umount -f /dev/shm |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
robkinyon
Sep 4, 2015
I am also experiencing this problem with Chrome. Increasing the tmpfs with sudo mount -o remount,size=1024M /tmp does help delay the "Aww Snap!", but it still happens. I generally only see it on pages with tons of images or JS/Flash. I've never seen it on just normal pages, even when I have a ton of tabs open or when doing development.
All of my VMs are Fedora21 template from R3rc2 (installed a week ago) and running at the full default allocation of 4G RAM.
robkinyon
commented
Sep 4, 2015
|
I am also experiencing this problem with Chrome. Increasing the tmpfs with All of my VMs are Fedora21 template from R3rc2 (installed a week ago) and running at the full default allocation of 4G RAM. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Sep 4, 2015
Member
@rootkovska (or anyone else) can you do some simple test? Try to
reproduce this on bare metal Fedora 21 and small /dev/shm:
mount /dev/shm -o remount,size=150M
(or even smaller)
It may be also about /tmp which is also backed by tmpfs and is quite
small by default in Qubes (half of initial memory size).
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
|
@rootkovska (or anyone else) can you do some simple test? Try to
(or even smaller) Best Regards, |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
v6ak
Sep 6, 2015
I don't use Chromium much, so I can't provide any statistically significant data. However, if this issue is caused by low space on /tmp, then this problem should not be present on Debian templates, as those don't have tmpfs on /tmp by default. (At least, my Debian-8 on Qubes-3.0-RC2 don't have that.) Unless there is low space in /root…
Well, I haven't found anybody complaining about problems on Debian. This can either be a coincidence or support the hypothesis.
v6ak
commented
Sep 6, 2015
|
I don't use Chromium much, so I can't provide any statistically significant data. However, if this issue is caused by low space on /tmp, then this problem should not be present on Debian templates, as those don't have tmpfs on /tmp by default. (At least, my Debian-8 on Qubes-3.0-RC2 don't have that.) Unless there is low space in /root… Well, I haven't found anybody complaining about problems on Debian. This can either be a coincidence or support the hypothesis. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Oct 4, 2015
Member
I think I can confirm this is about /dev/shm size. I can no longer reproduce
the problem after using 1GB for /dev/shm. Also monitoring its usage during
heavy google maps zooming shows about 200MB used. Which is clearly more than
150MB (default in Qubes).
I still think that Chrome should handle it somehow more gracefully, but at
least we know how to mitigate the problem.
|
I think I can confirm this is about |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
robkinyon
Oct 5, 2015
I can also confirm putting /dev/shm at 1G in fstab works.
Is there a recommended way of setting /dev/shm to 1G so that it survives
reboots of VMs?
On Sun, Oct 4, 2015 at 7:38 PM, Marek Marczykowski-Górecki <
notifications@github.com> wrote:
I think I can confirm this is about /dev/shm size. I can no longer
reproduce
the problem after using 1GB for /dev/shm. Also monitoring its usage during
heavy google maps zooming shows about 200MB used. Which is clearly more
than
150MB (default in Qubes).
I still think that Chrome should handle it somehow more gracefully, but at
lease we know how to mitigate the problem.—
Reply to this email directly or view it on GitHub
#1003 (comment)
.
Thanks,
Rob Kinyon
http://greenfishbluefish.com/
robkinyon
commented
Oct 5, 2015
|
I can also confirm putting /dev/shm at 1G in fstab works. Is there a recommended way of setting /dev/shm to 1G so that it survives On Sun, Oct 4, 2015 at 7:38 PM, Marek Marczykowski-Górecki <
Thanks, |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Oct 5, 2015
Member
This will be fixed in upcoming update (qubes-core-vm 3.0.18).
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
|
This will be fixed in upcoming update (qubes-core-vm 3.0.18). Best Regards, |
rootkovska commentedMay 21, 2015
In order to reproduce: open e.g. maps.google.com and zoom in and out a few times. On my system it crashes 100% times.
The crash reports point out to __memmove_avx_unaligned() however I have checked under gdb that this function is normally called many times by chrome and it normally executes fine. So this does not seem related to AVX instructions not being accessible to the Xen PV guest... Rather it seems like either some AVX instruction causes this SIGBUS only with special memory access pattern or, more likely, the bug is triggered by some code higher up which just results in _memmove* being called with wrong arguments. However it is courious that this crash seems to be manifesting itself only under Xen -- I just checked a baremetal Fedora 21 with the same google-chrome (and on the same laptop) and it works fine...