New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google-chrome-stable 43.0.2357 crashes with SIGBUS under Xen PV #1003

Closed
rootkovska opened this Issue May 21, 2015 · 16 comments

Comments

Projects
None yet
5 participants
@rootkovska
Member

rootkovska commented May 21, 2015

In order to reproduce: open e.g. maps.google.com and zoom in and out a few times. On my system it crashes 100% times.

The crash reports point out to __memmove_avx_unaligned() however I have checked under gdb that this function is normally called many times by chrome and it normally executes fine. So this does not seem related to AVX instructions not being accessible to the Xen PV guest... Rather it seems like either some AVX instruction causes this SIGBUS only with special memory access pattern or, more likely, the bug is triggered by some code higher up which just results in _memmove* being called with wrong arguments. However it is courious that this crash seems to be manifesting itself only under Xen -- I just checked a baremetal Fedora 21 with the same google-chrome (and on the same laptop) and it works fine...

@rootkovska

This comment has been minimized.

Show comment
Hide comment
@rootkovska

rootkovska May 23, 2015

Member

I've done some more investigations...
First, I booted Xen passing xsave=0, which forced Xen not to expose AVX to guests. I could confirm that glibc now used __memmove_ssse3() instead of __memmove_avx_unaligned() as expected. Sadly, Chrome continued to be crashing. Usually I got the crash here (this is for google-chrome-stable-43.0.2357.65-1.x86_64 on fc21 template with glibc-2.20-8.fc21.x86_64 and randomize_va_space=0):

Program received signal SIGBUS, Bus error.
[Switching to Thread 0x7fffdbc19700 (LWP 880)]
0x00007ffff18aa2dd in __memmove_ssse3 () from /lib64/libc.so.6

(gdb) x/3i 0x7ffff18aa2dd
=> 0x7ffff18aa2dd <__memmove_ssse3+797>:        movdqa %xmm1,-0x10(%rdi)
   0x7ffff18aa2e2 <__memmove_ssse3+802>:        sub    $0x10,%rdi
   0x7ffff18aa2e6 <__memmove_ssse3+806>:        cmp    $0x80,%rdx

(gdb) i r rdi
rdi            0x7fffbf351ab0   140736401316528
(gdb) x *0x7fffbf351ab0-0x10

So, it looks like rdi was messed up. Given that the same Chrome on the same FC21 on the same CPU but running on baremetal (i.e. without Xen) does not crash, I started suspecting this is caused by Xen improperly restoring register states under some specific circumstances (that Chrome is just unlucky to be triggering)....

I thus tried one more test -- I assigned the VM where Chrome was running only one vCPU and pinned it to one physical CPU, and also made sure that all the other VMs in the system never use this same physical CPU, so it was exclusively for the Chrome VM. Now, admittedly, it took significantly longer to crash Chrome. More specifically: it crashed as hell, but only to "Ah, Snap" stage, and I was able to reload the crashed Tab without restarting the whole Chrome. After only some dozens of Tab crashes, I finally seen a SIGBUS crash in the __memmove_ssse3()... This doesn't mean, of course, that Xen still doesn't mess up the vCPU context in some way, which is currently the only explanation I can think of.

Member

rootkovska commented May 23, 2015

I've done some more investigations...
First, I booted Xen passing xsave=0, which forced Xen not to expose AVX to guests. I could confirm that glibc now used __memmove_ssse3() instead of __memmove_avx_unaligned() as expected. Sadly, Chrome continued to be crashing. Usually I got the crash here (this is for google-chrome-stable-43.0.2357.65-1.x86_64 on fc21 template with glibc-2.20-8.fc21.x86_64 and randomize_va_space=0):

Program received signal SIGBUS, Bus error.
[Switching to Thread 0x7fffdbc19700 (LWP 880)]
0x00007ffff18aa2dd in __memmove_ssse3 () from /lib64/libc.so.6

(gdb) x/3i 0x7ffff18aa2dd
=> 0x7ffff18aa2dd <__memmove_ssse3+797>:        movdqa %xmm1,-0x10(%rdi)
   0x7ffff18aa2e2 <__memmove_ssse3+802>:        sub    $0x10,%rdi
   0x7ffff18aa2e6 <__memmove_ssse3+806>:        cmp    $0x80,%rdx

(gdb) i r rdi
rdi            0x7fffbf351ab0   140736401316528
(gdb) x *0x7fffbf351ab0-0x10

So, it looks like rdi was messed up. Given that the same Chrome on the same FC21 on the same CPU but running on baremetal (i.e. without Xen) does not crash, I started suspecting this is caused by Xen improperly restoring register states under some specific circumstances (that Chrome is just unlucky to be triggering)....

I thus tried one more test -- I assigned the VM where Chrome was running only one vCPU and pinned it to one physical CPU, and also made sure that all the other VMs in the system never use this same physical CPU, so it was exclusively for the Chrome VM. Now, admittedly, it took significantly longer to crash Chrome. More specifically: it crashed as hell, but only to "Ah, Snap" stage, and I was able to reload the crashed Tab without restarting the whole Chrome. After only some dozens of Tab crashes, I finally seen a SIGBUS crash in the __memmove_ssse3()... This doesn't mean, of course, that Xen still doesn't mess up the vCPU context in some way, which is currently the only explanation I can think of.

@rootkovska

This comment has been minimized.

Show comment
Hide comment
@rootkovska

rootkovska May 23, 2015

Member

One line didn't paste correctly in the comment above:

(gdb) i r rdi
rdi            0x7fffbf351ab0   140736401316528
(gdb) x *0x7fffbf351ab0-0x10
Cannot access memory at address 0x7fffbf351ab0

So, it shows that rdi, or some other register earlier used to load rdi (or some memory location) was messed up somehow.

Member

rootkovska commented May 23, 2015

One line didn't paste correctly in the comment above:

(gdb) i r rdi
rdi            0x7fffbf351ab0   140736401316528
(gdb) x *0x7fffbf351ab0-0x10
Cannot access memory at address 0x7fffbf351ab0

So, it shows that rdi, or some other register earlier used to load rdi (or some memory location) was messed up somehow.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek May 23, 2015

Member

Why do you assume its Xen fault? There is dozen simpler, more probable
explanations. It looks like some generic software bug like passing
messed up arguments to memmove. If assigning only one CPU to the VM
helps it somehow, most likely it is some race between threads there. If
not bug in the chrome itself, maybe some library it uses. And as you
see, available CPU features results in different code path, so
likely the set available on your machine, under Xen PV, happen to trigger
buggy code path.

If Xen would mess such essential thing like saving rdi, you would get
crashes all the time - most likely you would not manage to even boot the
OS.

Check yum log for latest updates and try to downgrade libraries one by
one (or perform some binary search).

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Member

marmarek commented May 23, 2015

Why do you assume its Xen fault? There is dozen simpler, more probable
explanations. It looks like some generic software bug like passing
messed up arguments to memmove. If assigning only one CPU to the VM
helps it somehow, most likely it is some race between threads there. If
not bug in the chrome itself, maybe some library it uses. And as you
see, available CPU features results in different code path, so
likely the set available on your machine, under Xen PV, happen to trigger
buggy code path.

If Xen would mess such essential thing like saving rdi, you would get
crashes all the time - most likely you would not manage to even boot the
OS.

Check yum log for latest updates and try to downgrade libraries one by
one (or perform some binary search).

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

@rootkovska

This comment has been minimized.

Show comment
Hide comment
@rootkovska

rootkovska May 23, 2015

Member

The same chrome/fedora packages combination works flawlessly on baremetal as mentioned already. If that was indeed a bug in Chrome others would already be experiencing it, I think.

Member

rootkovska commented May 23, 2015

The same chrome/fedora packages combination works flawlessly on baremetal as mentioned already. If that was indeed a bug in Chrome others would already be experiencing it, I think.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek May 23, 2015

Member

I guess you are not the only one (or thats only yours crashes?):
https://retrace.fedoraproject.org/faf/problems/?component_names=google-chrome-stable&function_names=__memmove_avx_unaligned
There is also a lot of crash reports with "anonymous function" -
unfortunately retrace service does not allow filtering on signal ("error
name").

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Member

marmarek commented May 23, 2015

I guess you are not the only one (or thats only yours crashes?):
https://retrace.fedoraproject.org/faf/problems/?component_names=google-chrome-stable&function_names=__memmove_avx_unaligned
There is also a lot of crash reports with "anonymous function" -
unfortunately retrace service does not allow filtering on signal ("error
name").

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

@rootkovska

This comment has been minimized.

Show comment
Hide comment
@rootkovska

rootkovska May 25, 2015

Member

Apparently these were only mine crash reports...

Member

rootkovska commented May 25, 2015

Apparently these were only mine crash reports...

@rootkovska

This comment has been minimized.

Show comment
Hide comment
@rootkovska

rootkovska May 27, 2015

Member

Just got the same bug on Qubes R2 running on another machine, and on fc20 template. This was shortly after I upgraded to google-chrome-stable-43.0.2357.81-1.x86_64...

Member

rootkovska commented May 27, 2015

Just got the same bug on Qubes R2 running on another machine, and on fc20 template. This was shortly after I upgraded to google-chrome-stable-43.0.2357.81-1.x86_64...

@marmarek marmarek added the wontfix label May 27, 2015

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek May 27, 2015

Member

Ok, so this is clearly chrome bug.

Member

marmarek commented May 27, 2015

Ok, so this is clearly chrome bug.

@marmarek marmarek closed this May 27, 2015

@rootkovska

This comment has been minimized.

Show comment
Hide comment
@rootkovska

rootkovska May 28, 2015

Member

Not sure if clearly Chrome bug, because it does seem to manifest itself only under Xen (and probably only on PVs)...

Member

rootkovska commented May 28, 2015

Not sure if clearly Chrome bug, because it does seem to manifest itself only under Xen (and probably only on PVs)...

@rootkovska rootkovska reopened this Jun 1, 2015

@cjschroeder

This comment has been minimized.

Show comment
Hide comment
@cjschroeder

cjschroeder Jul 30, 2015

I see the same thing running in a Docker container:

#0 0x00007fb1df8428f6 __memmove_avx_unaligned (libc.so.6)

as well as

#0 0x00007fd320f9dd66 in __memcpy_avx_unaligned () at /lib64/libc.so.6

Have you found a resolution by chance?

I have seen this on Fedora 22 also.

Seems in the case of Docker, the shared memory filesystem is capped at 64M;
The resolution:

umount -f /dev/shm
mount -t tmpfs -o rw,nodev,noexec,nosuid,relatime,size=512M tmpfs

I see the same thing running in a Docker container:

#0 0x00007fb1df8428f6 __memmove_avx_unaligned (libc.so.6)

as well as

#0 0x00007fd320f9dd66 in __memcpy_avx_unaligned () at /lib64/libc.so.6

Have you found a resolution by chance?

I have seen this on Fedora 22 also.

Seems in the case of Docker, the shared memory filesystem is capped at 64M;
The resolution:

umount -f /dev/shm
mount -t tmpfs -o rw,nodev,noexec,nosuid,relatime,size=512M tmpfs

@robkinyon

This comment has been minimized.

Show comment
Hide comment
@robkinyon

robkinyon Sep 4, 2015

I am also experiencing this problem with Chrome. Increasing the tmpfs with sudo mount -o remount,size=1024M /tmp does help delay the "Aww Snap!", but it still happens. I generally only see it on pages with tons of images or JS/Flash. I've never seen it on just normal pages, even when I have a ton of tabs open or when doing development.

All of my VMs are Fedora21 template from R3rc2 (installed a week ago) and running at the full default allocation of 4G RAM.

I am also experiencing this problem with Chrome. Increasing the tmpfs with sudo mount -o remount,size=1024M /tmp does help delay the "Aww Snap!", but it still happens. I generally only see it on pages with tons of images or JS/Flash. I've never seen it on just normal pages, even when I have a ton of tabs open or when doing development.

All of my VMs are Fedora21 template from R3rc2 (installed a week ago) and running at the full default allocation of 4G RAM.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Sep 4, 2015

Member

@rootkovska (or anyone else) can you do some simple test? Try to
reproduce this on bare metal Fedora 21 and small /dev/shm:

mount /dev/shm -o remount,size=150M

(or even smaller)
It may be also about /tmp which is also backed by tmpfs and is quite
small by default in Qubes (half of initial memory size).

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Member

marmarek commented Sep 4, 2015

@rootkovska (or anyone else) can you do some simple test? Try to
reproduce this on bare metal Fedora 21 and small /dev/shm:

mount /dev/shm -o remount,size=150M

(or even smaller)
It may be also about /tmp which is also backed by tmpfs and is quite
small by default in Qubes (half of initial memory size).

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

@v6ak

This comment has been minimized.

Show comment
Hide comment
@v6ak

v6ak Sep 6, 2015

I don't use Chromium much, so I can't provide any statistically significant data. However, if this issue is caused by low space on /tmp, then this problem should not be present on Debian templates, as those don't have tmpfs on /tmp by default. (At least, my Debian-8 on Qubes-3.0-RC2 don't have that.) Unless there is low space in /root…

Well, I haven't found anybody complaining about problems on Debian. This can either be a coincidence or support the hypothesis.

v6ak commented Sep 6, 2015

I don't use Chromium much, so I can't provide any statistically significant data. However, if this issue is caused by low space on /tmp, then this problem should not be present on Debian templates, as those don't have tmpfs on /tmp by default. (At least, my Debian-8 on Qubes-3.0-RC2 don't have that.) Unless there is low space in /root…

Well, I haven't found anybody complaining about problems on Debian. This can either be a coincidence or support the hypothesis.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 4, 2015

Member

I think I can confirm this is about /dev/shm size. I can no longer reproduce
the problem after using 1GB for /dev/shm. Also monitoring its usage during
heavy google maps zooming shows about 200MB used. Which is clearly more than
150MB (default in Qubes).
I still think that Chrome should handle it somehow more gracefully, but at
least we know how to mitigate the problem.

Member

marmarek commented Oct 4, 2015

I think I can confirm this is about /dev/shm size. I can no longer reproduce
the problem after using 1GB for /dev/shm. Also monitoring its usage during
heavy google maps zooming shows about 200MB used. Which is clearly more than
150MB (default in Qubes).
I still think that Chrome should handle it somehow more gracefully, but at
least we know how to mitigate the problem.

@robkinyon

This comment has been minimized.

Show comment
Hide comment
@robkinyon

robkinyon Oct 5, 2015

I can also confirm putting /dev/shm at 1G in fstab works.

Is there a recommended way of setting /dev/shm to 1G so that it survives
reboots of VMs?

On Sun, Oct 4, 2015 at 7:38 PM, Marek Marczykowski-Górecki <
notifications@github.com> wrote:

I think I can confirm this is about /dev/shm size. I can no longer
reproduce
the problem after using 1GB for /dev/shm. Also monitoring its usage during
heavy google maps zooming shows about 200MB used. Which is clearly more
than
150MB (default in Qubes).
I still think that Chrome should handle it somehow more gracefully, but at
lease we know how to mitigate the problem.


Reply to this email directly or view it on GitHub
#1003 (comment)
.

Thanks,
Rob Kinyon
http://greenfishbluefish.com/

I can also confirm putting /dev/shm at 1G in fstab works.

Is there a recommended way of setting /dev/shm to 1G so that it survives
reboots of VMs?

On Sun, Oct 4, 2015 at 7:38 PM, Marek Marczykowski-Górecki <
notifications@github.com> wrote:

I think I can confirm this is about /dev/shm size. I can no longer
reproduce
the problem after using 1GB for /dev/shm. Also monitoring its usage during
heavy google maps zooming shows about 200MB used. Which is clearly more
than
150MB (default in Qubes).
I still think that Chrome should handle it somehow more gracefully, but at
lease we know how to mitigate the problem.


Reply to this email directly or view it on GitHub
#1003 (comment)
.

Thanks,
Rob Kinyon
http://greenfishbluefish.com/

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 5, 2015

Member

This will be fixed in upcoming update (qubes-core-vm 3.0.18).

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Member

marmarek commented Oct 5, 2015

This will be fixed in upcoming update (qubes-core-vm 3.0.18).

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

marmarek added a commit to marmarek/old-qubes-core-agent-linux that referenced this issue Oct 11, 2015

Enlarge /tmp and /dev/shm
Initial size of those tmpfs-mounted directories is calculated as 50% of
RAM at VM startup time. Which happen to be quite small number, like
150M. Having such small /tmp and/or /dev/shm apparently isn't enough for
some applications like Google chrome. So set the size statically at 1GB,
which would be the case for baremetal system with 2GB of RAM.

Fixes QubesOS/qubes-issues#1003

(cherry picked from commit 2a39adf)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment