Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upVMs sometimes fail to autostart--maybe a race condition #1075
Comments
marmarek
added
bug
C: core
labels
Jul 21, 2015
marmarek
added this to the Release 3.0 milestone
Jul 21, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mig5
Jul 28, 2015
This is a real nightmare when the VM that fails to start is your Yubikey USBVM and you can't get login as a result :)
I reproduce exactly. Any tips on things to try and I can report results. Am pretty good with Python but lack the understanding of the race condition (it does feel like one) and how VMs start up under the hood, to know how to modify.
mig5
commented
Jul 28, 2015
|
This is a real nightmare when the VM that fails to start is your Yubikey USBVM and you can't get login as a result :) I reproduce exactly. Any tips on things to try and I can report results. Am pretty good with Python but lack the understanding of the race condition (it does feel like one) and how VMs start up under the hood, to know how to modify. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jul 28, 2015
Member
I'm aware of the issue and also able to reproduce it.
Take a look at /usr/lib/qubes/prepare-volatile-img.sh
(qubes-core-admin/linux/aux-tools/prepare-volatile-img.sh). This
scripts is responsible for creating volatile.img, which consists of two
partitions:
- swap (1G)
- place for overlay over root.img (same size as root.img) - just empty
space
This script creates the image, initializes partition table and then
attaches that image as partitioned loop device (loop3p1, loop3p2 etc).
This step is apparently racy - when multiple instances of this script is
called at the same time, sometimes two (or more) instances thinks that
it have the same loop device number (loop3 for example).
I think for now the solution would be to guard this script with some
lock - for example using flock utility.
Later we might move that image preparation to VM's initramfs, so dom0
don't need to know anything about its layout. But it is too late for
such change to go into R3.0.
I'd really appreciate help here with fixing this issue :)
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
|
I'm aware of the issue and also able to reproduce it.
This script creates the image, initializes partition table and then I think for now the solution would be to guard this script with some Later we might move that image preparation to VM's initramfs, so dom0 I'd really appreciate help here with fixing this issue :) Best Regards, |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mig5
Jul 28, 2015
I've implemented a flock fix for this, with a simple loop until the lock is free. Pull request is at QubesOS/qubes-core-admin#4 if you wish to implement.
In gist https://gist.github.com/mig5/a5a6f194b8221263d6f9 is an example excerpt of the journalctl. My 3 VMs have started every time, 5 times over, after applying the patch. Look for the message 'Lock is already obtained.. waiting 5 seconds til we try again'
To avoid an infinite loop we set a max counter and just give up if it fails - but it's worked for me every time since, on the second try.
Let me know what you think.
mig5
commented
Jul 28, 2015
|
I've implemented a flock fix for this, with a simple loop until the lock is free. Pull request is at QubesOS/qubes-core-admin#4 if you wish to implement. In gist https://gist.github.com/mig5/a5a6f194b8221263d6f9 is an example excerpt of the journalctl. My 3 VMs have started every time, 5 times over, after applying the patch. Look for the message 'Lock is already obtained.. waiting 5 seconds til we try again' To avoid an infinite loop we set a max counter and just give up if it fails - but it's worked for me every time since, on the second try. Let me know what you think. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jul 28, 2015
Member
Maybe just "flock" (without -n), instead of a loop?
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
|
Maybe just "flock" (without -n), instead of a loop? Best Regards, |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mig5
Jul 28, 2015
My concern was the other VMs would just fail to start if a lock was at all present, hence a loop to try again. We want to keep retrying for a sane amount of time til they succeed.
I'll try it out though.
mig5
commented
Jul 28, 2015
|
My concern was the other VMs would just fail to start if a lock was at all present, hence a loop to try again. We want to keep retrying for a sane amount of time til they succeed. I'll try it out though. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jul 28, 2015
Member
flock should just wait for the other process to release the lock.
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
|
flock should just wait for the other process to release the lock. Best Regards, |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mig5
Jul 30, 2015
Sorry, you are right that flock will wait by default, but I tried different methods and couldn't find a way to get it to work. Maybe because I couldn't work out what step to run it (before which command) and therefore when to release it.
Given up for now, my original way certainly works well, just a bit overkill I guess. I hope it sets someone else on the right path
mig5
commented
Jul 30, 2015
|
Sorry, you are right that flock will wait by default, but I tried different methods and couldn't find a way to get it to work. Maybe because I couldn't work out what step to run it (before which command) and therefore when to release it. Given up for now, my original way certainly works well, just a bit overkill I guess. I hope it sets someone else on the right path |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adrelanos
Jul 30, 2015
Member
I guess we also want non-zero exit codes in case no lock could be acquired?
"Lock is already obtained" - this sounds positive. Suggested change: "Lock could not be acquired"
|
I guess we also want non-zero exit codes in case no lock could be acquired? "Lock is already obtained" - this sounds positive. Suggested change: "Lock could not be acquired" |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adrelanos
Jul 30, 2015
Member
Working on a patch. Do you want the following code block inside or outside the flock?
TOTAL_SIZE=$[ $ROOT_SIZE + $SWAP_SIZE + 512 ]
truncate -s ${TOTAL_SIZE}M "$FILENAME"
sfdisk --no-reread -u M "$FILENAME" > /dev/null 2> /dev/null <<EOF
0,${SWAP_SIZE},S
,${ROOT_SIZE},L
EOF
|
Working on a patch. Do you want the following code block inside or outside the flock?
|
added a commit
to adrelanos/qubes-core-admin
that referenced
this issue
Jul 30, 2015
adrelanos
referenced this issue
in QubesOS/qubes-core-admin
Jul 30, 2015
Merged
fix VM autostart race condition #5
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adrelanos
Jul 30, 2015
Member
Please check this pull request:
QubesOS/qubes-core-admin#5
Is that, what you had in mind, @marmarek?
I haven't tested this yet, could you have a look and perhaps test @mig5?
Stuff to consider:
- Will folder
/var/run/qubes/be guaranteed to exist? - Will
flockbe installed? In other words, do you have a package dependency onflockalready?
|
Please check this pull request: Is that, what you had in mind, @marmarek? I haven't tested this yet, could you have a look and perhaps test @mig5? Stuff to consider:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jul 30, 2015
Member
On Thu, Jul 30, 2015 at 08:48:11AM -0700, Patrick Schleizer wrote:
Please check this pull request:
QubesOS/qubes-core-admin#5Is that, what you had in mind, @marmarek?
I think that will do. I'll merge it (as well as other waiting PRs)
somehow tomorrow or this weekend.
I haven't tested this yet, could you have a look and perhaps test @mig5?
Stuff to consider:
- Will folder
/var/run/qubes/be guaranteed to exist?
Yes, it is created using tmpfiles service.
- Will
flockbe installed? In other words, do you have a package dependency onflockalready?
It is part of util-linux package, so yes.
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
|
On Thu, Jul 30, 2015 at 08:48:11AM -0700, Patrick Schleizer wrote:
I think that will do. I'll merge it (as well as other waiting PRs)
Yes, it is created using tmpfiles service.
It is part of util-linux package, so yes. Best Regards, |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mig5
Jul 30, 2015
@adrelanos thanks, the patch works for me.
For safety is it worth using the -w flag to flock so that if something goes wrong with the loop setup, the lock isn't held infinitely (and thus prevent a boot up potentially?) Just an idea, might be fine.
mig5
commented
Jul 30, 2015
|
@adrelanos thanks, the patch works for me. For safety is it worth using the -w flag to flock so that if something goes wrong with the loop setup, the lock isn't held infinitely (and thus prevent a boot up potentially?) Just an idea, might be fine. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jul 31, 2015
Member
Regarding system startup, systemd already enforces some timeout (AFAIR
90s), so I don't think it is necessary.
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
|
Regarding system startup, systemd already enforces some timeout (AFAIR Best Regards, |
nvesely commentedJul 21, 2015
On R3-rc1 I have 7 VMs set to autostart altogether and usually only net, firewall, vpn, and one or two others will actually start. The other 2 or 3 will fail. I put a journalctl log of VM autostart service failing individually and in context.
I tried to figure this out myself, but am just learning Python. It seems like the volatile image creation is crashing in the traceback, but the error message is saying a loopdevice is busy. From what I understand, the loop device is a snapshot of the root.img from the TemplateVM. The name is supposed to be connected to the inode number, but it seems like they are just sequential (e.g., this one is loop11). Is the command
mkswap -f ${loopdev}p1 > /dev/nullin /usr/lib/qubes/prepare-volatile-img.sh what's failing? I read the TemplateImplementation documentation and thought I understood it, but that seems to put me a ways from understanding the actual template implementation....journalctl of just researchVM startup:
Full journalctl log:
Jul 19 21:25:41 dom0 systemd[1]: Started Qubes NetVM startup.
Jul 19 21:25:41 dom0 systemd[1]: Starting Start Qubes VM media...
Jul 19 21:25:41 dom0 systemd[1]: Starting Start Qubes VM usbvm...
Jul 19 21:25:41 dom0 systemd[1]: Starting Start Qubes VM research...
Jul 19 21:25:41 dom0 systemd[1]: Starting Start Qubes VM vault...
Jul 19 21:25:41 dom0 systemd[1]: Starting Start Qubes VM work...
Jul 19 21:25:41 dom0 qvm-start[2441]: --> Creating volatile image: /var/lib/qubes/appvms/usbvm/volatile.img...
Jul 19 21:25:41 dom0 qvm-start[2444]: --> Creating volatile image: /var/lib/qubes/appvms/work/volatile.img...
Jul 19 21:25:41 dom0 qvm-start[2442]: --> Creating volatile image: /var/lib/qubes/appvms/research/volatile.img...
Jul 19 21:25:41 dom0 qvm-start[2443]: --> Creating volatile image: /var/lib/qubes/appvms/vault/volatile.img...
Jul 19 21:25:41 dom0 qvm-start[2440]: --> Creating volatile image: /var/lib/qubes/appvms/media/volatile.img...
Jul 19 21:25:41 dom0 kernel: loop11: p1 p2
Jul 19 21:25:41 dom0 kernel: loop12: p1 p2
Jul 19 21:25:41 dom0 kernel: loop14: p1 p2
Jul 19 21:25:42 dom0 qvm-start[2442]: /dev/loop11p1: Device or resource busy
Jul 19 21:25:42 dom0 qvm-start[2440]: /dev/loop13p1: No such file or directory
Jul 19 21:25:42 dom0 qvm-start[2442]: Traceback (most recent call last):
Jul 19 21:25:42 dom0 qvm-start[2442]: File "/usr/bin/qvm-start", line 125, in
Jul 19 21:25:42 dom0 qvm-start[2442]: main()
Jul 19 21:25:42 dom0 qvm-start[2442]: File "/usr/bin/qvm-start", line 109, in main
Jul 19 21:25:42 dom0 qvm-start[2442]: xid = vm.start(verbose=options.verbose, preparing_dvm=options.preparing_dvm, start_guid=not options.noguid, notify_function=tray_notify_generic if options.tray else None)
Jul 19 21:25:42 dom0 qvm-start[2442]: File "/usr/lib64/python2.7/site-packages/qubes/modules/000QubesVm.py", line 1696, in start
Jul 19 21:25:42 dom0 qvm-start[2442]: self.storage.prepare_for_vm_startup(verbose=verbose)
Jul 19 21:25:42 dom0 qvm-start[2442]: File "/usr/lib64/python2.7/site-packages/qubes/storage/init.py", line 196, in prepare_for_vm_startup
Jul 19 21:25:42 dom0 qvm-start[2440]: Traceback (most recent call last):
Jul 19 21:25:42 dom0 qvm-start[2440]: File "/usr/bin/qvm-start", line 125, in
Jul 19 21:25:42 dom0 qvm-start[2440]: main()
Jul 19 21:25:42 dom0 qvm-start[2440]: File "/usr/bin/qvm-start", line 109, in main
Jul 19 21:25:42 dom0 qvm-start[2440]: xid = vm.start(verbose=options.verbose, preparing_dvm=options.preparing_dvm, start_guid=not options.noguid, notify_function=tray_notify_generic if options.tray else None)
Jul 19 21:25:42 dom0 qvm-start[2440]: File "/usr/lib64/python2.7/site-packages/qubes/modules/000QubesVm.py", line 1696, in start
Jul 19 21:25:42 dom0 qvm-start[2440]: self.storage.prepare_for_vm_startup(verbose=verbose)
Jul 19 21:25:42 dom0 qvm-start[2440]: File "/usr/lib64/python2.7/site-packages/qubes/storage/init.py", line 196, in prepare_for_vm_startup
Jul 19 21:25:42 dom0 qvm-start[2442]: self.reset_volatile_storage(verbose=verbose)
Jul 19 21:25:42 dom0 qvm-start[2440]: self.reset_volatile_storage(verbose=verbose)
Jul 19 21:25:42 dom0 qvm-start[2440]: File "/usr/lib64/python2.7/site-packages/qubes/storage/init.py", line 193, in reset_volatile_storage
Jul 19 21:25:42 dom0 qvm-start[2440]: self.volatile_img, str(self.root_img_size / 1024 / 1024)])
Jul 19 21:25:42 dom0 qvm-start[2440]: File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call
Jul 19 21:25:42 dom0 qvm-start[2442]: File "/usr/lib64/python2.7/site-packages/qubes/storage/init.py", line 193, in reset_volatile_storage
Jul 19 21:25:42 dom0 qvm-start[2442]: self.volatile_img, str(self.root_img_size / 1024 / 1024)])
Jul 19 21:25:42 dom0 qvm-start[2442]: File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call
Jul 19 21:25:42 dom0 qvm-start[2441]: --> Loading the VM (type = AppVM)...
Jul 19 21:25:42 dom0 qvm-start[2442]: raise CalledProcessError(retcode, cmd)
Jul 19 21:25:42 dom0 qvm-start[2442]: subprocess.CalledProcessError: Command '['/usr/lib/qubes/prepare-volatile-img.sh', '/var/lib/qubes/appvms/research/volatile.img', '10240']' returned non-zero exit status 1
Jul 19 21:25:42 dom0 qvm-start[2440]: raise CalledProcessError(retcode, cmd)
Jul 19 21:25:42 dom0 qvm-start[2440]: subprocess.CalledProcessError: Command '['/usr/lib/qubes/prepare-volatile-img.sh', '/var/lib/qubes/appvms/media/volatile.img', '10240']' returned non-zero exit status 1
Jul 19 21:25:42 dom0 qvm-start[2444]: --> Loading the VM (type = AppVM)...
Jul 19 21:25:42 dom0 qvm-start[2443]: --> Loading the VM (type = AppVM)...
Jul 19 21:25:42 dom0 libvirtd[1224]: End of file while reading data: Input/output error
Jul 19 21:25:42 dom0 systemd[1]: qubes-vm@media.service: main process exited, code=exited, status=1/FAILURE
Jul 19 21:25:42 dom0 systemd[1]: Failed to start Start Qubes VM media.
Jul 19 21:25:42 dom0 systemd[1]: Unit qubes-vm@media.service entered failed state.
Jul 19 21:25:42 dom0 libvirtd[1224]: End of file while reading data: Input/output error
Jul 19 21:25:42 dom0 systemd[1]: qubes-vm@research.service: main process exited, code=exited, status=1/FAILURE
Jul 19 21:25:42 dom0 systemd[1]: Failed to start Start Qubes VM research.
Jul 19 21:25:42 dom0 systemd[1]: Unit qubes-vm@research.service entered failed state.