Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upMissing LVM snapshot volume results in data loss (R4-rc1) #3164
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
0spinboson
commented
Oct 10, 2017
|
So attaching an ISO worked for you? Wonder why I'm special. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
doug1
Oct 10, 2017
@0spinboson Yeah, actually that took some work. I ended up doing "losetup -f" on the ISO image and then using the VM Settings GUI to boot the empty VM from the new loop block device.
doug1
commented
Oct 10, 2017
|
@0spinboson Yeah, actually that took some work. I ended up doing "losetup -f" on the ISO image and then using the VM Settings GUI to boot the empty VM from the new loop block device. |
andrewdavidwong
added
bug
C: core
labels
Oct 11, 2017
andrewdavidwong
added this to the Release 4.0 milestone
Oct 11, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
HW42
Oct 12, 2017
I experienced (probably) the same. On a test machine sys-net lost it's private volume.
I think I now understand what's happening:
vm.start()starts the VM.- VM halts.
domain-shutdownget's generated. - A second
vm.start()get's thestartup_lockbeforevm.on_domain_shutdown_coro()get's it. VM starts. vm.on_domain_shutdown_coro()aquires lock and runsvm.storage.stop(). So['remove', self.vid], then['clone', self._vid_snap, self.vid]and then['remove', self._vid_snap].- VM halts.
vm.on_domain_shutdown_coro()runsvm.storage.stop(). Now['remove', self.vid]deletes the only remaining volume containing the private image.
So we have two problems:
-
thin_volume._commit()does not check if the snapshot it wants to commit actually exists. -
The TODO in
vm.on_domain_shutdown()has not been addressed.# TODO: ensure that domain haven't been started _before_ this # coroutine got a chance to acquire a lock
Re (2): I think we should add vm.halt_pending to keep track if the VM has been started but no stop event has been seen. And then reflect this in the value returned by vm.get_power_state(). Any better ideas?
Additionally we might want to consider:
- Log lvm operations at default log level. This makes reconstructing what happend much easier.
- Make a backup before a remove and clone operation (creating a snapshot should be cheap). At least if we remove
self.vid. This should prevent data loss even if something breaks synchronization.
@doug1: Thanks for your report and analysis. This was quite helpful to track down the problem.
HW42
commented
Oct 12, 2017
|
I experienced (probably) the same. On a test machine sys-net lost it's private volume. I think I now understand what's happening:
So we have two problems:
Re (2): I think we should add Additionally we might want to consider:
@doug1: Thanks for your report and analysis. This was quite helpful to track down the problem. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Oct 12, 2017
Member
Make a backup before a remove and clone operation (creating a snapshot should be cheap). At least if we remove self.vid. This should prevent data loss even if something breaks synchronization.
It is done if revisions_to_keep is set to something positive. Anyway, I think this can be improved by renaming volume instead of removing in the first step. And removing after successful clone.
Re (2): I think we should add vm.halt_pending to keep track if the VM has been started but no stop event has been seen. And then reflect this in the value returned by vm.get_power_state(). Any better ideas?
You mean set vm.halt_pending in vm.start(), or vm.on_domain_shutdown() ? Ideally we'd take vm.startup_lock in vm.on_domain_shutdown(), but unfortunately it is not possible (it isn't coroutine).
It is done if
You mean set |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
HW42
Oct 12, 2017
Make a backup before a remove and clone operation (creating a snapshot should be cheap). At least if we remove self.vid. This should prevent data loss even if something breaks synchronization.
It is done if revisions_to_keep is set to something positive. Anyway, I think this can be improved by renaming volume instead of removing in the first step. And removing after successful clone.
Right, this sounds good.
Re (2): I think we should add vm.halt_pending to keep track if the VM has been started but no stop event has been seen. And then reflect this in the value returned by vm.get_power_state(). Any better ideas?
You mean set vm.halt_pending in vm.start(), or vm.on_domain_shutdown() ?
vm.halt_pending was a bad name proposal. I mean setting it in vm.start() and unsetting it if we saw and handled the shutdown (and until then vm.get_power_state() returns Transient instead of Halted). So better something like vm.running or so. BTW is it intentional that vm.is_running() does not mean vm.get_power_state() == 'Running'.
Ideally we'd take vm.startup_lock in vm.on_domain_shutdown(), but unfortunately it is not possible (it isn't coroutine).
Even if possible I'm not sure if this would be enough. Since I'm not sure if libvirt guarantees that libvirt_domain.isActive() == True till VIR_DOMAIN_EVENT_STOPPED has been delivered.
HW42
commented
Oct 12, 2017
Right, this sounds good.
Even if possible I'm not sure if this would be enough. Since I'm not sure if libvirt guarantees that |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Oct 12, 2017
Member
vm.halt_pending was a bad name proposal. I mean setting it in vm.start() and unsetting it if we saw and handled the shutdown (and until then vm.get_power_state() returns Transient instead of Halted). So better something like vm.running or so.
Maybe it should be a lock, that is taken at vm.start() and released by vm.on_domain_shutdown_coro()? But I fear some deadlocks if VIR_DOMAIN_EVENT_STOPPED got missing for any reason (like libvirt restart). What should happen if vm.running==True but libvirt_domain.isActive() == True and someone calls vm.start()? Actively wait for vm.running==False? Exception?
Currently the code correctly recover from the case when VIR_DOMAIN_EVENT_STOPPED isn't delivered at all.
Maybe better course of action would be ensuring that:
a) vm.on_domain_shutdown_coro() is called before vm.start(), if domain was running before
b) vm.on_domain_shutdown_coro() does nothing if was called after domain was started again
This means that even if VIR_DOMAIN_EVENT_STOPPED arrives at totally different time (or never), vm.on_domain_shutdown_coro() will still be called after domain shutdown (unless someone stop qubesd for that time).
So, to answer my question above: if vm.running==True and someone calls vm.start(), but vm.is_running() == False, call vm.on_domain_shutdown_coro() then set vm.running=False. And in vm.on_domain_shutdown_coro() exit early if vm.running==False. This require careful handling vm.startup_lock. What do you think?
Even if possible I'm not sure if this would be enough. Since I'm not sure if libvirt guarantees that libvirt_domain.isActive() == True till VIR_DOMAIN_EVENT_STOPPED has been delivered.
To be honest I don't know. Looking at the code it isn't obvious.
BTW is it intentional that vm.is_running() does not mean vm.get_power_state() == 'Running'.
It is vm.is_running() == vm.get_power_state() != 'Halted'. This is shortcut for ease various tests if domain files/config can be safely manipulated. Probably better name would be is_active()...
Maybe it should be a lock, that is taken at Currently the code correctly recover from the case when This means that even if So, to answer my question above: if
To be honest I don't know. Looking at the code it isn't obvious.
It is |
marmarek
referenced this issue
in QubesOS/qubes-core-admin
Oct 12, 2017
Merged
storage/lvm: remove old volume only after successfully cloning new one #158
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
See linked PR for data loss prevention. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
HW42
Oct 13, 2017
vm.halt_pending was a bad name proposal. I mean setting it in vm.start() and unsetting it if we saw and handled the shutdown (and until then vm.get_power_state() returns Transient instead of Halted). So better something like vm.running or so.
Maybe it should be a lock, that is taken at
vm.start()and released byvm.on_domain_shutdown_coro()? But I fear some deadlocks if VIR_DOMAIN_EVENT_STOPPED got missing for any reason (like libvirt restart). What should happen ifvm.running==Truebutlibvirt_domain.isActive() == Trueand someone callsvm.start()? Actively wait forvm.running==False? Exception?Currently the code correctly recover from the case when
VIR_DOMAIN_EVENT_STOPPEDisn't delivered at all.
Maybe better course of action would be ensuring that:
a)vm.on_domain_shutdown_coro()is called beforevm.start(), if domain was running before
b)vm.on_domain_shutdown_coro()does nothing if was called after domain was started againThis means that even if
VIR_DOMAIN_EVENT_STOPPEDarrives at totally different time (or never),vm.on_domain_shutdown_coro()will still be called after domain shutdown (unless someone stopqubesdfor that time).So, to answer my question above: if
vm.running==Trueand someone callsvm.start(), butvm.is_running() == False, callvm.on_domain_shutdown_coro()then setvm.running=False. And invm.on_domain_shutdown_coro()exit early ifvm.running==False. This require careful handlingvm.startup_lock. What do you think?
If we set vm.running=False then cleanup will not work. Also after thinking about it I currently see no way how to fix it such that a) cleanup gets run b) if the VM stops very quickly again cleanup is guaranteed to run only once.
The best similar solution I came up, which is able to deal with unreliable events, involved counting starts and shutdowns. But it's not ideal and this is getting quite complex so I thought about another strategy.
The idea is that we register for libvirt events only from the domain we are interested in. By doing so we can stop events by unregistering. While doing this I moved also the lock logic before firing the event so that the events are ordered meaningfully and the lock is hold inside the event handlers. Like it's already the case for domain-spawn, domain-start, etc.
Current WIP version at https://github.com/HW42/qubes-core-admin/commits/hw42/wip/stopped-event. Attention: It's barely tested.
HW42
commented
Oct 13, 2017
If we set The best similar solution I came up, which is able to deal with unreliable events, involved counting starts and shutdowns. But it's not ideal and this is getting quite complex so I thought about another strategy. The idea is that we register for libvirt events only from the domain we are interested in. By doing so we can stop events by unregistering. While doing this I moved also the lock logic before firing the event so that the events are ordered meaningfully and the lock is hold inside the event handlers. Like it's already the case for Current WIP version at https://github.com/HW42/qubes-core-admin/commits/hw42/wip/stopped-event. Attention: It's barely tested. |
added a commit
to marmarek/qubes-core-admin
that referenced
this issue
Oct 15, 2017
marmarek
closed this
in
marmarek/qubes-core-admin@021047f
Oct 16, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
qubesos-bot
Oct 16, 2017
Automated announcement from builder-github
The package qubes-core-dom0-4.0.10-1.fc25 has been pushed to the r4.0 testing repository for dom0.
To test this update, please install it with the following command:
sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing
qubesos-bot
commented
Oct 16, 2017
|
Automated announcement from builder-github The package
|
qubesos-bot
added
the
r4.0-dom0-cur-test
label
Oct 16, 2017
qubesos-bot
referenced this issue
in QubesOS/updates-status
Oct 16, 2017
Closed
core-admin v4.0.10 (r4.0) #271
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
HW42
Oct 17, 2017
Updated my branch with the improved event handling.
I did not only fix the logic flaw you noticed but also addressed another problem. The previous version relied on that after calling domainEventDeregisterAny() the callback is not called anymore. But the libvirt documentation is not clear if this is true or if there might be some unhandled event in the queue which will trigger the callback after unregistering the callback.
When I added a second lock for the stopped event handling (instead of using startup_lock) I also noticed that the logic can be (IMO) simplified.
This version should be quite resistant against unreliable stopped events (missing, dupplicated, delayed).
I did not opened a PR yet since I first want to test if I can trigger the race (and then verify my patch).
HW42
commented
Oct 17, 2017
|
Updated my branch with the improved event handling. I did not only fix the logic flaw you noticed but also addressed another problem. The previous version relied on that after calling When I added a second lock for the stopped event handling (instead of using This version should be quite resistant against unreliable stopped events (missing, dupplicated, delayed). I did not opened a PR yet since I first want to test if I can trigger the race (and then verify my patch). |
This was referenced Oct 17, 2017
added a commit
to marmarek/qubes-core-admin
that referenced
this issue
Oct 20, 2017
added a commit
to HW42/qubes-core-admin
that referenced
this issue
Oct 20, 2017
HW42
referenced this issue
in QubesOS/qubes-core-admin
Oct 20, 2017
Merged
qubes/vm: Improve stopped event handling #159
added a commit
to HW42/qubes-core-admin
that referenced
this issue
Oct 20, 2017
added a commit
to marmarek/qubes-core-admin
that referenced
this issue
Oct 21, 2017
qubesos-bot
referenced this issue
in QubesOS/updates-status
Oct 21, 2017
Closed
core-admin v4.0.11 (r4.0) #284
added a commit
to HW42/qubes-core-admin
that referenced
this issue
Oct 21, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
qubesos-bot
Oct 30, 2017
Automated announcement from builder-github
The package qubes-core-dom0-4.0.11-1.fc25 has been pushed to the r4.0 stable repository for dom0.
To install this update, please use the standard update command:
sudo qubes-dom0-update
Or update dom0 via Qubes Manager.
qubesos-bot
commented
Oct 30, 2017
|
Automated announcement from builder-github The package
Or update dom0 via Qubes Manager. |
doug1 commentedOct 10, 2017
•
edited
Edited 1 time
-
doug1
edited Oct 10, 2017 (most recent)
R4-rc1 plus some updates from testing
While attempting to install Windows 7 in a standalone HVM, I performed many start/stop/kill cycles which possibly got things into a strange state. For whatever reason, my Win7 root volume was missing its snapshot volume when it should have been there. The only manual LVM command that I had executed was the GUI-recommended LVM resize on the root volume about an hour earlier, and then performed several more stop/start cycles while installing Win7 patches.
After a later "stop" command, I lost my root volume completely and was unable to start the VM again. I believe that ThinVolume::stop called ThinVolume::_commit which does not check that the snapshot volume (self._vid_snap) exists (i.e. is_dirty) before removing the main volume (self.vid) and cloning the snapshot.
https://github.com/QubesOS/qubes-core-admin/blob/b8d45c214d96516f807fa3d4f9f4bb133b1ef4d6/qubes/storage/lvm.py#L274
I believe that adding a check here would prevent anyone who gets into this weird state from losing data in the future. Thank you. -Doug