Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upBackup restore and template installation should write directly to LVM volumes #3230
Comments
andrewdavidwong
added
bug
C: core
labels
Oct 28, 2017
andrewdavidwong
added this to the Release 4.0 milestone
Oct 28, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Oct 29, 2017
Member
Template installation is much less issue, as it is limited by template builder to 10GB. And also it is much trickier to solve, as RPM do not like to write directly to block device (or rather - a socket/pipe - remember that now it use Admin API, so you can install template also from a management VM).
As for backup restore - only parts are stored as files (100MB each), and in parallel are uploaded to the actual VM volume (using Admin API). But currently there is no limit on how many such parts are queued. This is because (currently) you can't control the speed of archive extraction (either tar, or qfile-unpacker). That would require either adding additional layer (cat-like process, used to pause data input when needed), or instructing somehow extractor process to pause operation (SIGSTOP/SIGCONT? that could be fragile...).
Not storing those fragments as files at all would be very tricky, because you need to verify the fragment before doing anything with it. And you can do that only when the full fragment is extracted. You can not (should not) start parsing its content in any way before verification.
Some alternative could be using tmpfs, or using memory directly (python object). But that could easily lead to OOM, especially when restoring using a VM (aka "paranoid mode").
|
Template installation is much less issue, as it is limited by template builder to 10GB. And also it is much trickier to solve, as RPM do not like to write directly to block device (or rather - a socket/pipe - remember that now it use Admin API, so you can install template also from a management VM). As for backup restore - only parts are stored as files (100MB each), and in parallel are uploaded to the actual VM volume (using Admin API). But currently there is no limit on how many such parts are queued. This is because (currently) you can't control the speed of archive extraction (either tar, or qfile-unpacker). That would require either adding additional layer (cat-like process, used to pause data input when needed), or instructing somehow extractor process to pause operation (SIGSTOP/SIGCONT? that could be fragile...). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
qubesuser
Oct 29, 2017
I think one could write the fragments directly to an LVM volume (for instance using tar --to-stdout and piping to dd), verify either by reading from the VM volume or by teeing the data to the verification, and then rename the VM volume to $vm-private if verification passes.
For templates, probably the best solution is to not ship them in the RPMs, but rather ship them like the installation ISOs and only provide a download link and hash in the RPM, and have the RPM install script download the ISO and pipe it via qrexec to dom0 while the checksum is verified in parallel and the install is finalized only if the checksum verification succeeds.
qubesuser
commented
Oct 29, 2017
•
|
I think one could write the fragments directly to an LVM volume (for instance using tar --to-stdout and piping to dd), verify either by reading from the VM volume or by teeing the data to the verification, and then rename the VM volume to $vm-private if verification passes. For templates, probably the best solution is to not ship them in the RPMs, but rather ship them like the installation ISOs and only provide a download link and hash in the RPM, and have the RPM install script download the ISO and pipe it via qrexec to dom0 while the checksum is verified in parallel and the install is finalized only if the checksum verification succeeds. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Oct 29, 2017
Member
This is too late if you want to keep clean tasks separation. The principle is to do nothing with the data until it gets verified. While the current implementation may indeed allow that (not use the volume until it get renamed to $vm-private), we should not make such assumption. Also keep in mind, the backup restore tool should not assume direct access to LVM. It use Admin API to upload volume content. So, such mechanism would require introducing some additional action to rename volume, or separate "upload" and "commit" actions. Reading the volume for verification is intentionally not supported through Admin API, but that isn't a problem here, because you can calculate data hash on the fly (and in fact scrypt tool we use there do that already).
There is also one technical detail - you need somehow pass individual fragments to scrypt for decryption and verification. While its output could be redirected somewhere, for input you need to separate individual VM's volumes (and their fragments), so just tar --to-stdout isn't feasible, because you'll get all of them concatenated.
Backup archive is split into fragments exactly to allow limiting temporary space needed to do a backup and to restore it. The latter feature is not implemented, but the current architecture should allow that.
|
This is too late if you want to keep clean tasks separation. The principle is to do nothing with the data until it gets verified. While the current implementation may indeed allow that (not use the volume until it get renamed to $vm-private), we should not make such assumption. Also keep in mind, the backup restore tool should not assume direct access to LVM. It use Admin API to upload volume content. So, such mechanism would require introducing some additional action to rename volume, or separate "upload" and "commit" actions. Reading the volume for verification is intentionally not supported through Admin API, but that isn't a problem here, because you can calculate data hash on the fly (and in fact Backup archive is split into fragments exactly to allow limiting temporary space needed to do a backup and to restore it. The latter feature is not implemented, but the current architecture should allow that. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
na--
Oct 29, 2017
@qubesuser: I think that if the other issue you reported is fixed, this one would not be that big of a deal.
@marmarek: If this is up-to-date, that means that there's a tar extraction of the huge backup file in the beginning. The tar options --checkpoint= and --checkpoint-action=exec=...... can be used to limit the speed of the archive extraction with some artificial sleep. It's an ugly hack but I use it for a task that needs piping tar extraction of huge files in /tmp and processing them as they are being extracted.
Here's the code I use: tar --checkpoint=20000 --checkpoint-action=exec='sleep "$(stat -f --format="(((%b-%a)/%b)^5)*30" /tmp | bc -l)"' --extract --verbose __other_tar_args__ | program_to_process_extracted_files
Ugly as sin, but it causes tar to sleep progressively more as /tmp is being filled up, so that the program_to_process_extracted_files can catch up with processing and deleting the already extracted files. For more complex flow control logic, tar can call an external script that implements it, for example "pause extraction of file n until file n-2 is processed and removed" or something of the sort, which should be much less fragile than signalling tar externally
Edit: link to the tar checkpoint documentation: https://www.gnu.org/software/tar/manual/html_section/tar_26.html and https://www.gnu.org/software/tar/manual/html_section/tar_29.html
na--
commented
Oct 29, 2017
•
|
@qubesuser: I think that if the other issue you reported is fixed, this one would not be that big of a deal. @marmarek: If this is up-to-date, that means that there's a Here's the code I use: Ugly as sin, but it causes tar to sleep progressively more as Edit: link to the tar checkpoint documentation: https://www.gnu.org/software/tar/manual/html_section/tar_26.html and https://www.gnu.org/software/tar/manual/html_section/tar_29.html |
na--
referenced this issue
Oct 29, 2017
Closed
dom0 root filesystem not mounted with discard on thin provisioning #3226
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Oct 29, 2017
Member
The tar options --checkpoint= and --checkpoint-action=exec=...... can be used to limit the speed of the archive extraction with some artificial sleep. It's an ugly hack but I use it for a task that needs piping tar extraction of huge files in /tmp and processing them as they are being extracted.
Tar is used there only if backup file is exposed directly to dom0. If it is loaded from some VM (like sys-usb), then qfile-unpacker is used. But in this case we could add such option ourselves.
Tar is used there only if backup file is exposed directly to dom0. If it is loaded from some VM (like sys-usb), then |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
qubesuser
Oct 30, 2017
Yeah, it would need some sort of upload+commit interface (with hashes computed on the fly): ideally one where a qrexec connection is kept open until a commit command is sent, and the VM/volume is deleted automatically when the connection is broken or upon booting the system (to handle the system being hard rebooted during restore).
Not totally sure how to setup the input with tar. Maybe it could be possible to create private.img.XXX files as UNIX sockets or fifos and convince tar to write into them instead of recreating them? (perhaps tar --overwrite does that, not sure). Or use tar --to-stdout with a single-file filelist, if tar can seek efficiently (but this requires that the input be a file and not a pipe from another VM, unless tar is run in the other VM). Alternatively, one could even just use tar --to-stdout with all the files and have them concatenated, and then split them afterwards since the size of each fragment is known (or can be determined by separately running tar -t).
qubesuser
commented
Oct 30, 2017
•
|
Yeah, it would need some sort of upload+commit interface (with hashes computed on the fly): ideally one where a qrexec connection is kept open until a commit command is sent, and the VM/volume is deleted automatically when the connection is broken or upon booting the system (to handle the system being hard rebooted during restore). Not totally sure how to setup the input with tar. Maybe it could be possible to create private.img.XXX files as UNIX sockets or fifos and convince tar to write into them instead of recreating them? (perhaps tar --overwrite does that, not sure). Or use tar --to-stdout with a single-file filelist, if tar can seek efficiently (but this requires that the input be a file and not a pipe from another VM, unless tar is run in the other VM). Alternatively, one could even just use tar --to-stdout with all the files and have them concatenated, and then split them afterwards since the size of each fragment is known (or can be determined by separately running tar -t). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Oct 30, 2017
Member
Generally it is too late for major changes in backup (or other) architecture changes for Qubes 4.0. Upload+commit may be a good idea for Qubes 4.1. Splitting concatenated files, or placing fifos for tar to write to is IMO too fragile to consider it at all. Backup mechanism is complex enough already.
One think we may consider at this stage, is slowing down tar/qfile-unpacker enough to not require too much space in in /tmp. --checkpoint-action is interesting, but exact command there needs to be adjusted. I'd put there something controlled from python script, and from there make sure not more than X files/size units are waiting to be handled. For example reading 1 byte from a pipe; and from python write 1 byte after each file is handled. And also put X bytes there at the beginning. Classic token solution.
What "checkpoint" ("record") unit is? I though it may be one tar block (512 bytes), but according to simple test with --checkpoint=1 it is closer to "a file" (but sometimes two small files are fit between checkpoints). Do you know any documentation about this? @na--
|
Generally it is too late for major changes in backup (or other) architecture changes for Qubes 4.0. Upload+commit may be a good idea for Qubes 4.1. Splitting concatenated files, or placing fifos for tar to write to is IMO too fragile to consider it at all. Backup mechanism is complex enough already. One think we may consider at this stage, is slowing down tar/qfile-unpacker enough to not require too much space in in |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
na--
Oct 31, 2017
@marmarek: sorry, I'm not sure. I've read only what's in the tar manual and it's not very specific. I remember fiddling with the options until it was good enough and leaving it at that, since in my case it was not for something very important. I thought that a record is one tar block, but apparently not.
na--
commented
Oct 31, 2017
|
@marmarek: sorry, I'm not sure. I've read only what's in the tar manual and it's not very specific. I remember fiddling with the options until it was good enough and leaving it at that, since in my case it was not for something very important. I thought that a record is one tar block, but apparently not. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jpouellet
Nov 9, 2017
Contributor
@qubesuser can you elaborate on what exactly you see an upload+commit interface performing and looking like?
Just the ability to write a stream directly to pool storage with some temporary name guaranteed to never be used by any VM, returning perhaps some token to be used by admin.vm.volume.CloneTo or such?
|
@qubesuser can you elaborate on what exactly you see an upload+commit interface performing and looking like? Just the ability to write a stream directly to pool storage with some temporary name guaranteed to never be used by any VM, returning perhaps some token to be used by |
qubesuser commentedOct 27, 2017
Qubes OS version:
R4.0-rc2
Steps to reproduce the behavior:
Expected behavior:
dom0 disk space usage does not change significantly.
Backups with VMs larger than half the size of the disk can be restored.
Actual behavior:
dom0 disk space usage changes significantly because the data is first written to a file in the dom0 root and then copied over.
Backups with VMs larger than half the size of the disk cannot be restored since there is not enough disk space for both the data on dom0 root and on the LVM volume
General notes:
This is a big issue for restoring large VMs and also fixing this would allow to use a smaller dom0 root rather than sizing it to be as large as the thin pool, saving GBs wasted for filesystem structures for an unnecessarily big filesystem (also need to make sure log files don't expand out of control for that).