Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upMultiply speed of backups by not using / storage during backups #1652
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Rudd-O
Jan 17, 2016
A better way to do this would be to entirely forego the HMAC via openssl, and do the HMAC within Python. That way the HMAC can be computed as you read from the original file (or the encrypted output from openssl), stored in a memory variable, and then added to the backup output. I just don't know how to do that in Python, but I know it should be doable.
Rudd-O
commented
Jan 17, 2016
|
A better way to do this would be to entirely forego the HMAC via openssl, and do the HMAC within Python. That way the HMAC can be computed as you read from the original file (or the encrypted output from openssl), stored in a memory variable, and then added to the backup output. I just don't know how to do that in Python, but I know it should be doable. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Jan 17, 2016
Member
A better way to do this would be to entirely forego the HMAC via openssl, and do the HMAC within Python.
Would this preserve the ability to do manual HMAC verification with openssl? One of the main benefits of the current backup system is that it allows non-programmer users to recover their data relatively easily using common tools like tar and openssl on any Linux system.
Would this preserve the ability to do manual HMAC verification with openssl? One of the main benefits of the current backup system is that it allows non-programmer users to recover their data relatively easily using common tools like tar and openssl on any Linux system. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jan 17, 2016
Member
The problem is that you don't know the chunk size (needed in tar header) before it's finished. Mostly because of compression, but also because of sparse files.
As for HMAC calculation - it isn't such a big issue, as you can simply read from openssl stdout (pipe). Calculating HMAC in Python would also be fine, especially after solving #971
|
The problem is that you don't know the chunk size (needed in tar header) before it's finished. Mostly because of compression, but also because of sparse files. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Rudd-O
Jan 17, 2016
@axon-qubes the only change is how the tar file is written, not using tar, but using tarfile.py. Everything else stays the same. Naturally, the HMAC in Python should output the same as openssl hmac.
@marmarek explain that tar header issue. I lack an understanding of it. Thanks.
Rudd-O
commented
Jan 17, 2016
|
@axon-qubes the only change is how the tar file is written, not using tar, but using tarfile.py. Everything else stays the same. Naturally, the HMAC in Python should output the same as openssl hmac. @marmarek explain that tar header issue. I lack an understanding of it. Thanks. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jan 18, 2016
Member
File list of the outer archive looks like this:
vm1/private.img.000
vm1/private.img.000.hmac
vm1/private.img.001
vm1/private.img.001.hmac
vm1/private.img.002
vm1/private.img.002.hmac
...
To write any of this to the output stream, first you need to send a file header, which contains file size. You don't know the file size until that file is fully written, so you can't pipe it directly from the inner tar layer. Theoretically you can cache it in memory instead of /var/tmp, but IMO it isn't good idea, can lead to out-of-memory. Also it isn't good for speed - now backup tool simultaneously create that chunk files and send them to the output device at the same time (so when chunk N+1 is prepared, chunk N is written to the output device). Piping those two things together would mean that one would need to wait for the other one.
But, if that's only about SSD lifetime, and you have enough RAM, maybe using tmpfs would be enough? /tmp is mounted as tmpfs by default, so simply an option to specify temporary directory would do the job. What do you think @Rudd-O ?
|
File list of the outer archive looks like this:
To write any of this to the output stream, first you need to send a file header, which contains file size. You don't know the file size until that file is fully written, so you can't pipe it directly from the inner tar layer. Theoretically you can cache it in memory instead of But, if that's only about SSD lifetime, and you have enough RAM, maybe using tmpfs would be enough? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Rudd-O
Jan 18, 2016
The big files are always max 100 MB in size. It's even hardcoded in line 632 of backup.py. This isn't going to lead anyone into an OOM situation.
If you are worried about garbage collection churn, Just allocate the 100 MB as a buffer, and then read into the buffer until the buffer is full. Right now, you're writing to chunkfile_p -- why not modify wait_backup_feedback to write to the buffer instead?
- Allocate a bytearray(1024x1024x1024)
- Pass the bytearray to wait_backup_feedback.
- Make wait_backup_feedback write the output of in_stream into the bytearray. (len = in_stream.readinto(bytearray))
4 . Make wait_backup_feedback return "size_limit" if the len of the read is 1024x1024x1024. - Now you have the data, and you can compute the file name xxxx.XXX, as well as the tarinfo size / attributes. Construct a tarinfo structure, and write it to disk.
To improve parallelism and therefore performance, you can push the tarinfo structure into a Queue.Queue of a certain maximum depth, and run the "write the outer tar" process into a thread, consuming tarinfos from the Queue as you go along. You would need a circular buffer, but it's doable. In fact, the hmac process can also read from that buffer, as well as produce tarinfos that get pushed into that queue, so the tarinfo gets computed and written in parallel. I will shortly post pseudocode of that.
This way, the only limits are:
- the read speed of the device backing the VMs
- the write speed of the device receiving the backup
Rudd-O
commented
Jan 18, 2016
|
The big files are always max 100 MB in size. It's even hardcoded in line 632 of If you are worried about garbage collection churn, Just allocate the 100 MB as a buffer, and then read into the buffer until the buffer is full. Right now, you're writing to
To improve parallelism and therefore performance, you can push the tarinfo structure into a Queue.Queue of a certain maximum depth, and run the "write the outer tar" process into a thread, consuming tarinfos from the Queue as you go along. You would need a circular buffer, but it's doable. In fact, the hmac process can also read from that buffer, as well as produce tarinfos that get pushed into that queue, so the tarinfo gets computed and written in parallel. I will shortly post pseudocode of that. This way, the only limits are:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jan 18, 2016
Member
The big files are always max 100 MB in size. It's even hardcoded in line 632 of backup.py. This isn't going to lead anyone into an OOM situation.
But you can have multiple those files. Maximum number is also hardcoded: 10. So it is 1GB.
To improve parallelism and therefore performance, you can push the tarinfo structure into a Queue.Queue of a certain maximum depth, and run the "write the outer tar" process into a thread,
This is actually very similar to the current implementation.
Anyway, using tmpfs for those files would give the same performance benefits. But also will be much easier to understand and debug that code.
But you can have multiple those files. Maximum number is also hardcoded: 10. So it is 1GB.
This is actually very similar to the current implementation. Anyway, using tmpfs for those files would give the same performance benefits. But also will be much easier to understand and debug that code. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Rudd-O
Jan 19, 2016
As promised:
https://gist.github.com/Rudd-O/da8bc169e2cccb3a3707
This thing goes faster than qvm-backup, writes nothing to SSD or HDD, and deals with parallelism (mostly) correctly. You can plug more parallel tasks and connect them with stdin/stdout as you see fit. My choice was lambdas, but hey, what else could I choose given the circumstances?
Given that we're reading from a pipe, my desires to use mmap() or something like that went poof in the air. Very sad. But anyway, I spent a few hours, sorry for the time it took me to reply, to get this together. Here is a more parallel, threadsafe, error-detecting, shitty, incomplete qvm-backup. Ouch.
The big feature of this chunkacode is that it binds memory consumption to a multiple of (Numtasks*Sizereadbuffer) such that you can regulate those parameters as the end user (you shouldn't have to, though). With numtasks=4, you can have at most 1 Sizereadbuffer being read into, and 3 Sizereadbuffer being processed and written to the tarfile at the same time (that's 1 buffer being filled, 1 or 2 separate buffers being written to the tarfile, and the the complement of 1 or 2 buffers being hmac'd in its way to the tarfile). The number of tasks (excluding the read task, which happens in the main thread) never exceeds 4 under the default values. My experiments locally -- look at the time.sleep(5)'s to simulate slow writes -- show that with Sizereadbuffer=256MB and numtasks=4, you never go above 600 MB RAM (all swappable). If, as an user, you want the thing to take 1 GB, 256 MB, or whatever, you the user can tune those parameters. Assuming a decent-sized swap file, and also assuming worst conditions, there will be no OOM with numtasks=4 and the 1 GB read size, just a slowdown. The right thing to do here is to allocate N threads' memory buffers but I could not find a way to make that work with tarfile.py. That sucks, really, but it's only a limitation that can be removed by writing the right code. The parallelism framework is there to make it work. The optimal values are N being the lowest of the number of processors and the number of Sizereadbuffers that fit in the memory of the dom0 divided by half the number of processors, and of course Sizereadbuffers the largest chunk of contiguous allocatable RAM without swapping divided by two.
Using tmpfs would probably help a lot. But there is no reason me, the end user, should shadow /var/tmp on the whole system with a tmpfs, just so I can run a backup. This should work right out of the box. So let's not force hacks down users' experiences.
(Yes, there's the pesky "you don't have a free gig of RAM? you're kinda fucked now." problem with my code. That was sort of already a problem with the current code (without that gig of RAM, stuff gets swapped out of existence and into a crawling user experience anyway), but I'll admit it's a problem with this code too. This can be fixed by reducing the chunk size. That reduces the total size of what can be backed up, of course (102410241024*256 , or whatever value, instead of *999), but then again that limitation was an outstanding problem in qvm-backup. Also, smaller chunks mean targeted tampering with data has less of an impact in uncompressed backups.
Rudd-O
commented
Jan 19, 2016
|
As promised: https://gist.github.com/Rudd-O/da8bc169e2cccb3a3707 This thing goes faster than qvm-backup, writes nothing to SSD or HDD, and deals with parallelism (mostly) correctly. You can plug more parallel tasks and connect them with stdin/stdout as you see fit. My choice was lambdas, but hey, what else could I choose given the circumstances? Given that we're reading from a pipe, my desires to use mmap() or something like that went poof in the air. Very sad. But anyway, I spent a few hours, sorry for the time it took me to reply, to get this together. Here is a more parallel, threadsafe, error-detecting, shitty, incomplete qvm-backup. Ouch. The big feature of this chunkacode is that it binds memory consumption to a multiple of (Numtasks*Sizereadbuffer) such that you can regulate those parameters as the end user (you shouldn't have to, though). With numtasks=4, you can have at most 1 Sizereadbuffer being read into, and 3 Sizereadbuffer being processed and written to the tarfile at the same time (that's 1 buffer being filled, 1 or 2 separate buffers being written to the tarfile, and the the complement of 1 or 2 buffers being hmac'd in its way to the tarfile). The number of tasks (excluding the read task, which happens in the main thread) never exceeds 4 under the default values. My experiments locally -- look at the time.sleep(5)'s to simulate slow writes -- show that with Sizereadbuffer=256MB and numtasks=4, you never go above 600 MB RAM (all swappable). If, as an user, you want the thing to take 1 GB, 256 MB, or whatever, you the user can tune those parameters. Assuming a decent-sized swap file, and also assuming worst conditions, there will be no OOM with numtasks=4 and the 1 GB read size, just a slowdown. The right thing to do here is to allocate N threads' memory buffers but I could not find a way to make that work with tarfile.py. That sucks, really, but it's only a limitation that can be removed by writing the right code. The parallelism framework is there to make it work. The optimal values are N being the lowest of the number of processors and the number of Sizereadbuffers that fit in the memory of the dom0 divided by half the number of processors, and of course Sizereadbuffers the largest chunk of contiguous allocatable RAM without swapping divided by two. Using tmpfs would probably help a lot. But there is no reason me, the end user, should shadow /var/tmp on the whole system with a tmpfs, just so I can run a backup. This should work right out of the box. So let's not force hacks down users' experiences. (Yes, there's the pesky "you don't have a free gig of RAM? you're kinda fucked now." problem with my code. That was sort of already a problem with the current code (without that gig of RAM, stuff gets swapped out of existence and into a crawling user experience anyway), but I'll admit it's a problem with this code too. This can be fixed by reducing the chunk size. That reduces the total size of what can be backed up, of course (102410241024*256 , or whatever value, instead of *999), but then again that limitation was an outstanding problem in qvm-backup. Also, smaller chunks mean targeted tampering with data has less of an impact in uncompressed backups. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jan 19, 2016
Member
But there is no reason me, the end user, should shadow /var/tmp on the whole system with a tmpfs, just so I can run a backup. This should work right out of the box. So let's not force hacks down users' experiences.
Yes, of course. I've written "/tmp is mounted as tmpfs by default, so simply an option to specify temporary directory would do the job. What do you think @Rudd-O ?" Maybe even it should have some automatic detection, based on available RAM?
deals with parallelism (mostly) correctly.
I don't see how order of files in output archive would be preserved - probably not. But that would be a minor change - simply plugging Queue somewhere there (instead of lock in SerializedTarWriter?)
Yes, of course. I've written "/tmp is mounted as tmpfs by default, so simply an option to specify temporary directory would do the job. What do you think @Rudd-O ?" Maybe even it should have some automatic detection, based on available RAM?
I don't see how order of files in output archive would be preserved - probably not. But that would be a minor change - simply plugging Queue somewhere there (instead of lock in SerializedTarWriter?) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Rudd-O
Jan 19, 2016
Yes, correct, you could just plug a queue, which would of course make the whole write block on the longest write and you'd lose some performance that way. But the question is why. The order doesn't matter, does it? If it does, I would like to know why and, of course, then it's trivial to revise the code for that purpose.
A meh workaround would be to specify the tempdir in the current script, or to make it honor TMPDIR. That might already be the case. I did not try that.
Rudd-O
commented
Jan 19, 2016
|
Yes, correct, you could just plug a queue, which would of course make the whole write block on the longest write and you'd lose some performance that way. But the question is why. The order doesn't matter, does it? If it does, I would like to know why and, of course, then it's trivial to revise the code for that purpose. A meh workaround would be to specify the tempdir in the current script, or to make it honor TMPDIR. That might already be the case. I did not try that. |
added a commit
to marmarek/old-qubes-core-admin
that referenced
this issue
Jan 20, 2016
marmarek
referenced this issue
in marmarek/old-qubes-core-admin
Jan 20, 2016
Merged
backup: Allow to specify custom temporary directory #10
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jan 20, 2016
Member
Yes, correct, you could just plug a queue, which would of course make the whole write block on the longest write and you'd lose some performance that way. But the question is why. The order doesn't matter, does it? If it does, I would like to know why and, of course, then it's trivial to revise the code for that purpose.
Generally to ease restore process. If the hmac file is just after its data file, you can verify the data file immediately after receiving it. And abort if it's invalid, or missing. Otherwise it would be somehow trickier to make sure that all the files were verified (and then harder to audit that code). This way you can also extract data files as they come - in parallel in retrieving backup stream - because you can assume that data files are in order (and abort if they aren't).
A meh workaround would be to specify the tempdir in the current script, or to make it honor TMPDIR. That might already be the case. I did not try that.
No it doesn't honor TMPDIR (https://github.com/QubesOS/qubes-core-admin/blob/master/core/backup.py#L513)...
Anyway take a look at this:
https://github.com/marmarek/qubes-core-admin/pull/10
On 3.5GB sample VM, SSD disk, no compression, no encryption, it makes a difference, but not that big: 2:53 vs 3:19
Generally to ease restore process. If the hmac file is just after its data file, you can verify the data file immediately after receiving it. And abort if it's invalid, or missing. Otherwise it would be somehow trickier to make sure that all the files were verified (and then harder to audit that code). This way you can also extract data files as they come - in parallel in retrieving backup stream - because you can assume that data files are in order (and abort if they aren't).
No it doesn't honor TMPDIR (https://github.com/QubesOS/qubes-core-admin/blob/master/core/backup.py#L513)... Anyway take a look at this: On 3.5GB sample VM, SSD disk, no compression, no encryption, it makes a difference, but not that big: 2:53 vs 3:19 |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Rudd-O
Jan 20, 2016
Ah. In that case ordering would work fine.
So, any chance that my proto code could be adapted to qubes-backup?
Rudd-O
commented
Jan 20, 2016
|
Ah. In that case ordering would work fine. So, any chance that my proto code could be adapted to qubes-backup? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jan 20, 2016
Member
So, any chance that my proto code could be adapted to qubes-backup?
Maybe as part of work on #971 (together with #1213), at least parts of it. Some concerns about your approach:
- is the tar format the same (as
tar -cO --posix)? (just requires checking) - what about non-SSD disks? concurrent handling multiple files may even decrease performance...
- having temporary data as files makes it easier to debug, and error recovery (which is actually more important for restoring data, not making backup); but for example debugging errors like using too much memory - if that would be simple files in /tmp, it is trivial to see what is there...
Anyway, surely not for the final R3.1, but I hope it's obvious.
Maybe as part of work on #971 (together with #1213), at least parts of it. Some concerns about your approach:
Anyway, surely not for the final R3.1, but I hope it's obvious. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Rudd-O
Jan 20, 2016
- The tar format can be chosen to be ustar or posix. It's in the library.
- Handling files concurrently when the read sizes are 100MB (they are) poses no more effort on rotating disks than on SSDs. It'd be a different story if the reads were of many tiny files that cause the heads to seek like crazy. Reading several files at the same time also has the virtue that the elevator can help be more efficient and read more bytes per second, increasing throughput. If this would decrease the performance of other tasks in the machine, then it's likely that what's needed is to set the disk priority of the backup process to idle or batch.
- Writing temporary files that get deleted doesn't make it easier to debug the problem. Backtraces, debuggers and tests do. It also doesn't help in restoring data to write temporary files during the backup (though perhaps during the restore it might).
I understand it can't be done for 3.1 tho. No worries.
The general issue here is not just to make the backup faster, but to save the machine from doing unnecessary work. Writing to /tmp saves work but -- because a copy must be written and then read, adding more context switches -- not as much as loading the data into buffers and writing it directly.
Rudd-O
commented
Jan 20, 2016
I understand it can't be done for 3.1 tho. No worries. The general issue here is not just to make the backup faster, but to save the machine from doing unnecessary work. Writing to /tmp saves work but -- because a copy must be written and then read, adding more context switches -- not as much as loading the data into buffers and writing it directly. |
Rudd-O commentedJan 17, 2016
The bottleneck during backups is the write to /var/tmp. This isn't necessary.
What I learned going through the backup code is fairly simple. You tar a file chunk into a file within /var/tmp (possibly with a compressor pipe and possibly with an encryptor pipe), then you hmac the file using openssl.
That isn't needed. You can use the tarfile python module to produce the proper tar file, no need to execute tar into a temporary directory, you can push that through openssl encrypt, and you can also push that (with a tee) into both hmac and the final file. The tarfile module lets you specify the path conversion, so you don't need to deal with string conversion specification in the tar_sparse command line.
This would easily double the speed at which backups happen on my machine, perhaps more. Avoiding writing temporary files in /var/tmp is actually a huge win.
I can't do the code myself for a variety of reasons that escape this filed issue, but I'm happy to explain the workings of the tarfile module.
The pipeline would neet to be like this:
This is trickier than I portrayed it, but it will provide a several-fold improvement in backup performance, and it will save the lifetime of the SSDs for everyone running Qubes on an SSD.