Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upImprove qvm-backup key derivation/management #971
Comments
marmarek
added
enhancement
C: core
labels
May 3, 2015
marmarek
added this to the Release 3.0 milestone
May 3, 2015
marmarek
modified the milestones:
Release 3.1,
Release 3.0
May 13, 2015
marmarek
added
the
help wanted
label
Jun 15, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jun 15, 2015
Member
We need someone who know cryptography to check discussed key derivation scheme. Especially answer for questions in this message: https://groups.google.com/d/msg/qubes-devel/CZ7WRwLXcnk/1N0sYf6lVvUJ
|
We need someone who know cryptography to check discussed key derivation scheme. Especially answer for questions in this message: https://groups.google.com/d/msg/qubes-devel/CZ7WRwLXcnk/1N0sYf6lVvUJ |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Oct 22, 2015
Member
We may never get the assistance of a cryptographer on this. For now, we should at least pass the -md option to openssl and specify sha256 (to match the key length of aes256). In other words:
openssl enc -md sha256 -e -aes-256 [...]
This would be better than nothing. Currently, md5 is used by default, which is actually weakening security for users who select passphrases larger than 128 bits (probably almost all Qubes users).
|
We may never get the assistance of a cryptographer on this. For now, we should at least pass the
This would be better than nothing. Currently, |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Oct 22, 2015
Member
On Thu, Oct 22, 2015 at 02:54:10AM -0700, Axon wrote:
We may never get the assistance of a cryptographer on this. For now, we should at least pass the
-mdoption toopenssland specifysha256(to match the key length ofaes256). In other words:openssl enc -md sha256 -e -aes-256 [...]This would be better than nothing. Currently,
md5is used by default, which is actually weakening security for users who select passphrases larger than 128 bits (probably almost all Qubes users).
As the user can (theoretically) freely choose encryption algorithm, just
sha256 may not be an universal option. Do you know any way to get key
length from openssl cipher name? Or just search for some numbers in
it?...
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
|
On Thu, Oct 22, 2015 at 02:54:10AM -0700, Axon wrote:
As the user can (theoretically) freely choose encryption algorithm, just Best Regards, |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Oct 23, 2015
Member
As the user can (theoretically) freely choose encryption algorithm, just sha256 may not be an universal option.
Two ideas:
- Set the default to sha256, but give the user the option to change it. If the user wants to choose a different encryption algorithm, then it's up to her to also choose an appropriate message digest algorithm to accompany it.
- Hard code sha512 for everything, then let each encryption algorithm take as many truncated bits as it can (assuming that would work).
Do you know any way to get key length from openssl cipher name? Or just search for some numbers in it?...
I'm not aware of any. That would be my guess too...
Two ideas:
I'm not aware of any. That would be my guess too... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Oct 23, 2015
Member
Based on this and after further research, I think it would be best to go with GPG for encryption (but not for integrity-verification).
Ideally, we would read from the user-defined preferences in ~/.gnupg/gpg.conf in dom0. This would allow users to set the following options (among others):
--s2k-cipher-algo name
Use name as the cipher algorithm used to protect secret keys. The default cipher
is CAST5. This cipher is also used for conventional encryption if --personal-
cipher-preferences and --cipher-algo is not given.
--s2k-digest-algo name
Use name as the digest algorithm used to mangle the passphrases. The default
algorithm is SHA-1.
--s2k-mode n
Selects how passphrases are mangled. If n is 0 a plain passphrase (which is not
recommended) will be used, a 1 adds a salt to the passphrase and a 3 (the
default) iterates the whole process a number of times (see --s2k-count). Unless
--rfc1991 is used, this mode is also used for conventional encryption.
--s2k-count n
Specify how many times the passphrase mangling is repeated. This value may range
between 1024 and 65011712 inclusive. The default is inquired from gpg-agent.
Note that not all values in the 1024-65011712 range are legal and if an illegal
value is selected, GnuPG will round up to the nearest legal value. This option
is only meaningful if --s2k-mode is 3.
Recommended defaults:
gpg --encrypt --symmetric --s2k-cipher-algo AES256 --s2k-digest-algo SHA512 --s2k-mode 3 --s2k-count 65011712 --compress-algo bzip2
|
Based on this and after further research, I think it would be best to go with GPG for encryption (but not for integrity-verification). Ideally, we would read from the user-defined preferences in
Recommended defaults:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Oct 23, 2015
Member
This seems to be the best option.
Some minor questions/improvements/thoughts:
- Should be no
--encryptas it is for asymmetric encryption - We currently have compression handled independently of encryption, I think we should keep that and use
--compress-algo none. Our current approach allows to specify any "filter" program to do the actual compression. Personally I likepigz(parallel gzip) for much better performance on 8-core system :) - Generally I'm fine with
--s2k-count 65011712; on my system it needs 1.00s to encrypt just "test" string (so mostly key derivation benchmark), but it may be somehow longer on older systems; do we care? On Thinkpad T61 (Core 2 Duo T7300) - 1.42s. Ok, not bad at all. - What about integrity protection (openssl)? is it a problem that the passphrase is used directly here (at least I think it is)? If yes and if we want to change that, it would require some more work to make it compatible with older formats, because integrity protection of backup header (only after its verification we can get some properties like backup format version, encryption algorithm etc).
- Backup data will be integrity-protected twice - once by openssl, and additionally by gpg. I don't think it is a problem, but I'm not a cryptographer.
- I don't think we should parse
~/.gnupg/gpg.conf. Either let use the defaults (possibly changed by user in that file), or override some option.
|
This seems to be the best option.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adrelanos
Oct 23, 2015
Member
As for most secure gpg settings, check this out.
Quote https://github.com/Whonix/anon-gpg-tweaks
"Security and privacy enhancements for gnupg's config file for user
"user" in /home/user/.gnupg/gpg.conf." See also:
https://raw.github.com/ioerror/torbirdy/master/gpg.conf and
ioerror/torbirdy#11.
- Backup data will be integrity-protected twice - once by openssl, and additionally by gpg. I don't think it is a problem, but I'm not a cryptographer.
Not a bad thought. In past there were "known plaintext attacks", where
it was recommended against to encrypt known plaintext as this would
weaken the ciphertext. Any serious algorithm nowadays must pass this.
Normally it should not. If it did, it would be a critical cipher weakness.
I am just worried of the extra complexity of double encryption wrt to
manual backup recovery. But if you think it's fine, go for it.
- I don't think we should parse
~/.gnupg/gpg.conf. Either let use the defaults (possibly changed by user in that file), or override some option.
Yes, programs should not rely on user settings in ~/.gnupg/gpg.conf
indeed. Configurable options is a pony.
gpg supports --no-options (skip ~/.gnupg/gpg.conf) and --homedir.
By the way, in Qubes Q3 (and earlier) you cannot continue to backup if
you uncheck the encryption box. [I personally use full disk encryption
for all my disks, so the encryption of QVMM is nice, but I prefer to
disable it to ease manual recovery [in case the backup is ever damaged,
version conflict or w/e).]
|
As for most secure gpg settings, check this out. Quote https://github.com/Whonix/anon-gpg-tweaks "Security and privacy enhancements for gnupg's config file for user
Not a bad thought. In past there were "known plaintext attacks", where Normally it should not. If it did, it would be a critical cipher weakness. I am just worried of the extra complexity of double encryption wrt to
Yes, programs should not rely on user settings in gpg supports --no-options (skip By the way, in Qubes Q3 (and earlier) you cannot continue to backup if |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Oct 23, 2015
Member
- Backup data will be integrity-protected twice - once by openssl, and additionally by gpg. I don't think it is a problem, but I'm not a cryptographer.
Not a bad thought. In past there were "known plaintext attacks", where
it was recommended against to encrypt known plaintext as this would
weaken the ciphertext. Any serious algorithm nowadays must pass this.Normally it should not. If it did, it would be a critical cipher weakness.
I am just worried of the extra complexity of double encryption wrt to
manual backup recovery. But if you think it's fine, go for it.
Not encryption, just integrity protection - (.hmac files).
By the way, in Qubes Q3 (and earlier) you cannot continue to backup if
you uncheck the encryption box. [I personally use full disk encryption
for all my disks, so the encryption of QVMM is nice, but I prefer to
disable it to ease manual recovery [in case the backup is ever damaged,
version conflict or w/e).]
Can you create separate ticket for it? I was not aware of this
problem...
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
Not encryption, just integrity protection - (.hmac files).
Can you create separate ticket for it? I was not aware of this Best Regards, |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Oct 24, 2015
Member
- Should be no --encrypt as it is for asymmetric encryption
Oops. Right.
- We currently have compression handled independently of encryption, I think we should keep that and use --compress-algo none. Our current approach allows to specify any "filter" program to do the actual compression. Personally I like pigz (parallel gzip) for much better performance on 8-core system :)
Sounds good to me. :)
- Generally I'm fine with --s2k-count 65011712; on my system it needs 1.00s to encrypt just "test" string (so mostly key derivation benchmark), but it may be somehow longer on older systems; do we care? On Thinkpad T61 (Core 2 Duo T7300) - 1.42s. Ok, not bad at all.
I suggested 65011712 as the default because it's the maximum. I think we should assume Qubes users want the "most secure" options by default. If they value backward compatibility more, they can override the defaults. (The security gain of the "most secure defaults" strategy is in some cases debatable, but the choice of defaults acts as an important reputational signal for a security-oriented OS.)
- What about integrity protection (openssl)? is it a problem that the passphrase is used directly here (at least I think it is)?
Yes, this is the other part of the problem.
If yes and if we want to change that, it would require some more work to make it compatible with older formats, because integrity protection of backup header (only after its verification we can get some properties like backup format version, encryption algorithm etc).
True...
- Backup data will be integrity-protected twice - once by openssl, and additionally by gpg. I don't think it is a problem, but I'm not a cryptographer.
IANAC either. I take it integrity-protection from gpg can't be turned off?
- I don't think we should parse ~/.gnupg/gpg.conf.
The only reason I suggested parsing ~/.gnupg/gpg.conf is because that seemed like the most logical place for the user to put their custom settings, but certainly if it's dangerous to parse it (or undesirable for another reason), then let's not do it. I really only meant to suggest that users should be able to specify their own options for s2k-cipher-algo, s2k-digest-algo, s2k-mode, and s2k-count (and any others that should be available).
Either let use the defaults
Sorry, I don't understand. Can you rephrase?
(possibly changed by user in that file),
Which file?
or override some option.
Would it be something like this?
qvm-backup --encrypt --s2k-count 1024 [...]
Oops. Right.
Sounds good to me. :)
I suggested 65011712 as the default because it's the maximum. I think we should assume Qubes users want the "most secure" options by default. If they value backward compatibility more, they can override the defaults. (The security gain of the "most secure defaults" strategy is in some cases debatable, but the choice of defaults acts as an important reputational signal for a security-oriented OS.)
Yes, this is the other part of the problem.
True...
IANAC either. I take it integrity-protection from gpg can't be turned off?
The only reason I suggested parsing
Sorry, I don't understand. Can you rephrase?
Which file?
Would it be something like this?
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Oct 24, 2015
Member
Yes, programs should not rely on user settings in
~/.gnupg/gpg.confindeed.
I'm glad you and @marmarek pointed this out to me. As mentioned in my previous message, my suggestion about reading from ~/.gnupg/gpg.conf was really just an expression of my desire for users to be able to choose their own custom settings for gpg to use.
However, now I'm curious: Why is it such a bad idea for programs to read from/rely on ~/.gnupg/gpg.conf?
Configurable options is a pony.
What does this mean?
I'm glad you and @marmarek pointed this out to me. As mentioned in my previous message, my suggestion about reading from However, now I'm curious: Why is it such a bad idea for programs to read from/rely on
What does this mean? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Oct 24, 2015
Member
On Fri, Oct 23, 2015 at 07:16:42PM -0700, Axon wrote:
- Backup data will be integrity-protected twice - once by openssl, and additionally by gpg. I don't think it is a problem, but I'm not a cryptographer.
IANAC either. I take it integrity-protection from gpg can't be turned off?
Yes, I think it can't be disabled.
- I don't think we should parse ~/.gnupg/gpg.conf.
The only reason I suggested parsing
~/.gnupg/gpg.confis because that seemed like the most logical place for the user to put their custom settings, but certainly if it's dangerous to parse it (or undesirable for another reason), then let's not do it. I really only meant to suggest that users should be able to specify their own options fors2k-cipher-algo,s2k-digest-algo,s2k-mode, ands2k-count(and any others that should be available).Either let use the defaults
Sorry, I don't understand. Can you rephrase?
(possibly changed by user in that file),
Which file?
qvm-backup (one program) should not read ~/.gnupg/gpg.conf
(configuration of another program). We have no control over it, syntax
can change, its location may change, some other options may be
introduced (including another file?) etc.
So for each option we should either set it explicitly on command line
(always), or not set at all. Do not try to guess whether user set some
custom value in gpg.conf or not.
or override some option.
Would it be something like this?
qvm-backup --encrypt --s2k-count 1024 [...]
Yes.
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
|
On Fri, Oct 23, 2015 at 07:16:42PM -0700, Axon wrote:
Yes, I think it can't be disabled.
qvm-backup (one program) should not read So for each option we should either set it explicitly on command line
Yes. Best Regards, |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adrelanos
Oct 24, 2015
Member
Axon:
Yes, programs should not rely on user settings in
~/.gnupg/gpg.confindeed.I'm glad you and @marmarek pointed this out to me. As mentioned in my previous message, my suggestion about reading from
~/.gnupg/gpg.confwas really just an expression of my desire for users to be able to choose their own custom settings for gpg to use.However, now I'm curious: Why is it such a bad idea for programs to read from/rely on
~/.gnupg/gpg.conf?
It's not super bad, not a huge issue, just bad practice. Because that
file is for user settings. Not arbitrary tools. Avoiding conflicts with
whatever the user did. It's hard to modify a user config file.
Specifically if there is no .d folder and especially in the home
folder. By avoiding it, it gets easier to configure it tailed for the
tool in question.
Configurable options is a pony.
What does this mean?
An optional, non-essential, nice to have, thing if someone contributes it.
|
Axon:
It's not super bad, not a huge issue, just bad practice. Because that
An optional, non-essential, nice to have, thing if someone contributes it. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adrelanos
Oct 24, 2015
Member
Axon:
integrity-protection from gpg can't be turned off?
I don't know if it should be turned off (why?) but it can be configured:
--force-mdc / --disable-mdc
|
Axon:
I don't know if it should be turned off (why?) but it can be configured: --force-mdc / --disable-mdc |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Oct 24, 2015
Member
qvm-backup (one program) should not read
~/.gnupg/gpg.conf(configuration of another program). We have no control over it, syntax can change, its location may change, some other options may be introduced (including another file?) etc.So for each option we should either set it explicitly on command line (always), or not set at all. Do not try to guess whether user set some custom value in
gpg.confor not.
Ok, that makes sense. Thank you for explaining it to me. :)
Ok, that makes sense. Thank you for explaining it to me. :) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Oct 24, 2015
Member
It's not super bad, not a huge issue, just bad practice. Because that file is for user settings. Not arbitrary tools. Avoiding conflicts with whatever the user did. It's hard to modify a user config file. Specifically if there is no
.dfolder and especially in the home folder. By avoiding it, it gets easier to configure it tailed for the tool in question.
Right, makes sense.
An optional, non-essential, nice to have, thing if someone contributes it.
Oh, I see. Thanks. :)
I don't know if it should be turned off (why?) but it can be configured:[...]
Oh, ok. I guess the only reason to turn it off would be if we're worried about double integrity-protection causing problems, but it sounds like that's not really a concern.
Right, makes sense.
Oh, I see. Thanks. :)
Oh, ok. I guess the only reason to turn it off would be if we're worried about double integrity-protection causing problems, but it sounds like that's not really a concern. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
pqg
Oct 25, 2015
FWIW, I wonder whether the minimum effort variation would be to apply the key stretching just once, to encrypt a random master key.
e.g.:
- A random Master Key,
MK, of 512 bits is generated. - This key is recorded in a file in the archive, say
./master-key, encrypted under the user's passphrase, using a key stretching method of some variety (I'll get back to the details of such later). - The encryption key
EKis derived asEK = HMAC_SHA512(MK, "qubes_backup_encryption_key") EK, truncated to 256 bits, is used to encrypt files in the current manner, but it is supplied toopensslvia the parameter-K(i.e. as a raw key); I suspect this also means that the IV will have to be generated by the backup program and be manually fed in toopensslvia-iv.- The Authentication Master Key
AMKis derived asAMK = HMAC_SHA512(MK, "qubes_backup_authentication_key") - For each file, an Authentication Key
AK_<filepath>is derived asAK_<filepath> = HMAC_SHA512(AMK, "<filepath>") - Each file has it's HMAC taken and stored in an adjacent
<blah>.hmacfile, in the current manner.
Using a unique key for each file's MAC resolves a possible issue where an attacker could swap files around in the archive, potentially migrating an infection from a lower-security VM to a higher-security VM.
This also avoids having an attacker brute-force the user's passphrase against one/any of the HMACs.
A NIST-specified KDF could be used on the Master Key to derive the other keys instead, but none lend themselves to being specified in an openssl one-liner. Also, 2 of the 3 algos in NIST 800-108 more or less reduce to the above HMAC invocations if we don't need subkeys longer than the HMAC's block size.
Alas, this proposal requires more operations before verifying the ./backup-header file, which is bad both for code exposure concerns and backwards compatibility. However, preserving the ./backup-header.hmac in its current form will leave a method by which, as mentioned, an attacker can bypass the key stretching and brute-force the bare passphrase (HashCat has an "HMAC-SHA256 (key = $pass)" mode already that might be able to be dropped in to serve such a purpose, though I've not tested it).
I've not looked at the code yet, but if GnuPG does not do /too much/ parsing, maybe it's still okay to use it for key stretching?
There was actually a competition held recently to pick a good key stretching ("password hashing") algorithm. The winner was an entrant called Argon2, which beat out other entrants including yescrypt, which is an improved scrypt variant. Note that a key issue with scrypt is that it allows time/memory trade-offs to an extent that's undesirable in an optimal key stretching algorithm, though this issue can be mitigated by tuning the scrypt invocation to require sufficiently large memory. And all of these key stretching algorithms wipe the floor with GnuPG's S2K Mode 3, which is just an iterated hash, and therefore not memory-hard at all (i.e. GPUs and ASICs can attack it much faster than the CPUs that will be computing it under normal circumstances).
Of course, the problem with exotic key stretching algos is that it's hard to use them without making manual backup recovery dependent on non-ubiquitous software. And a million iterations of SHA512 is much better than no stretching at all.
I'll look into this a bit more later, and report back with any developments :)
pqg
commented
Oct 25, 2015
|
FWIW, I wonder whether the minimum effort variation would be to apply the key stretching just once, to encrypt a random master key. e.g.:
Using a unique key for each file's MAC resolves a possible issue where an attacker could swap files around in the archive, potentially migrating an infection from a lower-security VM to a higher-security VM. This also avoids having an attacker brute-force the user's passphrase against one/any of the HMACs. A NIST-specified KDF could be used on the Master Key to derive the other keys instead, but none lend themselves to being specified in an Alas, this proposal requires more operations before verifying the I've not looked at the code yet, but if GnuPG does not do /too much/ parsing, maybe it's still okay to use it for key stretching? There was actually a competition held recently to pick a good key stretching ("password hashing") algorithm. The winner was an entrant called Of course, the problem with exotic key stretching algos is that it's hard to use them without making manual backup recovery dependent on non-ubiquitous software. And a million iterations of SHA512 is much better than no stretching at all. I'll look into this a bit more later, and report back with any developments :) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Oct 25, 2015
Member
On Sun, Oct 25, 2015 at 10:13:47AM -0700, pqg wrote:
FWIW, I wonder whether the minimum effort variation would be to apply the key stretching just once, to encrypt a random master key.
Generally we want to avoid doing to much low-level crypto handling,
because we are not cryptographers.
Using a unique key for each file's MAC resolves a possible issue where an attacker could swap files around in the archive, potentially migrating an infection from a lower-security VM to a higher-security VM.
This isn't a problem, because paths are stored inside of archive(s). Outer layer is only for convenience (streaming the backup while creating it, partial restore). Take a look at point 7 here:
https://www.qubes-os.org/en/doc/backup-emergency-restore-v3/
Alas, this proposal requires more operations before verifying the
./backup-headerfile, which is bad both for code exposure concerns and backwards compatibility. However, preserving the./backup-header.hmacin its current form will leave a method by which, as mentioned, an attacker can bypass the key stretching and brute-force the bare passphrase (HashCat has an "HMAC-SHA256 (key = $pass)" mode already that might be able to be dropped in to serve such a purpose, though I've not tested it).
Yes, IMO this is currently the main concern. When we get to the ./backup-header, we can have KDF paramenters, salt, iteration count, or any other configuration needed for restoring the rest of the backup. But before, we need something working without such knowledge...
I've not looked at the code yet, but if GnuPG does not do /too much/ parsing, maybe it's still okay to use it for key stretching?
I think we previously concluded, that we don't want to expose gpg to any unverified input.
Of course, the problem with exotic key stretching algos is that it's hard to use them without making manual backup recovery dependent on non-ubiquitous software. And a million iterations of SHA512 is much better than no stretching at all.
And also using exotic algorithms isn't a good idea, because those didn't received that much review/analysis/tests.
But generally IMHO scrypt as a KDF (exactly the thing it was designed for, right?) should be ok. If we'd need to employ KDF ourself at all...
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
|
On Sun, Oct 25, 2015 at 10:13:47AM -0700, pqg wrote:
Generally we want to avoid doing to much low-level crypto handling,
This isn't a problem, because paths are stored inside of archive(s). Outer layer is only for convenience (streaming the backup while creating it, partial restore). Take a look at point 7 here:
Yes, IMO this is currently the main concern. When we get to the
I think we previously concluded, that we don't want to expose gpg to any unverified input.
And also using exotic algorithms isn't a good idea, because those didn't received that much review/analysis/tests. But generally IMHO Best Regards, |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Oct 26, 2015
Member
There was actually a competition held recently to pick a good key stretching ("password hashing") algorithm. The winner was an entrant called Argon2, which beat out other entrants including yescrypt, which is an improved scrypt variant. Note that a key issue with scrypt is that it allows time/memory trade-offs to an extent that's undesirable in an optimal key stretching algorithm, though this issue can be mitigated by tuning the scrypt invocation to require sufficiently large memory. And all of these key stretching algorithms wipe the floor with GnuPG's S2K Mode 3, which is just an iterated hash, and therefore not memory-hard at all (i.e. GPUs and ASICs can attack it much faster than the CPUs that will be computing it under normal circumstances).
Of course, the problem with exotic key stretching algos is that it's hard to use them without making manual backup recovery dependent on non-ubiquitous software. And a million iterations of SHA512 is much better than no stretching at all.
Exactly. I'm skeptical that the practical security benefits of using a newer KDF will outweigh the benefits of being able to do manual backup recovery in a variety of disaster scenarios. Security-concious users (which almost by definition includes all Qubes users) ought to be using reasonably long, high-entropy passphrases (I'll refer to these as "good" passphrases for the sake of brevity).
The current system (in which openssl applies a single round of md5) punishes users who pick good passphrases by capping maximum entropy at a measly 128 bits. We should fix that. However, my understanding (correct me if I'm wrong) is that S2K or PBKDF2, applied to good passphrases, will be uncrackable for the foreseeable long-term future (i.e., even if we assume many quadrillions of guesses per second, it would still take at least millions of years). If that's correct, then it seems to me that we should instead select the most trusted, tested, ubiquitous tools available. My backup is no good to me if I can't recover the data because the only software available to me is whatever's on an old Ubuntu disc, for example. (We can't predict in advance what kinds of emergency scenarios Qubes users might find themselves in, and under what conditions they might need manual access to their data.)
Exactly. I'm skeptical that the practical security benefits of using a newer KDF will outweigh the benefits of being able to do manual backup recovery in a variety of disaster scenarios. Security-concious users (which almost by definition includes all Qubes users) ought to be using reasonably long, high-entropy passphrases (I'll refer to these as "good" passphrases for the sake of brevity). The current system (in which openssl applies a single round of md5) punishes users who pick good passphrases by capping maximum entropy at a measly 128 bits. We should fix that. However, my understanding (correct me if I'm wrong) is that S2K or PBKDF2, applied to good passphrases, will be uncrackable for the foreseeable long-term future (i.e., even if we assume many quadrillions of guesses per second, it would still take at least millions of years). If that's correct, then it seems to me that we should instead select the most trusted, tested, ubiquitous tools available. My backup is no good to me if I can't recover the data because the only software available to me is whatever's on an old Ubuntu disc, for example. (We can't predict in advance what kinds of emergency scenarios Qubes users might find themselves in, and under what conditions they might need manual access to their data.) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Oct 26, 2015
Member
Having said all that, if scrypt provides exactly what we're looking for, then it may be worthwhile to go with that. (And advise users to store a copy of scrypt with their backups, or something.) If anyone can get this right, it's Colin Percival.
|
Having said all that, if |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mfc
Oct 26, 2015
Member
we should add a crypto tag to this (and any other similar issues) and I will create a thread with Cure53 (or NCC / iSec Partners) about this. They can provide free crypto advice to OTF-funded projects.
|
we should add a crypto tag to this (and any other similar issues) and I will create a thread with Cure53 (or NCC / iSec Partners) about this. They can provide free crypto advice to OTF-funded projects. |
marmarek
added
the
crypto
label
Oct 26, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Great idea! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
pqg
Oct 26, 2015
I concur with your position that, if at all possible, the data should be recoverable with an "old Ubuntu disc".
The scrypt binary and python-scrypt module don't seem to be currently packaged with Fedora anyway (they are available on Debian Jessie; I've not checked elsewhere). And keeping scrypt alongside the backup would be problematic if we need to invoke scrypt /before/ we can verify the integrity of anything.
As an aside, I think we should distinguish between 2 uses of "KDF":
- A means by which to derive subkeys of arbitrary length from a master key of fixed length.
- A means by which to perform "key stretching" (and salting) so as to make brute force attacks more costly on user-supplied keying material.
For instance, scrypt could provide function 2, but not function 1. Function 1 is served by things like NIST 800-108, which in the special case of all subkeys being of equal or shorter length than the underlying hash function, reduces to HMAC(MK, "\x01<subkey_purpose_specifier_string>\x00\x20"), where the \x01 and \x00 are no longer serving a particular purpose in this reduced form, and \x20 is the length of the resulting subkey. More conservative KDFs (purpose 1) use a single keystream that feeds back the result of the previous HMAC invocations or, more conservative still, they create one feedback pipeline whose output is passed to a second HMAC invocation to produce the actual keystream (TLSv1.2 does this, see section 5). However, all of these are prohibitively awkward in the "old Ubuntu disc" scenario. By contrast:
echo -n qubes_backup_authentication_key | openssl dgst -sha256 -hmac <Hex(Master Key)>
...should work almost anywhere, yield the desired different keys for different purposes, and be just as secure as long as SHA-256 is a strong hash function. (The fancier derivation methods are just meant to give some additional protection if it should turn out that the underlying hash function is weaker than expected.) Though note that the Master Key (randomly generated or the result of scrypt or some other key stretching) will be a binary blob, which will be tough to pass via -hmac unless encoded by hex (as above) or base64, which is kind of unsightly protocol-wise. On OpenSSL >= v1.0.0 (Ubuntu since at least 2011) the following invocation could be used instead to pass the master key hex-encoded:
echo -n qubes_backup_authentication_key | openssl dgst -sha256 -mac hmac -macopt hexkey:<Hex(Master Key)>
So that the output is actually HMAC_SHA256(MK, "qubes_backup_authentication_key") rather than HMAC_SHA256(Hex(MK), "qubes_backup_authentication_key") as it would have to be with -hmac.
However, my understanding (correct me if I'm wrong) is that S2K or PBKDF2, applied to good passphrases, will be uncrackable for the foreseeable long-term future (i.e., even if we assume many quadrillions of guesses per second, it would still take at least millions of years).
With no key stretching, e.g. just take the SHA-256 of the passphrase, an attacker capable of 10 quadrillion tries a second for 10 million years would exhaust an ~101-bit keyspace. This represents approximately an 8-word diceware passphrase (~103 bits). With ~1million rounds of key stretching, as in S2K Mode 3 at max settings, ~81 bits of input keyspace could be searched, which requires a diceware passphrase of 7 words (~90 bits) though you would get pretty close with 6 words (~78 bits).
A GPU running hashcat can currently trial on the order of 2 billion SHA-256 sums per second (2*10^9/sec) so 10^16/sec represents a powerful adversary indeed. 10 million years is of course just providing a robust margin of error, since in 30 years the crypto will be completely obsolete, and in 100 years we'll all be dead.
So 128 bits is still enough for anyone. Unless your threat model includes the arrival of quantum computers in the very near term, in which case you should double all your symmetric crypto key length requirements.
That said, as you originally advanced, even if nothing else changes from this discussion, -md sha256 should be added to the openssl enc invocations. It's much more conservative.
None of this, of course, helps with the MACs, where we really want to salt and stretch the user's passphrase before doing anything, even if the same resulting key is used for both verification and encryption.
Adding a salt not only avoids precomputaion attacks, but binds the files to the particular backup. While the renaming attack I previously feared is indeed impossible, I don't /think/ that there's currently any protection against shuffling in an identically-named VM from a different backup, as long as both backups were made with the same passphrase.
However, there then needs to be some way to store the salt and, if variable, any key stretching parameters, and load them before verification. Would it be an acceptable risk to parse just the key=values out of the ./backup-header and interpret those necessary to establish the key, then check the ./backup-header.hmac, and finally interpret the rest of the values (e.g. if, in the future, a timestamp is added to the header, it would be good to not have to parse that until after verification)? Setting hard line and file length limits might help limit the attack surface.
This certainly presents less attack surface than any way I can think of using GPG to do pre-verification stretching. Though, I'm aware I haven't actually proposed another way to perform key stretching yet. Python has baked-in hashlib.pbkdf2_hmac(), but only since v3.4 and v2.7.8, which are far too recent (mid 2014). I'd hoped it might be possible to use gpg --symmetric on a fixed value, hash the resulting "file", and use that as the key, but I've not been able to find a way to pass a salt and/or IV to GPG, so it unfortunately generates a random one on each invocation. The scrypt binary has the same problem, assuming its availability issues could be overcome.
The bcrypt functionality in python-passlib may be a reasonable bet. I can't guarantee how ubiquitous it is, but it's in Debian >=7 and backports for 6, definitely in Ubuntu >= 12.04, and available on Fedora 21. bcrypt has been fairly heavily popularized and so should common enough; I suspect that a working implementation should be more readily obtainable for our hypothetical frazzled power user than scrypt.
On the other hand, the scrypt binary is pretty easy to build if you've got gcc and autotools. As mentioned, it's only good for encrypting/decrypting files, not plain key stretching, so we'd be back to the ./master-key idea or similar, but it still presents a considerably smaller attack surface than GnuPG.
pqg
commented
Oct 26, 2015
|
I concur with your position that, if at all possible, the data should be recoverable with an "old Ubuntu disc". The scrypt binary and python-scrypt module don't seem to be currently packaged with Fedora anyway (they are available on Debian Jessie; I've not checked elsewhere). And keeping scrypt alongside the backup would be problematic if we need to invoke scrypt /before/ we can verify the integrity of anything. As an aside, I think we should distinguish between 2 uses of "KDF":
For instance, scrypt could provide function 2, but not function 1. Function 1 is served by things like NIST 800-108, which in the special case of all subkeys being of equal or shorter length than the underlying hash function, reduces to
...should work almost anywhere, yield the desired different keys for different purposes, and be just as secure as long as SHA-256 is a strong hash function. (The fancier derivation methods are just meant to give some additional protection if it should turn out that the underlying hash function is weaker than expected.) Though note that the Master Key (randomly generated or the result of scrypt or some other key stretching) will be a binary blob, which will be tough to pass via
So that the output is actually
With no key stretching, e.g. just take the SHA-256 of the passphrase, an attacker capable of 10 quadrillion tries a second for 10 million years would exhaust an ~101-bit keyspace. This represents approximately an 8-word diceware passphrase (~103 bits). With ~1million rounds of key stretching, as in S2K Mode 3 at max settings, ~81 bits of input keyspace could be searched, which requires a diceware passphrase of 7 words (~90 bits) though you would get pretty close with 6 words (~78 bits). A GPU running hashcat can currently trial on the order of 2 billion SHA-256 sums per second (2*10^9/sec) so 10^16/sec represents a powerful adversary indeed. 10 million years is of course just providing a robust margin of error, since in 30 years the crypto will be completely obsolete, and in 100 years we'll all be dead. So 128 bits is still enough for anyone. Unless your threat model includes the arrival of quantum computers in the very near term, in which case you should double all your symmetric crypto key length requirements. That said, as you originally advanced, even if nothing else changes from this discussion, None of this, of course, helps with the MACs, where we really want to salt and stretch the user's passphrase before doing anything, even if the same resulting key is used for both verification and encryption. Adding a salt not only avoids precomputaion attacks, but binds the files to the particular backup. While the renaming attack I previously feared is indeed impossible, I don't /think/ that there's currently any protection against shuffling in an identically-named VM from a different backup, as long as both backups were made with the same passphrase. However, there then needs to be some way to store the salt and, if variable, any key stretching parameters, and load them before verification. Would it be an acceptable risk to parse just the key=values out of the This certainly presents less attack surface than any way I can think of using GPG to do pre-verification stretching. Though, I'm aware I haven't actually proposed another way to perform key stretching yet. Python has baked-in The bcrypt functionality in On the other hand, the scrypt binary is pretty easy to build if you've got gcc and autotools. As mentioned, it's only good for encrypting/decrypting files, not plain key stretching, so we'd be back to the |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Oct 26, 2015
Member
Adding a salt not only avoids precomputaion attacks, but binds the files to the particular backup. While the renaming attack I previously feared is indeed impossible, I don't /think/ that there's currently any protection against shuffling in an identically-named VM from a different backup, as long as both backups were made with the same passphrase.
Yes, such attack probably is possible.
However, there then needs to be some way to store the salt and, if variable, any key stretching parameters, and load them before verification. Would it be an acceptable risk to parse just the key=values out of the
./backup-headerand interpret those necessary to establish the key, then check the./backup-header.hmac, and finally interpret the rest of the values (e.g. if, in the future, a timestamp is added to the header, it would be good to not have to parse that until after verification)? Setting hard line and file length limits might help limit the attack surface.
This might be some option if nothing better comes to us. Those
pre-verification options needs to be clearly marked (some prefix in
name?) and reduced to absolutely minimum. If we'd have to go that way, I
think this should include version field, which would determine most of
parameters. For example it would be bad idea to include hash algorithm
in those pre-authentication set, because the attacker would be able to
replace it with some weak value (vide "cipher_null" in LUKS header[1]).
The bcrypt functionality in
python-passlibmay be a reasonable bet. I can't guarantee how ubiquitous it is, but it's in Debian >=7 and backports for 6, definitely in Ubuntu >= 12.04, and available on Fedora 21. bcrypt has been fairly heavily popularized and so should common enough; I suspect that a working implementation should be more readily obtainable for our hypothetical frazzled power user than scrypt.
And according to previous comments, bcrypt is also somehow sensible
choice, right?
On the other hand, the scrypt binary is pretty easy to build if you've got gcc and autotools. As mentioned, it's only good for encrypting/decrypting files, not plain key stretching, so we'd be back to the
./master-keyidea or similar, but it still presents a considerably smaller attack surface than GnuPG.
I think we can simply have hex-encoded, encrypted master key included in
the backup header.
And as the backup format is going to be more and more complex, maybe
it's a good idea to include emergency restoration script as one of
plain text, integrity protected files (similar to backup-header)? So the
emergency procedure would include manual verification of backup header
and that script, and then simply run the script.
[1]
https://github.com/QubesOS/qubes-secpack/blob/master/QSBs/qsb-019-2015.txt
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
Yes, such attack probably is possible.
This might be some option if nothing better comes to us. Those
And according to previous comments,
I think we can simply have hex-encoded, encrypted master key included in And as the backup format is going to be more and more complex, maybe [1] Best Regards, |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Oct 27, 2015
Member
I notice that all of pqg's messages have disappeared from this thread. pqg, did you intentionally delete them? If so, may I ask why? (Feel free to send me an encrypted message if you prefer.) Note that this may create difficulties for people who try to read this thread in the future (e.g., cryptographers interested in helping with this issue).
|
I notice that all of pqg's messages have disappeared from this thread. pqg, did you intentionally delete them? If so, may I ask why? (Feel free to send me an encrypted message if you prefer.) Note that this may create difficulties for people who try to read this thread in the future (e.g., cryptographers interested in helping with this issue). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Oct 27, 2015
Member
I notice that all of pqg's messages have disappeared from this thread.
As the whole account...
As the whole account... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Oct 27, 2015
Member
As the whole account...
Ah. In light of that, I'm not sure what to make of his/her/their previous suggestions...
Ah. In light of that, I'm not sure what to make of his/her/their previous suggestions... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
pqg
Oct 27, 2015
Apparently I tripped over some bot heuristics and was hellbanned. Presumably because I am using Tor and posted some comments with some links in them.
I trust that you can see me again?
pqg
commented
Oct 27, 2015
|
Apparently I tripped over some bot heuristics and was hellbanned. Presumably because I am using Tor and posted some comments with some links in them. I trust that you can see me again? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Yes. |
mig5
referenced this issue
Jan 4, 2016
Closed
Backup encryption passphrase is exposed to dom0 process list #1582
marmarek
referenced this issue
Jan 17, 2016
Open
Multiply speed of backups by not using / storage during backups #1652
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Rudd-O
Jan 17, 2016
Hi. Use the scrypt source code for key derivation. It's simple, good license, and lets you create arbitrarily hard derivations. I'm using it for a project I'm working on, and it works exactly fine.
Rudd-O
commented
Jan 17, 2016
|
Hi. Use the scrypt source code for key derivation. It's simple, good license, and lets you create arbitrarily hard derivations. I'm using it for a project I'm working on, and it works exactly fine. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jan 17, 2016
Member
Take a look at #971 (comment) and #971 (comment) mostly about scrypt availability and usage for sole key derivation. One of important requirements for backup format here is ability to restore the data using commonly available tools, not necessary on Qubes OS.
|
Take a look at #971 (comment) and #971 (comment) mostly about scrypt availability and usage for sole key derivation. One of important requirements for backup format here is ability to restore the data using commonly available tools, not necessary on Qubes OS. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jan 22, 2016
Member
Python has baked-in hashlib.pbkdf2_hmac(), but only since v3.4 and v2.7.8, which are far too recent (mid 2014).
There is pbkdf2 implementation in passlib python module (http://pythonhosted.org/passlib/lib/passlib.utils.pbkdf2.html), for quite a long time. As for bcrypt in that library, I can see only an API to create a password hash, not sure how can it be applied as key derivation...
There is pbkdf2 implementation in passlib python module (http://pythonhosted.org/passlib/lib/passlib.utils.pbkdf2.html), for quite a long time. As for bcrypt in that library, I can see only an API to create a password hash, not sure how can it be applied as key derivation... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Rudd-O
Oct 31, 2016
On 10/31/2016 09:43 PM, Jean-Philippe Ouellet wrote:
According to that paper, duplicati considers neither confidentiality
nor integrity.Statements from that paper like:
"We chose the SHA256 hash as that seems to be a safe bet, although the cryptographic properties of SHA are not required."and
This is an unfair criticism.
Duplicati lets you choose plain AES like Qubes backup today, or GPG if
you dislike the choice of SHA256. Furthermore, the software is open
source — LGPL 2.1 to be exact — so if you want to plug your own
algorithm, you're free to do so.
"If a required block is not found after scanning the index files, block files are downloaded at random, until all known blocks are found. This enables a restore, even in cases without any index files."are rather concerning when considering an adversary who maliciously
modifies your backups.
Duh, if an adversary modifies your files, it does not matter whether you
have to download a thousand 1 MB blocks or 1 GB file, you still can't
restore! This is a 100% frivolous criticism, my man.
I like the idea of a simple incremental backup solution which treats
the actual blocks as completely opaque
That is precisely what Duplicati is — the blocks are opaque to the
storage medium — except it isn't in the full+incremental scheme — it's
fully content-addressable, which is superior in many ways.
(the content-parsing to determine optimal chunking boundaries of other
backup solutions sounds like an interesting attack surface to me), but
to me this solution alone is dangerously insufficient.
I do not believe you have formed that opinion based on valid facts.
Rudd-O
http://rudd-o.com/
Rudd-O
commented
Oct 31, 2016
|
On 10/31/2016 09:43 PM, Jean-Philippe Ouellet wrote:
This is an unfair criticism. Duplicati lets you choose plain AES like Qubes backup today, or GPG if
Duh, if an adversary modifies your files, it does not matter whether you
That is precisely what Duplicati is — the blocks are opaque to the
I do not believe you have formed that opinion based on valid facts.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Rudd-O
Oct 31, 2016
On 10/31/2016 09:46 PM, Marek Marczykowski-Górecki wrote:
@Rudd-O https://github.com/Rudd-O indeed interesting, but offtopic
here. It says nothing about encryption and integrity protection which
is main issue discussed here. It's mostly on topic of #858
#858
EDIT: @jpouellet https://github.com/jpouellet was faster about
details, so not repeating it here.
I recognize it's off topic, as the paper does not speak of integrity
protection. The integrity protection of the implementation is provided
by the choice of encryption and hashing algorithm, which is not the
topic of the paper (the paper focuses on explaining the unencrypted data
structures, which I thought would be a good read for you).
I will be trying to implement something like this for Qubes OS, since I
need this storage system for a variety of reasons. I will report back
with progress.
Rudd-O
http://rudd-o.com/
Rudd-O
commented
Oct 31, 2016
|
On 10/31/2016 09:46 PM, Marek Marczykowski-Górecki wrote:
I recognize it's off topic, as the paper does not speak of integrity I will be trying to implement something like this for Qubes OS, since I
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Oct 31, 2016
Member
I don't think deduplication (especially across VMs) is a good idea for secure backup solution at all. There are multiple things that can go wrong. For example - if attacker control one of the VMs, can store some data there, and later observing backup size guess if the data is present also in some other VM, thus leaking information about content of another VM. Probably not feasible for leaking private keys (something not known to the attacker), but very real for finding who have some known data (like breaking anonymity of some Tor user).
Also, today SHA256 is considered safe, but this format rely on it to have never, ever any collision, not even unintentional one. This drawback is even mentioned in that paper.
And this is just what I can think of in 5 minutes. There are probably more. So, any solution must not share anything across VMs.
|
I don't think deduplication (especially across VMs) is a good idea for secure backup solution at all. There are multiple things that can go wrong. For example - if attacker control one of the VMs, can store some data there, and later observing backup size guess if the data is present also in some other VM, thus leaking information about content of another VM. Probably not feasible for leaking private keys (something not known to the attacker), but very real for finding who have some known data (like breaking anonymity of some Tor user). And this is just what I can think of in 5 minutes. There are probably more. So, any solution must not share anything across VMs. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jpouellet
Nov 1, 2016
Contributor
On Mon, Oct 31, 2016 at 6:40 PM, Rudd-O notifications@github.com wrote:
On 10/31/2016 09:43 PM, Jean-Philippe Ouellet wrote:
"If a required block is not found after scanning the index files,
block files are downloaded at
random, until all known blocks are found. This enables a restore,
even in cases without any index files."are rather concerning when considering an adversary who maliciously
modifies your backups.Duh, if an adversary modifies your files, it does not matter whether you
have to download a thousand 1 MB blocks or 1 GB file, you still can't
restore! This is a 100% frivolous criticism, my man.
Sure, they could harm availability, but if properly authenticated, the attacker's capabilities end there.
But, if you don't have a trusted way to identify and authenticate the root of your merkle-tree index thing, (which sounds to be the case with this system as described when restoring without an already trusted index) then your adversary may also gain arbitrary write on restore, and that is plenty of leverage for further attacks.
I like the idea of a simple incremental backup solution which treats
the actual blocks as completely opaqueThat is precisely what Duplicati is — the blocks are opaque to the
storage medium — except it isn't in the full+incremental scheme — it's
fully content-addressable, which is superior in many ways.
Right. I was saying that's the main concept I like from it, as opposed to other systems.
(the content-parsing to determine optimal chunking boundaries of other
backup solutions sounds like an interesting attack surface to me), but
to me this solution alone is dangerously insufficient.I do not believe you have formed that opinion based on valid facts.
Sorry to disappoint.
It makes sense from the perspective of one optimizing for storage above all else to want to split streams on intelligent boundaries such that you can re-use as many already-stored content-addressed chunks as possible.
Unfortunately, determining the most intelligent boundary is seriously non-trivial, and as one tries to achieve better results by incorporating more format-specific knowledge, complexity (and with it opportunity for vulnerabilities) necessarily explodes.
Consider libchop's anchor-based chopper based on this paper.
Upon a cursory examination, the implementation appears to include a partial domain-specific JIT compiler (!!!) just to choose where to split data!! I'd much prefer something which treated things as completely opaque blobs.
Sure, one could argue it could be effectively sandboxed and pass back only a list of indeces to chop at, but still... for stuff likely running in Dom0, definitely do not want!
And for those who say "yeah, but... how likely is it for it to actually have a vuln in that... It can't be that complex... it's just some minor string manipulation", just look at the libarchive exploit that recently affected FreeBSD's update system.
|
On Mon, Oct 31, 2016 at 6:40 PM, Rudd-O notifications@github.com wrote:
Sure, they could harm availability, but if properly authenticated, the attacker's capabilities end there. But, if you don't have a trusted way to identify and authenticate the root of your merkle-tree index thing, (which sounds to be the case with this system as described when restoring without an already trusted index) then your adversary may also gain arbitrary write on restore, and that is plenty of leverage for further attacks.
Right. I was saying that's the main concept I like from it, as opposed to other systems.
Sorry to disappoint. It makes sense from the perspective of one optimizing for storage above all else to want to split streams on intelligent boundaries such that you can re-use as many already-stored content-addressed chunks as possible. Unfortunately, determining the most intelligent boundary is seriously non-trivial, and as one tries to achieve better results by incorporating more format-specific knowledge, complexity (and with it opportunity for vulnerabilities) necessarily explodes. Consider libchop's anchor-based chopper based on this paper. Upon a cursory examination, the implementation appears to include a partial domain-specific JIT compiler (!!!) just to choose where to split data!! I'd much prefer something which treated things as completely opaque blobs. Sure, one could argue it could be effectively sandboxed and pass back only a list of indeces to chop at, but still... for stuff likely running in Dom0, definitely do not want! And for those who say "yeah, but... how likely is it for it to actually have a vuln in that... It can't be that complex... it's just some minor string manipulation", just look at the libarchive exploit that recently affected FreeBSD's update system. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Rudd-O
Nov 2, 2016
On 10/31/2016 11:09 PM, Marek Marczykowski-Górecki wrote:
I don't think deduplication (especially across VMs) is a good idea for
secure backup solution at all. There are multiple things that can go
wrong. For example - if attacker control one of the VMs, can store
some data there, and later observing backup size guess if the data is
present also in some other VM, thus leaking information about content
of another VM.
This is an objection that is incredibly far-fetched. First of all, you
are suggesting that an attacker controls a VM that is backed up, AND
also controls the backup data. That's already incredibly improbable.
Second, you're also suggesting that the attacker writes some that
happens to be deduplicated, which means the attacker ALREADY HAS the
data he was trying to discover BY DEFINITION.
I do not think this is a valid objection.
But, if you fear this extremely improbable thing, you can just switch
off deduplication. Presto, problem gone.
Below I explain how that can be accomplished.
Probably not feasible for leaking private keys (something not known to
the attacker), but very real for finding who have some known data
(like breaking anonymity of some Tor user).
Also, today SHA256 is considered safe, but this format rely on it to
have /never, ever/ any collision, not even unintentional one. This
drawback is even mentioned in that paper.
I'm sure new algorithms can be added to the code. It's open source.
And this is just what I can think of in 5 minutes. There are probably
more. So, any solution must not share anything across VMs.
Just turn dedup off!
It's super simple to do that.
Here's just one way to do it:
During backup, at the point where backup is hashing the data to see if
there are any duplicates, simply hash the data with the path name of the
file being stored.
Bam, dedup between VMs is now off.
Best thing: all of this is implementable because the code is open source.
Another way to turn dedup off: always do a full backup. That way this
mysterious attacker you posit cannot compare before+after backup sizes.
Bam, problem solved!
Fundamentally, some people care about the remote scenarios you propose,
and others do not. To cater for them, there should be a setting that
allows them to toggle the behavior you dislike.
Rudd-O
http://rudd-o.com/
Rudd-O
commented
Nov 2, 2016
|
On 10/31/2016 11:09 PM, Marek Marczykowski-Górecki wrote:
This is an objection that is incredibly far-fetched. First of all, you I do not think this is a valid objection. But, if you fear this extremely improbable thing, you can just switch Below I explain how that can be accomplished.
I'm sure new algorithms can be added to the code. It's open source.
Just turn dedup off! It's super simple to do that. Here's just one way to do it: During backup, at the point where backup is hashing the data to see if Bam, dedup between VMs is now off. Best thing: all of this is implementable because the code is open source. Another way to turn dedup off: always do a full backup. That way this Bam, problem solved! Fundamentally, some people care about the remote scenarios you propose,
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Rudd-O
Nov 2, 2016
On 11/01/2016 12:19 AM, Jean-Philippe Ouellet wrote:
Sure, they could harm availability, but if properly authenticated, the
attacker's capabilities end there.But, if you don't have a trusted way to identify and authenticate the
root of your merkle-tree index thing, (which sounds to be the case
with this system as described when restoring without an already
trusted index)
I don't think that's the case. That merkle root will not decrypt if you
give it the wrong password.
Feel free to correct me with the relevant technical details.
Rudd-O
http://rudd-o.com/
Rudd-O
commented
Nov 2, 2016
|
On 11/01/2016 12:19 AM, Jean-Philippe Ouellet wrote:
I don't think that's the case. That merkle root will not decrypt if you Feel free to correct me with the relevant technical details.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jpouellet
Nov 2, 2016
Contributor
On Wed, Nov 2, 2016 at 6:42 AM, Rudd-O notifications@github.com wrote:
On 10/31/2016 11:09 PM, Marek Marczykowski-Górecki wrote:
I don't think deduplication (especially across VMs) is a good idea for
secure backup solution at all.This is an objection that is incredibly far-fetched. First of all, you
are suggesting that an attacker controls a VM that is backed up, AND
also controls the backup data. That's already incredibly improbable.
Second, you're also suggesting that the attacker writes some that
happens to be deduplicated, which means the attacker ALREADY HAS the
data he was trying to discover BY DEFINITION.
Yes, the adversary knows what the data is, but does not know you
have it (and you wish to keep that a secret).
I do not think this is a valid objection.
I think this is a very valid objection.
Consider the case where you are a whistle-blower trying to expose some
corruption or whatever. You have a non-networked "secrets" vm you use
to analyze some cache of documents exposing said corruption that you
stole from some database. The adversary suspects you may possess said
documents, but wants proof before making you disappear.
By injecting a single block of said data into the disk a different
(less-trusted) vm (plenty of ways to do that... via browser cache for
example) and observing deduplication, then the adversary can confirm
the presence of data you wished to keep the existence of a secret.
But, if you fear this extremely improbable thing, you can just switch
off deduplication. Presto, problem gone.Fundamentally, some people care about the remote scenarios you propose,
and others do not. To cater for them, there should be a setting that
allows them to toggle the behavior you dislike.
I believe that is not the correct philosophy for an open source
project aiming to provide improved security and privacy for an
audience including less-technical users.
You expect some journalist wanting to hide the fact that they have
certain documents to be aware of duplication-based
content-confirmation attacks, be aware that qubes' backup system is
vulnerable by default, be aware that there is an option to turn that
off, and actually do so before it matters? I do not think this is a
reasonable model for the intended users.
Personally, I'm a great fan of OpenBSD-style decision rationale, where
choices are made to be secure by default and have few knobs to turn.
Consistent application of that philosophy protects users.
|
On Wed, Nov 2, 2016 at 6:42 AM, Rudd-O notifications@github.com wrote:
Yes, the adversary knows what the data is, but does not know you
I think this is a very valid objection. Consider the case where you are a whistle-blower trying to expose some By injecting a single block of said data into the disk a different
I believe that is not the correct philosophy for an open source You expect some journalist wanting to hide the fact that they have Personally, I'm a great fan of OpenBSD-style decision rationale, where |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jpouellet
Nov 2, 2016
Contributor
On Wed, Nov 2, 2016 at 6:43 AM, Rudd-O notifications@github.com wrote:
I don't think that's the case. That merkle root will not decrypt if you
give it the wrong password.Feel free to correct me with the relevant technical details.
But, if you don't have a trusted way to identify and authenticate the
root of your merkle-tree index thing, (which sounds to be the case
I was assuming things were implemented as described in the paper, with
no additional integrity protection mechanisms wrapping the indexes.
If this is not the case, then I am mistaken.
Content-confirmation attacks associated with
convergent-encryption-based deduplication schemes like this one still
remain problematic though.
|
On Wed, Nov 2, 2016 at 6:43 AM, Rudd-O notifications@github.com wrote:
I was assuming things were implemented as described in the paper, with If this is not the case, then I am mistaken. Content-confirmation attacks associated with |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Rudd-O
Nov 3, 2016
On 11/02/2016 02:07 PM, Jean-Philippe Ouellet wrote:
Fundamentally, some people care about the remote scenarios you propose,
and others do not. To cater for them, there should be a setting that
allows them to toggle the behavior you dislike.I believe that is not the correct philosophy for an open source
project aiming to provide improved security and privacy for an
audience including less-technical users.
What's wrong with the setting, if the setting is toggled to the secure
value by default?
Settings are a perfectly valid way of changing how software behaves.
Rudd-O
http://rudd-o.com/
Rudd-O
commented
Nov 3, 2016
|
On 11/02/2016 02:07 PM, Jean-Philippe Ouellet wrote:
What's wrong with the setting, if the setting is toggled to the secure Settings are a perfectly valid way of changing how software behaves.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
defuse
Nov 3, 2016
@Rudd-O: I was going to reply here about why settings are bad in general, but I felt it was off-topic, so I moved it into a blog post: https://bqp.io/on-software-settings.html (here it's the "second problem" that applies most).
defuse
commented
Nov 3, 2016
|
@Rudd-O: I was going to reply here about why settings are bad in general, but I felt it was off-topic, so I moved it into a blog post: https://bqp.io/on-software-settings.html (here it's the "second problem" that applies most). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jpouellet
Nov 3, 2016
Contributor
Another post that explains things more eloquently than I could: http://www.daemonology.net/blog/2009-09-04-complexity-is-insecurity.html
That post also happens to be from the author of scrypt :)
EDIT: This also applies to my argument against trying to chunk data at intelligent boundaries rather than fixed-size blocks. While the former clearly results in superior deduplication, the latter is more secure.
|
Another post that explains things more eloquently than I could: http://www.daemonology.net/blog/2009-09-04-complexity-is-insecurity.html That post also happens to be from the author of scrypt :) EDIT: This also applies to my argument against trying to chunk data at intelligent boundaries rather than fixed-size blocks. While the former clearly results in superior deduplication, the latter is more secure. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
defuse
Nov 14, 2016
Heads up that I added a feature to crack Qubes v3.1 (and think, but have not verified v3.2) backup encryption (default settings) to CrackStation: https://bqp.io/qubes-backup-encryption-added-to-crackstation.html I hope by doing this I'll attract some more cryptographers in here making sure the next version is rock solid.
In case anyone wants a short-term workaround, what I've done personally is to encrypt my backup disk with LUKS, and then store my Qubes-encrypted backups on there.
defuse
commented
Nov 14, 2016
|
Heads up that I added a feature to crack Qubes v3.1 (and think, but have not verified v3.2) backup encryption (default settings) to CrackStation: https://bqp.io/qubes-backup-encryption-added-to-crackstation.html I hope by doing this I'll attract some more cryptographers in here making sure the next version is rock solid. In case anyone wants a short-term workaround, what I've done personally is to encrypt my backup disk with LUKS, and then store my Qubes-encrypted backups on there. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Dec 11, 2016
Member
Heads up that I added a feature to crack Qubes v3.1 (and think, but have not verified v3.2) backup encryption (default settings) to CrackStation: https://bqp.io/qubes-backup-encryption-added-to-crackstation.html
Yes, this applies to 3.2 also.
I hope by doing this I'll attract some more cryptographers in here making sure the next version is rock solid.
Thanks, @defuse.
Related question: Theoretically, what's the fastest way to brute force a Qubes backup under the current scheme? There are at least two ways to go, right?
- Keep trying 32-digit hexadecimal strings (i.e., possible outputs of
md5(passphrase)) asaes-256-cbckeys until one works. (Assuming that the passphrase has >= 128-bits of entropy.) - Keep computing
openssl dgst -sha512 -hmac '<passphrase>' backup-headeruntil one matchesbackup-header.hmac.
Is 1 theoretically faster because it entails a smaller keyspace? Or is there another, faster way?
The reason I ask is because this would seem to affect estimates of how strong/weak the current system is (i.e., how long it would take to brute force given different values of passphrase entropy).
Yes, this applies to 3.2 also.
Thanks, @defuse. Related question: Theoretically, what's the fastest way to brute force a Qubes backup under the current scheme? There are at least two ways to go, right?
Is 1 theoretically faster because it entails a smaller keyspace? Or is there another, faster way? The reason I ask is because this would seem to affect estimates of how strong/weak the current system is (i.e., how long it would take to brute force given different values of passphrase entropy). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jpouellet
Dec 11, 2016
Contributor
- Keep trying 32-digit hexadecimal strings (i.e., possible outputs of md5(passphrase)) as aes-256-cbc keys until one works. (Assuming that the passphrase has >= 128-bits of entropy.)
No, 128 bits is an unthinkably large state space to try to brute-force. The set of md5(likely_passphrases) is sparse within that space, so you would definitely want to start with likely passphrase candidates and pass those through md5 to generate candidate aes keys, not try to randomly guess a correct 128-bit value (assumed to be so incredibly unlikely that it is why many standards target the 128-bit effective security level).
The larger problem with this approach though is checking whether or not a resulting passphrase is correct. You then need to use the resulting key to try to decrypt the filesystem, and check if the plaintext looks like a filesystem. The HMAC-cracking approach has the benefit that it is immediately obvious whether or not the result is correct.
- Keep computing openssl dgst -sha512 -hmac '' backup-header until one matches backup-header.hmac.
Yes. This.
Is 1 theoretically faster because it entails a smaller keyspace? Or is there another, faster way?
No. The question is not about 128-bits vs 256-bits (both are assumed to be sufficient in crypto literature). The design flaw here is that is the HMAC and encryption key are derived from exactly the same input, and an HMAC construction allows fast (cheap to attack) checking of whether or not a key is correct (because it is designed for cheap message authentication using shared secrets), whereas a strong KDF is designed to be expensive-to-compute (and that expense does not need to be paid by an attacker who can bypass it by computing the HMAC for each candidate passphrase instead).
Disclaimer: I am not a cryptographer, I just like reading and thinking about applied crypto, and occasionally have to crack passwords
No, 128 bits is an unthinkably large state space to try to brute-force. The set of md5(likely_passphrases) is sparse within that space, so you would definitely want to start with likely passphrase candidates and pass those through md5 to generate candidate aes keys, not try to randomly guess a correct 128-bit value (assumed to be so incredibly unlikely that it is why many standards target the 128-bit effective security level). The larger problem with this approach though is checking whether or not a resulting passphrase is correct. You then need to use the resulting key to try to decrypt the filesystem, and check if the plaintext looks like a filesystem. The HMAC-cracking approach has the benefit that it is immediately obvious whether or not the result is correct.
Yes. This.
No. The question is not about 128-bits vs 256-bits (both are assumed to be sufficient in crypto literature). The design flaw here is that is the HMAC and encryption key are derived from exactly the same input, and an HMAC construction allows fast (cheap to attack) checking of whether or not a key is correct (because it is designed for cheap message authentication using shared secrets), whereas a strong KDF is designed to be expensive-to-compute (and that expense does not need to be paid by an attacker who can bypass it by computing the HMAC for each candidate passphrase instead). Disclaimer: I am not a cryptographer, I just like reading and thinking about applied crypto, and occasionally have to crack passwords |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jpouellet
Dec 11, 2016
Contributor
The reason I ask is because this would seem to affect estimates of how strong/weak the current system is (i.e., how long it would take to brute force given different values of passphrase entropy).
In its current (3.2) state, we essentially have the cheapest KDF possible, so I think the only safe way to use it is to have a passphrase that is already sufficiently high entropy for use directly as a key.
When I used qvm-backup (to transfer VMs from old laptop to new laptop), I used a key from dd if=/dev/random, encrypted the result asymmetrically to a key generated on the destination machine, signed that with a key generated on the origin machine, performed the backup & transfer, then decrypted the used symmetric key on the destination machine.
I think such an asymmetric scheme is really the way to go for various reasons, but I've yet to put my arguments together and formally make a case for it. tl;dr - allows to do away with passphrases and have a safely-stored offline backup-recovery key (which is not needed at backup-creation time), and a backup-origin signing key. This would enable the creation of encrypted and authenticated backups in a hostile environment where entering a backup passphrase would be problematic.
In its current (3.2) state, we essentially have the cheapest KDF possible, so I think the only safe way to use it is to have a passphrase that is already sufficiently high entropy for use directly as a key. When I used qvm-backup (to transfer VMs from old laptop to new laptop), I used a key from dd if=/dev/random, encrypted the result asymmetrically to a key generated on the destination machine, signed that with a key generated on the origin machine, performed the backup & transfer, then decrypted the used symmetric key on the destination machine. I think such an asymmetric scheme is really the way to go for various reasons, but I've yet to put my arguments together and formally make a case for it. tl;dr - allows to do away with passphrases and have a safely-stored offline backup-recovery key (which is not needed at backup-creation time), and a backup-origin signing key. This would enable the creation of encrypted and authenticated backups in a hostile environment where entering a backup passphrase would be problematic. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jpouellet
Dec 11, 2016
Contributor
tl;dr - allows to do away with passphrases and have a safely-stored offline backup-recovery key
That recovery key could (should) of course be protected with a passphrase itself, but it need not be used (or even known) at backup-creation time.
That recovery key could (should) of course be protected with a passphrase itself, but it need not be used (or even known) at backup-creation time. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Dec 11, 2016
Member
- Keep trying 32-digit hexadecimal strings (i.e., possible outputs of md5(passphrase)) as aes-256-cbc keys until one works. (Assuming that the passphrase has >= 128-bits of entropy.)
No, 128 bits is an unthinkably large state space to try to brute-force. The set of md5(likely_passphrases) is sparse within that space, so you would definitely want to start with likely passphrase candidates and pass those through md5 to generate candidate aes keys, not try to randomly guess a correct 128-bit value (assumed to be so incredibly unlikely that it is why many standards target the 128-bit effective security level).
Of course. I didn't say anything about randomly guessing 128-bit values. It goes without saying that you would prioritize the possible outputs of likely passphrases before unlikely ones. (This is, of course, an oversimplification of what sophisticated password crackers actually do.)
The larger problem with this approach though is checking whether or not a resulting passphrase is correct. You then need to use the resulting key to try to decrypt the filesystem, and check if the plaintext looks like a filesystem. The HMAC-cracking approach has the benefit that it is immediately obvious whether or not the result is correct.
That's one thing I was wondering about. I suppose it depends on how much additional time it takes to attempt to decrypt the file with each key.
Is 1 theoretically faster because it entails a smaller keyspace? Or is there another, faster way?
No. The question is not about 128-bits vs 256-bits (both are assumed to be sufficient in crypto literature).
You're missing the point. The question is which one is theoretically faster. The question is not whether both would take a sufficiently long time.
The design flaw here is that is the HMAC and encryption key are derived from exactly the same input, and an HMAC construction allows fast (cheap to attack) checking of whether or not a key is correct (because it is designed for cheap message authentication using shared secrets), whereas a strong KDF is designed to be expensive-to-compute (and that expense does not need to be paid by an attacker who can bypass it by computing the HMAC for each candidate passphrase instead).
I know. You just repeated the issue that I originally reported.
Of course. I didn't say anything about randomly guessing 128-bit values. It goes without saying that you would prioritize the possible outputs of likely passphrases before unlikely ones. (This is, of course, an oversimplification of what sophisticated password crackers actually do.)
That's one thing I was wondering about. I suppose it depends on how much additional time it takes to attempt to decrypt the file with each key.
You're missing the point. The question is which one is theoretically faster. The question is not whether both would take a sufficiently long time.
I know. You just repeated the issue that I originally reported. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jpouellet
Dec 11, 2016
Contributor
I know. You just repeated the issue that I originally reported.
Err, oops. Indeed. This thread has been so long I'd forgotten you were the OP. :) My apologies if you feel I've insulted your intelligence by my explanation.
Err, oops. Indeed. This thread has been so long I'd forgotten you were the OP. :) My apologies if you feel I've insulted your intelligence by my explanation. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
defuse
Dec 11, 2016
Related question: Theoretically, what's the fastest way to brute force a Qubes backup under the current scheme? There are at least two ways to go, right?
Keep trying 32-digit hexadecimal strings (i.e., possible outputs of md5(passphrase)) as aes-256-cbc keys until one works. (Assuming that the passphrase has >= 128-bits of entropy.) Keep computing openssl dgst -sha512 -hmac '<passphrase>' backup-header until one matches backup-header.hmac.
These are about equivalent, if you ignore the difference in time it takes to compute MD5 then AES vs. HMAC-SHA512.
The reason I ask is because this would seem to affect estimates of how strong/weak the current system is (i.e., how long it would take to brute force given different values of passphrase entropy).
The worst problem with the current system is that the password isn't salted. Without a salt, the attacker can perform their brute-force attack once, saving all the (HMAC, password) pairs they compute into a lookup table, then quickly crack any HMAC in that list with a fast (<1 second) lookup. That's a lot cheaper than having to re-run the brute-force attack for each hash (which using salt ensures). Fixing this changes the asymptotics of the best attack -- O(N) precomputation then O(1) for each hash to crack into no precomputation and O(N) for each hash to crack.
The second problem is that we want to use some sort of slow key derivation function on the password, like Argon2, scrypt, bcrypt, etc, to increase the cost of attacks further.
defuse
commented
Dec 11, 2016
These are about equivalent, if you ignore the difference in time it takes to compute MD5 then AES vs. HMAC-SHA512.
The worst problem with the current system is that the password isn't salted. Without a salt, the attacker can perform their brute-force attack once, saving all the (HMAC, password) pairs they compute into a lookup table, then quickly crack any HMAC in that list with a fast (<1 second) lookup. That's a lot cheaper than having to re-run the brute-force attack for each hash (which using salt ensures). Fixing this changes the asymptotics of the best attack -- O(N) precomputation then O(1) for each hash to crack into no precomputation and O(N) for each hash to crack. The second problem is that we want to use some sort of slow key derivation function on the password, like Argon2, scrypt, bcrypt, etc, to increase the cost of attacks further. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Dec 11, 2016
Member
See #971 (comment) I think it all have been covered already (mostly by using scrypt utility).
|
See #971 (comment) I think it all have been covered already (mostly by using |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jan 12, 2017
Member
One issue with new backup format is that each chunk (currently 100M) has key derived separately - and it takes time (by design). I think the easiest think to do is to increase chunk size to for example 1GB, to minimize the impact.
Other solution would be to derive key once use it for all the chunks, but it will require more work because:
- cannot use
scrypttool directly - for example go back toopenssl enc/openssl dgst- this time with explicit keys (derived using scrypt KDF) instead of very weak built-in KDF; or find some other tool - needs some other way to bind encrypted chunk to its name (currently done by prefixing chunk name to the passphrase feed into scrypt KDF)
And this will also make the manual restore harder.
Looking at this long discussion, I don't want to go through it all along again...
|
One issue with new backup format is that each chunk (currently 100M) has key derived separately - and it takes time (by design). I think the easiest think to do is to increase chunk size to for example 1GB, to minimize the impact.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Jan 13, 2017
Member
One issue with new backup format is that each chunk (currently 100M) has key derived separately - and it takes time (by design). I think the easiest think to do is to increase chunk size to for example 1GB, to minimize the impact.
Is this "just" a performance issue, or is it also a security issue? How much of a performance impact are we talking about? Why is the current chunk size 100M in the first place?
Other solution would be to derive key once use it for all the chunks, but it will require more work because: [...]
Both of those options sound undesirable. Are there any downsides to just using a 1GB chunk size?
Looking at this long discussion, I don't want to go through it all along again...
Not sure what you mean here. Why would we have to go through it all again?
Is this "just" a performance issue, or is it also a security issue? How much of a performance impact are we talking about? Why is the current chunk size 100M in the first place?
Both of those options sound undesirable. Are there any downsides to just using a 1GB chunk size?
Not sure what you mean here. Why would we have to go through it all again? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jan 13, 2017
Member
|
On Fri, Jan 13, 2017 at 01:44:14AM -0800, Andrew David Wong wrote:
> One issue with new backup format is that each chunk (currently 100M) has key derived separately - and it takes time (by design). I think the easiest think to do is to increase chunk size to for example 1GB, to minimize the impact.
Is this "just" a performance issue, or is it also a security issue? How much of a performance impact are we talking about? Why is the current chunk size 100M in the first place?
Yes, I think it is "just" performance issue. 100MB was arbitrary chosen as a
trade-off between space needed in dom0 temporary directory and
performance. There is also a second arbitrary number here: how many
chunks can be queued in dom0 - currently it is 10. Each chunk needs to
be first created in dom0 and only when the whole chunk is ready it can
be sent to VM/final storage. Having at most 10*100MB chunks means it
require at most 1GB, which is feasible to keep even in tmpfs (RAM), for
performance. Something like 10*1GB, or even 5*1GB probably will be too
much, so disk will need to be used (lowering performance).
> Looking at this long discussion, I don't want to go through it all along again...
Not sure what you mean here. Why would we have to go through it all again?
There are many tools, and assessing which is the best is tricky,
especially when none of us is a cryptographer...
…--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Jan 14, 2017
Member
In that case, I agree with you. Better to keep it simple and just increase the chunk size to 1GB (or some similarly suitable size).
|
In that case, I agree with you. Better to keep it simple and just increase the chunk size to 1GB (or some similarly suitable size). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ptitdoc
Feb 16, 2017
Keep in mind that some people want to backup things because their disk is almost full. That was the reason of using chunks in the first place.
I feel that 1GB chucks are dangerous in such cases. For instance, if I have 5GB free, the backup should not queue more than 4 chunks of 1GB.
Then if the backup fails because there is not enough free space, the files may not be properly removed, and the user ends up with a broken system.
With lvm thin provisioning, I usually ends up with a read-only device with IO errors because of a full disk, with no opportunity to clear some free space without running a live cd.
ptitdoc
commented
Feb 16, 2017
•
|
Keep in mind that some people want to backup things because their disk is almost full. That was the reason of using chunks in the first place. I feel that 1GB chucks are dangerous in such cases. For instance, if I have 5GB free, the backup should not queue more than 4 chunks of 1GB. With lvm thin provisioning, I usually ends up with a read-only device with IO errors because of a full disk, with no opportunity to clear some free space without running a live cd. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ptitdoc
Feb 16, 2017
Just to add to the discussion. I worked on an alternative version that still uses openssl enc and dgst while using python hashlib key derivation function:
- perform key derivation using python PBKDF2 hashlib function
- add a .salt file to the backup file to support PBKDF2
- use openssl enc and dgst with the derived key as the KEY without using openssl password derivation mechanisms
- add a random .iv file in the backup for each chuck
- backup can still be decrypted manually using a small python script to perform PBKDF2 key derivation, reading IVs and feeding it to the openssl enc or dgst tool.
Problems are:
- in this case we are playing with python random generator os.urandom and python implementation of PBKDF2 (based on the documentation, it could use openssl implementation)
- it may still be prone to cryptographic attack because of errors in handling or sizes of IVs ...
- the raw derived backup key is passed as parameters to openssl, and is present everywhere in the memory.
ptitdoc
commented
Feb 16, 2017
•
|
Just to add to the discussion. I worked on an alternative version that still uses openssl enc and dgst while using python hashlib key derivation function:
Problems are:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Feb 16, 2017
Member
Keep in mind that some people want to backup things because their disk is almost full. That was the reason of using chunks in the first place.
Then tmpfs could be used (so RAM + swap). When using 1GB chunks, maximum number of them should be probably decreased to 5 or so (currently it's 10). The number of chunks in the queue is not really important as long as it isn't 1 - it should be only large enough to allow parallel uploading one chunk while preparing the next one. Maybe a good compromise would be 4x500MB?
Then tmpfs could be used (so RAM + swap). When using 1GB chunks, maximum number of them should be probably decreased to 5 or so (currently it's 10). The number of chunks in the queue is not really important as long as it isn't 1 - it should be only large enough to allow parallel uploading one chunk while preparing the next one. Maybe a good compromise would be 4x500MB? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ptitdoc
Feb 16, 2017
Yes, most Qubes users probably use at least 8GB of RAM, and most VMs currently needs to be stopped to perform backups. So we can hope that there is 4GB of free space in dom0, or maybe we can request that before performing a backup, and warn the user about bad performance if this assertion failed ?
ptitdoc
commented
Feb 16, 2017
|
Yes, most Qubes users probably use at least 8GB of RAM, and most VMs currently needs to be stopped to perform backups. So we can hope that there is 4GB of free space in dom0, or maybe we can request that before performing a backup, and warn the user about bad performance if this assertion failed ? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Feb 16, 2017
Member
If the performance impact is significant (e.g., at least several minutes faster per qvm-backup) on systems with sufficient RAM, then I recommend keeping the larger 1GB chunk size. As @ptitdoc pointed out, most Qubes systems will have at least 8GB. The ones that have less will never be running many VMs anyway, so it won't be hard to shut them down for qvm-backup. Meanwhile, the vast majority of users will benefit from performance gains on a task they do (or should be doing) frequently. Across all Qubes users, this seems like it could easily add up to thousands of hours saved over the lifetime of R4.0.
|
If the performance impact is significant (e.g., at least several minutes faster per |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Feb 16, 2017
Member
It's about 1s for the sole KDF. Lets do some math: assume 100GB backup, with VM sizes being multiply of 1GB. This gives:
- 1GB chunk: 1m 60s overhead for KDF
- 500MB chunk: 3m 20s overhead for KDF
- 100MB chunk: 16m 40s overhead for KDF
Is ~3.5m bad?
|
It's about 1s for the sole KDF. Lets do some math: assume 100GB backup, with VM sizes being multiply of 1GB. This gives:
Is ~3.5m bad? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Feb 16, 2017
Member
Suppose a single user makes weekly backups for a year:
(3.5 minutes * 52 weeks) / 60 minutes = ~3 hours saved per year
Suppose a little under 10% of the estimated Qubes userbase does this:
~3 hours * 2,000 users = ~6,000 hours saved per year
So, even with conservative assumptions, we'd be saving users thousands of hours per year, collectively. And the only cost is that users with only 4GB of RAM may have to shut down their VMs before they run qvm-backup, which they'd probably do anyway? Seems worth it to me.
|
Suppose a single user makes weekly backups for a year:
Suppose a little under 10% of the estimated Qubes userbase does this:
So, even with conservative assumptions, we'd be saving users thousands of hours per year, collectively. And the only cost is that users with only 4GB of RAM may have to shut down their VMs before they run |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Feb 16, 2017
Member
I don't see why to look at collective time of Qubes userbase. It isn't really observable value for anyone. But it does make sense to look at time spent/saved per year.
|
I don't see why to look at collective time of Qubes userbase. It isn't really observable value for anyone. But it does make sense to look at time spent/saved per year. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
grey-olli
Feb 16, 2017
grey-olli
commented
Feb 16, 2017
|
On Feb 16, 2017 14:47, "Andrew David Wong" <notifications@github.com> wrote:
If the performance impact is significant (e.g., at least several minutes
faster per qvm-backup) on systems with sufficient RAM, then I recommend
keeping the larger 1GB chunk size. As @ptitdoc <https://github.com/ptitdoc>
pointed out, most Qubes systems will have at least 8GB. The ones that have
less will never be running many VMs anyway, so it won't be hard to shut
them down for qvm-backup. Meanwhile, the vast majority of users will
benefit from performance gains on a task they do (or should be doing)
frequently. Across all Qubes users, this seems like it could easily add up
to thousands of hours saved over the lifetime
Users should know explicitly the requirements for backup before them start
it. No matter of the chunk size you agree on the requirement to have X
gigabytes on disk or free ram should appear before process is allowed to
start. BTW: if you want that size on disk - why not to reserve it by
creating loop-mounted image of that size for backup operations, then use
and finally delete it? If the memory minimum could be located that way -
why not to do this on backup start?
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jpouellet
Feb 16, 2017
Contributor
I don't see a fundamental reason why the KDF needs to be recomputed for each chunk... the output of the KDF should be safely cacheable, just as we must cache the plaintext passphrase between chunks today.
Do I understand correctly that the current KDF-recomputation-on-each-chunk is just an artifact of wanting to build the backup system out of guaranteed-to-be-easily-obtainable-in-the-future programs and treat each such program as a black box, and the single scrypt binary being responsible for both computing the KDF and doing the bulk encryption, thus having no clean way to cache the KDF output? Or is there some other reason?
|
I don't see a fundamental reason why the KDF needs to be recomputed for each chunk... the output of the KDF should be safely cacheable, just as we must cache the plaintext passphrase between chunks today. Do I understand correctly that the current KDF-recomputation-on-each-chunk is just an artifact of wanting to build the backup system out of guaranteed-to-be-easily-obtainable-in-the-future programs and treat each such program as a black box, and the single scrypt binary being responsible for both computing the KDF and doing the bulk encryption, thus having no clean way to cache the KDF output? Or is there some other reason? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Yes, exactly this one alone. |
andrewdavidwong commentedApr 26, 2015
See:
https://groups.google.com/d/msg/qubes-devel/CZ7WRwLXcnk/u_rZPoVxL5IJ