New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve qvm-backup key derivation/management #971

Closed
andrewdavidwong opened this Issue Apr 26, 2015 · 150 comments

Comments

Projects
None yet

@marmarek marmarek added this to the Release 3.0 milestone May 3, 2015

@marmarek marmarek modified the milestones: Release 3.1, Release 3.0 May 13, 2015

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jun 15, 2015

Member

We need someone who know cryptography to check discussed key derivation scheme. Especially answer for questions in this message: https://groups.google.com/d/msg/qubes-devel/CZ7WRwLXcnk/1N0sYf6lVvUJ

Member

marmarek commented Jun 15, 2015

We need someone who know cryptography to check discussed key derivation scheme. Especially answer for questions in this message: https://groups.google.com/d/msg/qubes-devel/CZ7WRwLXcnk/1N0sYf6lVvUJ

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Oct 22, 2015

Member

We may never get the assistance of a cryptographer on this. For now, we should at least pass the -md option to openssl and specify sha256 (to match the key length of aes256). In other words:

openssl enc -md sha256 -e -aes-256 [...]

This would be better than nothing. Currently, md5 is used by default, which is actually weakening security for users who select passphrases larger than 128 bits (probably almost all Qubes users).

Member

andrewdavidwong commented Oct 22, 2015

We may never get the assistance of a cryptographer on this. For now, we should at least pass the -md option to openssl and specify sha256 (to match the key length of aes256). In other words:

openssl enc -md sha256 -e -aes-256 [...]

This would be better than nothing. Currently, md5 is used by default, which is actually weakening security for users who select passphrases larger than 128 bits (probably almost all Qubes users).

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 22, 2015

Member

On Thu, Oct 22, 2015 at 02:54:10AM -0700, Axon wrote:

We may never get the assistance of a cryptographer on this. For now, we should at least pass the -md option to openssl and specify sha256 (to match the key length of aes256). In other words:

openssl enc -md sha256 -e -aes-256 [...]

This would be better than nothing. Currently, md5 is used by default, which is actually weakening security for users who select passphrases larger than 128 bits (probably almost all Qubes users).

As the user can (theoretically) freely choose encryption algorithm, just
sha256 may not be an universal option. Do you know any way to get key
length from openssl cipher name? Or just search for some numbers in
it?...

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Member

marmarek commented Oct 22, 2015

On Thu, Oct 22, 2015 at 02:54:10AM -0700, Axon wrote:

We may never get the assistance of a cryptographer on this. For now, we should at least pass the -md option to openssl and specify sha256 (to match the key length of aes256). In other words:

openssl enc -md sha256 -e -aes-256 [...]

This would be better than nothing. Currently, md5 is used by default, which is actually weakening security for users who select passphrases larger than 128 bits (probably almost all Qubes users).

As the user can (theoretically) freely choose encryption algorithm, just
sha256 may not be an universal option. Do you know any way to get key
length from openssl cipher name? Or just search for some numbers in
it?...

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Oct 23, 2015

Member

As the user can (theoretically) freely choose encryption algorithm, just sha256 may not be an universal option.

Two ideas:

  1. Set the default to sha256, but give the user the option to change it. If the user wants to choose a different encryption algorithm, then it's up to her to also choose an appropriate message digest algorithm to accompany it.
  2. Hard code sha512 for everything, then let each encryption algorithm take as many truncated bits as it can (assuming that would work).

Do you know any way to get key length from openssl cipher name? Or just search for some numbers in it?...

I'm not aware of any. That would be my guess too...

Member

andrewdavidwong commented Oct 23, 2015

As the user can (theoretically) freely choose encryption algorithm, just sha256 may not be an universal option.

Two ideas:

  1. Set the default to sha256, but give the user the option to change it. If the user wants to choose a different encryption algorithm, then it's up to her to also choose an appropriate message digest algorithm to accompany it.
  2. Hard code sha512 for everything, then let each encryption algorithm take as many truncated bits as it can (assuming that would work).

Do you know any way to get key length from openssl cipher name? Or just search for some numbers in it?...

I'm not aware of any. That would be my guess too...

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Oct 23, 2015

Member

Based on this and after further research, I think it would be best to go with GPG for encryption (but not for integrity-verification).

Ideally, we would read from the user-defined preferences in ~/.gnupg/gpg.conf in dom0. This would allow users to set the following options (among others):

--s2k-cipher-algo name
      Use name as the cipher algorithm used to protect secret keys.  The default cipher
      is CAST5. This cipher is also used for  conventional  encryption  if  --personal-
      cipher-preferences and --cipher-algo is not given.

--s2k-digest-algo name
      Use  name  as  the  digest algorithm used to mangle the passphrases.  The default
      algorithm is SHA-1.

--s2k-mode n
      Selects how passphrases are mangled. If n is 0 a plain passphrase (which  is  not
      recommended)  will  be  used,  a  1  adds  a  salt to the passphrase and a 3 (the
      default) iterates the whole process a number of times (see --s2k-count).   Unless
      --rfc1991 is used, this mode is also used for conventional encryption.

--s2k-count n
      Specify how many times the passphrase mangling is repeated.  This value may range
      between 1024 and 65011712 inclusive.  The default  is  inquired  from  gpg-agent.
      Note  that  not all values in the 1024-65011712 range are legal and if an illegal
      value is selected, GnuPG will round up to the nearest legal value.   This  option
      is only meaningful if --s2k-mode is 3.

Recommended defaults:

gpg --encrypt --symmetric --s2k-cipher-algo AES256 --s2k-digest-algo SHA512 --s2k-mode 3 --s2k-count 65011712 --compress-algo bzip2
Member

andrewdavidwong commented Oct 23, 2015

Based on this and after further research, I think it would be best to go with GPG for encryption (but not for integrity-verification).

Ideally, we would read from the user-defined preferences in ~/.gnupg/gpg.conf in dom0. This would allow users to set the following options (among others):

--s2k-cipher-algo name
      Use name as the cipher algorithm used to protect secret keys.  The default cipher
      is CAST5. This cipher is also used for  conventional  encryption  if  --personal-
      cipher-preferences and --cipher-algo is not given.

--s2k-digest-algo name
      Use  name  as  the  digest algorithm used to mangle the passphrases.  The default
      algorithm is SHA-1.

--s2k-mode n
      Selects how passphrases are mangled. If n is 0 a plain passphrase (which  is  not
      recommended)  will  be  used,  a  1  adds  a  salt to the passphrase and a 3 (the
      default) iterates the whole process a number of times (see --s2k-count).   Unless
      --rfc1991 is used, this mode is also used for conventional encryption.

--s2k-count n
      Specify how many times the passphrase mangling is repeated.  This value may range
      between 1024 and 65011712 inclusive.  The default  is  inquired  from  gpg-agent.
      Note  that  not all values in the 1024-65011712 range are legal and if an illegal
      value is selected, GnuPG will round up to the nearest legal value.   This  option
      is only meaningful if --s2k-mode is 3.

Recommended defaults:

gpg --encrypt --symmetric --s2k-cipher-algo AES256 --s2k-digest-algo SHA512 --s2k-mode 3 --s2k-count 65011712 --compress-algo bzip2
@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 23, 2015

Member

This seems to be the best option.
Some minor questions/improvements/thoughts:

  1. Should be no --encrypt as it is for asymmetric encryption
  2. We currently have compression handled independently of encryption, I think we should keep that and use --compress-algo none. Our current approach allows to specify any "filter" program to do the actual compression. Personally I like pigz (parallel gzip) for much better performance on 8-core system :)
  3. Generally I'm fine with --s2k-count 65011712; on my system it needs 1.00s to encrypt just "test" string (so mostly key derivation benchmark), but it may be somehow longer on older systems; do we care? On Thinkpad T61 (Core 2 Duo T7300) - 1.42s. Ok, not bad at all.
  4. What about integrity protection (openssl)? is it a problem that the passphrase is used directly here (at least I think it is)? If yes and if we want to change that, it would require some more work to make it compatible with older formats, because integrity protection of backup header (only after its verification we can get some properties like backup format version, encryption algorithm etc).
  5. Backup data will be integrity-protected twice - once by openssl, and additionally by gpg. I don't think it is a problem, but I'm not a cryptographer.
  6. I don't think we should parse ~/.gnupg/gpg.conf. Either let use the defaults (possibly changed by user in that file), or override some option.
Member

marmarek commented Oct 23, 2015

This seems to be the best option.
Some minor questions/improvements/thoughts:

  1. Should be no --encrypt as it is for asymmetric encryption
  2. We currently have compression handled independently of encryption, I think we should keep that and use --compress-algo none. Our current approach allows to specify any "filter" program to do the actual compression. Personally I like pigz (parallel gzip) for much better performance on 8-core system :)
  3. Generally I'm fine with --s2k-count 65011712; on my system it needs 1.00s to encrypt just "test" string (so mostly key derivation benchmark), but it may be somehow longer on older systems; do we care? On Thinkpad T61 (Core 2 Duo T7300) - 1.42s. Ok, not bad at all.
  4. What about integrity protection (openssl)? is it a problem that the passphrase is used directly here (at least I think it is)? If yes and if we want to change that, it would require some more work to make it compatible with older formats, because integrity protection of backup header (only after its verification we can get some properties like backup format version, encryption algorithm etc).
  5. Backup data will be integrity-protected twice - once by openssl, and additionally by gpg. I don't think it is a problem, but I'm not a cryptographer.
  6. I don't think we should parse ~/.gnupg/gpg.conf. Either let use the defaults (possibly changed by user in that file), or override some option.
@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Oct 23, 2015

Member

As for most secure gpg settings, check this out.

Quote https://github.com/Whonix/anon-gpg-tweaks

"Security and privacy enhancements for gnupg's config file for user
"user" in /home/user/.gnupg/gpg.conf." See also:
https://raw.github.com/ioerror/torbirdy/master/gpg.conf and
ioerror/torbirdy#11.

  1. Backup data will be integrity-protected twice - once by openssl, and additionally by gpg. I don't think it is a problem, but I'm not a cryptographer.

Not a bad thought. In past there were "known plaintext attacks", where
it was recommended against to encrypt known plaintext as this would
weaken the ciphertext. Any serious algorithm nowadays must pass this.

Normally it should not. If it did, it would be a critical cipher weakness.

I am just worried of the extra complexity of double encryption wrt to
manual backup recovery. But if you think it's fine, go for it.

  1. I don't think we should parse ~/.gnupg/gpg.conf. Either let use the defaults (possibly changed by user in that file), or override some option.

Yes, programs should not rely on user settings in ~/.gnupg/gpg.conf
indeed. Configurable options is a pony.

gpg supports --no-options (skip ~/.gnupg/gpg.conf) and --homedir.

By the way, in Qubes Q3 (and earlier) you cannot continue to backup if
you uncheck the encryption box. [I personally use full disk encryption
for all my disks, so the encryption of QVMM is nice, but I prefer to
disable it to ease manual recovery [in case the backup is ever damaged,
version conflict or w/e).]

Member

adrelanos commented Oct 23, 2015

As for most secure gpg settings, check this out.

Quote https://github.com/Whonix/anon-gpg-tweaks

"Security and privacy enhancements for gnupg's config file for user
"user" in /home/user/.gnupg/gpg.conf." See also:
https://raw.github.com/ioerror/torbirdy/master/gpg.conf and
ioerror/torbirdy#11.

  1. Backup data will be integrity-protected twice - once by openssl, and additionally by gpg. I don't think it is a problem, but I'm not a cryptographer.

Not a bad thought. In past there were "known plaintext attacks", where
it was recommended against to encrypt known plaintext as this would
weaken the ciphertext. Any serious algorithm nowadays must pass this.

Normally it should not. If it did, it would be a critical cipher weakness.

I am just worried of the extra complexity of double encryption wrt to
manual backup recovery. But if you think it's fine, go for it.

  1. I don't think we should parse ~/.gnupg/gpg.conf. Either let use the defaults (possibly changed by user in that file), or override some option.

Yes, programs should not rely on user settings in ~/.gnupg/gpg.conf
indeed. Configurable options is a pony.

gpg supports --no-options (skip ~/.gnupg/gpg.conf) and --homedir.

By the way, in Qubes Q3 (and earlier) you cannot continue to backup if
you uncheck the encryption box. [I personally use full disk encryption
for all my disks, so the encryption of QVMM is nice, but I prefer to
disable it to ease manual recovery [in case the backup is ever damaged,
version conflict or w/e).]

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 23, 2015

Member
  1. Backup data will be integrity-protected twice - once by openssl, and additionally by gpg. I don't think it is a problem, but I'm not a cryptographer.

Not a bad thought. In past there were "known plaintext attacks", where
it was recommended against to encrypt known plaintext as this would
weaken the ciphertext. Any serious algorithm nowadays must pass this.

Normally it should not. If it did, it would be a critical cipher weakness.

I am just worried of the extra complexity of double encryption wrt to
manual backup recovery. But if you think it's fine, go for it.

Not encryption, just integrity protection - (.hmac files).

By the way, in Qubes Q3 (and earlier) you cannot continue to backup if
you uncheck the encryption box. [I personally use full disk encryption
for all my disks, so the encryption of QVMM is nice, but I prefer to
disable it to ease manual recovery [in case the backup is ever damaged,
version conflict or w/e).]

Can you create separate ticket for it? I was not aware of this
problem...

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Member

marmarek commented Oct 23, 2015

  1. Backup data will be integrity-protected twice - once by openssl, and additionally by gpg. I don't think it is a problem, but I'm not a cryptographer.

Not a bad thought. In past there were "known plaintext attacks", where
it was recommended against to encrypt known plaintext as this would
weaken the ciphertext. Any serious algorithm nowadays must pass this.

Normally it should not. If it did, it would be a critical cipher weakness.

I am just worried of the extra complexity of double encryption wrt to
manual backup recovery. But if you think it's fine, go for it.

Not encryption, just integrity protection - (.hmac files).

By the way, in Qubes Q3 (and earlier) you cannot continue to backup if
you uncheck the encryption box. [I personally use full disk encryption
for all my disks, so the encryption of QVMM is nice, but I prefer to
disable it to ease manual recovery [in case the backup is ever damaged,
version conflict or w/e).]

Can you create separate ticket for it? I was not aware of this
problem...

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Oct 24, 2015

Member

@marmarek:

  1. Should be no --encrypt as it is for asymmetric encryption

Oops. Right.

  1. We currently have compression handled independently of encryption, I think we should keep that and use --compress-algo none. Our current approach allows to specify any "filter" program to do the actual compression. Personally I like pigz (parallel gzip) for much better performance on 8-core system :)

Sounds good to me. :)

  1. Generally I'm fine with --s2k-count 65011712; on my system it needs 1.00s to encrypt just "test" string (so mostly key derivation benchmark), but it may be somehow longer on older systems; do we care? On Thinkpad T61 (Core 2 Duo T7300) - 1.42s. Ok, not bad at all.

I suggested 65011712 as the default because it's the maximum. I think we should assume Qubes users want the "most secure" options by default. If they value backward compatibility more, they can override the defaults. (The security gain of the "most secure defaults" strategy is in some cases debatable, but the choice of defaults acts as an important reputational signal for a security-oriented OS.)

  1. What about integrity protection (openssl)? is it a problem that the passphrase is used directly here (at least I think it is)?

Yes, this is the other part of the problem.

If yes and if we want to change that, it would require some more work to make it compatible with older formats, because integrity protection of backup header (only after its verification we can get some properties like backup format version, encryption algorithm etc).

True...

  1. Backup data will be integrity-protected twice - once by openssl, and additionally by gpg. I don't think it is a problem, but I'm not a cryptographer.

IANAC either. I take it integrity-protection from gpg can't be turned off?

  1. I don't think we should parse ~/.gnupg/gpg.conf.

The only reason I suggested parsing ~/.gnupg/gpg.conf is because that seemed like the most logical place for the user to put their custom settings, but certainly if it's dangerous to parse it (or undesirable for another reason), then let's not do it. I really only meant to suggest that users should be able to specify their own options for s2k-cipher-algo, s2k-digest-algo, s2k-mode, and s2k-count (and any others that should be available).

Either let use the defaults

Sorry, I don't understand. Can you rephrase?

(possibly changed by user in that file),

Which file?

or override some option.

Would it be something like this?

qvm-backup --encrypt --s2k-count 1024 [...]
Member

andrewdavidwong commented Oct 24, 2015

@marmarek:

  1. Should be no --encrypt as it is for asymmetric encryption

Oops. Right.

  1. We currently have compression handled independently of encryption, I think we should keep that and use --compress-algo none. Our current approach allows to specify any "filter" program to do the actual compression. Personally I like pigz (parallel gzip) for much better performance on 8-core system :)

Sounds good to me. :)

  1. Generally I'm fine with --s2k-count 65011712; on my system it needs 1.00s to encrypt just "test" string (so mostly key derivation benchmark), but it may be somehow longer on older systems; do we care? On Thinkpad T61 (Core 2 Duo T7300) - 1.42s. Ok, not bad at all.

I suggested 65011712 as the default because it's the maximum. I think we should assume Qubes users want the "most secure" options by default. If they value backward compatibility more, they can override the defaults. (The security gain of the "most secure defaults" strategy is in some cases debatable, but the choice of defaults acts as an important reputational signal for a security-oriented OS.)

  1. What about integrity protection (openssl)? is it a problem that the passphrase is used directly here (at least I think it is)?

Yes, this is the other part of the problem.

If yes and if we want to change that, it would require some more work to make it compatible with older formats, because integrity protection of backup header (only after its verification we can get some properties like backup format version, encryption algorithm etc).

True...

  1. Backup data will be integrity-protected twice - once by openssl, and additionally by gpg. I don't think it is a problem, but I'm not a cryptographer.

IANAC either. I take it integrity-protection from gpg can't be turned off?

  1. I don't think we should parse ~/.gnupg/gpg.conf.

The only reason I suggested parsing ~/.gnupg/gpg.conf is because that seemed like the most logical place for the user to put their custom settings, but certainly if it's dangerous to parse it (or undesirable for another reason), then let's not do it. I really only meant to suggest that users should be able to specify their own options for s2k-cipher-algo, s2k-digest-algo, s2k-mode, and s2k-count (and any others that should be available).

Either let use the defaults

Sorry, I don't understand. Can you rephrase?

(possibly changed by user in that file),

Which file?

or override some option.

Would it be something like this?

qvm-backup --encrypt --s2k-count 1024 [...]
@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Oct 24, 2015

Member

@adrelanos:

Yes, programs should not rely on user settings in ~/.gnupg/gpg.conf indeed.

I'm glad you and @marmarek pointed this out to me. As mentioned in my previous message, my suggestion about reading from ~/.gnupg/gpg.conf was really just an expression of my desire for users to be able to choose their own custom settings for gpg to use.

However, now I'm curious: Why is it such a bad idea for programs to read from/rely on ~/.gnupg/gpg.conf?

Configurable options is a pony.

What does this mean?

Member

andrewdavidwong commented Oct 24, 2015

@adrelanos:

Yes, programs should not rely on user settings in ~/.gnupg/gpg.conf indeed.

I'm glad you and @marmarek pointed this out to me. As mentioned in my previous message, my suggestion about reading from ~/.gnupg/gpg.conf was really just an expression of my desire for users to be able to choose their own custom settings for gpg to use.

However, now I'm curious: Why is it such a bad idea for programs to read from/rely on ~/.gnupg/gpg.conf?

Configurable options is a pony.

What does this mean?

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 24, 2015

Member

On Fri, Oct 23, 2015 at 07:16:42PM -0700, Axon wrote:

  1. Backup data will be integrity-protected twice - once by openssl, and additionally by gpg. I don't think it is a problem, but I'm not a cryptographer.

IANAC either. I take it integrity-protection from gpg can't be turned off?

Yes, I think it can't be disabled.

  1. I don't think we should parse ~/.gnupg/gpg.conf.

The only reason I suggested parsing ~/.gnupg/gpg.conf is because that seemed like the most logical place for the user to put their custom settings, but certainly if it's dangerous to parse it (or undesirable for another reason), then let's not do it. I really only meant to suggest that users should be able to specify their own options for s2k-cipher-algo, s2k-digest-algo, s2k-mode, and s2k-count (and any others that should be available).

Either let use the defaults

Sorry, I don't understand. Can you rephrase?

(possibly changed by user in that file),

Which file?

qvm-backup (one program) should not read ~/.gnupg/gpg.conf
(configuration of another program). We have no control over it, syntax
can change, its location may change, some other options may be
introduced (including another file?) etc.

So for each option we should either set it explicitly on command line
(always), or not set at all. Do not try to guess whether user set some
custom value in gpg.conf or not.

or override some option.

Would it be something like this?

qvm-backup --encrypt --s2k-count 1024 [...]

Yes.

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Member

marmarek commented Oct 24, 2015

On Fri, Oct 23, 2015 at 07:16:42PM -0700, Axon wrote:

  1. Backup data will be integrity-protected twice - once by openssl, and additionally by gpg. I don't think it is a problem, but I'm not a cryptographer.

IANAC either. I take it integrity-protection from gpg can't be turned off?

Yes, I think it can't be disabled.

  1. I don't think we should parse ~/.gnupg/gpg.conf.

The only reason I suggested parsing ~/.gnupg/gpg.conf is because that seemed like the most logical place for the user to put their custom settings, but certainly if it's dangerous to parse it (or undesirable for another reason), then let's not do it. I really only meant to suggest that users should be able to specify their own options for s2k-cipher-algo, s2k-digest-algo, s2k-mode, and s2k-count (and any others that should be available).

Either let use the defaults

Sorry, I don't understand. Can you rephrase?

(possibly changed by user in that file),

Which file?

qvm-backup (one program) should not read ~/.gnupg/gpg.conf
(configuration of another program). We have no control over it, syntax
can change, its location may change, some other options may be
introduced (including another file?) etc.

So for each option we should either set it explicitly on command line
(always), or not set at all. Do not try to guess whether user set some
custom value in gpg.conf or not.

or override some option.

Would it be something like this?

qvm-backup --encrypt --s2k-count 1024 [...]

Yes.

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Oct 24, 2015

Member

Axon:

@adrelanos:

Yes, programs should not rely on user settings in ~/.gnupg/gpg.conf indeed.

I'm glad you and @marmarek pointed this out to me. As mentioned in my previous message, my suggestion about reading from ~/.gnupg/gpg.conf was really just an expression of my desire for users to be able to choose their own custom settings for gpg to use.

However, now I'm curious: Why is it such a bad idea for programs to read from/rely on ~/.gnupg/gpg.conf?

It's not super bad, not a huge issue, just bad practice. Because that
file is for user settings. Not arbitrary tools. Avoiding conflicts with
whatever the user did. It's hard to modify a user config file.
Specifically if there is no .d folder and especially in the home
folder. By avoiding it, it gets easier to configure it tailed for the
tool in question.

Configurable options is a pony.

What does this mean?

An optional, non-essential, nice to have, thing if someone contributes it.

Member

adrelanos commented Oct 24, 2015

Axon:

@adrelanos:

Yes, programs should not rely on user settings in ~/.gnupg/gpg.conf indeed.

I'm glad you and @marmarek pointed this out to me. As mentioned in my previous message, my suggestion about reading from ~/.gnupg/gpg.conf was really just an expression of my desire for users to be able to choose their own custom settings for gpg to use.

However, now I'm curious: Why is it such a bad idea for programs to read from/rely on ~/.gnupg/gpg.conf?

It's not super bad, not a huge issue, just bad practice. Because that
file is for user settings. Not arbitrary tools. Avoiding conflicts with
whatever the user did. It's hard to modify a user config file.
Specifically if there is no .d folder and especially in the home
folder. By avoiding it, it gets easier to configure it tailed for the
tool in question.

Configurable options is a pony.

What does this mean?

An optional, non-essential, nice to have, thing if someone contributes it.

@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Oct 24, 2015

Member

Axon:

integrity-protection from gpg can't be turned off?

I don't know if it should be turned off (why?) but it can be configured:

--force-mdc / --disable-mdc

Member

adrelanos commented Oct 24, 2015

Axon:

integrity-protection from gpg can't be turned off?

I don't know if it should be turned off (why?) but it can be configured:

--force-mdc / --disable-mdc

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Oct 24, 2015

Member

qvm-backup (one program) should not read ~/.gnupg/gpg.conf (configuration of another program). We have no control over it, syntax can change, its location may change, some other options may be introduced (including another file?) etc.

So for each option we should either set it explicitly on command line (always), or not set at all. Do not try to guess whether user set some custom value in gpg.conf or not.

Ok, that makes sense. Thank you for explaining it to me. :)

Member

andrewdavidwong commented Oct 24, 2015

qvm-backup (one program) should not read ~/.gnupg/gpg.conf (configuration of another program). We have no control over it, syntax can change, its location may change, some other options may be introduced (including another file?) etc.

So for each option we should either set it explicitly on command line (always), or not set at all. Do not try to guess whether user set some custom value in gpg.conf or not.

Ok, that makes sense. Thank you for explaining it to me. :)

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Oct 24, 2015

Member

It's not super bad, not a huge issue, just bad practice. Because that file is for user settings. Not arbitrary tools. Avoiding conflicts with whatever the user did. It's hard to modify a user config file. Specifically if there is no .d folder and especially in the home folder. By avoiding it, it gets easier to configure it tailed for the tool in question.

Right, makes sense.

An optional, non-essential, nice to have, thing if someone contributes it.

Oh, I see. Thanks. :)

I don't know if it should be turned off (why?) but it can be configured:[...]

Oh, ok. I guess the only reason to turn it off would be if we're worried about double integrity-protection causing problems, but it sounds like that's not really a concern.

Member

andrewdavidwong commented Oct 24, 2015

It's not super bad, not a huge issue, just bad practice. Because that file is for user settings. Not arbitrary tools. Avoiding conflicts with whatever the user did. It's hard to modify a user config file. Specifically if there is no .d folder and especially in the home folder. By avoiding it, it gets easier to configure it tailed for the tool in question.

Right, makes sense.

An optional, non-essential, nice to have, thing if someone contributes it.

Oh, I see. Thanks. :)

I don't know if it should be turned off (why?) but it can be configured:[...]

Oh, ok. I guess the only reason to turn it off would be if we're worried about double integrity-protection causing problems, but it sounds like that's not really a concern.

@pqg

This comment has been minimized.

Show comment
Hide comment
@pqg

pqg Oct 25, 2015

FWIW, I wonder whether the minimum effort variation would be to apply the key stretching just once, to encrypt a random master key.

e.g.:

  1. A random Master Key, MK, of 512 bits is generated.
  2. This key is recorded in a file in the archive, say ./master-key, encrypted under the user's passphrase, using a key stretching method of some variety (I'll get back to the details of such later).
  3. The encryption key EK is derived as EK = HMAC_SHA512(MK, "qubes_backup_encryption_key")
  4. EK, truncated to 256 bits, is used to encrypt files in the current manner, but it is supplied to openssl via the parameter -K (i.e. as a raw key); I suspect this also means that the IV will have to be generated by the backup program and be manually fed in to openssl via -iv.
  5. The Authentication Master Key AMK is derived as AMK = HMAC_SHA512(MK, "qubes_backup_authentication_key")
  6. For each file, an Authentication Key AK_<filepath> is derived as AK_<filepath> = HMAC_SHA512(AMK, "<filepath>")
  7. Each file has it's HMAC taken and stored in an adjacent <blah>.hmac file, in the current manner.

Using a unique key for each file's MAC resolves a possible issue where an attacker could swap files around in the archive, potentially migrating an infection from a lower-security VM to a higher-security VM.

This also avoids having an attacker brute-force the user's passphrase against one/any of the HMACs.

A NIST-specified KDF could be used on the Master Key to derive the other keys instead, but none lend themselves to being specified in an openssl one-liner. Also, 2 of the 3 algos in NIST 800-108 more or less reduce to the above HMAC invocations if we don't need subkeys longer than the HMAC's block size.

Alas, this proposal requires more operations before verifying the ./backup-header file, which is bad both for code exposure concerns and backwards compatibility. However, preserving the ./backup-header.hmac in its current form will leave a method by which, as mentioned, an attacker can bypass the key stretching and brute-force the bare passphrase (HashCat has an "HMAC-SHA256 (key = $pass)" mode already that might be able to be dropped in to serve such a purpose, though I've not tested it).

I've not looked at the code yet, but if GnuPG does not do /too much/ parsing, maybe it's still okay to use it for key stretching?

There was actually a competition held recently to pick a good key stretching ("password hashing") algorithm. The winner was an entrant called Argon2, which beat out other entrants including yescrypt, which is an improved scrypt variant. Note that a key issue with scrypt is that it allows time/memory trade-offs to an extent that's undesirable in an optimal key stretching algorithm, though this issue can be mitigated by tuning the scrypt invocation to require sufficiently large memory. And all of these key stretching algorithms wipe the floor with GnuPG's S2K Mode 3, which is just an iterated hash, and therefore not memory-hard at all (i.e. GPUs and ASICs can attack it much faster than the CPUs that will be computing it under normal circumstances).

Of course, the problem with exotic key stretching algos is that it's hard to use them without making manual backup recovery dependent on non-ubiquitous software. And a million iterations of SHA512 is much better than no stretching at all.

I'll look into this a bit more later, and report back with any developments :)

pqg commented Oct 25, 2015

FWIW, I wonder whether the minimum effort variation would be to apply the key stretching just once, to encrypt a random master key.

e.g.:

  1. A random Master Key, MK, of 512 bits is generated.
  2. This key is recorded in a file in the archive, say ./master-key, encrypted under the user's passphrase, using a key stretching method of some variety (I'll get back to the details of such later).
  3. The encryption key EK is derived as EK = HMAC_SHA512(MK, "qubes_backup_encryption_key")
  4. EK, truncated to 256 bits, is used to encrypt files in the current manner, but it is supplied to openssl via the parameter -K (i.e. as a raw key); I suspect this also means that the IV will have to be generated by the backup program and be manually fed in to openssl via -iv.
  5. The Authentication Master Key AMK is derived as AMK = HMAC_SHA512(MK, "qubes_backup_authentication_key")
  6. For each file, an Authentication Key AK_<filepath> is derived as AK_<filepath> = HMAC_SHA512(AMK, "<filepath>")
  7. Each file has it's HMAC taken and stored in an adjacent <blah>.hmac file, in the current manner.

Using a unique key for each file's MAC resolves a possible issue where an attacker could swap files around in the archive, potentially migrating an infection from a lower-security VM to a higher-security VM.

This also avoids having an attacker brute-force the user's passphrase against one/any of the HMACs.

A NIST-specified KDF could be used on the Master Key to derive the other keys instead, but none lend themselves to being specified in an openssl one-liner. Also, 2 of the 3 algos in NIST 800-108 more or less reduce to the above HMAC invocations if we don't need subkeys longer than the HMAC's block size.

Alas, this proposal requires more operations before verifying the ./backup-header file, which is bad both for code exposure concerns and backwards compatibility. However, preserving the ./backup-header.hmac in its current form will leave a method by which, as mentioned, an attacker can bypass the key stretching and brute-force the bare passphrase (HashCat has an "HMAC-SHA256 (key = $pass)" mode already that might be able to be dropped in to serve such a purpose, though I've not tested it).

I've not looked at the code yet, but if GnuPG does not do /too much/ parsing, maybe it's still okay to use it for key stretching?

There was actually a competition held recently to pick a good key stretching ("password hashing") algorithm. The winner was an entrant called Argon2, which beat out other entrants including yescrypt, which is an improved scrypt variant. Note that a key issue with scrypt is that it allows time/memory trade-offs to an extent that's undesirable in an optimal key stretching algorithm, though this issue can be mitigated by tuning the scrypt invocation to require sufficiently large memory. And all of these key stretching algorithms wipe the floor with GnuPG's S2K Mode 3, which is just an iterated hash, and therefore not memory-hard at all (i.e. GPUs and ASICs can attack it much faster than the CPUs that will be computing it under normal circumstances).

Of course, the problem with exotic key stretching algos is that it's hard to use them without making manual backup recovery dependent on non-ubiquitous software. And a million iterations of SHA512 is much better than no stretching at all.

I'll look into this a bit more later, and report back with any developments :)

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 25, 2015

Member

On Sun, Oct 25, 2015 at 10:13:47AM -0700, pqg wrote:

FWIW, I wonder whether the minimum effort variation would be to apply the key stretching just once, to encrypt a random master key.

Generally we want to avoid doing to much low-level crypto handling,
because we are not cryptographers.

Using a unique key for each file's MAC resolves a possible issue where an attacker could swap files around in the archive, potentially migrating an infection from a lower-security VM to a higher-security VM.

This isn't a problem, because paths are stored inside of archive(s). Outer layer is only for convenience (streaming the backup while creating it, partial restore). Take a look at point 7 here:
https://www.qubes-os.org/en/doc/backup-emergency-restore-v3/

Alas, this proposal requires more operations before verifying the ./backup-header file, which is bad both for code exposure concerns and backwards compatibility. However, preserving the ./backup-header.hmac in its current form will leave a method by which, as mentioned, an attacker can bypass the key stretching and brute-force the bare passphrase (HashCat has an "HMAC-SHA256 (key = $pass)" mode already that might be able to be dropped in to serve such a purpose, though I've not tested it).

Yes, IMO this is currently the main concern. When we get to the ./backup-header, we can have KDF paramenters, salt, iteration count, or any other configuration needed for restoring the rest of the backup. But before, we need something working without such knowledge...

I've not looked at the code yet, but if GnuPG does not do /too much/ parsing, maybe it's still okay to use it for key stretching?

I think we previously concluded, that we don't want to expose gpg to any unverified input.

Of course, the problem with exotic key stretching algos is that it's hard to use them without making manual backup recovery dependent on non-ubiquitous software. And a million iterations of SHA512 is much better than no stretching at all.

And also using exotic algorithms isn't a good idea, because those didn't received that much review/analysis/tests.

But generally IMHO scrypt as a KDF (exactly the thing it was designed for, right?) should be ok. If we'd need to employ KDF ourself at all...

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Member

marmarek commented Oct 25, 2015

On Sun, Oct 25, 2015 at 10:13:47AM -0700, pqg wrote:

FWIW, I wonder whether the minimum effort variation would be to apply the key stretching just once, to encrypt a random master key.

Generally we want to avoid doing to much low-level crypto handling,
because we are not cryptographers.

Using a unique key for each file's MAC resolves a possible issue where an attacker could swap files around in the archive, potentially migrating an infection from a lower-security VM to a higher-security VM.

This isn't a problem, because paths are stored inside of archive(s). Outer layer is only for convenience (streaming the backup while creating it, partial restore). Take a look at point 7 here:
https://www.qubes-os.org/en/doc/backup-emergency-restore-v3/

Alas, this proposal requires more operations before verifying the ./backup-header file, which is bad both for code exposure concerns and backwards compatibility. However, preserving the ./backup-header.hmac in its current form will leave a method by which, as mentioned, an attacker can bypass the key stretching and brute-force the bare passphrase (HashCat has an "HMAC-SHA256 (key = $pass)" mode already that might be able to be dropped in to serve such a purpose, though I've not tested it).

Yes, IMO this is currently the main concern. When we get to the ./backup-header, we can have KDF paramenters, salt, iteration count, or any other configuration needed for restoring the rest of the backup. But before, we need something working without such knowledge...

I've not looked at the code yet, but if GnuPG does not do /too much/ parsing, maybe it's still okay to use it for key stretching?

I think we previously concluded, that we don't want to expose gpg to any unverified input.

Of course, the problem with exotic key stretching algos is that it's hard to use them without making manual backup recovery dependent on non-ubiquitous software. And a million iterations of SHA512 is much better than no stretching at all.

And also using exotic algorithms isn't a good idea, because those didn't received that much review/analysis/tests.

But generally IMHO scrypt as a KDF (exactly the thing it was designed for, right?) should be ok. If we'd need to employ KDF ourself at all...

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Oct 26, 2015

Member

There was actually a competition held recently to pick a good key stretching ("password hashing") algorithm. The winner was an entrant called Argon2, which beat out other entrants including yescrypt, which is an improved scrypt variant. Note that a key issue with scrypt is that it allows time/memory trade-offs to an extent that's undesirable in an optimal key stretching algorithm, though this issue can be mitigated by tuning the scrypt invocation to require sufficiently large memory. And all of these key stretching algorithms wipe the floor with GnuPG's S2K Mode 3, which is just an iterated hash, and therefore not memory-hard at all (i.e. GPUs and ASICs can attack it much faster than the CPUs that will be computing it under normal circumstances).

Of course, the problem with exotic key stretching algos is that it's hard to use them without making manual backup recovery dependent on non-ubiquitous software. And a million iterations of SHA512 is much better than no stretching at all.

Exactly. I'm skeptical that the practical security benefits of using a newer KDF will outweigh the benefits of being able to do manual backup recovery in a variety of disaster scenarios. Security-concious users (which almost by definition includes all Qubes users) ought to be using reasonably long, high-entropy passphrases (I'll refer to these as "good" passphrases for the sake of brevity).

The current system (in which openssl applies a single round of md5) punishes users who pick good passphrases by capping maximum entropy at a measly 128 bits. We should fix that. However, my understanding (correct me if I'm wrong) is that S2K or PBKDF2, applied to good passphrases, will be uncrackable for the foreseeable long-term future (i.e., even if we assume many quadrillions of guesses per second, it would still take at least millions of years). If that's correct, then it seems to me that we should instead select the most trusted, tested, ubiquitous tools available. My backup is no good to me if I can't recover the data because the only software available to me is whatever's on an old Ubuntu disc, for example. (We can't predict in advance what kinds of emergency scenarios Qubes users might find themselves in, and under what conditions they might need manual access to their data.)

Member

andrewdavidwong commented Oct 26, 2015

There was actually a competition held recently to pick a good key stretching ("password hashing") algorithm. The winner was an entrant called Argon2, which beat out other entrants including yescrypt, which is an improved scrypt variant. Note that a key issue with scrypt is that it allows time/memory trade-offs to an extent that's undesirable in an optimal key stretching algorithm, though this issue can be mitigated by tuning the scrypt invocation to require sufficiently large memory. And all of these key stretching algorithms wipe the floor with GnuPG's S2K Mode 3, which is just an iterated hash, and therefore not memory-hard at all (i.e. GPUs and ASICs can attack it much faster than the CPUs that will be computing it under normal circumstances).

Of course, the problem with exotic key stretching algos is that it's hard to use them without making manual backup recovery dependent on non-ubiquitous software. And a million iterations of SHA512 is much better than no stretching at all.

Exactly. I'm skeptical that the practical security benefits of using a newer KDF will outweigh the benefits of being able to do manual backup recovery in a variety of disaster scenarios. Security-concious users (which almost by definition includes all Qubes users) ought to be using reasonably long, high-entropy passphrases (I'll refer to these as "good" passphrases for the sake of brevity).

The current system (in which openssl applies a single round of md5) punishes users who pick good passphrases by capping maximum entropy at a measly 128 bits. We should fix that. However, my understanding (correct me if I'm wrong) is that S2K or PBKDF2, applied to good passphrases, will be uncrackable for the foreseeable long-term future (i.e., even if we assume many quadrillions of guesses per second, it would still take at least millions of years). If that's correct, then it seems to me that we should instead select the most trusted, tested, ubiquitous tools available. My backup is no good to me if I can't recover the data because the only software available to me is whatever's on an old Ubuntu disc, for example. (We can't predict in advance what kinds of emergency scenarios Qubes users might find themselves in, and under what conditions they might need manual access to their data.)

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Oct 26, 2015

Member

Having said all that, if scrypt provides exactly what we're looking for, then it may be worthwhile to go with that. (And advise users to store a copy of scrypt with their backups, or something.) If anyone can get this right, it's Colin Percival.

Member

andrewdavidwong commented Oct 26, 2015

Having said all that, if scrypt provides exactly what we're looking for, then it may be worthwhile to go with that. (And advise users to store a copy of scrypt with their backups, or something.) If anyone can get this right, it's Colin Percival.

@mfc

This comment has been minimized.

Show comment
Hide comment
@mfc

mfc Oct 26, 2015

Member

we should add a crypto tag to this (and any other similar issues) and I will create a thread with Cure53 (or NCC / iSec Partners) about this. They can provide free crypto advice to OTF-funded projects.

Member

mfc commented Oct 26, 2015

we should add a crypto tag to this (and any other similar issues) and I will create a thread with Cure53 (or NCC / iSec Partners) about this. They can provide free crypto advice to OTF-funded projects.

@marmarek marmarek added the crypto label Oct 26, 2015

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 26, 2015

Member

Great idea!

Member

marmarek commented Oct 26, 2015

Great idea!

@pqg

This comment has been minimized.

Show comment
Hide comment
@pqg

pqg Oct 26, 2015

@axon-qubes

I concur with your position that, if at all possible, the data should be recoverable with an "old Ubuntu disc".

The scrypt binary and python-scrypt module don't seem to be currently packaged with Fedora anyway (they are available on Debian Jessie; I've not checked elsewhere). And keeping scrypt alongside the backup would be problematic if we need to invoke scrypt /before/ we can verify the integrity of anything.

As an aside, I think we should distinguish between 2 uses of "KDF":

  1. A means by which to derive subkeys of arbitrary length from a master key of fixed length.
  2. A means by which to perform "key stretching" (and salting) so as to make brute force attacks more costly on user-supplied keying material.

For instance, scrypt could provide function 2, but not function 1. Function 1 is served by things like NIST 800-108, which in the special case of all subkeys being of equal or shorter length than the underlying hash function, reduces to HMAC(MK, "\x01<subkey_purpose_specifier_string>\x00\x20"), where the \x01 and \x00 are no longer serving a particular purpose in this reduced form, and \x20 is the length of the resulting subkey. More conservative KDFs (purpose 1) use a single keystream that feeds back the result of the previous HMAC invocations or, more conservative still, they create one feedback pipeline whose output is passed to a second HMAC invocation to produce the actual keystream (TLSv1.2 does this, see section 5). However, all of these are prohibitively awkward in the "old Ubuntu disc" scenario. By contrast:

echo -n qubes_backup_authentication_key | openssl dgst -sha256 -hmac <Hex(Master Key)>

...should work almost anywhere, yield the desired different keys for different purposes, and be just as secure as long as SHA-256 is a strong hash function. (The fancier derivation methods are just meant to give some additional protection if it should turn out that the underlying hash function is weaker than expected.) Though note that the Master Key (randomly generated or the result of scrypt or some other key stretching) will be a binary blob, which will be tough to pass via -hmac unless encoded by hex (as above) or base64, which is kind of unsightly protocol-wise. On OpenSSL >= v1.0.0 (Ubuntu since at least 2011) the following invocation could be used instead to pass the master key hex-encoded:

echo -n qubes_backup_authentication_key | openssl dgst -sha256 -mac hmac -macopt hexkey:<Hex(Master Key)>

So that the output is actually HMAC_SHA256(MK, "qubes_backup_authentication_key") rather than HMAC_SHA256(Hex(MK), "qubes_backup_authentication_key") as it would have to be with -hmac.

However, my understanding (correct me if I'm wrong) is that S2K or PBKDF2, applied to good passphrases, will be uncrackable for the foreseeable long-term future (i.e., even if we assume many quadrillions of guesses per second, it would still take at least millions of years).

With no key stretching, e.g. just take the SHA-256 of the passphrase, an attacker capable of 10 quadrillion tries a second for 10 million years would exhaust an ~101-bit keyspace. This represents approximately an 8-word diceware passphrase (~103 bits). With ~1million rounds of key stretching, as in S2K Mode 3 at max settings, ~81 bits of input keyspace could be searched, which requires a diceware passphrase of 7 words (~90 bits) though you would get pretty close with 6 words (~78 bits).

A GPU running hashcat can currently trial on the order of 2 billion SHA-256 sums per second (2*10^9/sec) so 10^16/sec represents a powerful adversary indeed. 10 million years is of course just providing a robust margin of error, since in 30 years the crypto will be completely obsolete, and in 100 years we'll all be dead.

So 128 bits is still enough for anyone. Unless your threat model includes the arrival of quantum computers in the very near term, in which case you should double all your symmetric crypto key length requirements.

That said, as you originally advanced, even if nothing else changes from this discussion, -md sha256 should be added to the openssl enc invocations. It's much more conservative.

None of this, of course, helps with the MACs, where we really want to salt and stretch the user's passphrase before doing anything, even if the same resulting key is used for both verification and encryption.

Adding a salt not only avoids precomputaion attacks, but binds the files to the particular backup. While the renaming attack I previously feared is indeed impossible, I don't /think/ that there's currently any protection against shuffling in an identically-named VM from a different backup, as long as both backups were made with the same passphrase.

However, there then needs to be some way to store the salt and, if variable, any key stretching parameters, and load them before verification. Would it be an acceptable risk to parse just the key=values out of the ./backup-header and interpret those necessary to establish the key, then check the ./backup-header.hmac, and finally interpret the rest of the values (e.g. if, in the future, a timestamp is added to the header, it would be good to not have to parse that until after verification)? Setting hard line and file length limits might help limit the attack surface.

This certainly presents less attack surface than any way I can think of using GPG to do pre-verification stretching. Though, I'm aware I haven't actually proposed another way to perform key stretching yet. Python has baked-in hashlib.pbkdf2_hmac(), but only since v3.4 and v2.7.8, which are far too recent (mid 2014). I'd hoped it might be possible to use gpg --symmetric on a fixed value, hash the resulting "file", and use that as the key, but I've not been able to find a way to pass a salt and/or IV to GPG, so it unfortunately generates a random one on each invocation. The scrypt binary has the same problem, assuming its availability issues could be overcome.

The bcrypt functionality in python-passlib may be a reasonable bet. I can't guarantee how ubiquitous it is, but it's in Debian >=7 and backports for 6, definitely in Ubuntu >= 12.04, and available on Fedora 21. bcrypt has been fairly heavily popularized and so should common enough; I suspect that a working implementation should be more readily obtainable for our hypothetical frazzled power user than scrypt.

On the other hand, the scrypt binary is pretty easy to build if you've got gcc and autotools. As mentioned, it's only good for encrypting/decrypting files, not plain key stretching, so we'd be back to the ./master-key idea or similar, but it still presents a considerably smaller attack surface than GnuPG.

pqg commented Oct 26, 2015

@axon-qubes

I concur with your position that, if at all possible, the data should be recoverable with an "old Ubuntu disc".

The scrypt binary and python-scrypt module don't seem to be currently packaged with Fedora anyway (they are available on Debian Jessie; I've not checked elsewhere). And keeping scrypt alongside the backup would be problematic if we need to invoke scrypt /before/ we can verify the integrity of anything.

As an aside, I think we should distinguish between 2 uses of "KDF":

  1. A means by which to derive subkeys of arbitrary length from a master key of fixed length.
  2. A means by which to perform "key stretching" (and salting) so as to make brute force attacks more costly on user-supplied keying material.

For instance, scrypt could provide function 2, but not function 1. Function 1 is served by things like NIST 800-108, which in the special case of all subkeys being of equal or shorter length than the underlying hash function, reduces to HMAC(MK, "\x01<subkey_purpose_specifier_string>\x00\x20"), where the \x01 and \x00 are no longer serving a particular purpose in this reduced form, and \x20 is the length of the resulting subkey. More conservative KDFs (purpose 1) use a single keystream that feeds back the result of the previous HMAC invocations or, more conservative still, they create one feedback pipeline whose output is passed to a second HMAC invocation to produce the actual keystream (TLSv1.2 does this, see section 5). However, all of these are prohibitively awkward in the "old Ubuntu disc" scenario. By contrast:

echo -n qubes_backup_authentication_key | openssl dgst -sha256 -hmac <Hex(Master Key)>

...should work almost anywhere, yield the desired different keys for different purposes, and be just as secure as long as SHA-256 is a strong hash function. (The fancier derivation methods are just meant to give some additional protection if it should turn out that the underlying hash function is weaker than expected.) Though note that the Master Key (randomly generated or the result of scrypt or some other key stretching) will be a binary blob, which will be tough to pass via -hmac unless encoded by hex (as above) or base64, which is kind of unsightly protocol-wise. On OpenSSL >= v1.0.0 (Ubuntu since at least 2011) the following invocation could be used instead to pass the master key hex-encoded:

echo -n qubes_backup_authentication_key | openssl dgst -sha256 -mac hmac -macopt hexkey:<Hex(Master Key)>

So that the output is actually HMAC_SHA256(MK, "qubes_backup_authentication_key") rather than HMAC_SHA256(Hex(MK), "qubes_backup_authentication_key") as it would have to be with -hmac.

However, my understanding (correct me if I'm wrong) is that S2K or PBKDF2, applied to good passphrases, will be uncrackable for the foreseeable long-term future (i.e., even if we assume many quadrillions of guesses per second, it would still take at least millions of years).

With no key stretching, e.g. just take the SHA-256 of the passphrase, an attacker capable of 10 quadrillion tries a second for 10 million years would exhaust an ~101-bit keyspace. This represents approximately an 8-word diceware passphrase (~103 bits). With ~1million rounds of key stretching, as in S2K Mode 3 at max settings, ~81 bits of input keyspace could be searched, which requires a diceware passphrase of 7 words (~90 bits) though you would get pretty close with 6 words (~78 bits).

A GPU running hashcat can currently trial on the order of 2 billion SHA-256 sums per second (2*10^9/sec) so 10^16/sec represents a powerful adversary indeed. 10 million years is of course just providing a robust margin of error, since in 30 years the crypto will be completely obsolete, and in 100 years we'll all be dead.

So 128 bits is still enough for anyone. Unless your threat model includes the arrival of quantum computers in the very near term, in which case you should double all your symmetric crypto key length requirements.

That said, as you originally advanced, even if nothing else changes from this discussion, -md sha256 should be added to the openssl enc invocations. It's much more conservative.

None of this, of course, helps with the MACs, where we really want to salt and stretch the user's passphrase before doing anything, even if the same resulting key is used for both verification and encryption.

Adding a salt not only avoids precomputaion attacks, but binds the files to the particular backup. While the renaming attack I previously feared is indeed impossible, I don't /think/ that there's currently any protection against shuffling in an identically-named VM from a different backup, as long as both backups were made with the same passphrase.

However, there then needs to be some way to store the salt and, if variable, any key stretching parameters, and load them before verification. Would it be an acceptable risk to parse just the key=values out of the ./backup-header and interpret those necessary to establish the key, then check the ./backup-header.hmac, and finally interpret the rest of the values (e.g. if, in the future, a timestamp is added to the header, it would be good to not have to parse that until after verification)? Setting hard line and file length limits might help limit the attack surface.

This certainly presents less attack surface than any way I can think of using GPG to do pre-verification stretching. Though, I'm aware I haven't actually proposed another way to perform key stretching yet. Python has baked-in hashlib.pbkdf2_hmac(), but only since v3.4 and v2.7.8, which are far too recent (mid 2014). I'd hoped it might be possible to use gpg --symmetric on a fixed value, hash the resulting "file", and use that as the key, but I've not been able to find a way to pass a salt and/or IV to GPG, so it unfortunately generates a random one on each invocation. The scrypt binary has the same problem, assuming its availability issues could be overcome.

The bcrypt functionality in python-passlib may be a reasonable bet. I can't guarantee how ubiquitous it is, but it's in Debian >=7 and backports for 6, definitely in Ubuntu >= 12.04, and available on Fedora 21. bcrypt has been fairly heavily popularized and so should common enough; I suspect that a working implementation should be more readily obtainable for our hypothetical frazzled power user than scrypt.

On the other hand, the scrypt binary is pretty easy to build if you've got gcc and autotools. As mentioned, it's only good for encrypting/decrypting files, not plain key stretching, so we'd be back to the ./master-key idea or similar, but it still presents a considerably smaller attack surface than GnuPG.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 26, 2015

Member

Adding a salt not only avoids precomputaion attacks, but binds the files to the particular backup. While the renaming attack I previously feared is indeed impossible, I don't /think/ that there's currently any protection against shuffling in an identically-named VM from a different backup, as long as both backups were made with the same passphrase.

Yes, such attack probably is possible.

However, there then needs to be some way to store the salt and, if variable, any key stretching parameters, and load them before verification. Would it be an acceptable risk to parse just the key=values out of the ./backup-header and interpret those necessary to establish the key, then check the ./backup-header.hmac, and finally interpret the rest of the values (e.g. if, in the future, a timestamp is added to the header, it would be good to not have to parse that until after verification)? Setting hard line and file length limits might help limit the attack surface.

This might be some option if nothing better comes to us. Those
pre-verification options needs to be clearly marked (some prefix in
name?) and reduced to absolutely minimum. If we'd have to go that way, I
think this should include version field, which would determine most of
parameters. For example it would be bad idea to include hash algorithm
in those pre-authentication set, because the attacker would be able to
replace it with some weak value (vide "cipher_null" in LUKS header[1]).

The bcrypt functionality in python-passlib may be a reasonable bet. I can't guarantee how ubiquitous it is, but it's in Debian >=7 and backports for 6, definitely in Ubuntu >= 12.04, and available on Fedora 21. bcrypt has been fairly heavily popularized and so should common enough; I suspect that a working implementation should be more readily obtainable for our hypothetical frazzled power user than scrypt.

And according to previous comments, bcrypt is also somehow sensible
choice, right?

On the other hand, the scrypt binary is pretty easy to build if you've got gcc and autotools. As mentioned, it's only good for encrypting/decrypting files, not plain key stretching, so we'd be back to the ./master-key idea or similar, but it still presents a considerably smaller attack surface than GnuPG.

I think we can simply have hex-encoded, encrypted master key included in
the backup header.

And as the backup format is going to be more and more complex, maybe
it's a good idea to include emergency restoration script as one of
plain text, integrity protected files (similar to backup-header)? So the
emergency procedure would include manual verification of backup header
and that script, and then simply run the script.

[1]
https://github.com/QubesOS/qubes-secpack/blob/master/QSBs/qsb-019-2015.txt

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Member

marmarek commented Oct 26, 2015

Adding a salt not only avoids precomputaion attacks, but binds the files to the particular backup. While the renaming attack I previously feared is indeed impossible, I don't /think/ that there's currently any protection against shuffling in an identically-named VM from a different backup, as long as both backups were made with the same passphrase.

Yes, such attack probably is possible.

However, there then needs to be some way to store the salt and, if variable, any key stretching parameters, and load them before verification. Would it be an acceptable risk to parse just the key=values out of the ./backup-header and interpret those necessary to establish the key, then check the ./backup-header.hmac, and finally interpret the rest of the values (e.g. if, in the future, a timestamp is added to the header, it would be good to not have to parse that until after verification)? Setting hard line and file length limits might help limit the attack surface.

This might be some option if nothing better comes to us. Those
pre-verification options needs to be clearly marked (some prefix in
name?) and reduced to absolutely minimum. If we'd have to go that way, I
think this should include version field, which would determine most of
parameters. For example it would be bad idea to include hash algorithm
in those pre-authentication set, because the attacker would be able to
replace it with some weak value (vide "cipher_null" in LUKS header[1]).

The bcrypt functionality in python-passlib may be a reasonable bet. I can't guarantee how ubiquitous it is, but it's in Debian >=7 and backports for 6, definitely in Ubuntu >= 12.04, and available on Fedora 21. bcrypt has been fairly heavily popularized and so should common enough; I suspect that a working implementation should be more readily obtainable for our hypothetical frazzled power user than scrypt.

And according to previous comments, bcrypt is also somehow sensible
choice, right?

On the other hand, the scrypt binary is pretty easy to build if you've got gcc and autotools. As mentioned, it's only good for encrypting/decrypting files, not plain key stretching, so we'd be back to the ./master-key idea or similar, but it still presents a considerably smaller attack surface than GnuPG.

I think we can simply have hex-encoded, encrypted master key included in
the backup header.

And as the backup format is going to be more and more complex, maybe
it's a good idea to include emergency restoration script as one of
plain text, integrity protected files (similar to backup-header)? So the
emergency procedure would include manual verification of backup header
and that script, and then simply run the script.

[1]
https://github.com/QubesOS/qubes-secpack/blob/master/QSBs/qsb-019-2015.txt

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Oct 27, 2015

Member

I notice that all of pqg's messages have disappeared from this thread. pqg, did you intentionally delete them? If so, may I ask why? (Feel free to send me an encrypted message if you prefer.) Note that this may create difficulties for people who try to read this thread in the future (e.g., cryptographers interested in helping with this issue).

Member

andrewdavidwong commented Oct 27, 2015

I notice that all of pqg's messages have disappeared from this thread. pqg, did you intentionally delete them? If so, may I ask why? (Feel free to send me an encrypted message if you prefer.) Note that this may create difficulties for people who try to read this thread in the future (e.g., cryptographers interested in helping with this issue).

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 27, 2015

Member

I notice that all of pqg's messages have disappeared from this thread.

As the whole account...

Member

marmarek commented Oct 27, 2015

I notice that all of pqg's messages have disappeared from this thread.

As the whole account...

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Oct 27, 2015

Member

As the whole account...

Ah. In light of that, I'm not sure what to make of his/her/their previous suggestions...

Member

andrewdavidwong commented Oct 27, 2015

As the whole account...

Ah. In light of that, I'm not sure what to make of his/her/their previous suggestions...

@pqg

This comment has been minimized.

Show comment
Hide comment
@pqg

pqg Oct 27, 2015

Apparently I tripped over some bot heuristics and was hellbanned. Presumably because I am using Tor and posted some comments with some links in them.

I trust that you can see me again?

pqg commented Oct 27, 2015

Apparently I tripped over some bot heuristics and was hellbanned. Presumably because I am using Tor and posted some comments with some links in them.

I trust that you can see me again?

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 27, 2015

Member

I trust that you can see me again?

Yes.

Member

marmarek commented Oct 27, 2015

I trust that you can see me again?

Yes.

@Rudd-O

This comment has been minimized.

Show comment
Hide comment
@Rudd-O

Rudd-O Jan 17, 2016

Hi. Use the scrypt source code for key derivation. It's simple, good license, and lets you create arbitrarily hard derivations. I'm using it for a project I'm working on, and it works exactly fine.

Rudd-O commented Jan 17, 2016

Hi. Use the scrypt source code for key derivation. It's simple, good license, and lets you create arbitrarily hard derivations. I'm using it for a project I'm working on, and it works exactly fine.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jan 17, 2016

Member

Take a look at #971 (comment) and #971 (comment) mostly about scrypt availability and usage for sole key derivation. One of important requirements for backup format here is ability to restore the data using commonly available tools, not necessary on Qubes OS.

Member

marmarek commented Jan 17, 2016

Take a look at #971 (comment) and #971 (comment) mostly about scrypt availability and usage for sole key derivation. One of important requirements for backup format here is ability to restore the data using commonly available tools, not necessary on Qubes OS.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jan 22, 2016

Member

Python has baked-in hashlib.pbkdf2_hmac(), but only since v3.4 and v2.7.8, which are far too recent (mid 2014).

There is pbkdf2 implementation in passlib python module (http://pythonhosted.org/passlib/lib/passlib.utils.pbkdf2.html), for quite a long time. As for bcrypt in that library, I can see only an API to create a password hash, not sure how can it be applied as key derivation...

Member

marmarek commented Jan 22, 2016

Python has baked-in hashlib.pbkdf2_hmac(), but only since v3.4 and v2.7.8, which are far too recent (mid 2014).

There is pbkdf2 implementation in passlib python module (http://pythonhosted.org/passlib/lib/passlib.utils.pbkdf2.html), for quite a long time. As for bcrypt in that library, I can see only an API to create a password hash, not sure how can it be applied as key derivation...

@Rudd-O

This comment has been minimized.

Show comment
Hide comment
@Rudd-O

Rudd-O Oct 31, 2016

On 10/31/2016 09:43 PM, Jean-Philippe Ouellet wrote:

According to that paper, duplicati considers neither confidentiality
nor integrity.

Statements from that paper like:

"We chose the SHA­256 hash as that seems to be a safe bet,
although the cryptographic properties of SHA are not required."

and

This is an unfair criticism.

Duplicati lets you choose plain AES like Qubes backup today, or GPG if
you dislike the choice of SHA256. Furthermore, the software is open
source — LGPL 2.1 to be exact — so if you want to plug your own
algorithm, you're free to do so.

"If a required block is not found after scanning the index files,
block files are downloaded at
random, until all known blocks are found. This enables a restore,
even in cases without any index files."

are rather concerning when considering an adversary who maliciously
modifies your backups.

Duh, if an adversary modifies your files, it does not matter whether you
have to download a thousand 1 MB blocks or 1 GB file, you still can't
restore! This is a 100% frivolous criticism, my man.

I like the idea of a simple incremental backup solution which treats
the actual blocks as completely opaque

That is precisely what Duplicati is — the blocks are opaque to the
storage medium — except it isn't in the full+incremental scheme — it's
fully content-addressable, which is superior in many ways.

(the content-parsing to determine optimal chunking boundaries of other
backup solutions sounds like an interesting attack surface to me), but
to me this solution alone is dangerously insufficient.

I do not believe you have formed that opinion based on valid facts.

Rudd-O
http://rudd-o.com/

Rudd-O commented Oct 31, 2016

On 10/31/2016 09:43 PM, Jean-Philippe Ouellet wrote:

According to that paper, duplicati considers neither confidentiality
nor integrity.

Statements from that paper like:

"We chose the SHA­256 hash as that seems to be a safe bet,
although the cryptographic properties of SHA are not required."

and

This is an unfair criticism.

Duplicati lets you choose plain AES like Qubes backup today, or GPG if
you dislike the choice of SHA256. Furthermore, the software is open
source — LGPL 2.1 to be exact — so if you want to plug your own
algorithm, you're free to do so.

"If a required block is not found after scanning the index files,
block files are downloaded at
random, until all known blocks are found. This enables a restore,
even in cases without any index files."

are rather concerning when considering an adversary who maliciously
modifies your backups.

Duh, if an adversary modifies your files, it does not matter whether you
have to download a thousand 1 MB blocks or 1 GB file, you still can't
restore! This is a 100% frivolous criticism, my man.

I like the idea of a simple incremental backup solution which treats
the actual blocks as completely opaque

That is precisely what Duplicati is — the blocks are opaque to the
storage medium — except it isn't in the full+incremental scheme — it's
fully content-addressable, which is superior in many ways.

(the content-parsing to determine optimal chunking boundaries of other
backup solutions sounds like an interesting attack surface to me), but
to me this solution alone is dangerously insufficient.

I do not believe you have formed that opinion based on valid facts.

Rudd-O
http://rudd-o.com/
@Rudd-O

This comment has been minimized.

Show comment
Hide comment
@Rudd-O

Rudd-O Oct 31, 2016

On 10/31/2016 09:46 PM, Marek Marczykowski-Górecki wrote:

@Rudd-O https://github.com/Rudd-O indeed interesting, but offtopic
here. It says nothing about encryption and integrity protection which
is main issue discussed here. It's mostly on topic of #858
#858
EDIT: @jpouellet https://github.com/jpouellet was faster about
details, so not repeating it here.

I recognize it's off topic, as the paper does not speak of integrity
protection. The integrity protection of the implementation is provided
by the choice of encryption and hashing algorithm, which is not the
topic of the paper (the paper focuses on explaining the unencrypted data
structures, which I thought would be a good read for you).

I will be trying to implement something like this for Qubes OS, since I
need this storage system for a variety of reasons. I will report back
with progress.

Rudd-O
http://rudd-o.com/

Rudd-O commented Oct 31, 2016

On 10/31/2016 09:46 PM, Marek Marczykowski-Górecki wrote:

@Rudd-O https://github.com/Rudd-O indeed interesting, but offtopic
here. It says nothing about encryption and integrity protection which
is main issue discussed here. It's mostly on topic of #858
#858
EDIT: @jpouellet https://github.com/jpouellet was faster about
details, so not repeating it here.

I recognize it's off topic, as the paper does not speak of integrity
protection. The integrity protection of the implementation is provided
by the choice of encryption and hashing algorithm, which is not the
topic of the paper (the paper focuses on explaining the unencrypted data
structures, which I thought would be a good read for you).

I will be trying to implement something like this for Qubes OS, since I
need this storage system for a variety of reasons. I will report back
with progress.

Rudd-O
http://rudd-o.com/
@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Oct 31, 2016

Member

I don't think deduplication (especially across VMs) is a good idea for secure backup solution at all. There are multiple things that can go wrong. For example - if attacker control one of the VMs, can store some data there, and later observing backup size guess if the data is present also in some other VM, thus leaking information about content of another VM. Probably not feasible for leaking private keys (something not known to the attacker), but very real for finding who have some known data (like breaking anonymity of some Tor user).
Also, today SHA256 is considered safe, but this format rely on it to have never, ever any collision, not even unintentional one. This drawback is even mentioned in that paper.

And this is just what I can think of in 5 minutes. There are probably more. So, any solution must not share anything across VMs.

Member

marmarek commented Oct 31, 2016

I don't think deduplication (especially across VMs) is a good idea for secure backup solution at all. There are multiple things that can go wrong. For example - if attacker control one of the VMs, can store some data there, and later observing backup size guess if the data is present also in some other VM, thus leaking information about content of another VM. Probably not feasible for leaking private keys (something not known to the attacker), but very real for finding who have some known data (like breaking anonymity of some Tor user).
Also, today SHA256 is considered safe, but this format rely on it to have never, ever any collision, not even unintentional one. This drawback is even mentioned in that paper.

And this is just what I can think of in 5 minutes. There are probably more. So, any solution must not share anything across VMs.

@jpouellet

This comment has been minimized.

Show comment
Hide comment
@jpouellet

jpouellet Nov 1, 2016

Contributor

On Mon, Oct 31, 2016 at 6:40 PM, Rudd-O notifications@github.com wrote:

On 10/31/2016 09:43 PM, Jean-Philippe Ouellet wrote:

"If a required block is not found after scanning the index files,
block files are downloaded at
random, until all known blocks are found. This enables a restore,
even in cases without any index files."

are rather concerning when considering an adversary who maliciously
modifies your backups.

Duh, if an adversary modifies your files, it does not matter whether you
have to download a thousand 1 MB blocks or 1 GB file, you still can't
restore! This is a 100% frivolous criticism, my man.

Sure, they could harm availability, but if properly authenticated, the attacker's capabilities end there.

But, if you don't have a trusted way to identify and authenticate the root of your merkle-tree index thing, (which sounds to be the case with this system as described when restoring without an already trusted index) then your adversary may also gain arbitrary write on restore, and that is plenty of leverage for further attacks.

I like the idea of a simple incremental backup solution which treats
the actual blocks as completely opaque

That is precisely what Duplicati is — the blocks are opaque to the
storage medium — except it isn't in the full+incremental scheme — it's
fully content-addressable, which is superior in many ways.

Right. I was saying that's the main concept I like from it, as opposed to other systems.

(the content-parsing to determine optimal chunking boundaries of other
backup solutions sounds like an interesting attack surface to me), but
to me this solution alone is dangerously insufficient.

I do not believe you have formed that opinion based on valid facts.

Sorry to disappoint.

It makes sense from the perspective of one optimizing for storage above all else to want to split streams on intelligent boundaries such that you can re-use as many already-stored content-addressed chunks as possible.

Unfortunately, determining the most intelligent boundary is seriously non-trivial, and as one tries to achieve better results by incorporating more format-specific knowledge, complexity (and with it opportunity for vulnerabilities) necessarily explodes.

Consider libchop's anchor-based chopper based on this paper.

Upon a cursory examination, the implementation appears to include a partial domain-specific JIT compiler (!!!) just to choose where to split data!! I'd much prefer something which treated things as completely opaque blobs.

Sure, one could argue it could be effectively sandboxed and pass back only a list of indeces to chop at, but still... for stuff likely running in Dom0, definitely do not want!

And for those who say "yeah, but... how likely is it for it to actually have a vuln in that... It can't be that complex... it's just some minor string manipulation", just look at the libarchive exploit that recently affected FreeBSD's update system.

Contributor

jpouellet commented Nov 1, 2016

On Mon, Oct 31, 2016 at 6:40 PM, Rudd-O notifications@github.com wrote:

On 10/31/2016 09:43 PM, Jean-Philippe Ouellet wrote:

"If a required block is not found after scanning the index files,
block files are downloaded at
random, until all known blocks are found. This enables a restore,
even in cases without any index files."

are rather concerning when considering an adversary who maliciously
modifies your backups.

Duh, if an adversary modifies your files, it does not matter whether you
have to download a thousand 1 MB blocks or 1 GB file, you still can't
restore! This is a 100% frivolous criticism, my man.

Sure, they could harm availability, but if properly authenticated, the attacker's capabilities end there.

But, if you don't have a trusted way to identify and authenticate the root of your merkle-tree index thing, (which sounds to be the case with this system as described when restoring without an already trusted index) then your adversary may also gain arbitrary write on restore, and that is plenty of leverage for further attacks.

I like the idea of a simple incremental backup solution which treats
the actual blocks as completely opaque

That is precisely what Duplicati is — the blocks are opaque to the
storage medium — except it isn't in the full+incremental scheme — it's
fully content-addressable, which is superior in many ways.

Right. I was saying that's the main concept I like from it, as opposed to other systems.

(the content-parsing to determine optimal chunking boundaries of other
backup solutions sounds like an interesting attack surface to me), but
to me this solution alone is dangerously insufficient.

I do not believe you have formed that opinion based on valid facts.

Sorry to disappoint.

It makes sense from the perspective of one optimizing for storage above all else to want to split streams on intelligent boundaries such that you can re-use as many already-stored content-addressed chunks as possible.

Unfortunately, determining the most intelligent boundary is seriously non-trivial, and as one tries to achieve better results by incorporating more format-specific knowledge, complexity (and with it opportunity for vulnerabilities) necessarily explodes.

Consider libchop's anchor-based chopper based on this paper.

Upon a cursory examination, the implementation appears to include a partial domain-specific JIT compiler (!!!) just to choose where to split data!! I'd much prefer something which treated things as completely opaque blobs.

Sure, one could argue it could be effectively sandboxed and pass back only a list of indeces to chop at, but still... for stuff likely running in Dom0, definitely do not want!

And for those who say "yeah, but... how likely is it for it to actually have a vuln in that... It can't be that complex... it's just some minor string manipulation", just look at the libarchive exploit that recently affected FreeBSD's update system.

@Rudd-O

This comment has been minimized.

Show comment
Hide comment
@Rudd-O

Rudd-O Nov 2, 2016

On 10/31/2016 11:09 PM, Marek Marczykowski-Górecki wrote:

I don't think deduplication (especially across VMs) is a good idea for
secure backup solution at all. There are multiple things that can go
wrong. For example - if attacker control one of the VMs, can store
some data there, and later observing backup size guess if the data is
present also in some other VM, thus leaking information about content
of another VM.

This is an objection that is incredibly far-fetched. First of all, you
are suggesting that an attacker controls a VM that is backed up, AND
also controls the backup data. That's already incredibly improbable.
Second, you're also suggesting that the attacker writes some that
happens to be deduplicated, which means the attacker ALREADY HAS the
data he was trying to discover BY DEFINITION.

I do not think this is a valid objection.

But, if you fear this extremely improbable thing, you can just switch
off deduplication. Presto, problem gone.

Below I explain how that can be accomplished.

Probably not feasible for leaking private keys (something not known to
the attacker), but very real for finding who have some known data
(like breaking anonymity of some Tor user).
Also, today SHA256 is considered safe, but this format rely on it to
have /never, ever/ any collision, not even unintentional one. This
drawback is even mentioned in that paper.

I'm sure new algorithms can be added to the code. It's open source.

And this is just what I can think of in 5 minutes. There are probably
more. So, any solution must not share anything across VMs.

Just turn dedup off!

It's super simple to do that.

Here's just one way to do it:

During backup, at the point where backup is hashing the data to see if
there are any duplicates, simply hash the data with the path name of the
file being stored.

Bam, dedup between VMs is now off.

Best thing: all of this is implementable because the code is open source.

Another way to turn dedup off: always do a full backup. That way this
mysterious attacker you posit cannot compare before+after backup sizes.

Bam, problem solved!

Fundamentally, some people care about the remote scenarios you propose,
and others do not. To cater for them, there should be a setting that
allows them to toggle the behavior you dislike.

Rudd-O
http://rudd-o.com/

Rudd-O commented Nov 2, 2016

On 10/31/2016 11:09 PM, Marek Marczykowski-Górecki wrote:

I don't think deduplication (especially across VMs) is a good idea for
secure backup solution at all. There are multiple things that can go
wrong. For example - if attacker control one of the VMs, can store
some data there, and later observing backup size guess if the data is
present also in some other VM, thus leaking information about content
of another VM.

This is an objection that is incredibly far-fetched. First of all, you
are suggesting that an attacker controls a VM that is backed up, AND
also controls the backup data. That's already incredibly improbable.
Second, you're also suggesting that the attacker writes some that
happens to be deduplicated, which means the attacker ALREADY HAS the
data he was trying to discover BY DEFINITION.

I do not think this is a valid objection.

But, if you fear this extremely improbable thing, you can just switch
off deduplication. Presto, problem gone.

Below I explain how that can be accomplished.

Probably not feasible for leaking private keys (something not known to
the attacker), but very real for finding who have some known data
(like breaking anonymity of some Tor user).
Also, today SHA256 is considered safe, but this format rely on it to
have /never, ever/ any collision, not even unintentional one. This
drawback is even mentioned in that paper.

I'm sure new algorithms can be added to the code. It's open source.

And this is just what I can think of in 5 minutes. There are probably
more. So, any solution must not share anything across VMs.

Just turn dedup off!

It's super simple to do that.

Here's just one way to do it:

During backup, at the point where backup is hashing the data to see if
there are any duplicates, simply hash the data with the path name of the
file being stored.

Bam, dedup between VMs is now off.

Best thing: all of this is implementable because the code is open source.

Another way to turn dedup off: always do a full backup. That way this
mysterious attacker you posit cannot compare before+after backup sizes.

Bam, problem solved!

Fundamentally, some people care about the remote scenarios you propose,
and others do not. To cater for them, there should be a setting that
allows them to toggle the behavior you dislike.

Rudd-O
http://rudd-o.com/
@Rudd-O

This comment has been minimized.

Show comment
Hide comment
@Rudd-O

Rudd-O Nov 2, 2016

On 11/01/2016 12:19 AM, Jean-Philippe Ouellet wrote:

Sure, they could harm availability, but if properly authenticated, the
attacker's capabilities end there.

But, if you don't have a trusted way to identify and authenticate the
root of your merkle-tree index thing, (which sounds to be the case
with this system as described when restoring without an already
trusted index)

I don't think that's the case. That merkle root will not decrypt if you
give it the wrong password.

Feel free to correct me with the relevant technical details.

Rudd-O
http://rudd-o.com/

Rudd-O commented Nov 2, 2016

On 11/01/2016 12:19 AM, Jean-Philippe Ouellet wrote:

Sure, they could harm availability, but if properly authenticated, the
attacker's capabilities end there.

But, if you don't have a trusted way to identify and authenticate the
root of your merkle-tree index thing, (which sounds to be the case
with this system as described when restoring without an already
trusted index)

I don't think that's the case. That merkle root will not decrypt if you
give it the wrong password.

Feel free to correct me with the relevant technical details.

Rudd-O
http://rudd-o.com/
@jpouellet

This comment has been minimized.

Show comment
Hide comment
@jpouellet

jpouellet Nov 2, 2016

Contributor

On Wed, Nov 2, 2016 at 6:42 AM, Rudd-O notifications@github.com wrote:

On 10/31/2016 11:09 PM, Marek Marczykowski-Górecki wrote:

I don't think deduplication (especially across VMs) is a good idea for
secure backup solution at all.

This is an objection that is incredibly far-fetched. First of all, you
are suggesting that an attacker controls a VM that is backed up, AND
also controls the backup data. That's already incredibly improbable.
Second, you're also suggesting that the attacker writes some that
happens to be deduplicated, which means the attacker ALREADY HAS the
data he was trying to discover BY DEFINITION.

Yes, the adversary knows what the data is, but does not know you
have it (and you wish to keep that a secret).

I do not think this is a valid objection.

I think this is a very valid objection.

Consider the case where you are a whistle-blower trying to expose some
corruption or whatever. You have a non-networked "secrets" vm you use
to analyze some cache of documents exposing said corruption that you
stole from some database. The adversary suspects you may possess said
documents, but wants proof before making you disappear.

By injecting a single block of said data into the disk a different
(less-trusted) vm (plenty of ways to do that... via browser cache for
example) and observing deduplication, then the adversary can confirm
the presence of data you wished to keep the existence of a secret.

But, if you fear this extremely improbable thing, you can just switch
off deduplication. Presto, problem gone.

Fundamentally, some people care about the remote scenarios you propose,
and others do not. To cater for them, there should be a setting that
allows them to toggle the behavior you dislike.

I believe that is not the correct philosophy for an open source
project aiming to provide improved security and privacy for an
audience including less-technical users.

You expect some journalist wanting to hide the fact that they have
certain documents to be aware of duplication-based
content-confirmation attacks, be aware that qubes' backup system is
vulnerable by default, be aware that there is an option to turn that
off, and actually do so before it matters? I do not think this is a
reasonable model for the intended users.

Personally, I'm a great fan of OpenBSD-style decision rationale, where
choices are made to be secure by default and have few knobs to turn.
Consistent application of that philosophy protects users.

Contributor

jpouellet commented Nov 2, 2016

On Wed, Nov 2, 2016 at 6:42 AM, Rudd-O notifications@github.com wrote:

On 10/31/2016 11:09 PM, Marek Marczykowski-Górecki wrote:

I don't think deduplication (especially across VMs) is a good idea for
secure backup solution at all.

This is an objection that is incredibly far-fetched. First of all, you
are suggesting that an attacker controls a VM that is backed up, AND
also controls the backup data. That's already incredibly improbable.
Second, you're also suggesting that the attacker writes some that
happens to be deduplicated, which means the attacker ALREADY HAS the
data he was trying to discover BY DEFINITION.

Yes, the adversary knows what the data is, but does not know you
have it (and you wish to keep that a secret).

I do not think this is a valid objection.

I think this is a very valid objection.

Consider the case where you are a whistle-blower trying to expose some
corruption or whatever. You have a non-networked "secrets" vm you use
to analyze some cache of documents exposing said corruption that you
stole from some database. The adversary suspects you may possess said
documents, but wants proof before making you disappear.

By injecting a single block of said data into the disk a different
(less-trusted) vm (plenty of ways to do that... via browser cache for
example) and observing deduplication, then the adversary can confirm
the presence of data you wished to keep the existence of a secret.

But, if you fear this extremely improbable thing, you can just switch
off deduplication. Presto, problem gone.

Fundamentally, some people care about the remote scenarios you propose,
and others do not. To cater for them, there should be a setting that
allows them to toggle the behavior you dislike.

I believe that is not the correct philosophy for an open source
project aiming to provide improved security and privacy for an
audience including less-technical users.

You expect some journalist wanting to hide the fact that they have
certain documents to be aware of duplication-based
content-confirmation attacks, be aware that qubes' backup system is
vulnerable by default, be aware that there is an option to turn that
off, and actually do so before it matters? I do not think this is a
reasonable model for the intended users.

Personally, I'm a great fan of OpenBSD-style decision rationale, where
choices are made to be secure by default and have few knobs to turn.
Consistent application of that philosophy protects users.

@jpouellet

This comment has been minimized.

Show comment
Hide comment
@jpouellet

jpouellet Nov 2, 2016

Contributor

On Wed, Nov 2, 2016 at 6:43 AM, Rudd-O notifications@github.com wrote:

I don't think that's the case. That merkle root will not decrypt if you
give it the wrong password.

Feel free to correct me with the relevant technical details.

But, if you don't have a trusted way to identify and authenticate the
root of your merkle-tree index thing, (which sounds to be the case

I was assuming things were implemented as described in the paper, with
no additional integrity protection mechanisms wrapping the indexes.

If this is not the case, then I am mistaken.

Content-confirmation attacks associated with
convergent-encryption-based deduplication schemes like this one still
remain problematic though.

Contributor

jpouellet commented Nov 2, 2016

On Wed, Nov 2, 2016 at 6:43 AM, Rudd-O notifications@github.com wrote:

I don't think that's the case. That merkle root will not decrypt if you
give it the wrong password.

Feel free to correct me with the relevant technical details.

But, if you don't have a trusted way to identify and authenticate the
root of your merkle-tree index thing, (which sounds to be the case

I was assuming things were implemented as described in the paper, with
no additional integrity protection mechanisms wrapping the indexes.

If this is not the case, then I am mistaken.

Content-confirmation attacks associated with
convergent-encryption-based deduplication schemes like this one still
remain problematic though.

@Rudd-O

This comment has been minimized.

Show comment
Hide comment
@Rudd-O

Rudd-O Nov 3, 2016

On 11/02/2016 02:07 PM, Jean-Philippe Ouellet wrote:

Fundamentally, some people care about the remote scenarios you propose,
and others do not. To cater for them, there should be a setting that
allows them to toggle the behavior you dislike.

I believe that is not the correct philosophy for an open source
project aiming to provide improved security and privacy for an
audience including less-technical users.

What's wrong with the setting, if the setting is toggled to the secure
value by default?

Settings are a perfectly valid way of changing how software behaves.

Rudd-O
http://rudd-o.com/

Rudd-O commented Nov 3, 2016

On 11/02/2016 02:07 PM, Jean-Philippe Ouellet wrote:

Fundamentally, some people care about the remote scenarios you propose,
and others do not. To cater for them, there should be a setting that
allows them to toggle the behavior you dislike.

I believe that is not the correct philosophy for an open source
project aiming to provide improved security and privacy for an
audience including less-technical users.

What's wrong with the setting, if the setting is toggled to the secure
value by default?

Settings are a perfectly valid way of changing how software behaves.

Rudd-O
http://rudd-o.com/
@defuse

This comment has been minimized.

Show comment
Hide comment
@defuse

defuse Nov 3, 2016

@Rudd-O: I was going to reply here about why settings are bad in general, but I felt it was off-topic, so I moved it into a blog post: https://bqp.io/on-software-settings.html (here it's the "second problem" that applies most).

defuse commented Nov 3, 2016

@Rudd-O: I was going to reply here about why settings are bad in general, but I felt it was off-topic, so I moved it into a blog post: https://bqp.io/on-software-settings.html (here it's the "second problem" that applies most).

@jpouellet

This comment has been minimized.

Show comment
Hide comment
@jpouellet

jpouellet Nov 3, 2016

Contributor

Another post that explains things more eloquently than I could: http://www.daemonology.net/blog/2009-09-04-complexity-is-insecurity.html

That post also happens to be from the author of scrypt :)

EDIT: This also applies to my argument against trying to chunk data at intelligent boundaries rather than fixed-size blocks. While the former clearly results in superior deduplication, the latter is more secure.

Contributor

jpouellet commented Nov 3, 2016

Another post that explains things more eloquently than I could: http://www.daemonology.net/blog/2009-09-04-complexity-is-insecurity.html

That post also happens to be from the author of scrypt :)

EDIT: This also applies to my argument against trying to chunk data at intelligent boundaries rather than fixed-size blocks. While the former clearly results in superior deduplication, the latter is more secure.

@defuse

This comment has been minimized.

Show comment
Hide comment
@defuse

defuse Nov 14, 2016

Heads up that I added a feature to crack Qubes v3.1 (and think, but have not verified v3.2) backup encryption (default settings) to CrackStation: https://bqp.io/qubes-backup-encryption-added-to-crackstation.html I hope by doing this I'll attract some more cryptographers in here making sure the next version is rock solid.

In case anyone wants a short-term workaround, what I've done personally is to encrypt my backup disk with LUKS, and then store my Qubes-encrypted backups on there.

defuse commented Nov 14, 2016

Heads up that I added a feature to crack Qubes v3.1 (and think, but have not verified v3.2) backup encryption (default settings) to CrackStation: https://bqp.io/qubes-backup-encryption-added-to-crackstation.html I hope by doing this I'll attract some more cryptographers in here making sure the next version is rock solid.

In case anyone wants a short-term workaround, what I've done personally is to encrypt my backup disk with LUKS, and then store my Qubes-encrypted backups on there.

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Dec 11, 2016

Member

Heads up that I added a feature to crack Qubes v3.1 (and think, but have not verified v3.2) backup encryption (default settings) to CrackStation: https://bqp.io/qubes-backup-encryption-added-to-crackstation.html

Yes, this applies to 3.2 also.

I hope by doing this I'll attract some more cryptographers in here making sure the next version is rock solid.

Thanks, @defuse.

Related question: Theoretically, what's the fastest way to brute force a Qubes backup under the current scheme? There are at least two ways to go, right?

  1. Keep trying 32-digit hexadecimal strings (i.e., possible outputs of md5(passphrase)) as aes-256-cbc keys until one works. (Assuming that the passphrase has >= 128-bits of entropy.)
  2. Keep computing openssl dgst -sha512 -hmac '<passphrase>' backup-header until one matches backup-header.hmac.

Is 1 theoretically faster because it entails a smaller keyspace? Or is there another, faster way?

The reason I ask is because this would seem to affect estimates of how strong/weak the current system is (i.e., how long it would take to brute force given different values of passphrase entropy).

Member

andrewdavidwong commented Dec 11, 2016

Heads up that I added a feature to crack Qubes v3.1 (and think, but have not verified v3.2) backup encryption (default settings) to CrackStation: https://bqp.io/qubes-backup-encryption-added-to-crackstation.html

Yes, this applies to 3.2 also.

I hope by doing this I'll attract some more cryptographers in here making sure the next version is rock solid.

Thanks, @defuse.

Related question: Theoretically, what's the fastest way to brute force a Qubes backup under the current scheme? There are at least two ways to go, right?

  1. Keep trying 32-digit hexadecimal strings (i.e., possible outputs of md5(passphrase)) as aes-256-cbc keys until one works. (Assuming that the passphrase has >= 128-bits of entropy.)
  2. Keep computing openssl dgst -sha512 -hmac '<passphrase>' backup-header until one matches backup-header.hmac.

Is 1 theoretically faster because it entails a smaller keyspace? Or is there another, faster way?

The reason I ask is because this would seem to affect estimates of how strong/weak the current system is (i.e., how long it would take to brute force given different values of passphrase entropy).

@jpouellet

This comment has been minimized.

Show comment
Hide comment
@jpouellet

jpouellet Dec 11, 2016

Contributor
  1. Keep trying 32-digit hexadecimal strings (i.e., possible outputs of md5(passphrase)) as aes-256-cbc keys until one works. (Assuming that the passphrase has >= 128-bits of entropy.)

No, 128 bits is an unthinkably large state space to try to brute-force. The set of md5(likely_passphrases) is sparse within that space, so you would definitely want to start with likely passphrase candidates and pass those through md5 to generate candidate aes keys, not try to randomly guess a correct 128-bit value (assumed to be so incredibly unlikely that it is why many standards target the 128-bit effective security level).

The larger problem with this approach though is checking whether or not a resulting passphrase is correct. You then need to use the resulting key to try to decrypt the filesystem, and check if the plaintext looks like a filesystem. The HMAC-cracking approach has the benefit that it is immediately obvious whether or not the result is correct.

  1. Keep computing openssl dgst -sha512 -hmac '' backup-header until one matches backup-header.hmac.

Yes. This.

Is 1 theoretically faster because it entails a smaller keyspace? Or is there another, faster way?

No. The question is not about 128-bits vs 256-bits (both are assumed to be sufficient in crypto literature). The design flaw here is that is the HMAC and encryption key are derived from exactly the same input, and an HMAC construction allows fast (cheap to attack) checking of whether or not a key is correct (because it is designed for cheap message authentication using shared secrets), whereas a strong KDF is designed to be expensive-to-compute (and that expense does not need to be paid by an attacker who can bypass it by computing the HMAC for each candidate passphrase instead).

Disclaimer: I am not a cryptographer, I just like reading and thinking about applied crypto, and occasionally have to crack passwords

Contributor

jpouellet commented Dec 11, 2016

  1. Keep trying 32-digit hexadecimal strings (i.e., possible outputs of md5(passphrase)) as aes-256-cbc keys until one works. (Assuming that the passphrase has >= 128-bits of entropy.)

No, 128 bits is an unthinkably large state space to try to brute-force. The set of md5(likely_passphrases) is sparse within that space, so you would definitely want to start with likely passphrase candidates and pass those through md5 to generate candidate aes keys, not try to randomly guess a correct 128-bit value (assumed to be so incredibly unlikely that it is why many standards target the 128-bit effective security level).

The larger problem with this approach though is checking whether or not a resulting passphrase is correct. You then need to use the resulting key to try to decrypt the filesystem, and check if the plaintext looks like a filesystem. The HMAC-cracking approach has the benefit that it is immediately obvious whether or not the result is correct.

  1. Keep computing openssl dgst -sha512 -hmac '' backup-header until one matches backup-header.hmac.

Yes. This.

Is 1 theoretically faster because it entails a smaller keyspace? Or is there another, faster way?

No. The question is not about 128-bits vs 256-bits (both are assumed to be sufficient in crypto literature). The design flaw here is that is the HMAC and encryption key are derived from exactly the same input, and an HMAC construction allows fast (cheap to attack) checking of whether or not a key is correct (because it is designed for cheap message authentication using shared secrets), whereas a strong KDF is designed to be expensive-to-compute (and that expense does not need to be paid by an attacker who can bypass it by computing the HMAC for each candidate passphrase instead).

Disclaimer: I am not a cryptographer, I just like reading and thinking about applied crypto, and occasionally have to crack passwords

@jpouellet

This comment has been minimized.

Show comment
Hide comment
@jpouellet

jpouellet Dec 11, 2016

Contributor

The reason I ask is because this would seem to affect estimates of how strong/weak the current system is (i.e., how long it would take to brute force given different values of passphrase entropy).

In its current (3.2) state, we essentially have the cheapest KDF possible, so I think the only safe way to use it is to have a passphrase that is already sufficiently high entropy for use directly as a key.

When I used qvm-backup (to transfer VMs from old laptop to new laptop), I used a key from dd if=/dev/random, encrypted the result asymmetrically to a key generated on the destination machine, signed that with a key generated on the origin machine, performed the backup & transfer, then decrypted the used symmetric key on the destination machine.

I think such an asymmetric scheme is really the way to go for various reasons, but I've yet to put my arguments together and formally make a case for it. tl;dr - allows to do away with passphrases and have a safely-stored offline backup-recovery key (which is not needed at backup-creation time), and a backup-origin signing key. This would enable the creation of encrypted and authenticated backups in a hostile environment where entering a backup passphrase would be problematic.

Contributor

jpouellet commented Dec 11, 2016

The reason I ask is because this would seem to affect estimates of how strong/weak the current system is (i.e., how long it would take to brute force given different values of passphrase entropy).

In its current (3.2) state, we essentially have the cheapest KDF possible, so I think the only safe way to use it is to have a passphrase that is already sufficiently high entropy for use directly as a key.

When I used qvm-backup (to transfer VMs from old laptop to new laptop), I used a key from dd if=/dev/random, encrypted the result asymmetrically to a key generated on the destination machine, signed that with a key generated on the origin machine, performed the backup & transfer, then decrypted the used symmetric key on the destination machine.

I think such an asymmetric scheme is really the way to go for various reasons, but I've yet to put my arguments together and formally make a case for it. tl;dr - allows to do away with passphrases and have a safely-stored offline backup-recovery key (which is not needed at backup-creation time), and a backup-origin signing key. This would enable the creation of encrypted and authenticated backups in a hostile environment where entering a backup passphrase would be problematic.

@jpouellet

This comment has been minimized.

Show comment
Hide comment
@jpouellet

jpouellet Dec 11, 2016

Contributor

tl;dr - allows to do away with passphrases and have a safely-stored offline backup-recovery key

That recovery key could (should) of course be protected with a passphrase itself, but it need not be used (or even known) at backup-creation time.

Contributor

jpouellet commented Dec 11, 2016

tl;dr - allows to do away with passphrases and have a safely-stored offline backup-recovery key

That recovery key could (should) of course be protected with a passphrase itself, but it need not be used (or even known) at backup-creation time.

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Dec 11, 2016

Member
  1. Keep trying 32-digit hexadecimal strings (i.e., possible outputs of md5(passphrase)) as aes-256-cbc keys until one works. (Assuming that the passphrase has >= 128-bits of entropy.)

No, 128 bits is an unthinkably large state space to try to brute-force. The set of md5(likely_passphrases) is sparse within that space, so you would definitely want to start with likely passphrase candidates and pass those through md5 to generate candidate aes keys, not try to randomly guess a correct 128-bit value (assumed to be so incredibly unlikely that it is why many standards target the 128-bit effective security level).

Of course. I didn't say anything about randomly guessing 128-bit values. It goes without saying that you would prioritize the possible outputs of likely passphrases before unlikely ones. (This is, of course, an oversimplification of what sophisticated password crackers actually do.)

The larger problem with this approach though is checking whether or not a resulting passphrase is correct. You then need to use the resulting key to try to decrypt the filesystem, and check if the plaintext looks like a filesystem. The HMAC-cracking approach has the benefit that it is immediately obvious whether or not the result is correct.

That's one thing I was wondering about. I suppose it depends on how much additional time it takes to attempt to decrypt the file with each key.

Is 1 theoretically faster because it entails a smaller keyspace? Or is there another, faster way?

No. The question is not about 128-bits vs 256-bits (both are assumed to be sufficient in crypto literature).

You're missing the point. The question is which one is theoretically faster. The question is not whether both would take a sufficiently long time.

The design flaw here is that is the HMAC and encryption key are derived from exactly the same input, and an HMAC construction allows fast (cheap to attack) checking of whether or not a key is correct (because it is designed for cheap message authentication using shared secrets), whereas a strong KDF is designed to be expensive-to-compute (and that expense does not need to be paid by an attacker who can bypass it by computing the HMAC for each candidate passphrase instead).

I know. You just repeated the issue that I originally reported.

Member

andrewdavidwong commented Dec 11, 2016

  1. Keep trying 32-digit hexadecimal strings (i.e., possible outputs of md5(passphrase)) as aes-256-cbc keys until one works. (Assuming that the passphrase has >= 128-bits of entropy.)

No, 128 bits is an unthinkably large state space to try to brute-force. The set of md5(likely_passphrases) is sparse within that space, so you would definitely want to start with likely passphrase candidates and pass those through md5 to generate candidate aes keys, not try to randomly guess a correct 128-bit value (assumed to be so incredibly unlikely that it is why many standards target the 128-bit effective security level).

Of course. I didn't say anything about randomly guessing 128-bit values. It goes without saying that you would prioritize the possible outputs of likely passphrases before unlikely ones. (This is, of course, an oversimplification of what sophisticated password crackers actually do.)

The larger problem with this approach though is checking whether or not a resulting passphrase is correct. You then need to use the resulting key to try to decrypt the filesystem, and check if the plaintext looks like a filesystem. The HMAC-cracking approach has the benefit that it is immediately obvious whether or not the result is correct.

That's one thing I was wondering about. I suppose it depends on how much additional time it takes to attempt to decrypt the file with each key.

Is 1 theoretically faster because it entails a smaller keyspace? Or is there another, faster way?

No. The question is not about 128-bits vs 256-bits (both are assumed to be sufficient in crypto literature).

You're missing the point. The question is which one is theoretically faster. The question is not whether both would take a sufficiently long time.

The design flaw here is that is the HMAC and encryption key are derived from exactly the same input, and an HMAC construction allows fast (cheap to attack) checking of whether or not a key is correct (because it is designed for cheap message authentication using shared secrets), whereas a strong KDF is designed to be expensive-to-compute (and that expense does not need to be paid by an attacker who can bypass it by computing the HMAC for each candidate passphrase instead).

I know. You just repeated the issue that I originally reported.

@jpouellet

This comment has been minimized.

Show comment
Hide comment
@jpouellet

jpouellet Dec 11, 2016

Contributor

I know. You just repeated the issue that I originally reported.

Err, oops. Indeed. This thread has been so long I'd forgotten you were the OP. :) My apologies if you feel I've insulted your intelligence by my explanation.

Contributor

jpouellet commented Dec 11, 2016

I know. You just repeated the issue that I originally reported.

Err, oops. Indeed. This thread has been so long I'd forgotten you were the OP. :) My apologies if you feel I've insulted your intelligence by my explanation.

@defuse

This comment has been minimized.

Show comment
Hide comment
@defuse

defuse Dec 11, 2016

Related question: Theoretically, what's the fastest way to brute force a Qubes backup under the current scheme? There are at least two ways to go, right?

Keep trying 32-digit hexadecimal strings (i.e., possible outputs of md5(passphrase)) as aes-256-cbc keys until one works. (Assuming that the passphrase has >= 128-bits of entropy.)
Keep computing openssl dgst -sha512 -hmac '<passphrase>' backup-header until one matches backup-header.hmac.

These are about equivalent, if you ignore the difference in time it takes to compute MD5 then AES vs. HMAC-SHA512.

The reason I ask is because this would seem to affect estimates of how strong/weak the current system is (i.e., how long it would take to brute force given different values of passphrase entropy).

The worst problem with the current system is that the password isn't salted. Without a salt, the attacker can perform their brute-force attack once, saving all the (HMAC, password) pairs they compute into a lookup table, then quickly crack any HMAC in that list with a fast (<1 second) lookup. That's a lot cheaper than having to re-run the brute-force attack for each hash (which using salt ensures). Fixing this changes the asymptotics of the best attack -- O(N) precomputation then O(1) for each hash to crack into no precomputation and O(N) for each hash to crack.

The second problem is that we want to use some sort of slow key derivation function on the password, like Argon2, scrypt, bcrypt, etc, to increase the cost of attacks further.

defuse commented Dec 11, 2016

Related question: Theoretically, what's the fastest way to brute force a Qubes backup under the current scheme? There are at least two ways to go, right?

Keep trying 32-digit hexadecimal strings (i.e., possible outputs of md5(passphrase)) as aes-256-cbc keys until one works. (Assuming that the passphrase has >= 128-bits of entropy.)
Keep computing openssl dgst -sha512 -hmac '<passphrase>' backup-header until one matches backup-header.hmac.

These are about equivalent, if you ignore the difference in time it takes to compute MD5 then AES vs. HMAC-SHA512.

The reason I ask is because this would seem to affect estimates of how strong/weak the current system is (i.e., how long it would take to brute force given different values of passphrase entropy).

The worst problem with the current system is that the password isn't salted. Without a salt, the attacker can perform their brute-force attack once, saving all the (HMAC, password) pairs they compute into a lookup table, then quickly crack any HMAC in that list with a fast (<1 second) lookup. That's a lot cheaper than having to re-run the brute-force attack for each hash (which using salt ensures). Fixing this changes the asymptotics of the best attack -- O(N) precomputation then O(1) for each hash to crack into no precomputation and O(N) for each hash to crack.

The second problem is that we want to use some sort of slow key derivation function on the password, like Argon2, scrypt, bcrypt, etc, to increase the cost of attacks further.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Dec 11, 2016

Member

See #971 (comment) I think it all have been covered already (mostly by using scrypt utility).

Member

marmarek commented Dec 11, 2016

See #971 (comment) I think it all have been covered already (mostly by using scrypt utility).

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jan 12, 2017

Member

One issue with new backup format is that each chunk (currently 100M) has key derived separately - and it takes time (by design). I think the easiest think to do is to increase chunk size to for example 1GB, to minimize the impact.
Other solution would be to derive key once use it for all the chunks, but it will require more work because:

  • cannot use scrypt tool directly - for example go back to openssl enc/openssl dgst - this time with explicit keys (derived using scrypt KDF) instead of very weak built-in KDF; or find some other tool
  • needs some other way to bind encrypted chunk to its name (currently done by prefixing chunk name to the passphrase feed into scrypt KDF)
    And this will also make the manual restore harder.
    Looking at this long discussion, I don't want to go through it all along again...
Member

marmarek commented Jan 12, 2017

One issue with new backup format is that each chunk (currently 100M) has key derived separately - and it takes time (by design). I think the easiest think to do is to increase chunk size to for example 1GB, to minimize the impact.
Other solution would be to derive key once use it for all the chunks, but it will require more work because:

  • cannot use scrypt tool directly - for example go back to openssl enc/openssl dgst - this time with explicit keys (derived using scrypt KDF) instead of very weak built-in KDF; or find some other tool
  • needs some other way to bind encrypted chunk to its name (currently done by prefixing chunk name to the passphrase feed into scrypt KDF)
    And this will also make the manual restore harder.
    Looking at this long discussion, I don't want to go through it all along again...
@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Jan 13, 2017

Member

One issue with new backup format is that each chunk (currently 100M) has key derived separately - and it takes time (by design). I think the easiest think to do is to increase chunk size to for example 1GB, to minimize the impact.

Is this "just" a performance issue, or is it also a security issue? How much of a performance impact are we talking about? Why is the current chunk size 100M in the first place?

Other solution would be to derive key once use it for all the chunks, but it will require more work because: [...]

Both of those options sound undesirable. Are there any downsides to just using a 1GB chunk size?

Looking at this long discussion, I don't want to go through it all along again...

Not sure what you mean here. Why would we have to go through it all again?

Member

andrewdavidwong commented Jan 13, 2017

One issue with new backup format is that each chunk (currently 100M) has key derived separately - and it takes time (by design). I think the easiest think to do is to increase chunk size to for example 1GB, to minimize the impact.

Is this "just" a performance issue, or is it also a security issue? How much of a performance impact are we talking about? Why is the current chunk size 100M in the first place?

Other solution would be to derive key once use it for all the chunks, but it will require more work because: [...]

Both of those options sound undesirable. Are there any downsides to just using a 1GB chunk size?

Looking at this long discussion, I don't want to go through it all along again...

Not sure what you mean here. Why would we have to go through it all again?

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Jan 13, 2017

Member
Member

marmarek commented Jan 13, 2017

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Jan 14, 2017

Member

In that case, I agree with you. Better to keep it simple and just increase the chunk size to 1GB (or some similarly suitable size).

Member

andrewdavidwong commented Jan 14, 2017

In that case, I agree with you. Better to keep it simple and just increase the chunk size to 1GB (or some similarly suitable size).

@ptitdoc

This comment has been minimized.

Show comment
Hide comment
@ptitdoc

ptitdoc Feb 16, 2017

Keep in mind that some people want to backup things because their disk is almost full. That was the reason of using chunks in the first place.

I feel that 1GB chucks are dangerous in such cases. For instance, if I have 5GB free, the backup should not queue more than 4 chunks of 1GB.
Then if the backup fails because there is not enough free space, the files may not be properly removed, and the user ends up with a broken system.

With lvm thin provisioning, I usually ends up with a read-only device with IO errors because of a full disk, with no opportunity to clear some free space without running a live cd.

ptitdoc commented Feb 16, 2017

Keep in mind that some people want to backup things because their disk is almost full. That was the reason of using chunks in the first place.

I feel that 1GB chucks are dangerous in such cases. For instance, if I have 5GB free, the backup should not queue more than 4 chunks of 1GB.
Then if the backup fails because there is not enough free space, the files may not be properly removed, and the user ends up with a broken system.

With lvm thin provisioning, I usually ends up with a read-only device with IO errors because of a full disk, with no opportunity to clear some free space without running a live cd.

@ptitdoc

This comment has been minimized.

Show comment
Hide comment
@ptitdoc

ptitdoc Feb 16, 2017

Just to add to the discussion. I worked on an alternative version that still uses openssl enc and dgst while using python hashlib key derivation function:

  • perform key derivation using python PBKDF2 hashlib function
  • add a .salt file to the backup file to support PBKDF2
  • use openssl enc and dgst with the derived key as the KEY without using openssl password derivation mechanisms
  • add a random .iv file in the backup for each chuck
  • backup can still be decrypted manually using a small python script to perform PBKDF2 key derivation, reading IVs and feeding it to the openssl enc or dgst tool.

Problems are:

  • in this case we are playing with python random generator os.urandom and python implementation of PBKDF2 (based on the documentation, it could use openssl implementation)
  • it may still be prone to cryptographic attack because of errors in handling or sizes of IVs ...
  • the raw derived backup key is passed as parameters to openssl, and is present everywhere in the memory.

ptitdoc commented Feb 16, 2017

Just to add to the discussion. I worked on an alternative version that still uses openssl enc and dgst while using python hashlib key derivation function:

  • perform key derivation using python PBKDF2 hashlib function
  • add a .salt file to the backup file to support PBKDF2
  • use openssl enc and dgst with the derived key as the KEY without using openssl password derivation mechanisms
  • add a random .iv file in the backup for each chuck
  • backup can still be decrypted manually using a small python script to perform PBKDF2 key derivation, reading IVs and feeding it to the openssl enc or dgst tool.

Problems are:

  • in this case we are playing with python random generator os.urandom and python implementation of PBKDF2 (based on the documentation, it could use openssl implementation)
  • it may still be prone to cryptographic attack because of errors in handling or sizes of IVs ...
  • the raw derived backup key is passed as parameters to openssl, and is present everywhere in the memory.
@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Feb 16, 2017

Member

Keep in mind that some people want to backup things because their disk is almost full. That was the reason of using chunks in the first place.

Then tmpfs could be used (so RAM + swap). When using 1GB chunks, maximum number of them should be probably decreased to 5 or so (currently it's 10). The number of chunks in the queue is not really important as long as it isn't 1 - it should be only large enough to allow parallel uploading one chunk while preparing the next one. Maybe a good compromise would be 4x500MB?

Member

marmarek commented Feb 16, 2017

Keep in mind that some people want to backup things because their disk is almost full. That was the reason of using chunks in the first place.

Then tmpfs could be used (so RAM + swap). When using 1GB chunks, maximum number of them should be probably decreased to 5 or so (currently it's 10). The number of chunks in the queue is not really important as long as it isn't 1 - it should be only large enough to allow parallel uploading one chunk while preparing the next one. Maybe a good compromise would be 4x500MB?

@ptitdoc

This comment has been minimized.

Show comment
Hide comment
@ptitdoc

ptitdoc Feb 16, 2017

Yes, most Qubes users probably use at least 8GB of RAM, and most VMs currently needs to be stopped to perform backups. So we can hope that there is 4GB of free space in dom0, or maybe we can request that before performing a backup, and warn the user about bad performance if this assertion failed ?

ptitdoc commented Feb 16, 2017

Yes, most Qubes users probably use at least 8GB of RAM, and most VMs currently needs to be stopped to perform backups. So we can hope that there is 4GB of free space in dom0, or maybe we can request that before performing a backup, and warn the user about bad performance if this assertion failed ?

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Feb 16, 2017

Member

If the performance impact is significant (e.g., at least several minutes faster per qvm-backup) on systems with sufficient RAM, then I recommend keeping the larger 1GB chunk size. As @ptitdoc pointed out, most Qubes systems will have at least 8GB. The ones that have less will never be running many VMs anyway, so it won't be hard to shut them down for qvm-backup. Meanwhile, the vast majority of users will benefit from performance gains on a task they do (or should be doing) frequently. Across all Qubes users, this seems like it could easily add up to thousands of hours saved over the lifetime of R4.0.

Member

andrewdavidwong commented Feb 16, 2017

If the performance impact is significant (e.g., at least several minutes faster per qvm-backup) on systems with sufficient RAM, then I recommend keeping the larger 1GB chunk size. As @ptitdoc pointed out, most Qubes systems will have at least 8GB. The ones that have less will never be running many VMs anyway, so it won't be hard to shut them down for qvm-backup. Meanwhile, the vast majority of users will benefit from performance gains on a task they do (or should be doing) frequently. Across all Qubes users, this seems like it could easily add up to thousands of hours saved over the lifetime of R4.0.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Feb 16, 2017

Member

It's about 1s for the sole KDF. Lets do some math: assume 100GB backup, with VM sizes being multiply of 1GB. This gives:

  • 1GB chunk: 1m 60s overhead for KDF
  • 500MB chunk: 3m 20s overhead for KDF
  • 100MB chunk: 16m 40s overhead for KDF

Is ~3.5m bad?

Member

marmarek commented Feb 16, 2017

It's about 1s for the sole KDF. Lets do some math: assume 100GB backup, with VM sizes being multiply of 1GB. This gives:

  • 1GB chunk: 1m 60s overhead for KDF
  • 500MB chunk: 3m 20s overhead for KDF
  • 100MB chunk: 16m 40s overhead for KDF

Is ~3.5m bad?

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Feb 16, 2017

Member

Suppose a single user makes weekly backups for a year:

(3.5 minutes * 52 weeks) / 60 minutes = ~3 hours saved per year

Suppose a little under 10% of the estimated Qubes userbase does this:

~3 hours * 2,000 users = ~6,000 hours saved per year

So, even with conservative assumptions, we'd be saving users thousands of hours per year, collectively. And the only cost is that users with only 4GB of RAM may have to shut down their VMs before they run qvm-backup, which they'd probably do anyway? Seems worth it to me.

Member

andrewdavidwong commented Feb 16, 2017

Suppose a single user makes weekly backups for a year:

(3.5 minutes * 52 weeks) / 60 minutes = ~3 hours saved per year

Suppose a little under 10% of the estimated Qubes userbase does this:

~3 hours * 2,000 users = ~6,000 hours saved per year

So, even with conservative assumptions, we'd be saving users thousands of hours per year, collectively. And the only cost is that users with only 4GB of RAM may have to shut down their VMs before they run qvm-backup, which they'd probably do anyway? Seems worth it to me.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Feb 16, 2017

Member

I don't see why to look at collective time of Qubes userbase. It isn't really observable value for anyone. But it does make sense to look at time spent/saved per year.

Member

marmarek commented Feb 16, 2017

I don't see why to look at collective time of Qubes userbase. It isn't really observable value for anyone. But it does make sense to look at time spent/saved per year.

@grey-olli

This comment has been minimized.

Show comment
Hide comment
@grey-olli

grey-olli Feb 16, 2017

@jpouellet

This comment has been minimized.

Show comment
Hide comment
@jpouellet

jpouellet Feb 16, 2017

Contributor

I don't see a fundamental reason why the KDF needs to be recomputed for each chunk... the output of the KDF should be safely cacheable, just as we must cache the plaintext passphrase between chunks today.

Do I understand correctly that the current KDF-recomputation-on-each-chunk is just an artifact of wanting to build the backup system out of guaranteed-to-be-easily-obtainable-in-the-future programs and treat each such program as a black box, and the single scrypt binary being responsible for both computing the KDF and doing the bulk encryption, thus having no clean way to cache the KDF output? Or is there some other reason?

Contributor

jpouellet commented Feb 16, 2017

I don't see a fundamental reason why the KDF needs to be recomputed for each chunk... the output of the KDF should be safely cacheable, just as we must cache the plaintext passphrase between chunks today.

Do I understand correctly that the current KDF-recomputation-on-each-chunk is just an artifact of wanting to build the backup system out of guaranteed-to-be-easily-obtainable-in-the-future programs and treat each such program as a black box, and the single scrypt binary being responsible for both computing the KDF and doing the bulk encryption, thus having no clean way to cache the KDF output? Or is there some other reason?

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Feb 16, 2017

Member

Yes, exactly this one alone.

Member

marmarek commented Feb 16, 2017

Yes, exactly this one alone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment