Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ca-derivations can create malformed NAR serializations #8113

Open
Kiskae opened this issue Mar 27, 2023 · 8 comments
Open

ca-derivations can create malformed NAR serializations #8113

Kiskae opened this issue Mar 27, 2023 · 8 comments
Labels
bug ca-derivations Derivations with content addressed outputs

Comments

@Kiskae
Copy link

Kiskae commented Mar 27, 2023

Describe the bug

Currently the ca-derivations feature implements its hash rewriting as a context-free replace over the NAR serialization of the output, but if any filename in the nar archive includes the hash this can break the lexical order of entries in the nar archive.

Steps To Reproduce

The following derivation creates a package that is very sensitive to changes in entry order:

runCommand "break-ca" {} ''
  mkdir -p $out
  touch $out/$(basename $out)
  touch $out/$(basename $out | head -c 2)
  touch $out/$(basename $out | head -c 2)zzz
  ls -la $out
''

On my machine trying to make this content-addressed results in the following error:

nix store make-content-addressed /nix/store/kpjx0d09414zwc76xyg57b4kx64b7z9b-break-ca
error: ca hash mismatch importing path '/nix/store/kxfp1fqd8xq2iksqk2id6gfvmn867yxl-break-ca';
         specified: sha256:1n1lkhdaaxgf0f35a8if4smh9d8k8ylr4q19zwy9jlbljjqm47hc
         got:       sha256:1ah0w6nnkc8z0r8ph1klv3gd879n0n7f8dbn1s42hj7w0p96q3ns

Expected behavior

I expect this to be broken since the requirements of the NAR serialization conflict with the context-free hash rewriting. One way to solve this is to ignore filenames during hash rewriting, but that will complicate the rewriting process. I'm mostly creating this issue so the behavior can be documented.

nix-env --version output

nix-env --version
nix-env (Nix) 2.13.3

Priorities

Add 👍 to issues you find important.

@Kiskae Kiskae added the bug label Mar 27, 2023
@thufschmitt thufschmitt added the ca-derivations Derivations with content addressed outputs label Mar 29, 2023
@thufschmitt
Copy link
Member

That is quite unfortunate 🤔

I think we could solve that by parsing and rebuilding the NAR during hash rewriting, but I'm a bit afraid of the performance impact for big paths. I guess the best way to know would be to try it

@Kiskae
Copy link
Author

Kiskae commented Mar 29, 2023

That would also make it a lot harder to verify the hash afterwards, since that also relies on the fact that hash-rewriting is done in-place and doesn't touch anything besides those hashes.

@layus
Copy link
Member

layus commented Apr 5, 2023

Just a random idea: The rewrite only happens on an internal, temporary nar serialisation. That nar is never uploaded, or made visible to the users. It's rater (non ca derivation -> nar serialize -> ca-path filter -> narDeserialize -> dump to ca path), right ? So one could add an internal flag to the nar deserializer to accept nars that are "invalid" due to a wrong ordering. Than ca logic could turn it on for it's internal needs, and no-one ever needs to know that from the outside.

@layus
Copy link
Member

layus commented Apr 5, 2023

Oh, but the hash of the content also relies on that ordering, right ? so we actually need a rewrite-aware nar serializer. That is a bit more tricky, but feasible too. Way better than fixing an already composed nar.

@Kiskae
Copy link
Author

Kiskae commented Apr 5, 2023

So one could add an internal flag to the nar deserializer to accept nars that are "invalid" due to a wrong ordering.

The verification scheme for ca-derivations relies on the fact that you can reproduce the 'internal' nar serialization by doing the substitution again with the final hash, so this makes it impossible to verify the output.

Oh, but the hash of the content also relies on that ordering, right

Yes, since the contents are serialized in topological order based on their file path, if the order of entries changes then the file content also gets moved around.

This is pretty similar to the issues ca-derivations have with compressed data, where the build-time hash leaks into the output in a way that cannot be substituted for a stable value. As long as hash substitution is limited to the file contents and symlinks the nar serialization should remain valid.

@layus
Copy link
Member

layus commented Apr 5, 2023

As of now we do

  narStr := narSerialize(path) // (a full, in memory copy)
  narStr := replaceStrings(narStr, substitutions) // (a second copy, replaces the first one)
  newHash := hashModulo(narStr, oldHash, newHash) // (streaming traversal, replace oldHashBy zero, also hash the replaced positions.)
  casPath := writeToStore(newPath, rewriteStringsStream(narStr, {oldHash -> newHash})) // a streaming traversal

This is not particularly efficient, and could be turned into

  // An improved narStream, that replaces strings and keeps the indexes where substitutions of the modulus happened.
  // Then hashes the resulting positions in the final hash.
  // No complete copy in memory, one filesystem traversal
  // Because rewriting happens before dumping to nar, it can happen before directory entries are sorted.
  // keeping track of rewrite positions is however very complex.
  newHash := narHashWithStringsReplaceModulo(path, substitutions, modulus = oldHash)
  
  casPath := writeToStore(newPath, narHashWithStringsReplace(path, substitutions U { oldHash -> newHash })) // a second streaming filesystem traversal.

This new algo has no in-memory copy of the path content. (better for perfs, a bit more risky if the content changes on disk). The narHashWithStringsReplaceModulo is tricky, but not impossible. Mostly rewrite all the names before using them, and all the content. Keeping rewrite indexes is really tricky however.

The bad property is that we still cannot hash a content addressed path, as self-references in path names will break it's nar serialisation. We need to check that paths are indeed content-adressed using

assertEqual(caPath, pathFromHash(narHashWithStringsReplaceModulo(path, {}, modulus = caPath.hashPart)

i.e., there is no way to test the ca-validity of a caPath just from it's nar serialization.

The only other option I see is a narStreamRewriteStream, which would be more generic, but would have to maintain an in-memory representation of the tree encoded in the nar, and stream it with dir entries reordered according to the rewritten strings. That's... maybe the best way to go 🤔

@Ericson2314
Copy link
Member

Ericson2314 commented Apr 17, 2023

That sounds like a good plan to me; would you have time to take a crack at implementing it?

You might also take a look at #4282, which cleans up some of the same code and would make for a nicer starting point (if it passes CI!)

@Kiskae
Copy link
Author

Kiskae commented Sep 21, 2023

I see the linked PR has been merged, but it appears to still be a direct find/replace over the serialized NAR. So this issue should still be present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug ca-derivations Derivations with content addressed outputs
Projects
None yet
Development

No branches or pull requests

4 participants