ca-derivations can create malformed NAR serializations #8113

Kiskae · 2023-03-27T15:01:46Z

Describe the bug

Currently the ca-derivations feature implements its hash rewriting as a context-free replace over the NAR serialization of the output, but if any filename in the nar archive includes the hash this can break the lexical order of entries in the nar archive.

Steps To Reproduce

The following derivation creates a package that is very sensitive to changes in entry order:

runCommand "break-ca" {} ''
  mkdir -p $out
  touch $out/$(basename $out)
  touch $out/$(basename $out | head -c 2)
  touch $out/$(basename $out | head -c 2)zzz
  ls -la $out
''

On my machine trying to make this content-addressed results in the following error:

❯ nix store make-content-addressed /nix/store/kpjx0d09414zwc76xyg57b4kx64b7z9b-break-ca
error: ca hash mismatch importing path '/nix/store/kxfp1fqd8xq2iksqk2id6gfvmn867yxl-break-ca';
         specified: sha256:1n1lkhdaaxgf0f35a8if4smh9d8k8ylr4q19zwy9jlbljjqm47hc
         got:       sha256:1ah0w6nnkc8z0r8ph1klv3gd879n0n7f8dbn1s42hj7w0p96q3ns

Expected behavior

I expect this to be broken since the requirements of the NAR serialization conflict with the context-free hash rewriting. One way to solve this is to ignore filenames during hash rewriting, but that will complicate the rewriting process. I'm mostly creating this issue so the behavior can be documented.

nix-env --version output

❯ nix-env --version
nix-env (Nix) 2.13.3

Priorities

Add 👍 to issues you find important.

The text was updated successfully, but these errors were encountered:

thufschmitt · 2023-03-29T08:53:08Z

That is quite unfortunate 🤔

I think we could solve that by parsing and rebuilding the NAR during hash rewriting, but I'm a bit afraid of the performance impact for big paths. I guess the best way to know would be to try it

Kiskae · 2023-03-29T10:11:43Z

That would also make it a lot harder to verify the hash afterwards, since that also relies on the fact that hash-rewriting is done in-place and doesn't touch anything besides those hashes.

layus · 2023-04-05T13:19:30Z

Just a random idea: The rewrite only happens on an internal, temporary nar serialisation. That nar is never uploaded, or made visible to the users. It's rater (non ca derivation -> nar serialize -> ca-path filter -> narDeserialize -> dump to ca path), right ? So one could add an internal flag to the nar deserializer to accept nars that are "invalid" due to a wrong ordering. Than ca logic could turn it on for it's internal needs, and no-one ever needs to know that from the outside.

layus · 2023-04-05T13:21:33Z

Oh, but the hash of the content also relies on that ordering, right ? so we actually need a rewrite-aware nar serializer. That is a bit more tricky, but feasible too. Way better than fixing an already composed nar.

Kiskae · 2023-04-05T13:46:01Z

So one could add an internal flag to the nar deserializer to accept nars that are "invalid" due to a wrong ordering.

The verification scheme for ca-derivations relies on the fact that you can reproduce the 'internal' nar serialization by doing the substitution again with the final hash, so this makes it impossible to verify the output.

Oh, but the hash of the content also relies on that ordering, right

Yes, since the contents are serialized in topological order based on their file path, if the order of entries changes then the file content also gets moved around.

This is pretty similar to the issues ca-derivations have with compressed data, where the build-time hash leaks into the output in a way that cannot be substituted for a stable value. As long as hash substitution is limited to the file contents and symlinks the nar serialization should remain valid.

layus · 2023-04-05T19:09:27Z

As of now we do

  narStr := narSerialize(path) // (a full, in memory copy)
  narStr := replaceStrings(narStr, substitutions) // (a second copy, replaces the first one)
  newHash := hashModulo(narStr, oldHash, newHash) // (streaming traversal, replace oldHashBy zero, also hash the replaced positions.)
  casPath := writeToStore(newPath, rewriteStringsStream(narStr, {oldHash -> newHash})) // a streaming traversal

This is not particularly efficient, and could be turned into

  // An improved narStream, that replaces strings and keeps the indexes where substitutions of the modulus happened.
  // Then hashes the resulting positions in the final hash.
  // No complete copy in memory, one filesystem traversal
  // Because rewriting happens before dumping to nar, it can happen before directory entries are sorted.
  // keeping track of rewrite positions is however very complex.
  newHash := narHashWithStringsReplaceModulo(path, substitutions, modulus = oldHash)
  
  casPath := writeToStore(newPath, narHashWithStringsReplace(path, substitutions U { oldHash -> newHash })) // a second streaming filesystem traversal.

This new algo has no in-memory copy of the path content. (better for perfs, a bit more risky if the content changes on disk). The narHashWithStringsReplaceModulo is tricky, but not impossible. Mostly rewrite all the names before using them, and all the content. Keeping rewrite indexes is really tricky however.

The bad property is that we still cannot hash a content addressed path, as self-references in path names will break it's nar serialisation. We need to check that paths are indeed content-adressed using

assertEqual(caPath, pathFromHash(narHashWithStringsReplaceModulo(path, {}, modulus = caPath.hashPart)

i.e., there is no way to test the ca-validity of a caPath just from it's nar serialization.

The only other option I see is a narStreamRewriteStream, which would be more generic, but would have to maintain an in-memory representation of the tree encoded in the nar, and stream it with dir entries reordered according to the rewritten strings. That's... maybe the best way to go 🤔

Ericson2314 · 2023-04-17T23:45:04Z

That sounds like a good plan to me; would you have time to take a crack at implementing it?

You might also take a look at #4282, which cleans up some of the same code and would make for a nicer starting point (if it passes CI!)

Kiskae · 2023-09-21T11:50:42Z

I see the linked PR has been merged, but it appears to still be a direct find/replace over the serialized NAR. So this issue should still be present.

Kiskae added the bug label Mar 27, 2023

thufschmitt added the ca-derivations Derivations with content addressed outputs label Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ca-derivations can create malformed NAR serializations #8113

ca-derivations can create malformed NAR serializations #8113

Kiskae commented Mar 27, 2023

thufschmitt commented Mar 29, 2023

Kiskae commented Mar 29, 2023

layus commented Apr 5, 2023

layus commented Apr 5, 2023

Kiskae commented Apr 5, 2023

layus commented Apr 5, 2023

Ericson2314 commented Apr 17, 2023 •

edited

Loading

Kiskae commented Sep 21, 2023

ca-derivations can create malformed NAR serializations #8113

ca-derivations can create malformed NAR serializations #8113

Comments

Kiskae commented Mar 27, 2023

thufschmitt commented Mar 29, 2023

Kiskae commented Mar 29, 2023

layus commented Apr 5, 2023

layus commented Apr 5, 2023

Kiskae commented Apr 5, 2023

layus commented Apr 5, 2023

Ericson2314 commented Apr 17, 2023 • edited Loading

Kiskae commented Sep 21, 2023

Ericson2314 commented Apr 17, 2023 •

edited

Loading