Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC 0062] Content-addressed paths #62

Merged
merged 34 commits into from Jan 12, 2022
Merged
Changes from 27 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
6d001f3
CAP RFC: First draft
thufschmitt Sep 19, 2019
435fc42
typo
thufschmitt Dec 11, 2019
7b26144
Apply @grahamc's suggestions
regnat Dec 11, 2019
81099b2
nix code -> Nix expression
thufschmitt Dec 11, 2019
4277386
Break-up the big introduction paragraph
thufschmitt Dec 11, 2019
7af7d2c
Rename to match the PR number
thufschmitt Dec 12, 2019
5fec861
Rename the drv attribute to __contentAddressed
thufschmitt Dec 12, 2019
9edc11f
Mention the GC issue
thufschmitt Jan 8, 2020
5717351
Remove the ambiguity on what an `output` is
thufschmitt Jan 8, 2020
1a844cc
Replace aliases paths by a pathOf mapping
thufschmitt Jan 15, 2020
26ae77e
Move the example after the design description
thufschmitt Jan 15, 2020
bbdca7e
Rephrase the design
thufschmitt Jan 15, 2020
63f3eca
Add shepherd team
thufschmitt Jan 16, 2020
a6d2f38
Rewrite the RFC to account for the RFC meeting comments
thufschmitt Feb 17, 2020
140e093
Add a section about leaking output paths
thufschmitt Feb 17, 2020
288dcb4
Merge remote-tracking branch 'upstream/master' into cas-rfc
Ericson2314 Mar 14, 2020
60e7da3
Merge pull request #5 from Ericson2314/cas-rfc-new-template
regnat Mar 18, 2020
1115a0d
Refine the design summary
thufschmitt Mar 18, 2020
13938de
Rename dependency-addressed into input-addressed
thufschmitt Mar 18, 2020
3a25f7f
minor fixup after comments
thufschmitt Mar 25, 2020
3a18867
Apply suggestions from code review
regnat Jun 19, 2020
fa16e86
Update rfcs/0062-content-addressed-paths.md
Mic92 Oct 22, 2020
94b65bd
Update the terminology to match the in the implementation
thufschmitt Apr 14, 2021
7ed4481
Reword the detailed design presentation
thufschmitt Apr 14, 2021
fb4c61d
Quote some strings in the yaml frontmatter
thufschmitt Apr 14, 2021
841fe3f
Add a design paragraph about the remote caching
thufschmitt Apr 14, 2021
27bd048
Lift the determinism requirement
thufschmitt Apr 14, 2021
1e8fab7
Typo
edolstra May 31, 2021
9772625
Apply suggestions from code review
edolstra May 31, 2021
02ae2b5
Rewrite the RFC
thufschmitt Jun 2, 2021
2d74fed
Make the python samples a bit more pythonic
regnat Jun 2, 2021
168a149
Explicit that unresolved dependencies are eval-time
thufschmitt Jun 2, 2021
427abed
Prettify
thufschmitt Jun 2, 2021
f275669
Make the end-goal an experiment
regnat Dec 10, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
339 changes: 339 additions & 0 deletions rfcs/0062-content-addressed-paths.md
@@ -0,0 +1,339 @@
---
feature: Simple content-adressed store paths
start-date: 2019-08-14
author: Théophane Hufschmitt
co-authors: (find a buddy later to help our with the RFC)
shepherd-team: "@layus, @edolstra and @Ericson2314"
shepherd-leader: "@edolstra"
related-issues: (will contain links to implementation PRs)
---

# Summary

[summary]: #summary

Add some basic but simple support for content-adressed store paths to Nix.
edolstra marked this conversation as resolved.
Show resolved Hide resolved

We plan here to give the possibility to mark certain store paths as
content-adressed (ca), while keeping the other input-adressed as they are
now (modulo some mandatory drv rewriting before the build, see below)
edolstra marked this conversation as resolved.
Show resolved Hide resolved

By making this opt-in, we can impose arbitrary limitations to the paths that
are allowed to be ca to avoid some tricky issues that can arise with
content-adressability.

In particular, we restrict ourselves to paths that only include textual
self-references (_e.g._ no self-reference hidden inside a zip file).

That way we don't have to worry about the fact that hash-rewriting is only an
approximation
edolstra marked this conversation as resolved.
Show resolved Hide resolved

We also leave the option to lift these restrictions later.

The implementation of this RFC is already partially integrated into Nix, behind
the `ca-derivation` experimental flag.

# Motivation

[motivation]: #motivation

Having a content-adressed store with Nix (aka the "Intensional store") is a
long-time dream of the community − a design for that was already taking a whole
chapter in [Eelco's PHD thesis][nixphd].

This was never done because it represents quite a big change in Nix's model,
with some non-trivial implications (regarding the trust model in
particular).
Even without going all the way down to a fully intensional model, we can
make specific paths content-adressed, which can give some important benefits of
the intensional store at a much lower price. In particular, setting some
critical derivations as content-adressed can lead to some substantial build
cutoffs.

# Detailed design

[design]: #detailed-design

When it comes to computing the output paths of a derivation, the current Nix
model, known as the “input-addressd” model (also sometimes referred to as the
edolstra marked this conversation as resolved.
Show resolved Hide resolved
“extensional” model) works (roughly) as follows:

- A Derivation is a data-structure that specifies how to build a package.
Derivations can refer to other derivations
- All these derivations have a “hash-modulo” associated to them, which is defined by:
- Some derivations known as “fixed-output” have a known result (for example
because they fetch a tarball from the internet, and we assume that this
tarball will stay immutable).
These have their output hash manually defined (and this hash will be
checked against the actual hash of their output when they get built)
- All the others have a hash that's recursively computed by the following algorithm:
- If a derivation doesn't depend on any other derivation, then we just hash its representation,
- Otherwise, we substitute each occurence of a dependency by its hash modulo and hash the result.
- For each output of a derivation, we compute the associated output path by
hashing the hash modulo of the derivation and the output name.

This proposal adds a new kind of derivation: “floating content-addressed
derivations”, which are similar to fixed-output derivations in that they are
stored in a content-addressed path, but don't have this output hash specified
ahead of time.

For this to work properly, we need to extend the current build process, as well
as the caching and remote building systems so that they are able to take into
account the specificies of these new derivations.

## Nix-build process

For the sake of clarity, we will refer to the current model (where the
derivations are indexed by their inputs, also sometimes called "extensional") as
the `input-addressed` model

### Output mappings

For each output `output` of a derivation `drv`, we define

- its **Output Id** `DrvOutput(drv, output)` as the tuple `(hashModulo(drv), output)`.
This id uniquely identifies the output.
We textually represent this as `hashModulo(drv)!output`.
- its **realisation** `Realisation(outputId)` containing
1. The path `path` at which this output is stored (either content-defined or input-defined depending on the type of derivation)
2. An optional set `signatures` of signatures certifying the above

In a input-addressed-only world, the concrete path for a derivation output was a pure function of this output's id that could be computed at eval-time. However this won't be the case anymore once we allow CA derivations, so we now need to store the results of the `Realisation` function in the Nix database as a new table:

```sql
create table if not exists Realisation (
drvHash integer not null,
outputName text not null,
outputPath integer not null,
)
```

### Building a non-ca derivation

#### Resolved derivations

As it is already internally the case in Nix, we define a **basic derivation** as a derivation that doesn't depend on any derivation output (except its own). Said otherwise, a basic derivation is a derivation whose only inputs are either

- Placeholders for its own outputs (from the `placeholder` builtin)
- Existing store paths
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Existing store paths
Existing store paths (including *built* content-addresed resolved derivations' output paths)


For a derivation `drv` whose input derivations have all been realised, we define its **associated resolved derivation** `resolved(drv)` as `drv` in which we replace every input derivation `inDrv` of `drv` by `Realisation(inDrv).path`, and update the output hash accordingly.

`resolved` is (intentionally) not injective: If `drv` and `drv'` only differ because one depends on `dep` and the other on `dep'`, but `dep` and `dep'` are content-addressed and have the same output hash, then `resolved(drv)` and `resolved(drv')` will be equal.

#### Build process

When asked to build a derivation `drv`, we instead:

1. Compute `resolved(drv)`
2. Substitute and build `resolved(drv)` like a normal derivation.
Possibly this is a no-op because it may be that `resolved(drv)` has already been built.
3. Add a new mapping `Realisation(drv!${output}) == ${output}(resolved(drv))` for each output `output` of `drv` (signing the mapping if needs be)

### Building a CA derivation

A **CA derivation** is a derivation with the `__contentAddressed` argument set
to `true` and the `outputHashAlgo` set to a value that is a valid hash name
recognized by Nix (see the description for `outputHashAlgo` at
<https://nixos.org/nix/manual/#sec-advanced-attributes> for the current allowed
values).

The process for building a content-adressed derivation `drv` is the following:

- We build it like a normal derivation (see above).
For each output `$outputId` of the derivation, this gives us a (temporary) output path `$out`.
- We compute a cryptographic hash `$chash` of `$out`[^modulo-hashing]
- We move `$out` to `/nix/store/$chash-$name`
- We store the mapping `Realisation($outputId) == "/nix/store/$chash-$name"`

[^modulo-hashing]:

We can possibly normalize all the self-references before
computing the hash and rewrite them when moving the path to handle paths with
self-references, but this isn't strictly required for a first iteration

### Example

In this example, we have the following Nix expression:

```nix
rec {
contentAddressed = mkDerivation {
name = "contentAddressed";
__contentAddressed = true;
… # Some extra arguments
};
dependent = mkDerivation {
name = "dependent";
buildInputs = [ contentAddressed ];
… # Some extra arguments
};
transitivelyDependent = mkDerivation {
name = "transitivelyDependent";
buildInputs = [ dependent ];
… # Some extra arguments
};
}
```

Suppose that we want to build `transitivelyDependent`.
What will happen is the following

1. We instantiate the Nix expression. This gives us three derivations:
`contentAddressed.drv`, `dependent.drv` and `transitivelyDependent.drv`
2. We build `contentAddressed.drv`.
- We first compute `resolved(contentAddressed.drv)`.
- We realise `resolved(contentAddressed.drv)`. This gives us an output path
`out(resolved(contentAddressed.drv))`
- We move `out(resolved(contentAddressed.drv))` to its content-adressed path
`ca(contentAddressed.drv)` which derives from
`sha256(out(resolved(contentAddressed.drv)))`
- We register in the db that `Realisation(contentAddressed.drv!out) == { .path = ca(contentAddressed.drv) }`
3. We build `dependent.drv`
- We first compute `resolved(dependent.drv)`.
This gives us a new derivation identical to `dependent.drv`, except that `contentAddressed.drv!out` is replaced by `Realisation(contentAddressed.drv!out).path == ca(contentAddressed.drv)`
- We realise `resolved(dependent.drv)`. This gives us an output path
`out(resolved(dependent.drv))`
- We register in the db that `Realisation(dependent.drv!out) == { .path = out(resolved(dependent.drv)) }`
4. We build `transitivelyDependent.drv`
- We first compute `resolved(transitivelyDependent.drv)`
This gives us a new derivation identical to `transitivelyDependent.drv`, except that `dependent.drv!out` is replaced by `Realisation(dependent.drv!out).path == out(resolved(dependent.drv))`
- We realise `resolved(transitivelyDependent.drv)`. This gives us an output path `out(resolved(transitivelyDependent.drv))`
- We register in the db that `Realisation(transitivelyDependent.drv!out) == { .path = out(resolved(transitivelyDependent.drv)) }`

Now suppose that we replace `contentAddressed` by `contentAddressed'`, which evaluates to a new derivation `contentAddressed'.drv` such that the output of `contentAddressed'.drv` is the same as the output of `contentAddressed.drv` (say we change a comment in a source file of `contentAddressed`).
We try to rebuild the new `transitivelyDependent`. What happens is the following:

1. We instantiate the Nix expression. This gives us three new derivations:
`contentAddressed'.drv`, `dependent'.drv` and `transitivelyDependent'.drv`
2. We build `contentAddressed'.drv`.
- We first compute `resolved(contentAddressed'.drv)`
- We realise `resolved(contentAddressed'.drv)`. This gives us an output path `out(resolved(contentAddressed'.drv))`
- We compute `ca(contentAddressed'.drv)` and notice that the path already exists (since it's the same as the one we built previously), so we discard the result.
- We register in the db that `Realisation(contentAddressed.drv'!out) == { .path = ca(contentAddressed'.drv) }` ( also equals to `Realisation(contentAddressed.drv!out)`)
3. We build `dependent'.drv`
- We first compute `resolved(dependent'.drv)`.
This gives us a new derivation identical to `dependent'.drv`, except that `contentAddressed'.drv!out` is replaced by `Realisation(contentAddressed'.drv!out).path == ca(contentAddressed'.drv)`
- We notice that `resolved(dependent'.drv) == resolved(dependent.drv)` (since `ca(contentAddressed'.drv) == ca(contentAddressed.drv)`), so we just return the already existing path
4. We build `transitivelyDependent'.drv`
- We first compute `resolved(transitivelyDependent'.drv)`
- Here again, we notice that `resolved(transitivelyDependent'.drv)` is the same as `resolved(transitivelyDependent.drv)`, so we don't build anything

## Remote caching

A consequence of this change is that a store path is now just a meaningless
blob of data if it doesn't have its associated `realisation` metadata −
besides, Nix can't know the output path of a content-addressed derivation
before building it anymore, so it can't ask the remote store for it.

As a consequence, the remote cache protocols is extended to not simply
work on store paths, but rather at the realisation level:

- The store interface now specifies a new method
```
queryRealisation : DrvOutput -> Maybe Realisation
```
- The substitution loop in Nix fist calls this method to ask the remote for the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The substitution loop in Nix fist calls this method to ask the remote for the
- The substitution loop in Nix first calls this method to ask the remote for the

realisation of the current derivation output.
If this first call succeeds, then it fetches the corresponding output path
like before. Then, it registers the realisation in the database.
- The binary caches now have a new toplevel folder `/realisations` storing
these realisations

# Drawbacks

[drawbacks]: #drawbacks

- Obviously, this makes the Nix model more complicated than it currently is. In
particular, the caching model needs some modifications (see [caching]);

- We specify that only a sub-category of derivations can safely be marked as
`contentAddressed`, but there's no way to enforce these restricitions;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bigger problem than it might look like, as it means that trivial updates can break the CA marking for reasons not worth mentioning in the upstream changelog.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely :)

Maybe that could be clearly stated, but the original scope of this work was to be able to mark very specific derivations that were clearly guaranteed to be deterministic, in which case the problem was less important

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the question «why not just propagate CA» shows that writing more is a good idea.

I do think that stressing the limitations in a few key places is also a nice thing to do (people should be able to apply RFC as passed, not what was intended and not what was discussed, after all… we should not treat ourselves worse than we treat computers!)


- This will probably be a breaking-change for some tooling since the output path
that's stored in the `.drv` files doesn't correspond to an actual on-disk
path.

# Alternatives

[alternatives]: #alternatives

[RFC 0017][] is another proposal with the
same end-goal. The big difference between these two is in the scope they cover:
RFC 0017 is about fundamentally changing the base model of Nix, while this
proposal suggests to make only the minimal amount of changes to the current
model to allow the content-adressed model to live in parallel (which would open
the way to a fully content-adressed store as RFC0017, but in a much more
incremental way).

Eventually this RFC should be subsumed by RFC0017.

# Unresolved questions

[unresolved]: #unresolved-questions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have functionality that allows to build a CA package twice with different apparent output paths, and optionally with different parallelism settings? The build of the package obviously fails if the CA unification doesn't lead to the same result.

Should we mandate that Hydra uses this functionality? Should it be on by default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://github.com/NixOS/rfcs/pull/62/files#r357243841 I think we can deal with non-deterministic derivation just fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, for binary cache transparency it is much better if you can build something locally, then regain connectivity and fetch stuff from a cache, then fetch stuff from a different cache, then build some more locally, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in my mind that's equally risky with and without content addressable derivations. The only difference is one lets you know if something goes wrong, and one doesn't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Build nondeterminism doesn't introduce significant behaviour changes, so as long as the expectations are not broken (yeah, we install you into this output path and your dependencies into those paths, and that is not going to change), it will be mostly usable. There are a few CPU-dependent optimisations from time to time, they are annoying.

With CA things are actually moved around, so even though everything would still work when assembled together, the assembling part will be failing. It is Nix, not the code that is built by Nix, that would fail to do things because of nondeterminism.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think trying to keep going despite non-determinism incoherence is a misfeature. You can always evict your own CA mappings (can keep the builds themselves for easy "rollback") and align with cache.reflex-frp.org and keep going.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if things are marked CA, of course it is a good idea to catch failures. But what you propose will not catch much, because a typical derivation is only built once (ever) by Hydra, later Hydra will use the binary cache. Also my proposal includes feeding different «apparent» output paths to the same build with the same dependencies, which has a better chance of discovering compressed self-references.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with running --check and trying different termporary output paths. Catching non-determinism I don't think is important, because it's really clashes that we care about. However, catching self-references is important as we have to be able to move the thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, varying the output paths is something that doesn't follow from anything Nix does, so it has to be spelled explicitly.


## Caching of non-deterministic paths

[caching]: #caching

A big question is about mixing remote-caching and non-determinism.
As [Eelco's phd thesis][nixphd] states, caching CA paths raises a number of
questions when building that path is non-deterministic (because two different
stores can have two different outputs for the same path, which might lead to
some dependencies being duplicated in the closure of a dependency).

The current implementation has a naive approach that just forbids fetching a
path if the local system has a different realisation for the same drv output.
This approach is simple and correct, but it's possible that it might not be
good-enough in practice as it can result in a totally useless binary cache in
some pathological cases.

There exist some better solutions to this problem (including one presented in
Eelco's thesis), but there are much more complex, so it's probably not worth
investing in them until we're sure that they are needed.

## Garbage collection

Another major open issue is garbage collection of the realisations table. It's
not clear when entries should be deleted. The paths in the domain are "fake" so
we can't use them for expiration. The paths in the codomain could be used (i.e.
if a path is GC'ed, we delete the alias entries that map to it) but it's not
clear whether that's desirable since you may want to bring back the path via
substitution in the future.

## Ensuring that no temporary output path leaks in the result

One possible issue with the CA model is that the output paths get moved after
being built, which breaks self-references. Hash rewriting solves this in most
cases, but it is only a heuristic and there is no way to truly ensure that we
don't leak a self-reference (for example if a self-reference appears in a
zipped file − like is often the case for man pages or Java jars, the
hash-rewriting machinery won't detect it). Having leaking self-references is
annoying since:

- These self-references change each time the inputs of the derivation change,
making CA useless (because the output will _always_ change when the input
change)
- More annoyingly, these references become dangling and can cause runtime
failures

We however have a way to dectect these: If we have leaking self-references then
the output will change if we artificially change its output path. This could be
integrated in the `--check` option of `nix-store`.

# Future work

[future]: #future-work

This RFC tries as much as possible to provide a solid foundation for building
ca paths with Nix, leaving as much room as possible for future extensions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ca paths with Nix, leaving as much room as possible for future extensions.
CA paths with Nix, leaving as much room as possible for future extensions.

In particular:

- Consolidate the caching model to make it more efficient in presence of
non-deterministic derivations
- (hopefully, one day) make the CA model the default one in Nix
- Investigate the consequences in term of privileges requirements
- Build a trust model on top of the content-adressed model to share store paths
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference the reserved truster field from here


[rfc 0017]: https://github.com/NixOS/rfcs/pull/17
[nixphd]: https://nixos.org/~eelco/pubs/phd-thesis.pdf