From 6d001f364f890416200e0e1642c738c3d6f37b7b Mon Sep 17 00:00:00 2001 From: regnat Date: Thu, 19 Sep 2019 10:43:37 +0200 Subject: [PATCH 01/32] CAP RFC: First draft --- rfcs/0060-content-addressed-paths.md | 281 +++++++++++++++++++++++++++ 1 file changed, 281 insertions(+) create mode 100644 rfcs/0060-content-addressed-paths.md diff --git a/rfcs/0060-content-addressed-paths.md b/rfcs/0060-content-addressed-paths.md new file mode 100644 index 000000000..92402243f --- /dev/null +++ b/rfcs/0060-content-addressed-paths.md @@ -0,0 +1,281 @@ +--- +feature: Simple content-adressed store paths +start-date: 2019-08-14 +author: Théophane Hufschmitt +co-authors: (find a buddy later to help our with the RFC) +shepherd-team: (names, to be nominated and accepted by RFC steering committee) +shepherd-leader: (name to be appointed by RFC steering committee) +related-issues: (will contain links to implementation PRs) +--- + +# Summary + +[summary]: #summary + +Add some basic but simple support for content-adressed store paths to Nix. + +We plan here to give the possibility to mark certain store paths as +content-adressed (ca), while keeping the other dependency-adressed as they are +now (modulo some mandatory drv rewriting before the build, see below) + +By making this opt-in, we can impose arbitrary limitations to the paths that +are allowed to be ca to avoid some tricky issues that can arise with +content-adressability. +In particular, we restrict ourselves to paths without any non-textual +self-reference (_i.e_ a self-reference hidden inside a zip file) and known to +be deterministic (for caching reasons, see [#caching]). +That way we don't have to worry about the fact that hash-rewriting is only an +approximation nor by the semantics of the distribution of non-deterministic +paths, **but** we also leave the option to lift these restrictions later. + +This RFC already has a (somewhat working) POC at +. + +# Motivation + +[motivation]: #motivation + +Having a content-adressed store with Nix (aka the "Intensional store") is a +long-time dream of the community − a design for that was already taking a whole +chapter in [Eelco's PHD thesis][nixphd]. + +This was never done because it represents a quite big change in Nix's model, +with some non-totally-solved implications (regarding the trust model in +particular). +Even without going all the way down to a fully intensional model (yet), we can +make certain paths content-adressed, which can give some important benefits of +the intensional store at a much lower price. In particular, setting some +critical derivations as content-adressed can lead to some substancial build +cutoffs. + +# Detailed design + +[design]: #detailed-design + +In all that follows, we pretend that each derivation has only one output. +This doesn't change the reasoning but makes things easier to state. + +The gist of the design is that + +- Some derivations can be marked as content-adressed (ca), in which case their + output will be moved to a path `ca` determined only by its content after the + build +- Each (non content-adressed) derivation will have two outputs: A `static` one + computed at evaluation time and a `dynamic` one computed from the dynamic + outputs of its dependencies. These outputs may be identical if the derivation + doesn't (transitively) depend on any ca derivation +- just prior to being realized, each derivation gets rewritten by replacing + each of its dependencies by its `dynamic` or `ca` path + +## Example + +Since the design is non trivial, better start with an example to give an +intuition of what's happening: + +In this example, we have the following nix code: + +```nix +rec { + contentAdressed = mkDerivation { + name = "contentAdressed"; + contentAdressed = true; + … # Some extra arguments + }; + dependent = mkDerivation { + name = "dependent"; + buildInputs = [ contentAdressed ]; + … # Some extra arguments + }; + transitivelyDependent = mkDerivation { + name = "transitivelyDependent"; + buildInputs = [ dependent ]; + … # Some extra arguments + }; +} +``` + +Suppose that we want to build `transitivelyDependent`. +What will happen is the following + +- We instantiate the nix code, this gives us three drv files: + `contentAdressed.drv`, `dependent.drv` and `transitivelyDependent.drv` +- We build `contentAdressed.drv`. + - We first compute `dynamic(contentAdressed.drv)` to replace its + inputs by their real output path. Since there is none, we + have here `dynamic(contentAdressed.drv) == contentAdressed.drv` + - We realise `dynamic(contentAdressed.drv)`. This gives us an output path + `out(dynamic(contentAdressed.drv))` + - We move `out(dynamic(contentAdressed.drv))` to its content-adressed path + `ca(contentAdressed.drv)` which derives from + `sha256(out(dynamic(contentAdressed.drv)))` +- We build `dependent.drv` + - We first compute `dynamic(dependent.drv)` to replace its + inputs by their real output path. + In that case, we replace `contentAdressed.drv!out` by + `ca(contentAdressed.drv)` + - We realise `dynamic(dependent.drv)`. This gives us an output path + `out(dynamic(dependent.drv))` +- We build `transitivelyDependent.drv` + - We first compute `dynamic(transitivelyDependent.drv)` to replace its + inputs by their real output path. + In that case, that means replacing `dependent.drv!out` by + `out(dynamic(dependent.drv))` + - We realise `dynamic(transitivelyDependent.drv)`. This gives us an output path + `out(dynamic(transitivelyDependent.drv))` + +Now suppose that we slightly change the definition of `contentAdressed` in such +a way that `contentAdressed.drv` will be modified, but its output will be the +same. We try to rebuild the new `transitivelyDependent`. What happens is the +following: + +- We instantiate the nix code, this gives us three new drv files: + `contentAdressed.drv`, `dependent.drv` and `transitivelyDependent.drv` +- We build `contentAdressed.drv`. + - We first compute `dynamic(contentAdressed.drv)` to replace its + inputs by their real output path. Since there is none, we + have here `dynamic(contentAdressed.drv) == contentAdressed.drv` + - We realise `dynamic(contentAdressed.drv)`. This gives us an output path + `out(dynamic(contentAdressed.drv))` + - We compute `ca(contentAdressed.drv)` and notice that the + path already exists (since it's the same as the one we built previously), + so we discard the result. +- We build `dependent.drv` + - We first compute `dynamic(dependent.drv)` to replace its + inputs by their real output path. + In that case, we replace `contentAdressed.drv!out` by + `ca(contentAdressed.drv)` + - We notice that `dynamic(dependent.drv)` is the same as before (since + `ca(contentAdressed.drv)` is the same as before), so we + just return the already existing path +- We build `transitivelyDependent.drv` + - We first compute `dynamic(transitivelyDependent.drv)` to replace its + inputs by their real output path. + In that case, that means replacing `dependent.drv!out` by + `out(dynamic(dependent.drv))` + - Here again, we notice that `dynamic(transitivelyDependent.drv)` is the same as before, + so we don't build anything + +## nix-build process + +### Aliases paths + +To allow this, we add a new type of store path: aliases paths. +These paths don't actually exist in the store, just in the database and point to +another path (so they are morally symlinks, but inside the db rather than +on-disk) + +### Building a ca derivation + +ca derivations are derivations with the `contentAdressed` argument set to +`true`. + +The process for building a content-adressed derivation is the following: + +- We build it like a normal derivation to get an output path `$out`. +- We compute a cryptographic hash `$chash` of `$out`[^modulo-hashing] +- We move `$out` to `/nix/store/$chash-$name` +- We create an alias path from `$out` to `/nix/store/$chash-$name` + +[^modulo-hashing]: + + We can possibly normalize all the self-references before + computing the hash and rewrite them when moving the path to handle paths with + self-references, but this isn't strictly required for a first iteration + +### Building a normal derivation + +The process for building a normal derivation is the following: + +- We look into the drv for all the inputs paths of the build +- For each input path, we look whether the path is an alias. If so we replace it + by its target +- We compute the `dynamic` output of the derivation from the patched version +- We then try to substitute and build the new derivation +- We create an alias path from the `static` output to the `dynamic` one + +## Wrapping it up + +# Drawbacks + +[drawbacks]: #drawbacks + +- Obviously, this makes the Nix model more complicated than what it is now. In + particular, the caching model needs some modifications (see [caching]); + +- We specify that only a sub-category of derivations can safely be marked as + `contentAdressed`, but there's no way to enforce these restricitions; + +- This will probably be a breaking-change for some tooling since the output path + that's stored in the `.drv` files doesn't correspond to the actual on-disk + path the output will be stored in (because it might just be an alias for the + other path) + +# Alternatives + +[alternatives]: #alternatives + +[RFC 0017][] is another proposal with the +same end-goal. The big difference between these two is in the scope they cover: +RFC 0017 is about fundamentally changing the base model of Nix, while this +proposal suggests to make only the minimal amount of changes to the current +model to allow the content-adressed model to live in parallel (which would open +the way to a fully content-adressed store as RFC0017, but in a much more +incremental way). + +Eventually this RFC should be subsumed by RFC0017. + +# Unresolved questions + +[unresolved]: #unresolved-questions + +## Caching + +[caching]: #caching + +The big unresolved question is about the caching of content-adressed paths. +As [Eelco's phd thesis][nixphd] states it, caching ca paths raises a number of +questions when building that path is non-deterministic (because two different +stores can have two different outputs for the same path, which might lead to +some dependencies being duplicated in the closure of a dependency). +There exist some solutions to this problem (including one presented in Eelco's +thesis), but for the sake of simplicity, this RFC simply forbids to mark a +derivation as ca if its build is not deterministic (although there's no real +way to check that so it's up to the author of the derivation to ensure that it +is the case). + +## Client support + +The bulk of the job here is done by the nix daemon. + +Depending on the details of the current Nix implementation, there might or +might not be a need for the client to also support it (which would require the +daemon and the client to be updated in synchronously) + +## Old Nix versions and caching + +What happens (and should happen) if a nix not supporting the cas model queries +a cache with cas paths in it is not clear yet. + +In particular, the content (and the existence) of the physical path of the +static derivation isn't decided. A backwards-compatible choice would be to make +this a symlink to the dynamic path, but this is also very leaky and potentially +unsound. + +# Future work + +[future]: #future-work + +This RFC tries as much as possible to provide a solid foundation for building +ca paths with Nix, leaving as much room as possible for future extensions. +In particular: + +- Add some path-rewriting to allow derivations with self-references to be built + as ca +- Consolidate the caching model to allow non-deterministic derivations to be + built as ca +- (hopefully, one day) make the CA model the default one in Nix +- Investigate the consequences in term of privileges requirements +- Build a trust model on top of the content-adressed model to share store paths + +[rfc 0017]: https://github.com/NixOS/rfcs/pull/17 +[nixphd]: https://nixos.org/~eelco/pubs/phd-thesis.pdf From 435fc425054fdcb1165c89fa3906941e64f85d48 Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 11 Dec 2019 17:03:29 +0100 Subject: [PATCH 02/32] typo --- rfcs/0060-content-addressed-paths.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/0060-content-addressed-paths.md b/rfcs/0060-content-addressed-paths.md index 92402243f..fd0e3e6f6 100644 --- a/rfcs/0060-content-addressed-paths.md +++ b/rfcs/0060-content-addressed-paths.md @@ -23,7 +23,7 @@ are allowed to be ca to avoid some tricky issues that can arise with content-adressability. In particular, we restrict ourselves to paths without any non-textual self-reference (_i.e_ a self-reference hidden inside a zip file) and known to -be deterministic (for caching reasons, see [#caching]). +be deterministic (for caching reasons, see [caching]). That way we don't have to worry about the fact that hash-rewriting is only an approximation nor by the semantics of the distribution of non-deterministic paths, **but** we also leave the option to lift these restrictions later. From 7b261448e18cef5abd279741cc1500b15283e063 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Th=C3=A9ophane=20Hufschmitt?= Date: Wed, 11 Dec 2019 17:32:27 +0100 Subject: [PATCH 03/32] Apply @grahamc's suggestions Co-Authored-By: Graham Christensen --- rfcs/0060-content-addressed-paths.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/rfcs/0060-content-addressed-paths.md b/rfcs/0060-content-addressed-paths.md index fd0e3e6f6..efda5213d 100644 --- a/rfcs/0060-content-addressed-paths.md +++ b/rfcs/0060-content-addressed-paths.md @@ -29,7 +29,7 @@ approximation nor by the semantics of the distribution of non-deterministic paths, **but** we also leave the option to lift these restrictions later. This RFC already has a (somewhat working) POC at -. +. # Motivation @@ -42,8 +42,8 @@ chapter in [Eelco's PHD thesis][nixphd]. This was never done because it represents a quite big change in Nix's model, with some non-totally-solved implications (regarding the trust model in particular). -Even without going all the way down to a fully intensional model (yet), we can -make certain paths content-adressed, which can give some important benefits of +Even without going all the way down to a fully intensional model, we can +make specific paths content-adressed, which can give some important benefits of the intensional store at a much lower price. In particular, setting some critical derivations as content-adressed can lead to some substancial build cutoffs. From 81099b23108b0bc7acb81130f4bc4b884806459f Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 11 Dec 2019 17:33:41 +0100 Subject: [PATCH 04/32] nix code -> Nix expression --- rfcs/0060-content-addressed-paths.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/rfcs/0060-content-addressed-paths.md b/rfcs/0060-content-addressed-paths.md index efda5213d..f7041afbe 100644 --- a/rfcs/0060-content-addressed-paths.md +++ b/rfcs/0060-content-addressed-paths.md @@ -72,7 +72,7 @@ The gist of the design is that Since the design is non trivial, better start with an example to give an intuition of what's happening: -In this example, we have the following nix code: +In this example, we have the following Nix expression: ```nix rec { @@ -97,7 +97,7 @@ rec { Suppose that we want to build `transitivelyDependent`. What will happen is the following -- We instantiate the nix code, this gives us three drv files: +- We instantiate the Nix expression, this gives us three drv files: `contentAdressed.drv`, `dependent.drv` and `transitivelyDependent.drv` - We build `contentAdressed.drv`. - We first compute `dynamic(contentAdressed.drv)` to replace its @@ -128,7 +128,7 @@ a way that `contentAdressed.drv` will be modified, but its output will be the same. We try to rebuild the new `transitivelyDependent`. What happens is the following: -- We instantiate the nix code, this gives us three new drv files: +- We instantiate the Nix expression, this gives us three new drv files: `contentAdressed.drv`, `dependent.drv` and `transitivelyDependent.drv` - We build `contentAdressed.drv`. - We first compute `dynamic(contentAdressed.drv)` to replace its @@ -155,7 +155,7 @@ following: - Here again, we notice that `dynamic(transitivelyDependent.drv)` is the same as before, so we don't build anything -## nix-build process +## Nix-build process ### Aliases paths @@ -245,7 +245,7 @@ is the case). ## Client support -The bulk of the job here is done by the nix daemon. +The bulk of the job here is done by the Nix daemon. Depending on the details of the current Nix implementation, there might or might not be a need for the client to also support it (which would require the @@ -253,7 +253,7 @@ daemon and the client to be updated in synchronously) ## Old Nix versions and caching -What happens (and should happen) if a nix not supporting the cas model queries +What happens (and should happen) if a Nix not supporting the cas model queries a cache with cas paths in it is not clear yet. In particular, the content (and the existence) of the physical path of the From 427738682460756a9c25de458dda727ad88b078e Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 11 Dec 2019 17:40:27 +0100 Subject: [PATCH 05/32] Break-up the big introduction paragraph As suggested in https://github.com/NixOS/rfcs/pull/62/files#r356694585 --- rfcs/0060-content-addressed-paths.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/rfcs/0060-content-addressed-paths.md b/rfcs/0060-content-addressed-paths.md index f7041afbe..2ca47d59e 100644 --- a/rfcs/0060-content-addressed-paths.md +++ b/rfcs/0060-content-addressed-paths.md @@ -21,12 +21,17 @@ now (modulo some mandatory drv rewriting before the build, see below) By making this opt-in, we can impose arbitrary limitations to the paths that are allowed to be ca to avoid some tricky issues that can arise with content-adressability. -In particular, we restrict ourselves to paths without any non-textual -self-reference (_i.e_ a self-reference hidden inside a zip file) and known to -be deterministic (for caching reasons, see [caching]). + +In particular, we restrict ourselves to paths that are: + +- without any non-textual self-reference (_i.e_ a self-reference hidden inside a zip file) +- known to be deterministic (for caching reasons, see [caching]). + That way we don't have to worry about the fact that hash-rewriting is only an approximation nor by the semantics of the distribution of non-deterministic -paths, **but** we also leave the option to lift these restrictions later. +paths. + +We also leave the option to lift these restrictions later. This RFC already has a (somewhat working) POC at . From 7af7d2c92767931da94e107844f40b293fea451e Mon Sep 17 00:00:00 2001 From: regnat Date: Thu, 12 Dec 2019 09:47:51 +0100 Subject: [PATCH 06/32] Rename to match the PR number --- ...content-addressed-paths.md => 0062-content-addressed-paths.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename rfcs/{0060-content-addressed-paths.md => 0062-content-addressed-paths.md} (100%) diff --git a/rfcs/0060-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md similarity index 100% rename from rfcs/0060-content-addressed-paths.md rename to rfcs/0062-content-addressed-paths.md From 5fec861fa409f12ff3412c328d619141e673a71e Mon Sep 17 00:00:00 2001 From: regnat Date: Thu, 12 Dec 2019 09:50:15 +0100 Subject: [PATCH 07/32] Rename the drv attribute to __contentAddressed Makes it more in line with other "magic" attributes like `__structuredAttributes` Also fix the orthograph --- rfcs/0062-content-addressed-paths.md | 58 ++++++++++++++-------------- 1 file changed, 29 insertions(+), 29 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index 2ca47d59e..5d3f79dc2 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -81,14 +81,14 @@ In this example, we have the following Nix expression: ```nix rec { - contentAdressed = mkDerivation { - name = "contentAdressed"; - contentAdressed = true; + contentAddressed = mkDerivation { + name = "contentAddressed"; + __contentAddressed = true; … # Some extra arguments }; dependent = mkDerivation { name = "dependent"; - buildInputs = [ contentAdressed ]; + buildInputs = [ contentAddressed ]; … # Some extra arguments }; transitivelyDependent = mkDerivation { @@ -103,21 +103,21 @@ Suppose that we want to build `transitivelyDependent`. What will happen is the following - We instantiate the Nix expression, this gives us three drv files: - `contentAdressed.drv`, `dependent.drv` and `transitivelyDependent.drv` -- We build `contentAdressed.drv`. - - We first compute `dynamic(contentAdressed.drv)` to replace its + `contentAddressed.drv`, `dependent.drv` and `transitivelyDependent.drv` +- We build `contentAddressed.drv`. + - We first compute `dynamic(contentAddressed.drv)` to replace its inputs by their real output path. Since there is none, we - have here `dynamic(contentAdressed.drv) == contentAdressed.drv` - - We realise `dynamic(contentAdressed.drv)`. This gives us an output path - `out(dynamic(contentAdressed.drv))` - - We move `out(dynamic(contentAdressed.drv))` to its content-adressed path - `ca(contentAdressed.drv)` which derives from - `sha256(out(dynamic(contentAdressed.drv)))` + have here `dynamic(contentAddressed.drv) == contentAddressed.drv` + - We realise `dynamic(contentAddressed.drv)`. This gives us an output path + `out(dynamic(contentAddressed.drv))` + - We move `out(dynamic(contentAddressed.drv))` to its content-adressed path + `ca(contentAddressed.drv)` which derives from + `sha256(out(dynamic(contentAddressed.drv)))` - We build `dependent.drv` - We first compute `dynamic(dependent.drv)` to replace its inputs by their real output path. - In that case, we replace `contentAdressed.drv!out` by - `ca(contentAdressed.drv)` + In that case, we replace `contentAddressed.drv!out` by + `ca(contentAddressed.drv)` - We realise `dynamic(dependent.drv)`. This gives us an output path `out(dynamic(dependent.drv))` - We build `transitivelyDependent.drv` @@ -128,29 +128,29 @@ What will happen is the following - We realise `dynamic(transitivelyDependent.drv)`. This gives us an output path `out(dynamic(transitivelyDependent.drv))` -Now suppose that we slightly change the definition of `contentAdressed` in such -a way that `contentAdressed.drv` will be modified, but its output will be the +Now suppose that we slightly change the definition of `contentAddressed` in such +a way that `contentAddressed.drv` will be modified, but its output will be the same. We try to rebuild the new `transitivelyDependent`. What happens is the following: - We instantiate the Nix expression, this gives us three new drv files: - `contentAdressed.drv`, `dependent.drv` and `transitivelyDependent.drv` -- We build `contentAdressed.drv`. - - We first compute `dynamic(contentAdressed.drv)` to replace its + `contentAddressed.drv`, `dependent.drv` and `transitivelyDependent.drv` +- We build `contentAddressed.drv`. + - We first compute `dynamic(contentAddressed.drv)` to replace its inputs by their real output path. Since there is none, we - have here `dynamic(contentAdressed.drv) == contentAdressed.drv` - - We realise `dynamic(contentAdressed.drv)`. This gives us an output path - `out(dynamic(contentAdressed.drv))` - - We compute `ca(contentAdressed.drv)` and notice that the + have here `dynamic(contentAddressed.drv) == contentAddressed.drv` + - We realise `dynamic(contentAddressed.drv)`. This gives us an output path + `out(dynamic(contentAddressed.drv))` + - We compute `ca(contentAddressed.drv)` and notice that the path already exists (since it's the same as the one we built previously), so we discard the result. - We build `dependent.drv` - We first compute `dynamic(dependent.drv)` to replace its inputs by their real output path. - In that case, we replace `contentAdressed.drv!out` by - `ca(contentAdressed.drv)` + In that case, we replace `contentAddressed.drv!out` by + `ca(contentAddressed.drv)` - We notice that `dynamic(dependent.drv)` is the same as before (since - `ca(contentAdressed.drv)` is the same as before), so we + `ca(contentAddressed.drv)` is the same as before), so we just return the already existing path - We build `transitivelyDependent.drv` - We first compute `dynamic(transitivelyDependent.drv)` to replace its @@ -171,7 +171,7 @@ on-disk) ### Building a ca derivation -ca derivations are derivations with the `contentAdressed` argument set to +ca derivations are derivations with the `__contentAddressed` argument set to `true`. The process for building a content-adressed derivation is the following: @@ -208,7 +208,7 @@ The process for building a normal derivation is the following: particular, the caching model needs some modifications (see [caching]); - We specify that only a sub-category of derivations can safely be marked as - `contentAdressed`, but there's no way to enforce these restricitions; + `contentAddressed`, but there's no way to enforce these restricitions; - This will probably be a breaking-change for some tooling since the output path that's stored in the `.drv` files doesn't correspond to the actual on-disk From 9edc11f6596c7dc8d344d79e2eecf530d0e3e587 Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 8 Jan 2020 07:13:01 +0100 Subject: [PATCH 08/32] Mention the GC issue --- rfcs/0062-content-addressed-paths.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index 5d3f79dc2..e862dff24 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -266,6 +266,15 @@ static derivation isn't decided. A backwards-compatible choice would be to make this a symlink to the dynamic path, but this is also very leaky and potentially unsound. +## Garbage collection + +Another major open issue is garbage collection of the aliases table. It's not +clear when entries should be deleted. The paths in the domain are "fake" so we +can't use them for expiration. The paths in the codomain could be used (i.e. if +a path is GC'ed, we delete the alias entries that map to it) but it's not clear +whether that's desirable since you may want to bring back the path via +substitution in the future. + # Future work [future]: #future-work From 5717351febdfd89893a9320f09bf768aef0c61ef Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 8 Jan 2020 07:37:20 +0100 Subject: [PATCH 09/32] Remove the ambiguity on what an `output` is --- rfcs/0062-content-addressed-paths.md | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index e862dff24..6d51841b2 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -60,17 +60,15 @@ cutoffs. In all that follows, we pretend that each derivation has only one output. This doesn't change the reasoning but makes things easier to state. -The gist of the design is that +The gist of the design is that: - Some derivations can be marked as content-adressed (ca), in which case their output will be moved to a path `ca` determined only by its content after the build -- Each (non content-adressed) derivation will have two outputs: A `static` one - computed at evaluation time and a `dynamic` one computed from the dynamic - outputs of its dependencies. These outputs may be identical if the derivation - doesn't (transitively) depend on any ca derivation -- just prior to being realized, each derivation gets rewritten by replacing - each of its dependencies by its `dynamic` or `ca` path +- When asked to build a derivation, Nix will instead compute a `dynamic` + version of that derivation (where all the ca dependencies are replaced by + their content addressed path), build this dynamic derivation and link back + the original one to this build result. ## Example From 1a844ccbe3a2e9c2c6f15236c046beb78bc482aa Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 15 Jan 2020 06:38:10 +0100 Subject: [PATCH 10/32] Replace aliases paths by a pathOf mapping --- rfcs/0062-content-addressed-paths.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index 6d51841b2..22fc5fc7e 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -160,12 +160,13 @@ following: ## Nix-build process -### Aliases paths +### Output mappings -To allow this, we add a new type of store path: aliases paths. -These paths don't actually exist in the store, just in the database and point to -another path (so they are morally symlinks, but inside the db rather than -on-disk) +A major consequence of allowing content-addressed derivations is that the +actual output path of a derivation might not match its output hash anymore. + +To express this, we introduce a new mapping `pathOf` that associates the hash +of every live derivation to its store path. ### Building a ca derivation @@ -177,7 +178,8 @@ The process for building a content-adressed derivation is the following: - We build it like a normal derivation to get an output path `$out`. - We compute a cryptographic hash `$chash` of `$out`[^modulo-hashing] - We move `$out` to `/nix/store/$chash-$name` -- We create an alias path from `$out` to `/nix/store/$chash-$name` +- We create a mapping from `$dhash` (the hash computed at eval-time) to + `/nix/store/$chash-$name` [^modulo-hashing]: @@ -189,12 +191,10 @@ The process for building a content-adressed derivation is the following: The process for building a normal derivation is the following: -- We look into the drv for all the inputs paths of the build -- For each input path, we look whether the path is an alias. If so we replace it - by its target +- We replace each input derivation `drv` by `pathOf(dhash(drv))` - We compute the `dynamic` output of the derivation from the patched version - We then try to substitute and build the new derivation -- We create an alias path from the `static` output to the `dynamic` one +- We add a new mapping `pathOf(dhash(drv)) = out(dynamic)` ## Wrapping it up From 26ae77e1b9d969d697c1c47646f13c7c7da45e6e Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 15 Jan 2020 06:39:27 +0100 Subject: [PATCH 11/32] Move the example after the design description --- rfcs/0062-content-addressed-paths.md | 79 +++++++++++++--------------- 1 file changed, 38 insertions(+), 41 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index 22fc5fc7e..3a3d2e3a0 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -70,10 +70,45 @@ The gist of the design is that: their content addressed path), build this dynamic derivation and link back the original one to this build result. -## Example +## Nix-build process + +### Output mappings + +A major consequence of allowing content-addressed derivations is that the +actual output path of a derivation might not match its output hash anymore. + +To express this, we introduce a new mapping `pathOf` that associates the hash +of every live derivation to its store path. -Since the design is non trivial, better start with an example to give an -intuition of what's happening: +### Building a ca derivation + +ca derivations are derivations with the `__contentAddressed` argument set to +`true`. + +The process for building a content-adressed derivation is the following: + +- We build it like a normal derivation to get an output path `$out`. +- We compute a cryptographic hash `$chash` of `$out`[^modulo-hashing] +- We move `$out` to `/nix/store/$chash-$name` +- We create a mapping from `$dhash` (the hash computed at eval-time) to + `/nix/store/$chash-$name` + +[^modulo-hashing]: + + We can possibly normalize all the self-references before + computing the hash and rewrite them when moving the path to handle paths with + self-references, but this isn't strictly required for a first iteration + +### Building a normal derivation + +The process for building a normal derivation is the following: + +- We replace each input derivation `drv` by `pathOf(dhash(drv))` +- We compute the `dynamic` output of the derivation from the patched version +- We then try to substitute and build the new derivation +- We add a new mapping `pathOf(dhash(drv)) = out(dynamic)` + +## Example In this example, we have the following Nix expression: @@ -158,44 +193,6 @@ following: - Here again, we notice that `dynamic(transitivelyDependent.drv)` is the same as before, so we don't build anything -## Nix-build process - -### Output mappings - -A major consequence of allowing content-addressed derivations is that the -actual output path of a derivation might not match its output hash anymore. - -To express this, we introduce a new mapping `pathOf` that associates the hash -of every live derivation to its store path. - -### Building a ca derivation - -ca derivations are derivations with the `__contentAddressed` argument set to -`true`. - -The process for building a content-adressed derivation is the following: - -- We build it like a normal derivation to get an output path `$out`. -- We compute a cryptographic hash `$chash` of `$out`[^modulo-hashing] -- We move `$out` to `/nix/store/$chash-$name` -- We create a mapping from `$dhash` (the hash computed at eval-time) to - `/nix/store/$chash-$name` - -[^modulo-hashing]: - - We can possibly normalize all the self-references before - computing the hash and rewrite them when moving the path to handle paths with - self-references, but this isn't strictly required for a first iteration - -### Building a normal derivation - -The process for building a normal derivation is the following: - -- We replace each input derivation `drv` by `pathOf(dhash(drv))` -- We compute the `dynamic` output of the derivation from the patched version -- We then try to substitute and build the new derivation -- We add a new mapping `pathOf(dhash(drv)) = out(dynamic)` - ## Wrapping it up # Drawbacks From bbdca7ed32718ccb409855bcd4904b8c21244e16 Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 15 Jan 2020 09:13:15 +0100 Subject: [PATCH 12/32] Rephrase the design In particular, replace `static` and `dynamic` by `symbolic` and `resolved` --- rfcs/0062-content-addressed-paths.md | 89 ++++++++++++++++------------ 1 file changed, 52 insertions(+), 37 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index 3a3d2e3a0..911be50c3 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -65,10 +65,15 @@ The gist of the design is that: - Some derivations can be marked as content-adressed (ca), in which case their output will be moved to a path `ca` determined only by its content after the build -- When asked to build a derivation, Nix will instead compute a `dynamic` - version of that derivation (where all the ca dependencies are replaced by - their content addressed path), build this dynamic derivation and link back - the original one to this build result. +- We introduce the notion of a `resolved derivation` which is a derivation that + doesn't refer to any other derivation but only to concrete store paths. + To prevent ambiguities, we might speak of a `symbolic derivation` to + designate a derivation that's not necessarily resolved. + We also define a `resolving` function that given a symbolic derivation + returns a new resolved derivation with the same semantics. +- When asked to build a derivation, Nix will first resolve it, build the + resolved derivation and link back the symbolic one to the out path of the + resolved one. ## Nix-build process @@ -79,6 +84,7 @@ actual output path of a derivation might not match its output hash anymore. To express this, we introduce a new mapping `pathOf` that associates the hash of every live derivation to its store path. +By extension, we also define `pathOf(drv) = pathOf(hash(drv))` ### Building a ca derivation @@ -87,7 +93,7 @@ ca derivations are derivations with the `__contentAddressed` argument set to The process for building a content-adressed derivation is the following: -- We build it like a normal derivation to get an output path `$out`. +- We build it like a normal derivation (see below) to get an output path `$out`. - We compute a cryptographic hash `$chash` of `$out`[^modulo-hashing] - We move `$out` to `/nix/store/$chash-$name` - We create a mapping from `$dhash` (the hash computed at eval-time) to @@ -101,12 +107,26 @@ The process for building a content-adressed derivation is the following: ### Building a normal derivation -The process for building a normal derivation is the following: +#### Resolved derivations -- We replace each input derivation `drv` by `pathOf(dhash(drv))` -- We compute the `dynamic` output of the derivation from the patched version -- We then try to substitute and build the new derivation -- We add a new mapping `pathOf(dhash(drv)) = out(dynamic)` +We define a `resolved derivation` as a derivation that has no reference to any +other derivation (but can refere to store paths). + +For a derivation `drv` whose input derivations have all been realised, we define +its `associated resolved derivation` of `drv` (`resolved(drv)`) as +`drv` in which we replace every input derivation `inDrv` of `drv` by +`pathOf(inDrv)` (and update the output hash accordingly). + +`resolved` is (intentionally) not injective: If `drv` and `drv'` only differ because one depends on `dep` and the other on `dep'`, but `dep` and `dep'` are content-addressed and have the same output hash, then `resolved(drv)` and `resolved(drv')` will be equal. + +Derivations that don't transitively depend on any ca derivation are “equivalent” to their associated resolved derivation in that they refer to the same inputs and have the same output hash. + +#### Build process + +When asked to build a derivation `drv`, we instead: + +1. Try to substitute and build `resolved(drv)`. Possibly this is a no-op because it may be that `resolved(drv)` has already been built. +2. Add a new mapping `pathOf(hash(drv)) = out(resolved(drv))` ## Example @@ -138,28 +158,28 @@ What will happen is the following - We instantiate the Nix expression, this gives us three drv files: `contentAddressed.drv`, `dependent.drv` and `transitivelyDependent.drv` - We build `contentAddressed.drv`. - - We first compute `dynamic(contentAddressed.drv)` to replace its + - We first compute `resolved(contentAddressed.drv)` to replace its inputs by their real output path. Since there is none, we - have here `dynamic(contentAddressed.drv) == contentAddressed.drv` - - We realise `dynamic(contentAddressed.drv)`. This gives us an output path - `out(dynamic(contentAddressed.drv))` - - We move `out(dynamic(contentAddressed.drv))` to its content-adressed path + have here `resolved(contentAddressed.drv) == contentAddressed.drv` + - We realise `resolved(contentAddressed.drv)`. This gives us an output path + `out(resolved(contentAddressed.drv))` + - We move `out(resolved(contentAddressed.drv))` to its content-adressed path `ca(contentAddressed.drv)` which derives from - `sha256(out(dynamic(contentAddressed.drv)))` + `sha256(out(resolved(contentAddressed.drv)))` - We build `dependent.drv` - - We first compute `dynamic(dependent.drv)` to replace its + - We first compute `resolved(dependent.drv)` to replace its inputs by their real output path. In that case, we replace `contentAddressed.drv!out` by `ca(contentAddressed.drv)` - - We realise `dynamic(dependent.drv)`. This gives us an output path - `out(dynamic(dependent.drv))` + - We realise `resolved(dependent.drv)`. This gives us an output path + `out(resolved(dependent.drv))` - We build `transitivelyDependent.drv` - - We first compute `dynamic(transitivelyDependent.drv)` to replace its + - We first compute `resolved(transitivelyDependent.drv)` to replace its inputs by their real output path. In that case, that means replacing `dependent.drv!out` by - `out(dynamic(dependent.drv))` - - We realise `dynamic(transitivelyDependent.drv)`. This gives us an output path - `out(dynamic(transitivelyDependent.drv))` + `out(resolved(dependent.drv))` + - We realise `resolved(transitivelyDependent.drv)`. This gives us an output path + `out(resolved(transitivelyDependent.drv))` Now suppose that we slightly change the definition of `contentAddressed` in such a way that `contentAddressed.drv` will be modified, but its output will be the @@ -169,28 +189,28 @@ following: - We instantiate the Nix expression, this gives us three new drv files: `contentAddressed.drv`, `dependent.drv` and `transitivelyDependent.drv` - We build `contentAddressed.drv`. - - We first compute `dynamic(contentAddressed.drv)` to replace its + - We first compute `resolved(contentAddressed.drv)` to replace its inputs by their real output path. Since there is none, we - have here `dynamic(contentAddressed.drv) == contentAddressed.drv` - - We realise `dynamic(contentAddressed.drv)`. This gives us an output path - `out(dynamic(contentAddressed.drv))` + have here `resolved(contentAddressed.drv) == contentAddressed.drv` + - We realise `resolved(contentAddressed.drv)`. This gives us an output path + `out(resolved(contentAddressed.drv))` - We compute `ca(contentAddressed.drv)` and notice that the path already exists (since it's the same as the one we built previously), so we discard the result. - We build `dependent.drv` - - We first compute `dynamic(dependent.drv)` to replace its + - We first compute `resolved(dependent.drv)` to replace its inputs by their real output path. In that case, we replace `contentAddressed.drv!out` by `ca(contentAddressed.drv)` - - We notice that `dynamic(dependent.drv)` is the same as before (since + - We notice that `resolved(dependent.drv)` is the same as before (since `ca(contentAddressed.drv)` is the same as before), so we just return the already existing path - We build `transitivelyDependent.drv` - - We first compute `dynamic(transitivelyDependent.drv)` to replace its + - We first compute `resolved(transitivelyDependent.drv)` to replace its inputs by their real output path. In that case, that means replacing `dependent.drv!out` by - `out(dynamic(dependent.drv))` - - Here again, we notice that `dynamic(transitivelyDependent.drv)` is the same as before, + `out(resolved(dependent.drv))` + - Here again, we notice that `resolved(transitivelyDependent.drv)` is the same as before, so we don't build anything ## Wrapping it up @@ -256,11 +276,6 @@ daemon and the client to be updated in synchronously) What happens (and should happen) if a Nix not supporting the cas model queries a cache with cas paths in it is not clear yet. -In particular, the content (and the existence) of the physical path of the -static derivation isn't decided. A backwards-compatible choice would be to make -this a symlink to the dynamic path, but this is also very leaky and potentially -unsound. - ## Garbage collection Another major open issue is garbage collection of the aliases table. It's not From 63f3eca1cbcdb8506beab6ba0f9c9dd8f776a4d0 Mon Sep 17 00:00:00 2001 From: regnat Date: Thu, 16 Jan 2020 15:52:57 +0100 Subject: [PATCH 13/32] Add shepherd team --- rfcs/0062-content-addressed-paths.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index 911be50c3..5285c519e 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -3,7 +3,7 @@ feature: Simple content-adressed store paths start-date: 2019-08-14 author: Théophane Hufschmitt co-authors: (find a buddy later to help our with the RFC) -shepherd-team: (names, to be nominated and accepted by RFC steering committee) +shepherd-team: @layus, @edolstra and @Ericson2314 shepherd-leader: (name to be appointed by RFC steering committee) related-issues: (will contain links to implementation PRs) --- From a6d2f38cb50f918cc6835fd105714b5295310022 Mon Sep 17 00:00:00 2001 From: regnat Date: Mon, 17 Feb 2020 12:18:33 +0100 Subject: [PATCH 14/32] Rewrite the RFC to account for the RFC meeting comments - Add the notion of `drvOutputId` - Replace the alias paths by a `PathOf(drvOutputId)` function - Mention the notion of "truster" in the `PathOf` function --- rfcs/0062-content-addressed-paths.md | 193 +++++++++++++-------------- 1 file changed, 94 insertions(+), 99 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index 5285c519e..4dcb1c2e2 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -57,14 +57,11 @@ cutoffs. [design]: #detailed-design -In all that follows, we pretend that each derivation has only one output. -This doesn't change the reasoning but makes things easier to state. - The gist of the design is that: -- Some derivations can be marked as content-adressed (ca), in which case their - output will be moved to a path `ca` determined only by its content after the - build +- Some derivations can be marked as content-adressed (ca), in which case each + one of their output will be moved to a path `ca` determined only by its + content after the build - We introduce the notion of a `resolved derivation` which is a derivation that doesn't refer to any other derivation but only to concrete store paths. To prevent ambiguities, we might speak of a `symbolic derivation` to @@ -79,54 +76,73 @@ The gist of the design is that: ### Output mappings -A major consequence of allowing content-addressed derivations is that the -actual output path of a derivation might not match its output hash anymore. - -To express this, we introduce a new mapping `pathOf` that associates the hash -of every live derivation to its store path. -By extension, we also define `pathOf(drv) = pathOf(hash(drv))` - -### Building a ca derivation - -ca derivations are derivations with the `__contentAddressed` argument set to -`true`. +For each output `output` of a derivation `drv`, we define -The process for building a content-adressed derivation is the following: +- its output id **DrvOutputId(drv, output)** as the tuple `(hash(drv), output, truster)`, where `truster` is a reserved field for future use and currently always set to `"world"`. + This id uniquely identifies the output. + We textually represent this as `hash(drv)!output[@truster]`. +- its concrete path **PathOf(outputId)** as the path on which the output will be stored on disk. -- We build it like a normal derivation (see below) to get an output path `$out`. -- We compute a cryptographic hash `$chash` of `$out`[^modulo-hashing] -- We move `$out` to `/nix/store/$chash-$name` -- We create a mapping from `$dhash` (the hash computed at eval-time) to - `/nix/store/$chash-$name` +> Unresolved: should we already include the `truster` field in `DrvOutputId` +> even if it's not used atm? What would be the cost of adding it later? -[^modulo-hashing]: +In a dependency-addressed-only world, the concrete path for a derivation output was a pure function of this output's id that could be computed at eval-time. However this won't be the case anymore once we allow content-addressed derivations, so we now need to store the results the `PathOf` function in the Nix database as a new table: - We can possibly normalize all the self-references before - computing the hash and rewrite them when moving the path to handle paths with - self-references, but this isn't strictly required for a first iteration +```sql +create table if not exists PathOf ( + drv integer not null, + output text not null, + truster integer not null, + path integer not null, +) +``` ### Building a normal derivation #### Resolved derivations -We define a `resolved derivation` as a derivation that has no reference to any -other derivation (but can refere to store paths). +We define a **resolved derivation** as a derivation whose only references are either: -For a derivation `drv` whose input derivations have all been realised, we define -its `associated resolved derivation` of `drv` (`resolved(drv)`) as -`drv` in which we replace every input derivation `inDrv` of `drv` by -`pathOf(inDrv)` (and update the output hash accordingly). +- Self references +- References to the outputs of other (non content-addresed) resolved derivations +- Existing store paths -`resolved` is (intentionally) not injective: If `drv` and `drv'` only differ because one depends on `dep` and the other on `dep'`, but `dep` and `dep'` are content-addressed and have the same output hash, then `resolved(drv)` and `resolved(drv')` will be equal. +For a derivation `drv` whose input derivations have all been realised, we define its **associated resolved derivation** `resolved(drv)` as `drv` in which we replace every input derivation `inDrv` of `drv` by `pathOf(inDrv)` (and update the output hash accordingly). + +> This doesn't have the property that for a derivation that doesn't depend on any CA derivation `resolved(drv) == drv`. I think that this is a rather big issue so we'll have to find a way to get this property back (but feel free to correct me if you think that it isn't a big deal) -Derivations that don't transitively depend on any ca derivation are “equivalent” to their associated resolved derivation in that they refer to the same inputs and have the same output hash. +`resolved` is (intentionally) not injective: If `drv` and `drv'` only differ because one depends on `dep` and the other on `dep'`, but `dep` and `dep'` are content-addressed and have the same output hash, then `resolved(drv)` and `resolved(drv')` will be equal. #### Build process When asked to build a derivation `drv`, we instead: -1. Try to substitute and build `resolved(drv)`. Possibly this is a no-op because it may be that `resolved(drv)` has already been built. -2. Add a new mapping `pathOf(hash(drv)) = out(resolved(drv))` +1. Compute `resolved(drv)` +2. Substitute and build `resolved(drv)` like a normal derivation. + Possibly this is a no-op because it may be that `resolved(drv)` has already been built. +3. Add a new mapping `pathOf(drv!${output}) == ${output}(resolved(drv))` for each output `output` of `drv` + +### Building a ca derivation + +A **ca derivation** is a derivation with the `__contentAddressed` argument set +to `true` and the `outputHashAlgo` set to a value that is a valid hash name +recognized by Nix (see the description for `outputHashAlgo` at + for the current allowed +values). + +The process for building a content-adressed derivation `drv` is the following: + +- We build it like a normal derivation (see above). + For each output `$outputId` of the derivation, this gives us a (temporary) output path `$out`. + - We compute a cryptographic hash `$chash` of `$out`[^modulo-hashing] + - We move `$out` to `/nix/store/$chash-$name` + - We store the mapping `PathOf($outputId) == "/nix/store/$chash-$name"` + +[^modulo-hashing]: + + We can possibly normalize all the self-references before + computing the hash and rewrite them when moving the path to handle paths with + self-references, but this isn't strictly required for a first iteration ## Example @@ -155,65 +171,45 @@ rec { Suppose that we want to build `transitivelyDependent`. What will happen is the following -- We instantiate the Nix expression, this gives us three drv files: - `contentAddressed.drv`, `dependent.drv` and `transitivelyDependent.drv` -- We build `contentAddressed.drv`. - - We first compute `resolved(contentAddressed.drv)` to replace its - inputs by their real output path. Since there is none, we - have here `resolved(contentAddressed.drv) == contentAddressed.drv` - - We realise `resolved(contentAddressed.drv)`. This gives us an output path - `out(resolved(contentAddressed.drv))` - - We move `out(resolved(contentAddressed.drv))` to its content-adressed path - `ca(contentAddressed.drv)` which derives from - `sha256(out(resolved(contentAddressed.drv)))` -- We build `dependent.drv` - - We first compute `resolved(dependent.drv)` to replace its - inputs by their real output path. - In that case, we replace `contentAddressed.drv!out` by - `ca(contentAddressed.drv)` - - We realise `resolved(dependent.drv)`. This gives us an output path - `out(resolved(dependent.drv))` -- We build `transitivelyDependent.drv` - - We first compute `resolved(transitivelyDependent.drv)` to replace its - inputs by their real output path. - In that case, that means replacing `dependent.drv!out` by - `out(resolved(dependent.drv))` - - We realise `resolved(transitivelyDependent.drv)`. This gives us an output path - `out(resolved(transitivelyDependent.drv))` - -Now suppose that we slightly change the definition of `contentAddressed` in such -a way that `contentAddressed.drv` will be modified, but its output will be the -same. We try to rebuild the new `transitivelyDependent`. What happens is the -following: - -- We instantiate the Nix expression, this gives us three new drv files: - `contentAddressed.drv`, `dependent.drv` and `transitivelyDependent.drv` -- We build `contentAddressed.drv`. - - We first compute `resolved(contentAddressed.drv)` to replace its - inputs by their real output path. Since there is none, we - have here `resolved(contentAddressed.drv) == contentAddressed.drv` - - We realise `resolved(contentAddressed.drv)`. This gives us an output path - `out(resolved(contentAddressed.drv))` - - We compute `ca(contentAddressed.drv)` and notice that the - path already exists (since it's the same as the one we built previously), - so we discard the result. -- We build `dependent.drv` - - We first compute `resolved(dependent.drv)` to replace its - inputs by their real output path. - In that case, we replace `contentAddressed.drv!out` by - `ca(contentAddressed.drv)` - - We notice that `resolved(dependent.drv)` is the same as before (since - `ca(contentAddressed.drv)` is the same as before), so we - just return the already existing path -- We build `transitivelyDependent.drv` - - We first compute `resolved(transitivelyDependent.drv)` to replace its - inputs by their real output path. - In that case, that means replacing `dependent.drv!out` by - `out(resolved(dependent.drv))` - - Here again, we notice that `resolved(transitivelyDependent.drv)` is the same as before, - so we don't build anything - -## Wrapping it up +1. We instantiate the Nix expression, this gives us three drv files: + `contentAddressed.drv`, `dependent.drv` and `transitivelyDependent.drv` +2. We build `contentAddressed.drv`. + - We first compute `resolved(contentAddressed.drv)`. + - We realise `resolved(contentAddressed.drv)`. This gives us an output path + `out(resolved(contentAddressed.drv))` + - We move `out(resolved(contentAddressed.drv))` to its content-adressed path + `ca(contentAddressed.drv)` which derives from + `sha256(out(resolved(contentAddressed.drv)))` + - We register in the db that `pathOf(contentAddressed.drv!out) == ca(contentAddressed.drv)` +3. We build `dependent.drv` + - We first compute `resolved(dependent.drv)`. + This gives us a new derivation identical to `dependent.drv`, except that `contentAddressed.drv!out` is replaced by `pathOf(contentAddressed.drv!out) == ca(contentAddressed.drv)` + - We realise `resolved(dependent.drv)`. This gives us an output path + `out(resolved(dependent.drv))` + - We register in the db that `pathOf(dependent.drv!out) == out(resolved(dependent.drv))` We build `transitivelyDependent.drv` +4. We build `transitivelyDependent.drv` + - We first compute `resolved(transitivelyDependent.drv)` + This gives us a new derivation identical to `transitivelyDependent.drv`, except that `dependent.drv!out` is replaced by `pathOf(dependent.drv!out) == out(resolved(dependent.drv))` + - We realise `resolved(transitivelyDependent.drv)`. This gives us an output path `out(resolved(transitivelyDependent.drv))` + - We register in the db that `pathOf(transitivelyDependent.drv!out) == out(resolved(transitivelyDependent.drv))` + +Now suppose that we replace `contentAddressed` by `contentAddressed'`, which evaluates to a new derivation `contentAddressed'.drv` such that the output of `contentAddressed'.drv` is the same as the output of `contentAddressed.drv` (say we change a comment in a source file of `contentAddressed`). +We try to rebuild the new `transitivelyDependent`. What happens is the following: + +1. We instantiate the Nix expression, this gives us three new drv files: + `contentAddressed'.drv`, `dependent'.drv` and `transitivelyDependent'.drv` +2. We build `contentAddressed'.drv`. + - We first compute `resolved(contentAddressed'.drv)` + - We realise `resolved(contentAddressed'.drv)`. This gives us an output path `out(resolved(contentAddressed'.drv))` + - We compute `ca(contentAddressed'.drv)` and notice that the path already exists (since it's the same as the one we built previously), so we discard the result. + - We register in the db that `pathOf(contentAddressed.drv'!out) == ca(contentAddressed'.drv)` ( also equals to `ca(contentAddressed.drv)`) +3. We build `dependent'.drv` + - We first compute `resolved(dependent'.drv)`. + This gives us a new derivation identical to `dependent'.drv`, except that `contentAddressed'.drv!out` is replaced by `pathOf(contentAddressed'.drv!out) == ca(contentAddressed'.drv)` + - We notice that `resolved(dependent'.drv) == resolved(dependent.drv)` (since `ca(contentAddressed'.drv) == ca(contentAddressed.drv)`), so we just return the already existing path +4. We build `transitivelyDependent'.drv` + - We first compute `resolved(transitivelyDependent'.drv)` + - Here again, we notice that `resolved(transitivelyDependent'.drv)` is the same as `resolved(transitivelyDependent.drv)`, so we don't build anything # Drawbacks @@ -226,9 +222,8 @@ following: `contentAddressed`, but there's no way to enforce these restricitions; - This will probably be a breaking-change for some tooling since the output path - that's stored in the `.drv` files doesn't correspond to the actual on-disk - path the output will be stored in (because it might just be an alias for the - other path) + that's stored in the `.drv` files doesn't correspond to an actual on-disk + path. # Alternatives From 140e09334c64599204b58c84a23a5e67c26f1190 Mon Sep 17 00:00:00 2001 From: regnat Date: Mon, 17 Feb 2020 15:47:50 +0100 Subject: [PATCH 15/32] Add a section about leaking output paths --- rfcs/0062-content-addressed-paths.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index 4dcb1c2e2..65a0ca696 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -280,6 +280,16 @@ a path is GC'ed, we delete the alias entries that map to it) but it's not clear whether that's desirable since you may want to bring back the path via substitution in the future. +## Ensuring that no temporary output path leaks in the result + +One possible issue with the ca model is that the output paths get moved after being built, which breaks self-references. Hash rewriting solves this in most cases, but it is only heuristic and there is no way to truly ensure that we don't leak a self-reference (for example if a self-reference appears in a zipped file − like it's often the case for man pages or java jars, the hash-rewriting machinery won't detect it). +Having leaking self-references is annoying since + +- These self-references change each time the inputs of the derivation change, making ca useless (because the output will _always_ change when the input change) +- More annoyingly, these references become dangling and can cause runtime failures + +We however have a way to dectect these: If we have leaking self-references then the output will change if we artificially change its output path. This could be integrated in the `--check` option of `nix-store`. + # Future work [future]: #future-work From 1115a0d09d348da85a69f572cfda9bd41204a463 Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 18 Mar 2020 07:17:27 +0100 Subject: [PATCH 16/32] Refine the design summary --- rfcs/0062-content-addressed-paths.md | 20 ++++++++------------ 1 file changed, 8 insertions(+), 12 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index 65a0ca696..a28e531ae 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -59,18 +59,14 @@ cutoffs. The gist of the design is that: -- Some derivations can be marked as content-adressed (ca), in which case each - one of their output will be moved to a path `ca` determined only by its - content after the build -- We introduce the notion of a `resolved derivation` which is a derivation that - doesn't refer to any other derivation but only to concrete store paths. - To prevent ambiguities, we might speak of a `symbolic derivation` to - designate a derivation that's not necessarily resolved. - We also define a `resolving` function that given a symbolic derivation - returns a new resolved derivation with the same semantics. -- When asked to build a derivation, Nix will first resolve it, build the - resolved derivation and link back the symbolic one to the out path of the - resolved one. +- Derivations can be marked as content-adressed (ca), in which case each + one of their output will be moved to content-addressed `ca` store path. + This extends the current notion of "fixed-output" derivations. +- We introduce the notion of "resolving" a derivation, which extends to + arbitrary `ca` derivations the current behavior of replacing fixed-outputs + derivations by their output hash. +- We refine the build process so that every derivation is first normalized + before being realized ## Nix-build process From 13938dec7928b5579f0077bbeb11ceea5788096c Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 18 Mar 2020 07:29:05 +0100 Subject: [PATCH 17/32] Rename dependency-addressed into input-addressed And define the term at the begining of the RFC --- rfcs/0062-content-addressed-paths.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index a28e531ae..64e6095e5 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -15,7 +15,7 @@ related-issues: (will contain links to implementation PRs) Add some basic but simple support for content-adressed store paths to Nix. We plan here to give the possibility to mark certain store paths as -content-adressed (ca), while keeping the other dependency-adressed as they are +content-adressed (ca), while keeping the other input-adressed as they are now (modulo some mandatory drv rewriting before the build, see below) By making this opt-in, we can impose arbitrary limitations to the paths that @@ -70,6 +70,10 @@ The gist of the design is that: ## Nix-build process +For the sake of clarity, we will refer to the current model (where the +derivations are indexed by their inputs, also sometimes called "extensional") as +the `input-addressed` model + ### Output mappings For each output `output` of a derivation `drv`, we define @@ -82,7 +86,7 @@ For each output `output` of a derivation `drv`, we define > Unresolved: should we already include the `truster` field in `DrvOutputId` > even if it's not used atm? What would be the cost of adding it later? -In a dependency-addressed-only world, the concrete path for a derivation output was a pure function of this output's id that could be computed at eval-time. However this won't be the case anymore once we allow content-addressed derivations, so we now need to store the results the `PathOf` function in the Nix database as a new table: +In a input-addressed-only world, the concrete path for a derivation output was a pure function of this output's id that could be computed at eval-time. However this won't be the case anymore once we allow content-addressed derivations, so we now need to store the results the `PathOf` function in the Nix database as a new table: ```sql create table if not exists PathOf ( From 3a25f7f88753ac7323ab71ee8f92901727f0769f Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 25 Mar 2020 17:18:22 +0100 Subject: [PATCH 18/32] minor fixup after comments --- rfcs/0062-content-addressed-paths.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index 64e6095e5..1ae646d7f 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -97,13 +97,13 @@ create table if not exists PathOf ( ) ``` -### Building a normal derivation +### Building a non-ca derivation #### Resolved derivations We define a **resolved derivation** as a derivation whose only references are either: -- Self references +- Placeholders for the its own outputs (from the `placeholder` builtin) - References to the outputs of other (non content-addresed) resolved derivations - Existing store paths From 3a188677a8ed9e8d4cbf897765cc4d18f22c58d2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Th=C3=A9ophane=20Hufschmitt?= Date: Fri, 19 Jun 2020 10:13:56 +0200 Subject: [PATCH 19/32] Apply suggestions from code review Co-authored-by: asymmetric Co-authored-by: Profpatsch --- rfcs/0062-content-addressed-paths.md | 42 ++++++++++++++-------------- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index 1ae646d7f..e256ea74f 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -24,11 +24,11 @@ content-adressability. In particular, we restrict ourselves to paths that are: -- without any non-textual self-reference (_i.e_ a self-reference hidden inside a zip file) +- only include textual self-references (_e.g._ no self-reference hidden inside a zip file) - known to be deterministic (for caching reasons, see [caching]). That way we don't have to worry about the fact that hash-rewriting is only an -approximation nor by the semantics of the distribution of non-deterministic +approximation nor about the semantics of the distribution of non-deterministic paths. We also leave the option to lift these restrictions later. @@ -44,13 +44,13 @@ Having a content-adressed store with Nix (aka the "Intensional store") is a long-time dream of the community − a design for that was already taking a whole chapter in [Eelco's PHD thesis][nixphd]. -This was never done because it represents a quite big change in Nix's model, -with some non-totally-solved implications (regarding the trust model in +This was never done because it represents quite a big change in Nix's model, +with some non-trivial implications (regarding the trust model in particular). Even without going all the way down to a fully intensional model, we can make specific paths content-adressed, which can give some important benefits of the intensional store at a much lower price. In particular, setting some -critical derivations as content-adressed can lead to some substancial build +critical derivations as content-adressed can lead to some substantial build cutoffs. # Detailed design @@ -60,12 +60,12 @@ cutoffs. The gist of the design is that: - Derivations can be marked as content-adressed (ca), in which case each - one of their output will be moved to content-addressed `ca` store path. + of their outputs will be moved to a CA store path. This extends the current notion of "fixed-output" derivations. - We introduce the notion of "resolving" a derivation, which extends to arbitrary `ca` derivations the current behavior of replacing fixed-outputs derivations by their output hash. -- We refine the build process so that every derivation is first normalized +- We refine the build process so that every derivation is normalized before being realized ## Nix-build process @@ -78,7 +78,7 @@ the `input-addressed` model For each output `output` of a derivation `drv`, we define -- its output id **DrvOutputId(drv, output)** as the tuple `(hash(drv), output, truster)`, where `truster` is a reserved field for future use and currently always set to `"world"`. +- its `outputId` **DrvOutputId(drv, output)** as the tuple `(hash(drv), output, truster)`, where `truster` is a reserved field for future use and currently always set to `"world"`. This id uniquely identifies the output. We textually represent this as `hash(drv)!output[@truster]`. - its concrete path **PathOf(outputId)** as the path on which the output will be stored on disk. @@ -86,7 +86,7 @@ For each output `output` of a derivation `drv`, we define > Unresolved: should we already include the `truster` field in `DrvOutputId` > even if it's not used atm? What would be the cost of adding it later? -In a input-addressed-only world, the concrete path for a derivation output was a pure function of this output's id that could be computed at eval-time. However this won't be the case anymore once we allow content-addressed derivations, so we now need to store the results the `PathOf` function in the Nix database as a new table: +In a input-addressed-only world, the concrete path for a derivation output was a pure function of this output's id that could be computed at eval-time. However this won't be the case anymore once we allow CA derivations, so we now need to store the results of the `PathOf` function in the Nix database as a new table: ```sql create table if not exists PathOf ( @@ -103,8 +103,8 @@ create table if not exists PathOf ( We define a **resolved derivation** as a derivation whose only references are either: -- Placeholders for the its own outputs (from the `placeholder` builtin) -- References to the outputs of other (non content-addresed) resolved derivations +- Placeholders for its own outputs (from the `placeholder` builtin) +- References to the outputs of other (non CA) resolved derivations - Existing store paths For a derivation `drv` whose input derivations have all been realised, we define its **associated resolved derivation** `resolved(drv)` as `drv` in which we replace every input derivation `inDrv` of `drv` by `pathOf(inDrv)` (and update the output hash accordingly). @@ -122,9 +122,9 @@ When asked to build a derivation `drv`, we instead: Possibly this is a no-op because it may be that `resolved(drv)` has already been built. 3. Add a new mapping `pathOf(drv!${output}) == ${output}(resolved(drv))` for each output `output` of `drv` -### Building a ca derivation +### Building a CA derivation -A **ca derivation** is a derivation with the `__contentAddressed` argument set +A **CA derivation** is a derivation with the `__contentAddressed` argument set to `true` and the `outputHashAlgo` set to a value that is a valid hash name recognized by Nix (see the description for `outputHashAlgo` at for the current allowed @@ -215,7 +215,7 @@ We try to rebuild the new `transitivelyDependent`. What happens is the following [drawbacks]: #drawbacks -- Obviously, this makes the Nix model more complicated than what it is now. In +- Obviously, this makes the Nix model more complicated than it currently is. In particular, the caching model needs some modifications (see [caching]); - We specify that only a sub-category of derivations can safely be marked as @@ -248,13 +248,13 @@ Eventually this RFC should be subsumed by RFC0017. [caching]: #caching The big unresolved question is about the caching of content-adressed paths. -As [Eelco's phd thesis][nixphd] states it, caching ca paths raises a number of +As [Eelco's phd thesis][nixphd] states, caching CA paths raises a number of questions when building that path is non-deterministic (because two different stores can have two different outputs for the same path, which might lead to some dependencies being duplicated in the closure of a dependency). There exist some solutions to this problem (including one presented in Eelco's thesis), but for the sake of simplicity, this RFC simply forbids to mark a -derivation as ca if its build is not deterministic (although there's no real +derivation as CA if its build is not deterministic (although there's no real way to check that so it's up to the author of the derivation to ensure that it is the case). @@ -282,10 +282,10 @@ substitution in the future. ## Ensuring that no temporary output path leaks in the result -One possible issue with the ca model is that the output paths get moved after being built, which breaks self-references. Hash rewriting solves this in most cases, but it is only heuristic and there is no way to truly ensure that we don't leak a self-reference (for example if a self-reference appears in a zipped file − like it's often the case for man pages or java jars, the hash-rewriting machinery won't detect it). -Having leaking self-references is annoying since +One possible issue with the CA model is that the output paths get moved after being built, which breaks self-references. Hash rewriting solves this in most cases, but it is only a heuristic and there is no way to truly ensure that we don't leak a self-reference (for example if a self-reference appears in a zipped file − like is often the case for man pages or Java jars, the hash-rewriting machinery won't detect it). +Having leaking self-references is annoying since: -- These self-references change each time the inputs of the derivation change, making ca useless (because the output will _always_ change when the input change) +- These self-references change each time the inputs of the derivation change, making CA useless (because the output will _always_ change when the input change) - More annoyingly, these references become dangling and can cause runtime failures We however have a way to dectect these: If we have leaking self-references then the output will change if we artificially change its output path. This could be integrated in the `--check` option of `nix-store`. @@ -299,9 +299,9 @@ ca paths with Nix, leaving as much room as possible for future extensions. In particular: - Add some path-rewriting to allow derivations with self-references to be built - as ca + as CA - Consolidate the caching model to allow non-deterministic derivations to be - built as ca + built as CA - (hopefully, one day) make the CA model the default one in Nix - Investigate the consequences in term of privileges requirements - Build a trust model on top of the content-adressed model to share store paths From fa16e86cb0daccb1bb24ff32d3677cad9dec998a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?J=C3=B6rg=20Thalheim?= Date: Thu, 22 Oct 2020 14:58:28 +0200 Subject: [PATCH 20/32] Update rfcs/0062-content-addressed-paths.md --- rfcs/0062-content-addressed-paths.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index e256ea74f..be77a3515 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -4,7 +4,7 @@ start-date: 2019-08-14 author: Théophane Hufschmitt co-authors: (find a buddy later to help our with the RFC) shepherd-team: @layus, @edolstra and @Ericson2314 -shepherd-leader: (name to be appointed by RFC steering committee) +shepherd-leader: @edolstra related-issues: (will contain links to implementation PRs) --- From 94b65bd3e8c0db4c680164e074f7bf42d7d47d0a Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 14 Apr 2021 06:53:05 +0200 Subject: [PATCH 21/32] Update the terminology to match the in the implementation --- rfcs/0062-content-addressed-paths.md | 55 +++++++++++++--------------- 1 file changed, 25 insertions(+), 30 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index be77a3515..d954bd1b3 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -33,8 +33,8 @@ paths. We also leave the option to lift these restrictions later. -This RFC already has a (somewhat working) POC at -. +The implementation of this RFC is already partially integrated into Nix, behind +the `ca-derivation` experimental flag. # Motivation @@ -78,22 +78,20 @@ the `input-addressed` model For each output `output` of a derivation `drv`, we define -- its `outputId` **DrvOutputId(drv, output)** as the tuple `(hash(drv), output, truster)`, where `truster` is a reserved field for future use and currently always set to `"world"`. +- its **Output Id** `DrvOutput(drv, output)` as the tuple `(hashModulo(drv), output)`. This id uniquely identifies the output. - We textually represent this as `hash(drv)!output[@truster]`. -- its concrete path **PathOf(outputId)** as the path on which the output will be stored on disk. + We textually represent this as `hashModulo(drv)!output`. +- its **realisation** `Realisation(outputId)` containing + 1. The path `path` at which this output is stored (either content-defined or input-defined depending on the type of derivation) + 2. An optional set `signatures` of signatures certifying the above -> Unresolved: should we already include the `truster` field in `DrvOutputId` -> even if it's not used atm? What would be the cost of adding it later? - -In a input-addressed-only world, the concrete path for a derivation output was a pure function of this output's id that could be computed at eval-time. However this won't be the case anymore once we allow CA derivations, so we now need to store the results of the `PathOf` function in the Nix database as a new table: +In a input-addressed-only world, the concrete path for a derivation output was a pure function of this output's id that could be computed at eval-time. However this won't be the case anymore once we allow CA derivations, so we now need to store the results of the `Realisation` function in the Nix database as a new table: ```sql -create table if not exists PathOf ( - drv integer not null, - output text not null, - truster integer not null, - path integer not null, +create table if not exists Realisation ( + drvHash integer not null, + outputName text not null, + outputPath integer not null, ) ``` @@ -101,15 +99,12 @@ create table if not exists PathOf ( #### Resolved derivations -We define a **resolved derivation** as a derivation whose only references are either: +As it is already internally the case in Nix, we define a **basic derivation** as a derivation that doesn't depend on any derivation output (except its own). Said otherwise, a basic derivation is a derivation whose only inputs are either - Placeholders for its own outputs (from the `placeholder` builtin) -- References to the outputs of other (non CA) resolved derivations - Existing store paths -For a derivation `drv` whose input derivations have all been realised, we define its **associated resolved derivation** `resolved(drv)` as `drv` in which we replace every input derivation `inDrv` of `drv` by `pathOf(inDrv)` (and update the output hash accordingly). - -> This doesn't have the property that for a derivation that doesn't depend on any CA derivation `resolved(drv) == drv`. I think that this is a rather big issue so we'll have to find a way to get this property back (but feel free to correct me if you think that it isn't a big deal) +For a derivation `drv` whose input derivations have all been realised, we define its **associated resolved derivation** `resolved(drv)` as `drv` in which we replace every input derivation `inDrv` of `drv` by `Realisation(inDrv).path`, and update the output hash accordingly. `resolved` is (intentionally) not injective: If `drv` and `drv'` only differ because one depends on `dep` and the other on `dep'`, but `dep` and `dep'` are content-addressed and have the same output hash, then `resolved(drv)` and `resolved(drv')` will be equal. @@ -120,7 +115,7 @@ When asked to build a derivation `drv`, we instead: 1. Compute `resolved(drv)` 2. Substitute and build `resolved(drv)` like a normal derivation. Possibly this is a no-op because it may be that `resolved(drv)` has already been built. -3. Add a new mapping `pathOf(drv!${output}) == ${output}(resolved(drv))` for each output `output` of `drv` +3. Add a new mapping `Realisation(drv!${output}) == ${output}(resolved(drv))` for each output `output` of `drv` (signing the mapping if needs be) ### Building a CA derivation @@ -136,7 +131,7 @@ The process for building a content-adressed derivation `drv` is the following: For each output `$outputId` of the derivation, this gives us a (temporary) output path `$out`. - We compute a cryptographic hash `$chash` of `$out`[^modulo-hashing] - We move `$out` to `/nix/store/$chash-$name` - - We store the mapping `PathOf($outputId) == "/nix/store/$chash-$name"` + - We store the mapping `Realisation($outputId) == "/nix/store/$chash-$name"` [^modulo-hashing]: @@ -171,7 +166,7 @@ rec { Suppose that we want to build `transitivelyDependent`. What will happen is the following -1. We instantiate the Nix expression, this gives us three drv files: +1. We instantiate the Nix expression. This gives us three derivations: `contentAddressed.drv`, `dependent.drv` and `transitivelyDependent.drv` 2. We build `contentAddressed.drv`. - We first compute `resolved(contentAddressed.drv)`. @@ -180,32 +175,32 @@ What will happen is the following - We move `out(resolved(contentAddressed.drv))` to its content-adressed path `ca(contentAddressed.drv)` which derives from `sha256(out(resolved(contentAddressed.drv)))` - - We register in the db that `pathOf(contentAddressed.drv!out) == ca(contentAddressed.drv)` + - We register in the db that `Realisation(contentAddressed.drv!out) == { .path = ca(contentAddressed.drv) }` 3. We build `dependent.drv` - We first compute `resolved(dependent.drv)`. - This gives us a new derivation identical to `dependent.drv`, except that `contentAddressed.drv!out` is replaced by `pathOf(contentAddressed.drv!out) == ca(contentAddressed.drv)` + This gives us a new derivation identical to `dependent.drv`, except that `contentAddressed.drv!out` is replaced by `Realisation(contentAddressed.drv!out).path == ca(contentAddressed.drv)` - We realise `resolved(dependent.drv)`. This gives us an output path `out(resolved(dependent.drv))` - - We register in the db that `pathOf(dependent.drv!out) == out(resolved(dependent.drv))` We build `transitivelyDependent.drv` + - We register in the db that `Realisation(dependent.drv!out) == { .path = out(resolved(dependent.drv)) }` 4. We build `transitivelyDependent.drv` - We first compute `resolved(transitivelyDependent.drv)` - This gives us a new derivation identical to `transitivelyDependent.drv`, except that `dependent.drv!out` is replaced by `pathOf(dependent.drv!out) == out(resolved(dependent.drv))` + This gives us a new derivation identical to `transitivelyDependent.drv`, except that `dependent.drv!out` is replaced by `Realisation(dependent.drv!out).path == out(resolved(dependent.drv))` - We realise `resolved(transitivelyDependent.drv)`. This gives us an output path `out(resolved(transitivelyDependent.drv))` - - We register in the db that `pathOf(transitivelyDependent.drv!out) == out(resolved(transitivelyDependent.drv))` + - We register in the db that `Realisation(transitivelyDependent.drv!out) == { .path = out(resolved(transitivelyDependent.drv)) }` Now suppose that we replace `contentAddressed` by `contentAddressed'`, which evaluates to a new derivation `contentAddressed'.drv` such that the output of `contentAddressed'.drv` is the same as the output of `contentAddressed.drv` (say we change a comment in a source file of `contentAddressed`). We try to rebuild the new `transitivelyDependent`. What happens is the following: -1. We instantiate the Nix expression, this gives us three new drv files: +1. We instantiate the Nix expression. This gives us three new derivations: `contentAddressed'.drv`, `dependent'.drv` and `transitivelyDependent'.drv` 2. We build `contentAddressed'.drv`. - We first compute `resolved(contentAddressed'.drv)` - We realise `resolved(contentAddressed'.drv)`. This gives us an output path `out(resolved(contentAddressed'.drv))` - We compute `ca(contentAddressed'.drv)` and notice that the path already exists (since it's the same as the one we built previously), so we discard the result. - - We register in the db that `pathOf(contentAddressed.drv'!out) == ca(contentAddressed'.drv)` ( also equals to `ca(contentAddressed.drv)`) + - We register in the db that `Realisation(contentAddressed.drv'!out) == { .path = ca(contentAddressed'.drv) }` ( also equals to `Realisation(contentAddressed.drv!out)`) 3. We build `dependent'.drv` - We first compute `resolved(dependent'.drv)`. - This gives us a new derivation identical to `dependent'.drv`, except that `contentAddressed'.drv!out` is replaced by `pathOf(contentAddressed'.drv!out) == ca(contentAddressed'.drv)` + This gives us a new derivation identical to `dependent'.drv`, except that `contentAddressed'.drv!out` is replaced by `Realisation(contentAddressed'.drv!out).path == ca(contentAddressed'.drv)` - We notice that `resolved(dependent'.drv) == resolved(dependent.drv)` (since `ca(contentAddressed'.drv) == ca(contentAddressed.drv)`), so we just return the already existing path 4. We build `transitivelyDependent'.drv` - We first compute `resolved(transitivelyDependent'.drv)` From 7ed44819760f1ce9f3e6e8788c5da04f33d9f55d Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 14 Apr 2021 10:54:08 +0200 Subject: [PATCH 22/32] Reword the detailed design presentation --- rfcs/0062-content-addressed-paths.md | 36 ++++++++++++++++++++-------- 1 file changed, 26 insertions(+), 10 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index d954bd1b3..219e43d25 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -57,16 +57,32 @@ cutoffs. [design]: #detailed-design -The gist of the design is that: - -- Derivations can be marked as content-adressed (ca), in which case each - of their outputs will be moved to a CA store path. - This extends the current notion of "fixed-output" derivations. -- We introduce the notion of "resolving" a derivation, which extends to - arbitrary `ca` derivations the current behavior of replacing fixed-outputs - derivations by their output hash. -- We refine the build process so that every derivation is normalized - before being realized +When it comes to computing the output paths of a derivation, the current Nix +model, known as the “input-addressd” model (also sometimes referred to as the +“extensional” model) works (roughly) as follows: + +- A Derivation is a data-structure that specifies how to build a package. + Derivations can refer to other derivations +- All these derivations have a “hash-modulo” associated to them, which is defined by: + - Some derivations known as “fixed-output” have a known result (for example + because they fetch a tarball from the internet, and we assume that this + tarball will stay immutable). + These have their output hash manually defined (and this hash will be + checked against the actual hash of their output when they get built) + - All the others have a hash that's recursively computed by the following algorithm: + - If a derivation doesn't depend on any other derivation, then we just hash its representation, + - Otherwise, we substitute each occurence of a dependency by its hash modulo and hash the result. +- For each output of a derivation, we compute the associated output path by + hashing the hash modulo of the derivation and the output name. + +This proposal adds a new kind of derivation: “floating content-addressed +derivations”, which are similar to fixed-output derivations in that they are +stored in a content-addressed path, but don't have this output hash specified +ahead of time. + +For this to work properly, we need to extend the current build process, as well +as the caching and remote building systems so that they are able to take into +account the specificies of these new derivations. ## Nix-build process From fb4c61d0a2f0432a708a040bc55635c802121c37 Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 14 Apr 2021 10:54:38 +0200 Subject: [PATCH 23/32] Quote some strings in the yaml frontmatter --- rfcs/0062-content-addressed-paths.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index 219e43d25..a9a3968aa 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -3,8 +3,8 @@ feature: Simple content-adressed store paths start-date: 2019-08-14 author: Théophane Hufschmitt co-authors: (find a buddy later to help our with the RFC) -shepherd-team: @layus, @edolstra and @Ericson2314 -shepherd-leader: @edolstra +shepherd-team: "@layus, @edolstra and @Ericson2314" +shepherd-leader: "@edolstra" related-issues: (will contain links to implementation PRs) --- From 841fe3f5f6126b45876c8c3da38d29386b51bdcb Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 14 Apr 2021 11:21:45 +0200 Subject: [PATCH 24/32] Add a design paragraph about the remote caching MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit And update the “unresolved questions” to take it into account --- rfcs/0062-content-addressed-paths.md | 87 +++++++++++++++++----------- 1 file changed, 54 insertions(+), 33 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index a9a3968aa..e84be8f2a 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -155,7 +155,7 @@ The process for building a content-adressed derivation `drv` is the following: computing the hash and rewrite them when moving the path to handle paths with self-references, but this isn't strictly required for a first iteration -## Example +### Example In this example, we have the following Nix expression: @@ -222,6 +222,27 @@ We try to rebuild the new `transitivelyDependent`. What happens is the following - We first compute `resolved(transitivelyDependent'.drv)` - Here again, we notice that `resolved(transitivelyDependent'.drv)` is the same as `resolved(transitivelyDependent.drv)`, so we don't build anything +## Remote caching + +A consequence of this change is that a store path is now just a meaningless +blob of data if it doesn't have its associated `realisation` metadata − +besides, Nix can't know the output path of a content-addressed derivation +before building it anymore, so it can't ask the remote store for it. + +As a consequence, the remote cache protocols is extended to not simply +work on store paths, but rather at the realisation level: + +- The store interface now specifies a new method + ``` + queryRealisation : DrvOutput -> Maybe Realisation + ``` +- The substitution loop in Nix fist calls this method to ask the remote for the + realisation of the current derivation output. + If this first call succeeds, then it fetches the corresponding output path + like before. Then, it registers the realisation in the database. +- The binary caches now have a new toplevel folder `/realisations` storing + these realisations + # Drawbacks [drawbacks]: #drawbacks @@ -254,52 +275,54 @@ Eventually this RFC should be subsumed by RFC0017. [unresolved]: #unresolved-questions -## Caching +## Caching of non-deterministic paths [caching]: #caching -The big unresolved question is about the caching of content-adressed paths. +A big question is about mixing remote-caching and non-determinism. As [Eelco's phd thesis][nixphd] states, caching CA paths raises a number of questions when building that path is non-deterministic (because two different stores can have two different outputs for the same path, which might lead to some dependencies being duplicated in the closure of a dependency). -There exist some solutions to this problem (including one presented in Eelco's -thesis), but for the sake of simplicity, this RFC simply forbids to mark a -derivation as CA if its build is not deterministic (although there's no real -way to check that so it's up to the author of the derivation to ensure that it -is the case). - -## Client support - -The bulk of the job here is done by the Nix daemon. - -Depending on the details of the current Nix implementation, there might or -might not be a need for the client to also support it (which would require the -daemon and the client to be updated in synchronously) -## Old Nix versions and caching +The current implementation has a naive approach that just forbids fetching a +path if the local system has a different realisation for the same drv output. +This approach is simple and correct, but it's possible that it might not be +good-enough in practice as it can result in a totally useless binary cache in +some pathological cases. -What happens (and should happen) if a Nix not supporting the cas model queries -a cache with cas paths in it is not clear yet. +There exist some better solutions to this problem (including one presented in +Eelco's thesis), but there are much more complex, so it's probably not worth +investing in them until we're sure that they are needed. ## Garbage collection -Another major open issue is garbage collection of the aliases table. It's not -clear when entries should be deleted. The paths in the domain are "fake" so we -can't use them for expiration. The paths in the codomain could be used (i.e. if -a path is GC'ed, we delete the alias entries that map to it) but it's not clear -whether that's desirable since you may want to bring back the path via +Another major open issue is garbage collection of the realisations table. It's +not clear when entries should be deleted. The paths in the domain are "fake" so +we can't use them for expiration. The paths in the codomain could be used (i.e. +if a path is GC'ed, we delete the alias entries that map to it) but it's not +clear whether that's desirable since you may want to bring back the path via substitution in the future. ## Ensuring that no temporary output path leaks in the result -One possible issue with the CA model is that the output paths get moved after being built, which breaks self-references. Hash rewriting solves this in most cases, but it is only a heuristic and there is no way to truly ensure that we don't leak a self-reference (for example if a self-reference appears in a zipped file − like is often the case for man pages or Java jars, the hash-rewriting machinery won't detect it). -Having leaking self-references is annoying since: +One possible issue with the CA model is that the output paths get moved after +being built, which breaks self-references. Hash rewriting solves this in most +cases, but it is only a heuristic and there is no way to truly ensure that we +don't leak a self-reference (for example if a self-reference appears in a +zipped file − like is often the case for man pages or Java jars, the +hash-rewriting machinery won't detect it). Having leaking self-references is +annoying since: -- These self-references change each time the inputs of the derivation change, making CA useless (because the output will _always_ change when the input change) -- More annoyingly, these references become dangling and can cause runtime failures +- These self-references change each time the inputs of the derivation change, + making CA useless (because the output will _always_ change when the input + change) +- More annoyingly, these references become dangling and can cause runtime + failures -We however have a way to dectect these: If we have leaking self-references then the output will change if we artificially change its output path. This could be integrated in the `--check` option of `nix-store`. +We however have a way to dectect these: If we have leaking self-references then +the output will change if we artificially change its output path. This could be +integrated in the `--check` option of `nix-store`. # Future work @@ -309,10 +332,8 @@ This RFC tries as much as possible to provide a solid foundation for building ca paths with Nix, leaving as much room as possible for future extensions. In particular: -- Add some path-rewriting to allow derivations with self-references to be built - as CA -- Consolidate the caching model to allow non-deterministic derivations to be - built as CA +- Consolidate the caching model to make it more efficient in presence of + non-deterministic derivations - (hopefully, one day) make the CA model the default one in Nix - Investigate the consequences in term of privileges requirements - Build a trust model on top of the content-adressed model to share store paths From 27bd048a669895d0c3d5167aad99275296c2cd62 Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 14 Apr 2021 13:50:42 +0200 Subject: [PATCH 25/32] Lift the determinism requirement It's not required anymore by the current remote caching semantics --- rfcs/0062-content-addressed-paths.md | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index e84be8f2a..35216caf6 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -22,14 +22,11 @@ By making this opt-in, we can impose arbitrary limitations to the paths that are allowed to be ca to avoid some tricky issues that can arise with content-adressability. -In particular, we restrict ourselves to paths that are: - -- only include textual self-references (_e.g._ no self-reference hidden inside a zip file) -- known to be deterministic (for caching reasons, see [caching]). +In particular, we restrict ourselves to paths that only include textual +self-references (_e.g._ no self-reference hidden inside a zip file). That way we don't have to worry about the fact that hash-rewriting is only an -approximation nor about the semantics of the distribution of non-deterministic -paths. +approximation We also leave the option to lift these restrictions later. From 1e8fab71a8433c915fa0a55382b3b4ddf066abe2 Mon Sep 17 00:00:00 2001 From: Eelco Dolstra Date: Mon, 31 May 2021 16:06:01 +0200 Subject: [PATCH 26/32] Typo Co-authored-by: davidak --- rfcs/0062-content-addressed-paths.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index 35216caf6..5924801ca 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -55,7 +55,7 @@ cutoffs. [design]: #detailed-design When it comes to computing the output paths of a derivation, the current Nix -model, known as the “input-addressd” model (also sometimes referred to as the +model, known as the “input-addressed” model (also sometimes referred to as the “extensional” model) works (roughly) as follows: - A Derivation is a data-structure that specifies how to build a package. From 97726251f432a49225e90b3901df7c9a1f8dc0d3 Mon Sep 17 00:00:00 2001 From: Eelco Dolstra Date: Mon, 31 May 2021 16:07:14 +0200 Subject: [PATCH 27/32] Apply suggestions from code review Co-authored-by: davidak --- rfcs/0062-content-addressed-paths.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index 5924801ca..95c2725b9 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -12,11 +12,11 @@ related-issues: (will contain links to implementation PRs) [summary]: #summary -Add some basic but simple support for content-adressed store paths to Nix. +Add some basic but simple support for content-addressed store paths to Nix. We plan here to give the possibility to mark certain store paths as content-adressed (ca), while keeping the other input-adressed as they are -now (modulo some mandatory drv rewriting before the build, see below) +now (modulo some mandatory drv rewriting before the build, see below). By making this opt-in, we can impose arbitrary limitations to the paths that are allowed to be ca to avoid some tricky issues that can arise with @@ -26,7 +26,7 @@ In particular, we restrict ourselves to paths that only include textual self-references (_e.g._ no self-reference hidden inside a zip file). That way we don't have to worry about the fact that hash-rewriting is only an -approximation +approximation. We also leave the option to lift these restrictions later. From 02ae2b5ffd63879b49f36c7664dfdabfd8bd5ef6 Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 2 Jun 2021 09:05:51 +0200 Subject: [PATCH 28/32] Rewrite the RFC MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This is mostly a full-rewrite of the RFC to 1. Make it more “incremental”: A first part just describes the minimal model upon which everything is based, and a second one shows different extensions of this model to add more features 2. Remove the big ugly examples that don’t add much value because they aren’t really readable 3. Add a python pseudo-code pseudo-formalisation of the RFC. This is imho both more readable and precise than nested bullet-points of handwaved language --- rfcs/0062-content-addressed-paths.md | 469 ++++++++++++++++++--------- 1 file changed, 312 insertions(+), 157 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index 95c2725b9..f9952f8bd 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -54,38 +54,71 @@ cutoffs. [design]: #detailed-design +*In everything that follows, most algorithms and data-structures will be expressed as pseudo-python snippets* + When it comes to computing the output paths of a derivation, the current Nix model, known as the “input-addressed” model (also sometimes referred to as the “extensional” model) works (roughly) as follows: -- A Derivation is a data-structure that specifies how to build a package. - Derivations can refer to other derivations -- All these derivations have a “hash-modulo” associated to them, which is defined by: - - Some derivations known as “fixed-output” have a known result (for example - because they fetch a tarball from the internet, and we assume that this - tarball will stay immutable). - These have their output hash manually defined (and this hash will be - checked against the actual hash of their output when they get built) - - All the others have a hash that's recursively computed by the following algorithm: - - If a derivation doesn't depend on any other derivation, then we just hash its representation, - - Otherwise, we substitute each occurence of a dependency by its hash modulo and hash the result. -- For each output of a derivation, we compute the associated output path by - hashing the hash modulo of the derivation and the output name. - -This proposal adds a new kind of derivation: “floating content-addressed -derivations”, which are similar to fixed-output derivations in that they are -stored in a content-addressed path, but don't have this output hash specified -ahead of time. - -For this to work properly, we need to extend the current build process, as well -as the caching and remote building systems so that they are able to take into -account the specificies of these new derivations. - -## Nix-build process - -For the sake of clarity, we will refer to the current model (where the -derivations are indexed by their inputs, also sometimes called "extensional") as -the `input-addressed` model +1. A Nix language expression gets evaluated to a `derivation` +2. This `derivation` is a data-structure describing how to build a package. In particular it contains + 1. A set of derivation outputs which will be used as input for the build + 2. A set of store paths that will be used as input for the build + 3. The build recipe proper (a script to run, with a set of environment + variables). This recipe can refer input paths or derivations by + interpolating their store path. + 4. The output paths into which the derivation will be installed. + These are computed from a hash of the other elements of the derivation. + +The “input-addressed” designation comes from the way the output paths are +computed: They derive from the derivation data-structure, which is the input of +the build. + +The idea behind the “content-addressed” model is that rather than deriving +these output paths from the inputs of the build, we derive them from the output +(the produced store path). + +Nix already supports a special-case of content-addressed derivations with the +so-called “fixed-output” derivations. These are derivations that are +content-addressed, but whose output hash has to be specified in advance, and +are used in particular to fetch data from the internet (as the constraint that +the hash has to be specified in advance means that we can relax the sandbox for +these derivations). + +To fully support this content-addressed model, we need to extend the current +build process, as well as the caching and remote building systems so that they +are able to take into account the specificies of these new derivations. + +Fully supporting content-addressed derivations requires some deep changes to the Nix model. +For the sake of readability, we’ll first present a simplistic model that support them in a very basic way, and then extend this model in several different ways to improve the support. + +## Basic support + +The input-addressed build process is roughly the following: + +```python +def nix_build(expr : NixExpr) -> [StorePath] : + resulting_derivation = eval(expr) + build_derivation( + resulting_derivation, + resulting_derivation.all_outputs(), + ) + return resulting_derivation.all_output_paths() + +def build_derivation(derivation : Derivation, outputsToBuild: [str]) -> (): + # Build all the inputs + for (inputDrv, requiredOutputs) in derivation.inputDrvs: + build_derivation(inputDrv, requiredOutputs) + # Run the build script, now that all the inputs are here + runBuildScript(derivation) +``` + +The main change required by the content-addressed model is that we can’t know +the output paths of a derivation before building it. + +This means that the Derivations as they are produced by the evaluator can’t +either know their output path, nor explicitely refer to their dependencies by +their output path. ### Output mappings @@ -94,11 +127,26 @@ For each output `output` of a derivation `drv`, we define - its **Output Id** `DrvOutput(drv, output)` as the tuple `(hashModulo(drv), output)`. This id uniquely identifies the output. We textually represent this as `hashModulo(drv)!output`. -- its **realisation** `Realisation(outputId)` containing - 1. The path `path` at which this output is stored (either content-defined or input-defined depending on the type of derivation) - 2. An optional set `signatures` of signatures certifying the above +- its **realisation** `Realisation(outputId)` containing the path `outputPath` at which this output is stored (either content-defined or input-defined depending on the type of derivation) + +```python +class DrvOutput: + derivationHash : Hash + outputName : str + +class Realisation: + id : DrvOutput + outputPath : StorePath +``` + +In a input-addressed-only world, the concrete path for a derivation output was a pure function of this output's id that could be computed at eval-time. However this won't be the case anymore once we allow CA derivations, so we now need a way to register this information in the store: -In a input-addressed-only world, the concrete path for a derivation output was a pure function of this output's id that could be computed at eval-time. However this won't be the case anymore once we allow CA derivations, so we now need to store the results of the `Realisation` function in the Nix database as a new table: +```python +def registerRealisation(store : Store, realisation : Realisation): + ... +``` + +For the local store, this function will store the realisation information in the Nix database as a new table: ```sql create table if not exists Realisation ( @@ -108,118 +156,139 @@ create table if not exists Realisation ( ) ``` -### Building a non-ca derivation - -#### Resolved derivations +### Resolved derivations As it is already internally the case in Nix, we define a **basic derivation** as a derivation that doesn't depend on any derivation output (except its own). Said otherwise, a basic derivation is a derivation whose only inputs are either - Placeholders for its own outputs (from the `placeholder` builtin) - Existing store paths -For a derivation `drv` whose input derivations have all been realised, we define its **associated resolved derivation** `resolved(drv)` as `drv` in which we replace every input derivation `inDrv` of `drv` by `Realisation(inDrv).path`, and update the output hash accordingly. +For a derivation `drv` whose input derivations have all been realised, we define its **associated resolved derivation** `resolved(drv)` as `drv` in which +we replace every input derivation `inDrv` of `drv` by `Realisation(inDrv).path`. `resolved` is (intentionally) not injective: If `drv` and `drv'` only differ because one depends on `dep` and the other on `dep'`, but `dep` and `dep'` are content-addressed and have the same output hash, then `resolved(drv)` and `resolved(drv')` will be equal. -#### Build process - -When asked to build a derivation `drv`, we instead: - -1. Compute `resolved(drv)` -2. Substitute and build `resolved(drv)` like a normal derivation. - Possibly this is a no-op because it may be that `resolved(drv)` has already been built. -3. Add a new mapping `Realisation(drv!${output}) == ${output}(resolved(drv))` for each output `output` of `drv` (signing the mapping if needs be) - -### Building a CA derivation - -A **CA derivation** is a derivation with the `__contentAddressed` argument set -to `true` and the `outputHashAlgo` set to a value that is a valid hash name -recognized by Nix (see the description for `outputHashAlgo` at - for the current allowed -values). - -The process for building a content-adressed derivation `drv` is the following: - -- We build it like a normal derivation (see above). - For each output `$outputId` of the derivation, this gives us a (temporary) output path `$out`. - - We compute a cryptographic hash `$chash` of `$out`[^modulo-hashing] - - We move `$out` to `/nix/store/$chash-$name` - - We store the mapping `Realisation($outputId) == "/nix/store/$chash-$name"` - -[^modulo-hashing]: - - We can possibly normalize all the self-references before - computing the hash and rewrite them when moving the path to handle paths with - self-references, but this isn't strictly required for a first iteration - -### Example - -In this example, we have the following Nix expression: - -```nix -rec { - contentAddressed = mkDerivation { - name = "contentAddressed"; - __contentAddressed = true; - … # Some extra arguments - }; - dependent = mkDerivation { - name = "dependent"; - buildInputs = [ contentAddressed ]; - … # Some extra arguments - }; - transitivelyDependent = mkDerivation { - name = "transitivelyDependent"; - buildInputs = [ dependent ]; - … # Some extra arguments - }; -} +### content-addressed build process + +We now need to update the build process as: + +```python +def build_derivation(derivation : Derivation, outputsToBuild: [str]) -> Map[DrvOutput, Realisation]: + inputRealisations : Map[DrvOutput, Realisation] = {} + # Build all the inputs, and store the newly built realisations + for (inputDrv, requiredOutputs) in derivation.inputDrvs: + inputRealisations += build_derivation(inputDrv, requiredOutputs) + + # We now need to “resolve” our realisation to replace all the symbolic + # references to its inputs by their actual store path + derivationToBuild : BasicDerivation = resolved(inputDrv, inputRealisations) + + # At that point, we might realise that the resolved derivation is actually + # something that we have already built. In that case we just return + # the existing result. + if (isBuilt(derivationToBuild)): + return queryOutputs(derivationToBuild, outputsToBuild) + + # The build script needs to know where to install stuff (so that for + # example `make install` can work properly). + # We obviously don’t know the final path yet, but we can assign some + # temporary output paths to the derivation that will be used during the + # build. + assignScratchOutputPaths(derivationToBuild) + + # Run the build script on the new resolved derivation + runBuildScript(derivationToBuild) + + # Move the newly built outputs to their final (content-addressed) paths, + # and return the corresponding realisations. + return moveToCAPaths(derivationToBuild.outputs) +``` + +## Extensions + +### Self-references + +A store path `/nix/store/abc-foo` is said to be **self-referential** if the +content of the path mentions the path `/nix/store/abc-foo` itself (and this +mention of the store path is called a **self-reference**). + +A lot of store paths happen to be self-referential (for example a path that contains both an dynamic library and an executable using that library will likely have the `rpath` of the exectuable mention the absolute path to the library). + +It happens that these are problematic with content-addressed derivations, because +1. A self-reference means that the output path depends on the temporary path that has been used during the build (potentially breaking reproducibility as there’s no guaranty for this path to be stable), +2. More annoyingly, a self-reference means that the path can’t be moved freely (otherwise the self-reference would become dangling). + +However, under the assumption that self-references only appear textually in the output (*i.e* running strings on a file that contains self-references will print all the self-references out), we can: + +- Build the derivation on a temporary directory (`/nix/store/someArbitraryHash-foo`, the path provided by the function `assignScratchOutputPaths` above) +- Replace all the occurences of `someArbitraryHash` by a fixed magic value +- Compute the hash of the resulting path to determine the final path +- Replace the occurences of the magic value by the final path hash +- Move the result to the final path. + +This is obviously a hack, however it seems to work very well in practice, due to the fact that: +- The string that we search for is a cryptographic hash that’s unlikely to occur by accident in the output path, +- Very few programs store self-references in a non-purely textual way + +In addition, it is possible to detect the cases where this hash-rewriting isn’t total (see [the corresponding future work](#ensuring-that-no-temporary-output-path-leaks-in-the-result)). + +### Mixing CA and non-CA derivations + +The model so far assumes that the whole world switches to content-addressed derivations. +It’s however possible to freely mix content- and input-addressed derivations in the same Nix store, and even in the same closure: + +The algorithm for building content-addressed derivations extends the algorithm for building input-addressed derivations in two ways: +1. Before running the build script, it resolves the derivation +2. When running the build script, it uses some temporary outputs, and moves them to their final location afterwards. + +Only the second part assumes that the derivation is content-addressed, and we can use two-different code-paths for the build-step: + +```python +def build_derivation(derivation : Derivation, outputsToBuild: [str]) -> Map[DrvOutput, Realisation]: + # Build the dependencies and resolve the derivation like before + derivationToBuild = ... + + if (derivationToBuild.isContentAddressed()): + assignScratchOutputPaths(derivationToBuild) + runBuildScript(derivationToBuild) + return moveToCAPaths(derivationToBuild.outputs) + else: + runBuildScript(derivationToBuild) + # If the derivation isn’t content-addressed, then it already knows its + # own output paths + return derivationToBuild.outputs() +``` + +For backwards-compatibility, we must change the algorithm a bit further: Resolving an input-addressed derivation changes its input derivation and input path sets (it replaces every input derivation by the corresponding store paths). +This means that it also has to change the output paths (as these depend on the inputs of the derivation). + +That’s something that we don’t want for the derivations that are already valid today, so we must bypass the resolving step for these derivations (which is okay as these derivations don’t need to be resolved). + +```python +def build_derivation(derivation : Derivation, outputsToBuild: [str]) -> Map[DrvOutput, Realisation]: + inputRealisations : Map[DrvOutput, Realisation] = {} + # Build all the inputs, and store the newly built realisations + for (inputDrv, requiredOutputs) in derivation.inputDrvs: + inputRealisations += build_derivation(inputDrv, requiredOutputs) + + derivationToBuild = + derivation if derivation.isStrictlyInputAddressed() + else resolved(derivation, inputRealisations) + + if (derivationToBuild.isContentAddressed()): + assignScratchOutputPaths(derivationToBuild) + runBuildScript(derivationToBuild) + return moveToCAPaths(derivationToBuild.outputs) + else: + runBuildScript(derivationToBuild) + # If the derivation isn’t content-addressed, then it already knows its + # own output paths + return derivationToBuild.outputs() ``` -Suppose that we want to build `transitivelyDependent`. -What will happen is the following - -1. We instantiate the Nix expression. This gives us three derivations: - `contentAddressed.drv`, `dependent.drv` and `transitivelyDependent.drv` -2. We build `contentAddressed.drv`. - - We first compute `resolved(contentAddressed.drv)`. - - We realise `resolved(contentAddressed.drv)`. This gives us an output path - `out(resolved(contentAddressed.drv))` - - We move `out(resolved(contentAddressed.drv))` to its content-adressed path - `ca(contentAddressed.drv)` which derives from - `sha256(out(resolved(contentAddressed.drv)))` - - We register in the db that `Realisation(contentAddressed.drv!out) == { .path = ca(contentAddressed.drv) }` -3. We build `dependent.drv` - - We first compute `resolved(dependent.drv)`. - This gives us a new derivation identical to `dependent.drv`, except that `contentAddressed.drv!out` is replaced by `Realisation(contentAddressed.drv!out).path == ca(contentAddressed.drv)` - - We realise `resolved(dependent.drv)`. This gives us an output path - `out(resolved(dependent.drv))` - - We register in the db that `Realisation(dependent.drv!out) == { .path = out(resolved(dependent.drv)) }` -4. We build `transitivelyDependent.drv` - - We first compute `resolved(transitivelyDependent.drv)` - This gives us a new derivation identical to `transitivelyDependent.drv`, except that `dependent.drv!out` is replaced by `Realisation(dependent.drv!out).path == out(resolved(dependent.drv))` - - We realise `resolved(transitivelyDependent.drv)`. This gives us an output path `out(resolved(transitivelyDependent.drv))` - - We register in the db that `Realisation(transitivelyDependent.drv!out) == { .path = out(resolved(transitivelyDependent.drv)) }` - -Now suppose that we replace `contentAddressed` by `contentAddressed'`, which evaluates to a new derivation `contentAddressed'.drv` such that the output of `contentAddressed'.drv` is the same as the output of `contentAddressed.drv` (say we change a comment in a source file of `contentAddressed`). -We try to rebuild the new `transitivelyDependent`. What happens is the following: - -1. We instantiate the Nix expression. This gives us three new derivations: - `contentAddressed'.drv`, `dependent'.drv` and `transitivelyDependent'.drv` -2. We build `contentAddressed'.drv`. - - We first compute `resolved(contentAddressed'.drv)` - - We realise `resolved(contentAddressed'.drv)`. This gives us an output path `out(resolved(contentAddressed'.drv))` - - We compute `ca(contentAddressed'.drv)` and notice that the path already exists (since it's the same as the one we built previously), so we discard the result. - - We register in the db that `Realisation(contentAddressed.drv'!out) == { .path = ca(contentAddressed'.drv) }` ( also equals to `Realisation(contentAddressed.drv!out)`) -3. We build `dependent'.drv` - - We first compute `resolved(dependent'.drv)`. - This gives us a new derivation identical to `dependent'.drv`, except that `contentAddressed'.drv!out` is replaced by `Realisation(contentAddressed'.drv!out).path == ca(contentAddressed'.drv)` - - We notice that `resolved(dependent'.drv) == resolved(dependent.drv)` (since `ca(contentAddressed'.drv) == ca(contentAddressed.drv)`), so we just return the already existing path -4. We build `transitivelyDependent'.drv` - - We first compute `resolved(transitivelyDependent'.drv)` - - Here again, we notice that `resolved(transitivelyDependent'.drv)` is the same as `resolved(transitivelyDependent.drv)`, so we don't build anything - -## Remote caching +### Remote caching + +#### Basic principles A consequence of this change is that a store path is now just a meaningless blob of data if it doesn't have its associated `realisation` metadata − @@ -230,15 +299,106 @@ As a consequence, the remote cache protocols is extended to not simply work on store paths, but rather at the realisation level: - The store interface now specifies a new method + ```python + def queryRealisation(output : DrvOutput) -> Maybe Realisation ``` - queryRealisation : DrvOutput -> Maybe Realisation - ``` + + If the store knows about the given derivation output, it will return the associated realisation, otherwise it will return `None`. - The substitution loop in Nix fist calls this method to ask the remote for the realisation of the current derivation output. If this first call succeeds, then it fetches the corresponding output path - like before. Then, it registers the realisation in the database. -- The binary caches now have a new toplevel folder `/realisations` storing - these realisations + like before. Then, it registers the realisation in the database: + + ```python + def substitute_realisation(substituter : Store, wantedOutput : DrvOutput) -> Maybe Realisation: + maybeRealisation = substituter.queryRealisation(wantedOutput) + if not maybeRealisation: + return None + substitute_path(substituter, maybeRealisation.outputPath) + return maybeRealisation + ``` + +On the binary cache side, they now have a new toplevel folder `/realisation` to store these realisations. +This folder contains a set of files of the form `{drvOutput}.doi`, each of them containing a Json serialisation of the realisation corresponding to the given `drvOutput`. + +#### The “two-glibc” issue + +As stated in [Eelco’s thesis][nixphd], remote caching of content-addressed derivations can be problematic in conjonction with non-determinism: + +A typical scenario where this can happen is: + +- Alice has `glibc` and `libfoo` built on her local store (with `libfoo` depending on `glibc`) +- She wants to build `firefox`, which depends on `libfoo` and `libbar` +- It happens that Bob-the-binary-cache contains `libbar`. `libbar` depends on `glibc`, but because the build of `glibc` isn’t deterministic, Bob actually has a different `glibc` (living in a different store path) than Alice. +- Alice fetches `libbar` from Bob. She also fetches Bob’s `glibc` as it’s a dependency of `libbar` +- Now alice uses `libfoo` and `libbar` to build `firefox`. But that means that `firefox` has both Alice’s `glibc` and Bob’s `glibc` in his closure (despite having only one specified in the derivation). After five hours of building, she starts `firefox` and it crashes with a cryptic “duplicated symbol” error. Now Alice is angry because Nix didn’t deliver on its promise of reproducibility and reliability. + +The easiest way out of here is to make sure that Alice can’t have two different outputs for the same `glibc` dependency locally. So in the present case, she can’t use the `libfoo` that Bob offers as it wouldn’t be compatible. + +The first step to that end, is to enforce the fact that a store can’t have more than one realisation for each derivation output. So it’s illegal to register the realisation for Alice’s `glibc` and Bob’s `glibc` at the same time. +We must also extend the notion of Realisation to keep track of their dependencies: In the example above, when the substitution mechanism will try to substitute a realisation for `libfoo` from Bob it, it will query Bob for the realisation, see that its output path is `/nix/store/abc-libfoo` and substitute this path (with its dependencies, so including `/nix/store/123-glibc`). But it will never try to register a realisation for Glibc. + +To fix this, we must extend a bit the notion of realisation, to keep track of its dependencies: On Bob’s store, `libfoo` is realised as `/nix/store/abc-libfoo`, but this realisation depends on the fact that `glibc` is realised as `/nix/store/123-glibc`. + +- Realisations now contain a `dependencies` field, which is a map from `DrvOutput` to `StorePath`: + + ```python + class Realisation: + id : DrvOutput + outputPath : StorePath + dependencies : Map[DrvOutput, StorePath] + ``` +- We add the constraint that realisations should form a closure in a store, meaning that if a store has the realisation for `foo!out` with a dependency on `bar!out->/nix/store/bar`, then the store must also have a realisation for `bar!out` whose output path is `/nix/store/bar` +- The realisation loop now keep tracks of these realisations to enforce this closure invariant: + ```python + # Returns true (and warns) iff we already have a realisation for the given + # derivation output, and that realisation has a different output path + # than the expected one. + def is_incompatible(drvOutput, expectedStorePath): + maybeLocalRealisation = localStore.queryRealisation(drvOutput) + if (maybeLocalRealisation and maybeLocalRealisation.outputPath != expectedStorePath): + warn(f"The substituter {substituter} has an incompatible realisation for {dependentDrvOutput}") + return true + return false + + + def substitute_realisation(substituter : Store, wantedOutput : DrvOutput) -> Maybe Realisation: + maybeRealisation = substituter.queryRealisation(wantedOutput) + if not maybeRealisation: + return None + + # Try substituting the derivations we depend on + for (dependentDrvOutput, expectedStorePath) in maybeRealisation.dependencies: + if is_incompatible(dependentDrvOutput, expectedStorePath) + return None + else: + substitute_realisation(substituter, wantedOutput) + + # Finally substitute the store path itself + substitute_path(substituter, maybeRealisation.outputPath) + return maybeRealisation + ``` + +### Signatures + +Input-addressed paths need to be signed because there’s no way to verify their content (short of rebuilding them and praying that the build is deterministic of course): If `/nix/store/123-foo` is input-addressed, then there’s no direct relation between the hash `123` and the content of the store path. + +Content-addressed paths on the other hand don’t need a signature: If `/nix/store/123-foo` is content-addressed, then `123` is supposed to be a hash of the content of the path, and that can be easily checked. +However, content-addressed realisations must be signed as there’s no simple deterministic relation between a derivation and its output paths. To that end, we extend the `Realisation` type to also include a set of signatures. + +```python +class Realisation: + ... + + signatures : Set[str] + + def sign(key : PrivateKey): + ... + def verify_signature(key : PublicKey): + ... +``` + +We also update `registerRealisation` for the local store to check these signatures before actually registering anything in the database. # Drawbacks @@ -251,8 +411,8 @@ work on store paths, but rather at the realisation level: `contentAddressed`, but there's no way to enforce these restricitions; - This will probably be a breaking-change for some tooling since the output path - that's stored in the `.drv` files doesn't correspond to an actual on-disk - path. + that's available at eval-time and stored in the `.drv` files doesn't + correspond to an actual on-disk path. # Alternatives @@ -292,14 +452,13 @@ There exist some better solutions to this problem (including one presented in Eelco's thesis), but there are much more complex, so it's probably not worth investing in them until we're sure that they are needed. -## Garbage collection +# Future work -Another major open issue is garbage collection of the realisations table. It's -not clear when entries should be deleted. The paths in the domain are "fake" so -we can't use them for expiration. The paths in the codomain could be used (i.e. -if a path is GC'ed, we delete the alias entries that map to it) but it's not -clear whether that's desirable since you may want to bring back the path via -substitution in the future. +[future]: #future-work + +This RFC tries as much as possible to provide a solid foundation for building +ca paths with Nix, leaving as much room as possible for future extensions. +In particular: ## Ensuring that no temporary output path leaks in the result @@ -321,19 +480,15 @@ We however have a way to dectect these: If we have leaking self-references then the output will change if we artificially change its output path. This could be integrated in the `--check` option of `nix-store`. -# Future work +## Make content-addressed derivations compatible with other Nix features -[future]: #future-work +As presented here, content-addressed derivations are incompatible with a few Nix features (in particular import from derivation and recursive Nix). -This RFC tries as much as possible to provide a solid foundation for building -ca paths with Nix, leaving as much room as possible for future extensions. -In particular: +## Enabling a truly multi-user trust-model + +One of the theoretical advantages of the content-addressed model is that it separates the trust (materialised by the realisations) and the storage (the store paths), meaning that several users can share the same Nix store, but have each a different trust relation to it. -- Consolidate the caching model to make it more efficient in presence of - non-deterministic derivations -- (hopefully, one day) make the CA model the default one in Nix -- Investigate the consequences in term of privileges requirements -- Build a trust model on top of the content-adressed model to share store paths +This means that each user could be a “trusted-user” for its own view of the store, without affecting the others. [rfc 0017]: https://github.com/NixOS/rfcs/pull/17 [nixphd]: https://nixos.org/~eelco/pubs/phd-thesis.pdf From 2d74fedf2f227ff8145ea92cb674fbbd7b97b6ab Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Th=C3=A9ophane=20Hufschmitt?= Date: Wed, 2 Jun 2021 11:29:26 +0200 Subject: [PATCH 29/32] Make the python samples a bit more pythonic Co-authored-by: zseri --- rfcs/0062-content-addressed-paths.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index f9952f8bd..8373b1005 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -256,7 +256,7 @@ def build_derivation(derivation : Derivation, outputsToBuild: [str]) -> Map[DrvO runBuildScript(derivationToBuild) # If the derivation isn’t content-addressed, then it already knows its # own output paths - return derivationToBuild.outputs() + return derivationToBuild.outputs ``` For backwards-compatibility, we must change the algorithm a bit further: Resolving an input-addressed derivation changes its input derivation and input path sets (it replaces every input derivation by the corresponding store paths). @@ -283,7 +283,7 @@ def build_derivation(derivation : Derivation, outputsToBuild: [str]) -> Map[DrvO runBuildScript(derivationToBuild) # If the derivation isn’t content-addressed, then it already knows its # own output paths - return derivationToBuild.outputs() + return derivationToBuild.outputs ``` ### Remote caching @@ -312,7 +312,7 @@ work on store paths, but rather at the realisation level: ```python def substitute_realisation(substituter : Store, wantedOutput : DrvOutput) -> Maybe Realisation: maybeRealisation = substituter.queryRealisation(wantedOutput) - if not maybeRealisation: + if maybeRealisation is None: return None substitute_path(substituter, maybeRealisation.outputPath) return maybeRealisation @@ -358,13 +358,13 @@ To fix this, we must extend a bit the notion of realisation, to keep track of it maybeLocalRealisation = localStore.queryRealisation(drvOutput) if (maybeLocalRealisation and maybeLocalRealisation.outputPath != expectedStorePath): warn(f"The substituter {substituter} has an incompatible realisation for {dependentDrvOutput}") - return true - return false + return True + return False def substitute_realisation(substituter : Store, wantedOutput : DrvOutput) -> Maybe Realisation: maybeRealisation = substituter.queryRealisation(wantedOutput) - if not maybeRealisation: + if maybeRealisation is None: return None # Try substituting the derivations we depend on From 168a149bb7c611aae369f62d82d4795f58f63c9e Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 2 Jun 2021 11:45:20 +0200 Subject: [PATCH 30/32] Explicit that unresolved dependencies are eval-time --- rfcs/0062-content-addressed-paths.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index 8373b1005..44de7c489 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -116,9 +116,11 @@ def build_derivation(derivation : Derivation, outputsToBuild: [str]) -> (): The main change required by the content-addressed model is that we can’t know the output paths of a derivation before building it. -This means that the Derivations as they are produced by the evaluator can’t -either know their output path, nor explicitely refer to their dependencies by -their output path. +This means that the Nix evaluator doesn’t know the output paths of the +dependencies it manipulates (it *could* know them if they are already built, but +that would be a blatant purity hole), so these derivations can’t neither embed +their own output path, nor explicitely refer to their dependencies by their +output path. ### Output mappings From 427abed085e935b8443caa71330e389fce4549e0 Mon Sep 17 00:00:00 2001 From: regnat Date: Wed, 2 Jun 2021 11:57:09 +0200 Subject: [PATCH 31/32] Prettify --- rfcs/0062-content-addressed-paths.md | 43 ++++++++++++++++------------ 1 file changed, 25 insertions(+), 18 deletions(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index 44de7c489..2b76816c1 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -54,7 +54,7 @@ cutoffs. [design]: #detailed-design -*In everything that follows, most algorithms and data-structures will be expressed as pseudo-python snippets* +_In everything that follows, most algorithms and data-structures will be expressed as pseudo-python snippets_ When it comes to computing the output paths of a derivation, the current Nix model, known as the “input-addressed” model (also sometimes referred to as the @@ -62,13 +62,13 @@ model, known as the “input-addressed” model (also sometimes referred to as t 1. A Nix language expression gets evaluated to a `derivation` 2. This `derivation` is a data-structure describing how to build a package. In particular it contains - 1. A set of derivation outputs which will be used as input for the build - 2. A set of store paths that will be used as input for the build - 3. The build recipe proper (a script to run, with a set of environment - variables). This recipe can refer input paths or derivations by - interpolating their store path. - 4. The output paths into which the derivation will be installed. - These are computed from a hash of the other elements of the derivation. + 1. A set of derivation outputs which will be used as input for the build + 2. A set of store paths that will be used as input for the build + 3. The build recipe proper (a script to run, with a set of environment + variables). This recipe can refer input paths or derivations by + interpolating their store path. + 4. The output paths into which the derivation will be installed. + These are computed from a hash of the other elements of the derivation. The “input-addressed” designation comes from the way the output paths are computed: They derive from the derivation data-structure, which is the input of @@ -79,7 +79,7 @@ these output paths from the inputs of the build, we derive them from the output (the produced store path). Nix already supports a special-case of content-addressed derivations with the -so-called “fixed-output” derivations. These are derivations that are +so-called “fixed-output” derivations. These are derivations that are content-addressed, but whose output hash has to be specified in advance, and are used in particular to fetch data from the internet (as the constraint that the hash has to be specified in advance means that we can relax the sandbox for @@ -117,7 +117,7 @@ The main change required by the content-addressed model is that we can’t know the output paths of a derivation before building it. This means that the Nix evaluator doesn’t know the output paths of the -dependencies it manipulates (it *could* know them if they are already built, but +dependencies it manipulates (it _could_ know them if they are already built, but that would be a blatant purity hole), so these derivations can’t neither embed their own output path, nor explicitely refer to their dependencies by their output path. @@ -217,10 +217,11 @@ mention of the store path is called a **self-reference**). A lot of store paths happen to be self-referential (for example a path that contains both an dynamic library and an executable using that library will likely have the `rpath` of the exectuable mention the absolute path to the library). It happens that these are problematic with content-addressed derivations, because + 1. A self-reference means that the output path depends on the temporary path that has been used during the build (potentially breaking reproducibility as there’s no guaranty for this path to be stable), 2. More annoyingly, a self-reference means that the path can’t be moved freely (otherwise the self-reference would become dangling). -However, under the assumption that self-references only appear textually in the output (*i.e* running strings on a file that contains self-references will print all the self-references out), we can: +However, under the assumption that self-references only appear textually in the output (_i.e_ running strings on a file that contains self-references will print all the self-references out), we can: - Build the derivation on a temporary directory (`/nix/store/someArbitraryHash-foo`, the path provided by the function `assignScratchOutputPaths` above) - Replace all the occurences of `someArbitraryHash` by a fixed magic value @@ -229,6 +230,7 @@ However, under the assumption that self-references only appear textually in the - Move the result to the final path. This is obviously a hack, however it seems to work very well in practice, due to the fact that: + - The string that we search for is a cryptographic hash that’s unlikely to occur by accident in the output path, - Very few programs store self-references in a non-purely textual way @@ -240,6 +242,7 @@ The model so far assumes that the whole world switches to content-addressed deri It’s however possible to freely mix content- and input-addressed derivations in the same Nix store, and even in the same closure: The algorithm for building content-addressed derivations extends the algorithm for building input-addressed derivations in two ways: + 1. Before running the build script, it resolves the derivation 2. When running the build script, it uses some temporary outputs, and moves them to their final location afterwards. @@ -301,11 +304,13 @@ As a consequence, the remote cache protocols is extended to not simply work on store paths, but rather at the realisation level: - The store interface now specifies a new method + ```python def queryRealisation(output : DrvOutput) -> Maybe Realisation ``` If the store knows about the given derivation output, it will return the associated realisation, otherwise it will return `None`. + - The substitution loop in Nix fist calls this method to ask the remote for the realisation of the current derivation output. If this first call succeeds, then it fetches the corresponding output path @@ -344,14 +349,16 @@ To fix this, we must extend a bit the notion of realisation, to keep track of it - Realisations now contain a `dependencies` field, which is a map from `DrvOutput` to `StorePath`: - ```python - class Realisation: - id : DrvOutput - outputPath : StorePath - dependencies : Map[DrvOutput, StorePath] - ``` + ```python + class Realisation: + id : DrvOutput + outputPath : StorePath + dependencies : Map[DrvOutput, StorePath] + ``` + - We add the constraint that realisations should form a closure in a store, meaning that if a store has the realisation for `foo!out` with a dependency on `bar!out->/nix/store/bar`, then the store must also have a realisation for `bar!out` whose output path is `/nix/store/bar` - The realisation loop now keep tracks of these realisations to enforce this closure invariant: + ```python # Returns true (and warns) iff we already have a realisation for the given # derivation output, and that realisation has a different output path @@ -469,7 +476,7 @@ being built, which breaks self-references. Hash rewriting solves this in most cases, but it is only a heuristic and there is no way to truly ensure that we don't leak a self-reference (for example if a self-reference appears in a zipped file − like is often the case for man pages or Java jars, the -hash-rewriting machinery won't detect it). Having leaking self-references is +hash-rewriting machinery won't detect it). Having leaking self-references is annoying since: - These self-references change each time the inputs of the derivation change, From f2756692f5e06edf073c0006f9f982582b27feda Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Th=C3=A9ophane=20Hufschmitt?= Date: Fri, 10 Dec 2021 15:35:10 +0100 Subject: [PATCH 32/32] Make the end-goal an experiment MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Simplify the paperwork and just get this to FCP because right now it’s stuck in a hole --- rfcs/0062-content-addressed-paths.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/0062-content-addressed-paths.md b/rfcs/0062-content-addressed-paths.md index 2b76816c1..742b02d26 100644 --- a/rfcs/0062-content-addressed-paths.md +++ b/rfcs/0062-content-addressed-paths.md @@ -12,7 +12,7 @@ related-issues: (will contain links to implementation PRs) [summary]: #summary -Add some basic but simple support for content-addressed store paths to Nix. +Add some experimental support for content-addressed store paths to Nix. We plan here to give the possibility to mark certain store paths as content-adressed (ca), while keeping the other input-adressed as they are