Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC 0152] local-overlay store #152

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Ericson2314
Copy link
Member

@Ericson2314 Ericson2314 commented Jun 14, 2023

Add a new local-overlay store implementation to Nix. This will be a local store that is layered upon another local filesystem store (local store or daemon). This allows locally extending a shared store that is periodically updated with additional store objects.

This work is sponsored by Replit

Ericson2314 and others added 3 commits June 14, 2023 09:51
Co-authored-by: Ryan Mulligan <ryan@ryantm.com>
Co-authored-by: Connor Brewster <cbrewster@hey.com>
Co-authored-by: Ben Radford <benradf@users.noreply.github.com>
Co-authored-by: Divam <dfordivam@protonmail.com>
@Ericson2314 Ericson2314 changed the title local-overlay store [RFC 0152]: local-overlay store Jun 14, 2023
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/super-colliding-nix-stores/28462/17

The `local-overlay` store can serve as a crucial tool to bridge these two modes of using Nix.
The lower store can be as before
--- however the artifacts were disseminated in the "hidden Nix" first phase of adoption
--- perhaps with only a small tweak to expose the DB / daemon socket if it wasn't before.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is exposing the socket used anywhere in the proposal, or is it just mentioned as a separate possibility (with relevant metadata sharing done via reading the underlying SQLite DB)?

Copy link
Member Author

@Ericson2314 Ericson2314 Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local overlay store is potentially a consumer of the socket provided by another Nix daemon. A Nix daemon can also be spun up using the local overlay store instead of the local store.

Basically, no new socket code is needed for this. As far as I can tell, everything one would want with sockets already works without limitation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A consumer — doing what with the socket?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing regular client, like anything else using the daemon. It will in fact only use it to read metadata; the lower store can be read-only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reread the paragraph, and even with your explanations I am not sure what process the paragraph as written describes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Server Nix store dir, clients don't use Nix but do use Nix-built things
  2. Serve either socket or SQLite database, clients can use Nix with that store but with restrictions (e.g. perhaps only read-only)
  3. Use local overlay store (the writable upper layer makes the read-only lower layer less of an issue)

Doe that make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the step 1 each snapshot gets its own store path, and the rest of the store is visible? (I guess one could also share a fixed-path directory with links to all the currently-relevant snapshots?)

@rhendric
Copy link
Member

What if, instead of requiring that the lower store grow monotonically, the overlay store maintained GC roots in the lower for any lower paths it references? I haven't thought this through to the level of detail in the spec, but it strikes me as an alternative worth considering; not being able to run garbage collection on the lower store would certainly be a barrier for personal use (perhaps not a central goal for this RFC) and possibly for org use as well—storage is cheap but it doesn't round down to free in every context.

@ryantm
Copy link
Member

ryantm commented Jun 20, 2023

I work for Replit (who sponsored work on this RFC) and helped out with the creation of this RFC. I am happy to serve as a Shepherd on this RFC, but also happy to cheer from the sidelines if people (or the Steering Committee) see this as a conflict of interest.

@baloo
Copy link
Member

baloo commented Jun 21, 2023

I'm generally excited about this. I have a slightly different target than Replit has.
I'm shipping firmware image (Multiple GB -large firmware images) with a nix/store that is immutable and signed (via dm-verity and secureboot). The lower overlay ships with a document that provides a way to rebuild the db with everything that was put in the store, at runtime the db is rebuilt. Configuration is then evaluated and a new system is built in the upper store.

This would solve the initial import of the DB which is still not that fast (although probably way faster than importing 16TB worth of derivations!).
For our use-case, I still believe we'll trash the upper layer at each boot because it's easier to reason about.

@Ericson2314
Copy link
Member Author

@rhendric That is useful for some things, but probably not the use-case of large numbers of consumers all sharing the same underlying store --- it is pretty important the underlying store be truely read only in that case, including any GC roots.

@Ericson2314
Copy link
Member Author

@baloo We have separately thought about those sorts of issues, including a persistent upper store that then "pivots" onto a new lower store when one does can upgrade of NixOS (and I suppose GC of the old generations). I think the pivoting feature is a nice future work item.

@kevincox kevincox added the status: open for nominations Open for shepherding team nominations label Jun 28, 2023
@kevincox
Copy link
Contributor

This RFC is now open for shepherd nominations!

@Ericson2314
Copy link
Member Author

I nominate @roberth

@7c6f434c
Copy link
Member

That is useful for some things, but probably not the use-case of large numbers of consumers all sharing the same underlying store

Now this makes me wonder if local-overlay sounds a bit more local than this use-case (from a naming point of view)…

@Ericson2314
Copy link
Member Author

I suppose would deprecate all "local" and "remote" as not being misleading names. I just picked local-overlay on moments notice to match the existing pattern.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/tweag-nix-dev-update-51/30870/1

@lheckemann
Copy link
Member

This RFC has not acquired enough shepherds. This typically shows lack of interest from the community. In order to progress a full shepherd team is required. Consider trying to raise interest by posting in Discourse, talking in Matrix or reaching out to people that you know.

If not enough shepherds can be found in the next month we will close this RFC until we can find enough interested participants. The PR can be reopened at any time if more shepherd nominations are made.

See more info on the Nix RFC process here

@ryantm
Copy link
Member

ryantm commented Jul 26, 2023

@baloo, @rickynils, @zhaofengli, @arianvp, @edolstra would any of you be open to shepherding this along?

I don't see much controversy or drama here and, in my opinion, the document is in good shape too, so overall I expect the shepherd work to be low-commitment.

If you've never done it before, you can look at https://github.com/NixOS/rfcs/blob/master/rfcs/0036-rfc-process-team-amendment.md#shepherd-team for more information about being a Shepherd.

@arianvp
Copy link
Member

arianvp commented Aug 1, 2023

I can Shepard

@arianvp
Copy link
Member

arianvp commented Aug 3, 2023

I also poked the #nixos-systemd channel. As some people are looking at Appliance images / immutable nix-store partitions and it seems to have a lot overlap with this RFC


We could have a single FUSE mount that could manually implement the "bind on demand" semantics described above without cluttering the mount namespace with an entry per each shared store object.
FUSE however is quite primitive, in that every read need to be shuffled via the FUSE server.
There is nothing like a "control plane vs data plane" separation where Nix could tell the OS "this directory is that other directory", and the OS can do the rest without involving the FUSE server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is false, FUSE passthrough can lead to bypass the FUSE server.
There are multiple folks in the Linux filesystem ecosystem that are working on bringing FUSE ecosystems with a relatively on-par performance with in-kernel filesystem or reuse the existing filesystems.

https://lwn.net/Articles/932060/
https://github.com/extfuse/extfuse
https://lpc.events/event/16/contributions/1339/

I am not really convinced of not pursuing this alternative as I feel like this bring the maximum flexibility and compatibility for all the usecases instead of making it a very limited thing based on OverlayFS.

If you are interested into chatting more on how to make this alternative possible, feel free to ping me, I know quite about FUSE filesystems and I am planning to write a nixstorefs at some point, which will rely on FUSE semantics first.

I also challenge the "worse performance" than in-kernel mounting solutions, it would be good to bring data on that, if you only pay the open cost, this is quite cheap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, I have mixed feelings about the RFC because I feel like the FUSE approach is a much better route than this one, I can say that I am not satisfied and feel like it should be more ambitious if it is going to take a RFC route.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that is fantastic! That is getting at the heart of the problems with FUSE. I wish I had known about this earlier.

However, even if I did, I would have advocated starting with the OverlayFS approach. That is because all these things are not yet mainline, and kernel development / trying out bleeding edge features is vastly more expensive.

IMO the right thing to do is

  1. Accept what we have on an experimental basis, using it to drum up interests and move us towards being able to pool resources on this.
  2. Get in communication with these Kernel devs; indeed I was already emailing back and forth with Amir Goldstein about some restrictions in OverlayFS.
  3. Try to be an early adopter of this stuff as it matures; nudge its development so it better meets our needs.

Also CC @flokli, because Tvix may be better positioned to be at the vanguard of trying this stuff out, as they are already exploring FUSE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disclosure: I am also a tvix-store developer ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, your argument is sound and convinced me :).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my bad I didn't realize that. @RaitoBezarius you should shepherd this :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaaaaaaaaaa, OK for the nerdsnipe.

Copy link

@ballit6782 ballit6782 Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be another approach with FUSE : using MergerFS with symlinkify=true and cache.symlinks=true. This way, FUSE is only used to resolve the symlink:

  • for a lot of applications, the opened directory fd will be reused, so most operations will run directly on the underlying store
  • later uses of the symlink would be cached by the kernel, so even without fuse-ebpf, we'd need the jump to userspace only once per symlink

This also relies on lower stores only growing monotonically when used, so that links would not go stale.

@8aed
Copy link

8aed commented Nov 3, 2023

There is some more context provided by a developer in this thread in the linux-fsdevel mailing list :

Specifically, renaming directories and files in lower that were already
copied up is going to have a weird outcome.

It doesn't say anything about adding directories/files in lower that already exist in upper layers, though.

Also, someone else then adds this comment about allowing changes to the lower fs :

Best way to keep things simple is to only add functionality when
someone actually needs it (and can test it). This has been the design
policy in overlayfs and it worked wonderfully.

Maybe we can reach out to linux-fsdevel and describe our use case. If we only require "extending" the lower filesystem online, it's possible that it wouldn't require a lot of changes.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nix-in-the-wild-project-idx-flox-blog/35025/2

@infinisil infinisil changed the title [RFC 0152]: local-overlay store [RFC 0152] local-overlay store Nov 16, 2023
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/sharing-nix-portable-store-between-multiple-users/36571/5

@lheckemann
Copy link
Member

@tomberek any chance the shepherd team could have a meeting sometime soon? Some of @ballit6782's concerns don't seem to have been responded to and it would be good to have these addressed in some way so the RFC can move forwards :)

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/replit-software-engineer-java-kotlin-or-generalist-developer-experience/39045/1

@infinisil
Copy link
Member

Another ping, please give an update on the state of this RFC, if any. If no progress is happening, it would be best if RFC is marked as a draft.

@ballit6782
Copy link

I think that for this RFC to progress, one of these things need to happen :

  • 1 some clarification from fsdevel as to what kind of changes, if any, are we allowed to do on the overlay's upper filesystem
  • 2 add the restriction that the upper store should always be frozen if the lower store is used
  • 3 choosing to experimentally rely on things that "seem to work" but that are not part of the kernel's guarantees (as I understand them)
  • 4 use a FUSE based filesystem to make this feature work, as proposed in the alternatives

I think that options 2 and 3 are not viable (frozen upper stores would be largely useless).

On the subject of FUSE alternatives, I think that this paragraph in the RFC does not take into account the caching mechanisms of Linux's VFS :

FUSE however is quite primitive, in that every read need to be shuffled via the FUSE server. There is nothing like a "control plane vs data plane" separation where Nix could tell the OS "this directory is that other directory", and the OS can do the rest without involving the FUSE server. That means the performance of FUSE are potentially worse than these in-kernel mounting solutions.

For example, if using MergerFS with the options symlinkify=true and cache.symlinks=true, MergerFS would in effect be telling the OS "this directory is that other directory". Furthermore, using cache.symlinks, only the first read of a root-level symlink in the store would involve userland code. While this might involve a bit more memory usage as there would be more dentries to cache, I think that they are largely offset by the memory usage that overlayed stores can bring.

MergerFS also opens up other possibilities that are very interesting for my use-case of the local-overlay store, namely providing "late persistence", by adding a tmpfs branch that can be dynamically shrunk to 0. And being able to set fsname=nix would also be quite neat.

I have planned since some time to make some complete tests of this in practice, but had not found the time nor motivation recently. However, if the shepherds think that this is a viable path for this RFC, I can document some experiments, run some benchmarks (especially measuring the additional context switches and how they scale) and try to provide an updated version of the RFC that would rely on MergerFS.

I'd also like to add : I think it would be great if the RFC could express more clearly if this feature could be used for multiple layered stores. I know this would be a very niche feature, but there are cases where it would be very useful (for integration with the Shufflecake layered plausible-deniability storage system). It would be helpful to know if it would be a supported or not use case.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/rfcsc-meeting-2024-03-05/40851/1

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/rfcsc-meeting-2024-03-19/41829/1

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/replaceruntimedependency-guix-grafts-and-other-approaches-to-the-fast-upgrade-problem/42610/1

@MMesch MMesch marked this pull request as draft April 2, 2024 15:02
@MMesch
Copy link

MMesch commented Apr 2, 2024

Since there was no activity for a while we decided to mark this as a draft. Feel free to undraft it any time once activity picks up again.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/rfcsc-meeting-2024-04-02/42643/1

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/super-colliding-nix-stores/28462/19

@jonringer
Copy link
Contributor

jonringer commented Apr 12, 2024

I think this is super cool usage of Nix, and would hate for it to fall by the wayside. I'll happily be a shepherd.

Read the discourse comments, not realizing they were 10 months apart, one asking for shepards, one stating implementation progress xD. oops!

Congrats on getting this closer to being landed. 🎉

@tomberek
Copy link
Contributor

Looks like the notes never made it here (my mistake): from Feb 01

Quick sync : local-overlay store

Present: raitobezarius, tomberek, John Ericson, ryantm.

Question : Update on where we are?

From an RFC perspective, it seems like there's stalling, but there have been progress and we should write that down.

local-overlay store usage at replit

Started the use few months ago. Objective: reduce the copying of the lower store database (gigabytes in size). In the past, every time a user has to do an action, they had to copy it before doing any Nix action. With the local overlay store, this is eliminated now.

It seems to be working fine, we have no bug report regarding Nix so far, though we may not have the most advanced usecase of Nix workloads.

We just finished the work for persisting the upper store and we will start testing that internally.

Where do we want to drive this?

John is trying to merge the PR, he worked with Ben from Tweag on this.

Tvix

flokli has been working on store compositional, Tvix has no database like Nix's SQLite, we are looking at building more compatibility and getting a theory of composition of stores. We will need at least six more months to have interesting results.

replit has contracted flokli to work on plumbing the blob store / CA store for their main package storage.

Conclusion : This group has been working more as a special interest group / working group than a purely RFC group, we are happy of the results it brought, ryantm mentions that he's happy of going through the RFC because it crystallized a lot of important information and the RFCSC has been helpful to keep us able to sync everyone else on what work is going on.

@tomberek
Copy link
Contributor

Update

NixOS/nix#8397 is now merged into Nix master branch.

@kevinh-canva
Copy link

There's mentions both in this RFC and in discourse that the local-overlay store would break if the lower store is garbage-collected.

Can I get a clarification on this: does this mean if user runs nix-store --gc that targets an already existing local-overlay store, it will be a faux delete (but the store object will still exist in the lower store), and there will be no corruption?

And the garbage collection here only applies to cases when the store objects is somehow deleted from the lower store prior to mounting? And in that case, would running a nix-store --verify --repair after mounting fixes it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.