New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nix and IPFS #859

Open
vcunat opened this Issue Mar 24, 2016 · 106 comments

Comments

Projects
None yet
@vcunat
Member

vcunat commented Mar 24, 2016

(I wanted to split this thread from #296 (comment) .)

Let's discuss relations with IPFS here. As I see it, mainly a decentralized way to distribute nix-stored data would be appreciated.

What we might start with

The easiest usable step might be to allow distribution of fixed-output derivations over IPFS. That are paths that already are content-addressed, typically by (truncated) sha256 over either a flat file or a tar-like dump of a directory tree; more details are in the docs. These paths are mainly used for compressed tarballs of sources. This step itself should avoid lots of problems with unstable upstream downloads, assuming we could convince enough nixers to serve their files over IPFS.

Converting hashes

One of the difficulties is that we use different kinds of hashing than in IPFS, and I don't think it would be good to require converting those many thousands of hashes in our expressions. (Note that it's infeasible to convert among those hashes unless you have the whole content.) IPFS people might best suggest how to work around this. I imagine we want to "serve" a mapping from the hashes we use to the IPFS's hashes, perhaps realized through IPNS. (I don't know details of IPFS's design, I'm afraid.) There's an advantage that one can easily verify the nix-style hash in the end after obtaining the paths in any way.

Non-fixed content

If we get that far, it shouldn't be too hard to manage distributing everything via IPFS, as for all other derivations we use something we could call indirect content addressing. To explain that, let's look at how we distribute binaries now – our binary caches. We hash the build recipe, including all its recipe dependencies, and we inspect the corresponding narinfo URL on cache.nixos.org. If our build farm has built that recipe, various information is in that file, mainly the hashes of the content of the resulting outputs of that build and crypto-signatures of them.

Note that this narinfo step just converts our problem to the previous fixed-output case, and the conversion itself seems very reminiscent of IPNS.

Deduplication

Note that nix-built stuff has significantly greater than usual potential for chunk-level deduplication. Very often we do a rebuild of a package only because something in a dependency has changed, so there are only very minor changes expected in the results, mainly just exchanging the references to runtime dependencies as their paths have changed. (In seldom occasions even lengths of the paths would change.) There's a great potential to save on that during distribution of binaries, which would be utilized by implementing the section above, and even potential in saving disk space in comparison to our way of hardlinking equal files (the next paragraph).

Saving disk space

Another use might be to actually store the files in a FS similar to what IPFS uses. That seems a little more complex and tricky thing to deploy, e.g. I'm not sure someone already trusts the implementation of the FS enough to have the whole OS running of it.

It's probably premature to speculate too much on this use ATM; I'll just write I can imagine having symlinks from /nix/store/foo to /ipfs/*, representing the locally trusted version of that path. (That's working around the problems related to making /nix/store/foo content-addressed.) Perhaps it could start as a per-path opt-in, so one could move only the less vital paths out of /nix/store itself.


I can help personally with bridging the two communities in my spare time. Not too long ago, I spent many months on researching various ways to handle "highly redundant" data, mainly from the point of view of theoretical computer science.

@ehmry

This comment has been minimized.

Member

ehmry commented Mar 24, 2016

I'm curious what the minimalist way to associate store paths to IPFS objects while interfering as little as possible with IPFS-unaware tools would be.

@vcunat

This comment has been minimized.

Member

vcunat commented Mar 24, 2016

I described such a way in the second paragraph from bottom. It should work with IPFS and nix store as they are, perhaps with some script that would move the data, create the symlink and pin the path in IPFS to avoid losing it during GC. (It could be unpinned when nix deletes the symlink during GC.)

@ehmry

This comment has been minimized.

Member

ehmry commented Mar 24, 2016

I was thinking about avoiding storing store objects in something that wouldn't require a daemon, but of course you can't have everything.

@Ericson2314

This comment has been minimized.

Member

Ericson2314 commented Mar 24, 2016

@vcunat Great write up! More thoughts on this later, but one thing that gets me is the tension between wanting incremental goals, and avoiding work we don't need long term. For example it will take some heroics to use our current hashing schemes, but for things like dedup and the intensional store we'd want to switch to what IPFS already does (or much closer to that) anyways.

Maybe the best first step is a new non-flat/non-NAR hashing strategy for fixed output derivations? We can slowly convert nixpkgs to use that, and get IPFS mirroring and dedup in the fixed-output case. Another step is using git tree hashes for fetch git. We already want to do that, and I suspect IPFS would want that too for other users. IPFS's multihash can certainly be heavily abused for such a thing :).

@Ericson2314

This comment has been minimized.

Member

Ericson2314 commented Mar 24, 2016

For me the end goal should be only using IPNS for the derivation -> build map. Any trust-based compatibility map between hashing schemes long term makes the perfectionist in me sad :).

@vcunat

This comment has been minimized.

Member

vcunat commented Mar 24, 2016

For example it will take some heroics to use our current hashing schemes, but for things like dedup and the intensional store we'd want to switch to what IPFS already does (or much closer to that) anyways.

I meant that we would "use" some IPFS hashes but also utilize a mapping from our current hashes, perhaps run over IPNS, so that it would still be possible to run our fetchurl { sha256 = "..." } without modification. Note that it's these flat tarball hashes that most upstreams release and sign, and that's not going to change anytime soon, moreover there's not much point in trying to deduplicate compressed tarballs anyway. (We might choose to use uncompressed sources instead, but that's just another partially independent decision I'm not sure about.)

@Ericson2314

This comment has been minimized.

Member

Ericson2314 commented Mar 24, 2016

For single files / IPFS blobs, we should be able to hash the same way without modification.

@Ericson2314

This comment has been minimized.

Member

Ericson2314 commented Mar 24, 2016

But for VCS fetches we currently do a recursive/nar hash right? That is what I was worried about.

@Ericson2314

This comment has been minimized.

Member

Ericson2314 commented Mar 24, 2016

@ehmry I assume it would be pretty easy to make the nix store an immutable FUSE filesystem backed by IPFS (hopefully such a thing exists already). Down the road I'd like to have package references and the other things currently in the SQLite database also backed by IPFS: they would "appear" in the fuse filesystem as specially-named symlinks/hard-links/duplicated sub-directories. "referees" is the only field I'm aware of that'd be a cache on top. Nix would keep track of roots, but IPFS would do GC itself, in the obvious way.

@cleverca22

This comment has been minimized.

Contributor

cleverca22 commented Apr 8, 2016

one idea i had, was to keep all outputs in NAR format, and have the fuse layer dynamically unpack things on-demand, that can then be used with some other planned IPFS features to share a file without copying it into the block storage

then you get a compressed store and don't have to store 2 copies of everything (the nar for sharing and the installed)

@nmikhailov

This comment has been minimized.

nmikhailov commented Apr 8, 2016

@cleverca22 yeah, I had same thoughts about that, its unclear how hard this would impact performance though

@cleverca22

This comment has been minimized.

Contributor

cleverca22 commented Apr 8, 2016

could keep a cache of recently used files in a normal tmpfs, and relay things over to that to boost performance back up

@davidar

This comment has been minimized.

davidar commented Apr 8, 2016

@cleverca22 another idea that was mentioned previously was to add support for NAR to ipfs, so that we can transparently unpack it as we do with TAR currently (ipfs tar --help)

@Ericson2314

This comment has been minimized.

Member

Ericson2314 commented Apr 8, 2016

NAR sucks though---no file-level dedup we could otherwise get for free. The above might be fine as a temporary step, but Nix should learn about a better format.

@davidar

This comment has been minimized.

davidar commented Apr 9, 2016

@Ericson2314 another option that was mentioned was for Nix and IPFS (and perhaps others) to try to standardise on a common archive format

@Ericson2314

This comment has been minimized.

Member

Ericson2314 commented Apr 9, 2016

@davidar Sure that's always good. For the shortish term, I was leaning towards a stripped down unixfs with just the attributes NAR cares about. As far as Nix is concerned this is basically the same format but with a different hashing scheme.

@Ericson2314

This comment has been minimized.

Member

Ericson2314 commented Apr 9, 2016

Yeah looking at Car, it's seems to be both an "IPFS Schema" over the IPFS Merkel DAG (Unless it just reuses unixfs), and then an interchange format for packing the dag into one binary blob.

That former is cool, but I don't think Nix even needs the latter (except perhaps as a new way to fall back on http etc if IPFS is not available while using a compatible format). For normal operation, I'd hope nix could just ask IPFS to populate the fuse filesystem that is the store given a hash, and everything else would be transparent.

@cleverca22

This comment has been minimized.

Contributor

cleverca22 commented Apr 9, 2016

https://github.com/cleverca22/fusenar

i now have a nixos container booting with a fuse filesystem at /nix/store, which mmap's a bunch of .nar files, and transparently reads the requested files

@knupfer

This comment has been minimized.

knupfer commented Jul 20, 2016

What is currently missing for using IPFS? How could I contribute? I really need this feature for work.

@knupfer

This comment has been minimized.

knupfer commented Jul 20, 2016

Pinging @jbenet and @whyrusleeping because they are only enlisted on the old issue.

@copumpkin

This comment has been minimized.

Member

copumpkin commented Jul 20, 2016

@knupfer I think writing a fetchIPFS would be a pretty easy first step. Deeper integration will be more work and require touching Nix itself.

@knupfer

This comment has been minimized.

knupfer commented Jul 28, 2016

Ok, I'm working on it but there are some problems. Apparently, ipfs doesn't save the executable flag, so stuff like stdenv doesn't work, because it expects an executable configure. The alternative would be to distribute tarballs and not directories, but that would be clearly inferior because it would exclude deduplication on file level. Any thoughts on that? I could make every file executable, but that would be not very nice...

@copumpkin

This comment has been minimized.

Member

copumpkin commented Jul 28, 2016

@knupfer it's not great, but would it be possible to distribute a "permissions spec file" paired with a derivation, that specifies file modes out of band? Think of it like a JSON file or whatever format and your thing pulls from IPFS, then applies the file modes to the contents of the directory as specified in the spec. The spec could be identified unique by the folder it's a spec for.

@copumpkin

This comment has been minimized.

Member

copumpkin commented Jul 28, 2016

In fact, the unit of distribution could be something like:

{
  "contents": "/ipfs/12345",
  "permissions": "/ipfs/647123"
}
@knupfer

This comment has been minimized.

knupfer commented Jul 28, 2016

Yep, that would work. Albeit it makes it more complicated for the user to add some sources to ipfs. But we could for example give an additional url in the fetchIPFS which wouldn't be in ipfs, and if it fetches from normal web automatically generate the permissions file and add that to ipfs... I'll think a bit about it.

@davidak

This comment has been minimized.

Contributor

davidak commented Jul 28, 2016

ipfs doesn't save the executable flag

should it? @jbenet

how is ipfs-npm doing it? maybe also just distributes tarballs. that is of course not the most elegant solution.

@shlevy shlevy added the backlog label Apr 1, 2018

@CMCDragonkai

This comment has been minimized.

CMCDragonkai commented May 5, 2018

I'm wondering if the new nix 2.0 store abstraction would help and adding an IPFS store.

@vcunat

This comment has been minimized.

Member

vcunat commented May 5, 2018

For reference, the experiments around https://github.com/NixIPFS found that IPFS isn't able to offer reasonable performance for the CDN part, at least not yet.

@CMCDragonkai

This comment has been minimized.

CMCDragonkai commented May 5, 2018

@vcunat

This comment has been minimized.

Member

vcunat commented May 5, 2018

I don't remember any definite results, except that it wasn't usable. @mguentner might remember more.

@mguentner

This comment has been minimized.

mguentner commented May 5, 2018

@CMCDragonkai No runnable benchmark, just personal experience.
Here you can read about the last deployment:

https://github.com/NixIPFS/infrastructure/blob/master/ipfs_mirror/logbook_20170311.txt

I have no idea how IPFS behaves currently, but I assume that the DHT management traffic is still a problem. Without a DHT you have to manually connect the instances.
Please note that IPFS itself works fine for smaller datasets (<= 1 GiB) but does not compare well against old-timers like rsync (which we used in a second deployment of nixipfs-scripts).

@davidak

This comment has been minimized.

Contributor

davidak commented May 5, 2018

@whyrusleeping is aware of these things

He wrote in some issue at the end of 2017:

In general, with each update we've had improvements that reduce bandwidth consumption.

So it might be already "usable" for this use case?

It is still not fixed completely. Here are some related issues to follow.

ipfs/go-ipfs#2828
ipfs/go-ipfs#3429
ipfs/go-ipfs#3065

@parkan

This comment has been minimized.

parkan commented Sep 5, 2018

would love to revive this, anyone on the nix side actively involved as of now?

@davidak

This comment has been minimized.

Contributor

davidak commented Sep 6, 2018

@parkan i don't think so. The linked ipfs issues in my last comment are still open, so we have to wait for fixes (or get involved there and help resolve them).

@vcunat

This comment has been minimized.

Member

vcunat commented Sep 6, 2018

@parkan: as written, there were severe performance problems with IPFS for our use case. I haven't heard of them being addressed, but I haven't been watching IPFS changes...

@parkan

This comment has been minimized.

parkan commented Sep 6, 2018

gotcha, thanks for the TLDR 😄

there's ongoing work on improving DHT performance, but the most effective approach will likely involve non-DHT based content routing -- I'll review the work in @NixIPFS to see if there's anything obvious we can do today

are there stats on things like total number of installed machines, cached binaries, etc somewhere?

@vcunat

This comment has been minimized.

Member

vcunat commented Sep 6, 2018

@parkan: there's a list of binary packages for a single snapshot (~70k of them). We have that amount roughly thrice at a single moment (stable + unstable + staging), and we probably don't want to abandon older snapshots before a few weeks/months have passed, though subsequent snapshots will usually share most of the count (and size). Overall I'd guess it might be on the order of hundreds of gigabytes of data to keep up at once (maybe a couple terabytes, I don't know).

I suppose the publishing rate of new data in GB/day would be interesting for this purpose (DHT write throughput), but I don't know how to get that data easily. And also the "download" traffic: I expect there will be large amounts, given that a single system update can easily cause hundreds of MB in downloads from the CDN, and github shows roughly a thousand of unique visitors to the repo each day (even though by default you download source snapshots from the CDN directly instead of git).

I'm sure I did see some stats on a NixCon, but I can't find them and they might be double nowadays. @AmineChikhaoui: any idea if it's easy to get similar stats from Amazon, or who could know/do that?

@mguentner

This comment has been minimized.

mguentner commented Sep 6, 2018

@parkan

  • total number of installed machines: 0 (once 3)
  • cached binaries: 0 (once ~ 400 GiB, rougly 40 jobsets of nixpkgs)

The project is dead at the moment because no one showed interested. I decided that I won't force something if the community is happy with the AW$ solution.

The @NixIPFS project was also an attempt to free the NixOS project of the AW$ dependency which seemed really silly and naive to me.

Since a simple rsync mirror already fulfills that requirement I went ahead with that. However I found nobody who wanted to commit themselves with server(s) and time.
The idea would have been a setup with mirrorbits (redudant with redis sentinel) and optional geo dns. Old issue

Ping me if you need assistance.

@Warbo

This comment has been minimized.

Contributor

Warbo commented Sep 6, 2018

I appreciate the scaling issues with serving NARs, etc. over IPFS, but it looks like this "full-blown" approach has derailed the orthogonal issue of making external sources more reliable (described under "What we might start with" in the first comment).

I've certainly encountered things like URLs and repos disappearing (e.g. disappearing TeXLive packages, people deleting their GitHub repos/accounts after the Microsoft aquisition, etc.), which has required me to search the Web for the new location (if it even exists elsewhere...) and alter "finished" projects to point at these new locations. This is especially frustrating for things like reproducible scientific experiments, where experimental results are tagged with a particular revision of the code, but that revision no longer works (even with everything pinned) due to the old URLs.

As far as I see it there are two problems that look like low hanging fruit:

The first is to make a fetchFromIPFS function which doesn't require hardcoding a HTTP gateway. This could be as simple as e.g.

fetchFromIPFS = { contentHash, sha256 }: fetchurl {
  inherit sha256;
  url = "https://ipfs.io/ipfs/${contentHash}";
}

This prevents having HTTP gateways scattered all over Nix files, and allows a future implementation to e.g. look for a local IPFS node, which would (a) remove the gateway dependency, (b) use the local IPFS cache and (c) have access to private nodes e.g. on a LAN.

The second issue is that personally, I would like to use a set of sources, a bit like metalink files or magnet links. The reason is that upstream HTTP URLs might be unreliable, but so might IPFS! At the moment, fixed-output derivations offer a false dichotomy: we must trust one source (except for the hardcoded mirrors.nix), so we can either hope that upstream works or force ourselves to reliably host things forever (whether through IPFS or otherwise). Whilst I don't trust upstreams to not disappear, I trust my own hosting ability even less!

I'm not sure how this would work internally, but I would love the ability to say e.g.

src = fetchAny [
  (fetchFromIPFS { inherit sha256; contentHash = "abc"; })
  (fetchurl { inherit sha256; url = http://example.org/library-1.0.tar.lz; })
  (fetchurl { inherit sha256; url = http://chriswarbo.net/files/library-1.0.tar.lz; })
];

The same goes for other fetching mechanisms too, e.g.

src = fetchAny [
  (fetchFromGitHub { inherit rev sha256; owner = "Warbo"; repo = "..."; })
  (fetchgit { inherit rev sha256; url = http://chriswarbo.net/git/...; })
  (fetchFromIPFS { inherit sha256; contentHash = "..."; })  # I also mirror repos to IPFS/IPNS
];

Whilst all of the hash conversion, Hydra integration, etc. discussed in this thread would be nice; simple mechanisms like the above would be a great help to me, at least. I could have a go at writing them myself, if there was concensus that I'm not barking up the wrong tree? ;)

@vcunat

This comment has been minimized.

Member

vcunat commented Sep 6, 2018

I don't think it's orthogonal at all. Sources are cached in the CDN as well. (Once in a longer while IIRC.) EDIT: maybe only fetchurl-based sources ATM, I think, but that's vast majority and not a blocker anyway, as it's only store paths again. Current example: NixOS/nixpkgs#46202

@vcunat

This comment has been minimized.

Member

vcunat commented Sep 6, 2018

I must admit it's difficult to compete with these CDNs, as long as someone pays/donates them. My systems commonly update with 100 Mb/s, reply < 5 ms. I'm convinced this WIP has taken lots of effort to get into this stage, but to make it close to the CDN, that would surely take many times more. I personally am "interested" in this, but it's a matter of priorities, and I've been overloaded with nix stuff that works much worse than the content distribution...

@Warbo

This comment has been minimized.

Contributor

Warbo commented Sep 6, 2018

@vcunat Just to be clear (can't tell if you were replying to me or not) my thoughts above were mostly concerned with custom packages (of which I have a lot ;) ) which have no CDN, etc. rather than "official" things from nixpkgs.

@vcunat

This comment has been minimized.

Member

vcunat commented Sep 6, 2018

OK, in this whole issue I've been only considering sources used in the official nixpkgs repository plus the binaries generated from that by hydra.nixos.org. Ability to seamlessly go above that would be nice, but it feels like overstretching my wishlist.

@Warbo

This comment has been minimized.

Contributor

Warbo commented Sep 6, 2018

Whoops, never mind; it looks like https://github.com/NixOS/nixpkgs/tree/master/pkgs/build-support/fetchipfs basically does what I described (fetch from a local IPFS node, with a HTTP fallback)!

@CMCDragonkai

This comment has been minimized.

CMCDragonkai commented Sep 7, 2018

Just a note, proprietary sources are not cached in the CDN. And these tend to break the most I find. In one instance the source link is not even encoded in nixpkgs (cuDNN) and you're expected to login to NVIDIA to get them. I did however find automatable link for acquiring cuDNN.

My original goal here was to have transparent ipfs fetching. So you dont need to special case the fetches, it just works reproducibly as the first time a fetch is applied, it gets put into an IPFS node.

@AmineChikhaoui

This comment has been minimized.

Member

AmineChikhaoui commented Sep 7, 2018

@vcunat I think @edolstra has a script that generates few stats/graphs from the binary cache, if I'm not wrong the latest was shared in https://nixos.org/~eelco/cache-stats/, I believe it should be possible to generate that again.
Is that what you're looking for ?

@vcunat

This comment has been minimized.

Member

vcunat commented Sep 7, 2018

I think that's exactly the link I had seen. It's data until December 2017, but that should still be good enough for a rough picture.

Unfree packages aren't cached as a matter of policy, in some cases even distribution of sources isn't legally allowed by the author. Yes, switching to IPFS would make it possible to decentralize that decision (and the legal responsibility), which might improve the situation from your point of view. But... you can use fetchIPFS for those already ;-) (and convince people to "serve" them via IPFS) – I don't expect anyone would oppose switching the undownloadable ones to fetchIPFS in upstream nixpkgs.

@cleverca22

This comment has been minimized.

Contributor

cleverca22 commented Sep 7, 2018

@Warbo
https://github.com/NixOS/nixpkgs/blob/082169ab029b4a111309f7d9a795b88e6429222c/pkgs/build-support/fetchurl/default.nix#L38-L43

pkgs.fetchurl already supports a list of URL's and will try each one in order until one returns something
so its just a matter of generating a call to fetchurl, that knows the ipfs hash, sha256, and the original upstream url

@Warbo

This comment has been minimized.

Contributor

Warbo commented Sep 7, 2018

@cleverca22 Wow, now that you point it out it's obvious; I've looked through that code so many times, but the ability to give multiple URLs didn't stick in my mind, maybe because I've not used it (because I forgot it was possible... and so on) :P

I've moved my other thoughts to #2408 since they're not specific to IPFS.

@CMCDragonkai

This comment has been minimized.

CMCDragonkai commented Sep 12, 2018

Unfree packages aren't cached as a matter of policy, in some cases even distribution of sources isn't legally allowed by the author. Yes, switching to IPFS would make it possible to decentralize that decision (and the legal responsibility), which might improve the situation from your point of view. But... you can use fetchIPFS for those already ;-) (and convince people to "serve" them via IPFS) – I don't expect anyone would oppose switching the undownloadable ones to fetchIPFS in upstream nixpkgs.

I want to also get these standard deep learning weights into Nixpkgs as well: https://github.com/fchollet/deep-learning-models/releases

But they are large fixed output derivations. Weights represent source code when there is more and more deep learning applications coming out. For example libpostal.

Someone on IRC mentioned it shouldn't be cached by hydra or something like that. At any case, I want to make use of Nix for scientific reproducibility, the only way to truly make Nix usable for all of these usecases and not bog down the official Nix caching systems with all our large files is to decentralised the responsibility. So another reason IPFS would be important here.

I was wondering if anyone considered Dat?


On another note, I had some work previously involving attempting to get Hydra integrated with IPFS. To do that we had to look more deeply in the IPFS functionality specifically its libp2p framework. We have moved on to other things for now, but we have some knowledge about this particular area. For deeper integration between Nix and IPFS beyond just fetchIPFS, feel free to put up issues in https://github.com/MatrixAI?utf8=%E2%9C%93&q=libp2p&type=&language=.

@LnL7

This comment has been minimized.

Contributor

LnL7 commented Sep 12, 2018

I might be missing something, but I'm not sure what's particularly large about that.

@vcunat

This comment has been minimized.

Member

vcunat commented Sep 12, 2018

I see two downloads over 300 MiB each, so perhaps that. (I don't know particulars at all.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment