Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nix and IPFS #859

Open
vcunat opened this issue Mar 24, 2016 · 161 comments
Open

Nix and IPFS #859

vcunat opened this issue Mar 24, 2016 · 161 comments
Assignees

Comments

@vcunat
Copy link
Member

@vcunat vcunat commented Mar 24, 2016

(I wanted to split this thread from #296 (comment) .)

Let's discuss relations with IPFS here. As I see it, mainly a decentralized way to distribute nix-stored data would be appreciated.

What we might start with

The easiest usable step might be to allow distribution of fixed-output derivations over IPFS. That are paths that already are content-addressed, typically by (truncated) sha256 over either a flat file or a tar-like dump of a directory tree; more details are in the docs. These paths are mainly used for compressed tarballs of sources. This step itself should avoid lots of problems with unstable upstream downloads, assuming we could convince enough nixers to serve their files over IPFS.

Converting hashes

One of the difficulties is that we use different kinds of hashing than in IPFS, and I don't think it would be good to require converting those many thousands of hashes in our expressions. (Note that it's infeasible to convert among those hashes unless you have the whole content.) IPFS people might best suggest how to work around this. I imagine we want to "serve" a mapping from the hashes we use to the IPFS's hashes, perhaps realized through IPNS. (I don't know details of IPFS's design, I'm afraid.) There's an advantage that one can easily verify the nix-style hash in the end after obtaining the paths in any way.

Non-fixed content

If we get that far, it shouldn't be too hard to manage distributing everything via IPFS, as for all other derivations we use something we could call indirect content addressing. To explain that, let's look at how we distribute binaries now – our binary caches. We hash the build recipe, including all its recipe dependencies, and we inspect the corresponding narinfo URL on cache.nixos.org. If our build farm has built that recipe, various information is in that file, mainly the hashes of the content of the resulting outputs of that build and crypto-signatures of them.

Note that this narinfo step just converts our problem to the previous fixed-output case, and the conversion itself seems very reminiscent of IPNS.

Deduplication

Note that nix-built stuff has significantly greater than usual potential for chunk-level deduplication. Very often we do a rebuild of a package only because something in a dependency has changed, so there are only very minor changes expected in the results, mainly just exchanging the references to runtime dependencies as their paths have changed. (In seldom occasions even lengths of the paths would change.) There's a great potential to save on that during distribution of binaries, which would be utilized by implementing the section above, and even potential in saving disk space in comparison to our way of hardlinking equal files (the next paragraph).

Saving disk space

Another use might be to actually store the files in a FS similar to what IPFS uses. That seems a little more complex and tricky thing to deploy, e.g. I'm not sure someone already trusts the implementation of the FS enough to have the whole OS running of it.

It's probably premature to speculate too much on this use ATM; I'll just write I can imagine having symlinks from /nix/store/foo to /ipfs/*, representing the locally trusted version of that path. (That's working around the problems related to making /nix/store/foo content-addressed.) Perhaps it could start as a per-path opt-in, so one could move only the less vital paths out of /nix/store itself.


I can help personally with bridging the two communities in my spare time. Not too long ago, I spent many months on researching various ways to handle "highly redundant" data, mainly from the point of view of theoretical computer science.

@ehmry
Copy link
Member

@ehmry ehmry commented Mar 24, 2016

I'm curious what the minimalist way to associate store paths to IPFS objects while interfering as little as possible with IPFS-unaware tools would be.

@vcunat
Copy link
Member Author

@vcunat vcunat commented Mar 24, 2016

I described such a way in the second paragraph from bottom. It should work with IPFS and nix store as they are, perhaps with some script that would move the data, create the symlink and pin the path in IPFS to avoid losing it during GC. (It could be unpinned when nix deletes the symlink during GC.)

@ehmry
Copy link
Member

@ehmry ehmry commented Mar 24, 2016

I was thinking about avoiding storing store objects in something that wouldn't require a daemon, but of course you can't have everything.

@Ericson2314
Copy link
Member

@Ericson2314 Ericson2314 commented Mar 24, 2016

@vcunat Great write up! More thoughts on this later, but one thing that gets me is the tension between wanting incremental goals, and avoiding work we don't need long term. For example it will take some heroics to use our current hashing schemes, but for things like dedup and the intensional store we'd want to switch to what IPFS already does (or much closer to that) anyways.

Maybe the best first step is a new non-flat/non-NAR hashing strategy for fixed output derivations? We can slowly convert nixpkgs to use that, and get IPFS mirroring and dedup in the fixed-output case. Another step is using git tree hashes for fetch git. We already want to do that, and I suspect IPFS would want that too for other users. IPFS's multihash can certainly be heavily abused for such a thing :).

@Ericson2314
Copy link
Member

@Ericson2314 Ericson2314 commented Mar 24, 2016

For me the end goal should be only using IPNS for the derivation -> build map. Any trust-based compatibility map between hashing schemes long term makes the perfectionist in me sad :).

@vcunat
Copy link
Member Author

@vcunat vcunat commented Mar 24, 2016

For example it will take some heroics to use our current hashing schemes, but for things like dedup and the intensional store we'd want to switch to what IPFS already does (or much closer to that) anyways.

I meant that we would "use" some IPFS hashes but also utilize a mapping from our current hashes, perhaps run over IPNS, so that it would still be possible to run our fetchurl { sha256 = "..." } without modification. Note that it's these flat tarball hashes that most upstreams release and sign, and that's not going to change anytime soon, moreover there's not much point in trying to deduplicate compressed tarballs anyway. (We might choose to use uncompressed sources instead, but that's just another partially independent decision I'm not sure about.)

@Ericson2314
Copy link
Member

@Ericson2314 Ericson2314 commented Mar 24, 2016

For single files / IPFS blobs, we should be able to hash the same way without modification.

@Ericson2314
Copy link
Member

@Ericson2314 Ericson2314 commented Mar 24, 2016

But for VCS fetches we currently do a recursive/nar hash right? That is what I was worried about.

@Ericson2314
Copy link
Member

@Ericson2314 Ericson2314 commented Mar 24, 2016

@ehmry I assume it would be pretty easy to make the nix store an immutable FUSE filesystem backed by IPFS (hopefully such a thing exists already). Down the road I'd like to have package references and the other things currently in the SQLite database also backed by IPFS: they would "appear" in the fuse filesystem as specially-named symlinks/hard-links/duplicated sub-directories. "referees" is the only field I'm aware of that'd be a cache on top. Nix would keep track of roots, but IPFS would do GC itself, in the obvious way.

@cleverca22
Copy link
Contributor

@cleverca22 cleverca22 commented Apr 8, 2016

one idea i had, was to keep all outputs in NAR format, and have the fuse layer dynamically unpack things on-demand, that can then be used with some other planned IPFS features to share a file without copying it into the block storage

then you get a compressed store and don't have to store 2 copies of everything (the nar for sharing and the installed)

@nmikhailov
Copy link

@nmikhailov nmikhailov commented Apr 8, 2016

@cleverca22 yeah, I had same thoughts about that, its unclear how hard this would impact performance though

@cleverca22
Copy link
Contributor

@cleverca22 cleverca22 commented Apr 8, 2016

could keep a cache of recently used files in a normal tmpfs, and relay things over to that to boost performance back up

@davidar
Copy link

@davidar davidar commented Apr 8, 2016

@cleverca22 another idea that was mentioned previously was to add support for NAR to ipfs, so that we can transparently unpack it as we do with TAR currently (ipfs tar --help)

@Ericson2314
Copy link
Member

@Ericson2314 Ericson2314 commented Apr 8, 2016

NAR sucks though---no file-level dedup we could otherwise get for free. The above might be fine as a temporary step, but Nix should learn about a better format.

@davidar
Copy link

@davidar davidar commented Apr 9, 2016

@Ericson2314 another option that was mentioned was for Nix and IPFS (and perhaps others) to try to standardise on a common archive format

@Ericson2314
Copy link
Member

@Ericson2314 Ericson2314 commented Apr 9, 2016

@davidar Sure that's always good. For the shortish term, I was leaning towards a stripped down unixfs with just the attributes NAR cares about. As far as Nix is concerned this is basically the same format but with a different hashing scheme.

@Ericson2314
Copy link
Member

@Ericson2314 Ericson2314 commented Apr 9, 2016

Yeah looking at Car, it's seems to be both an "IPFS Schema" over the IPFS Merkel DAG (Unless it just reuses unixfs), and then an interchange format for packing the dag into one binary blob.

That former is cool, but I don't think Nix even needs the latter (except perhaps as a new way to fall back on http etc if IPFS is not available while using a compatible format). For normal operation, I'd hope nix could just ask IPFS to populate the fuse filesystem that is the store given a hash, and everything else would be transparent.

@cleverca22
Copy link
Contributor

@cleverca22 cleverca22 commented Apr 9, 2016

https://github.com/cleverca22/fusenar

i now have a nixos container booting with a fuse filesystem at /nix/store, which mmap's a bunch of .nar files, and transparently reads the requested files

@knupfer
Copy link

@knupfer knupfer commented Jul 20, 2016

What is currently missing for using IPFS? How could I contribute? I really need this feature for work.

@knupfer
Copy link

@knupfer knupfer commented Jul 20, 2016

Pinging @jbenet and @whyrusleeping because they are only enlisted on the old issue.

@copumpkin
Copy link
Member

@copumpkin copumpkin commented Jul 20, 2016

@knupfer I think writing a fetchIPFS would be a pretty easy first step. Deeper integration will be more work and require touching Nix itself.

@knupfer
Copy link

@knupfer knupfer commented Jul 28, 2016

Ok, I'm working on it but there are some problems. Apparently, ipfs doesn't save the executable flag, so stuff like stdenv doesn't work, because it expects an executable configure. The alternative would be to distribute tarballs and not directories, but that would be clearly inferior because it would exclude deduplication on file level. Any thoughts on that? I could make every file executable, but that would be not very nice...

@copumpkin
Copy link
Member

@copumpkin copumpkin commented Jul 28, 2016

@knupfer it's not great, but would it be possible to distribute a "permissions spec file" paired with a derivation, that specifies file modes out of band? Think of it like a JSON file or whatever format and your thing pulls from IPFS, then applies the file modes to the contents of the directory as specified in the spec. The spec could be identified unique by the folder it's a spec for.

@copumpkin
Copy link
Member

@copumpkin copumpkin commented Jul 28, 2016

In fact, the unit of distribution could be something like:

{
  "contents": "/ipfs/12345",
  "permissions": "/ipfs/647123"
}
@knupfer
Copy link

@knupfer knupfer commented Jul 28, 2016

Yep, that would work. Albeit it makes it more complicated for the user to add some sources to ipfs. But we could for example give an additional url in the fetchIPFS which wouldn't be in ipfs, and if it fetches from normal web automatically generate the permissions file and add that to ipfs... I'll think a bit about it.

@davidak
Copy link
Member

@davidak davidak commented Jul 28, 2016

ipfs doesn't save the executable flag

should it? @jbenet

how is ipfs-npm doing it? maybe also just distributes tarballs. that is of course not the most elegant solution.

@kamadorueda
Copy link
Member

@kamadorueda kamadorueda commented Nov 9, 2020

If it's not but it's available on a binary cache, stream it from the binary cache to the user AND add it to the user IPFS node.

I missed this bit. Currently this can't be done by hitting the gateway directly. However I wonder if it would just be easier to have a cron job that adds the current store to IPFS every once and a while instead of a proxy? However either solution would be good.

Also if we are doing this how does the user publish this info? Just uploading the nar isn't enough to let other people use it.

For the moment users are just Mirroring binary caches over IPFS

In other words, users download the data they need from (the upstream binary cache / the nearest ipfs node that has it)

The upstream binary cache is who has the .narinfo files (small metadata files) and the distributed ipfs swarm (other people) is who has the nar.xz (may be big content files)

Users CANT announce store paths at discretion (security, trust problems, it's hard to implement but it's possible).

Users can only announce and receive from peers store paths that are in the upstream binary cache (cache.nixos.org, etc). If it exists on the upstream binary cache then it's trusted

I think we can start implementing this read-only proxy, the benefits are HUGE, mainly in costs savings for all involved parties and speed of transfer. This benefits both the nixos/cachix infrastructure, and end-users

Implementing a write-proxy is possible, it's hard but it's possible, I just think it's better to go step by step, solving problems and adding value every day. Start with the smallest possible change that changes things for better

@kevincox
Copy link

@kevincox kevincox commented Nov 9, 2020

It sounded like this was being done by the client right? This is just things that you have on your disk already. And since the nix store is immutable IPFS doesn't even need to copy the data.

I think even creating ipns directory of all keys in cache (and regulary update it) is a bit too much by itself.

I doubt it. The amount of work per-narinfo is based on the depth of the IPFS directory tree. IPFS can easily store thousands of directory entries in a single block so the depth is logarithmic with that base. This means that while the amount of work will grow over time it will still be relatively small.

The slightly more concerning number may be that the NixOS project may want to host all of those narinfo files forever. This will likely require something slightly more complicated than just pinning the tree however we currently pay for all of the narinfo and nar on S3 so I can't imagine that it is much worse.

I would love to see info on the total size of narinfo files in the current cache.nixos.org.

@kevincox
Copy link

@kevincox kevincox commented Nov 9, 2020

The upstream binary cache is who has the .narinfo files

Ah, so this is just proxying the narinfo requests? The doc isn't very clear on the difference between how the narinfo vs the nar are handled.

If you are just proxying the narinfo you can do something very cool. You can just transform the url parameter to point at the user's preferred gateway. (I'm assuming that that field supports absolute URLs, if not it shouldn't be that hard to add).

Then your proxy doesn't even see the nar requests. (And performance becomes mostly a non-issue).

Furthermore if this becomes widespread then we can at some point start publishing all the narinfos (pointing to IPFS) directly and remove the need for the proxy all together. This also allows people to publish their own caches via IPFS without needing to serve HTTP at all.

@kamadorueda
Copy link
Member

@kamadorueda kamadorueda commented Nov 9, 2020

From the nixos team perspective they pay S3 storage + data transfer

If we implement the proxy as I propose it nixos team would spend the same on S3 storage, but less in data transfer because some objects would be fetched by client from other clients in the IPFS network instead of S3 (or Cloudfront)

Basically users become a small CDN server of the derivations they use, care about, and have locally

There is no need for pinning services, $0 cost for it

Users benefit from speed, binary caches benefit from costs savings, win-win, the added cost is the time it take us (the volunteers) to create such software: https://github.com/kamadorueda/nix-ipfs

@iavael
Copy link
Contributor

@iavael iavael commented Nov 9, 2020

@kevincox wouldn't ipns approach require to list all keys of binary cache for every cache update? I don't think that there are thousands of them, it's more likely there are much much more.
And I don't even touch properties of ipfs and it's scalability. At first is it practical to create listing of millions of keys (or even dozens/hundreds of millions) for every cache update with cron job?

@kamadorueda
Copy link
Member

@kamadorueda kamadorueda commented Nov 9, 2020

The upstream binary cache is who has the .narinfo files

Ah, so this is just proxying the narinfo requests? The doc isn't very clear on the difference between how the narinfo vs the nar are handled.

  • nix requests for the narinfo files go to the upstream binary cache always
  • nix requests for the nar.xz file go to the upstream binary cache OR another peer that has such nar.xs file, the the user becomes a peer of such nar.xz file

Is it more clear know?

this is because nar.xz files are content-addressed, but narinfo are not. ipfs is content-addressed and that's why it's possible with nar.xz but not possible with narinfos

@kevincox
Copy link

@kevincox kevincox commented Nov 9, 2020

wouldn't ipns approach require to list all keys of binary cache for every cache update

No, you can do incremental updates. It is just a tree and you don't need to recompute unchanged subtrees. (Although currently the implementations that do this are not the best. I think we can use the go-ipfs mutable filesystem API as the scale of narinfos is small. However in the future we may need to implement something new, however that shouldn't be that hard).

@kevincox
Copy link

@kevincox kevincox commented Nov 9, 2020

nix requests for the narinfo files go to the upstream binary cache always

However IIUC we need to proxy the request so that we can modify it to point the url field at the proxy. (Although I guess since most caches use relative URLs we don't actually change anything, but in theory we would need to for non-relative URLs).

nix requests for the nar.xz file go to the upstream binary cache OR another peer that has such nar.xs file, the the user becomes a peer of such nar.xz file

That makes sense. One thing to be aware of here is the timeout for when the file isn't on IPFS yet. This may result in more fetches but otherwise the user could be left there waiting forever.

@kamadorueda
Copy link
Member

@kamadorueda kamadorueda commented Nov 9, 2020

That makes sense. One thing to be aware of here is the timeout for when the file isn't on IPFS yet. This may result in more fetches but otherwise the user could be left there waiting forever.

Sure, this one is easy! thanks

@lordcirth
Copy link

@lordcirth lordcirth commented Nov 10, 2020

Yes, you'll want a short timeout on the IPFS lookup. If something doesn't exist, it can take a long time for IPFS to decide that by default - you can't really prove it doesn't exist, you just have to decide when to give up. Since you have a good fallback, the best user experience is to give up much more quickly than normal. However, if I understand correctly, fetching the file from cache.nixos.org still results in adding the file to IPFS for future users, right?

@kamadorueda
Copy link
Member

@kamadorueda kamadorueda commented Nov 10, 2020

I just updated the document taking into account everything you guys said! The change has so many deltas that I think it's faster to read it all again

https://github.com/kamadorueda/nix-ipfs/blob/latest/README.md


Yes, you'll want a short timeout on the IPFS lookup. If something doesn't exist, it can take a long time for IPFS to decide that by default - you can't really prove it doesn't exist, you just have to decide when to give up. Since you have a good fallback, the best user experience is to give up much more quickly than normal. However, if I understand correctly, fetching the file from cache.nixos.org still results in adding the file to IPFS for future users, right?

yes, that's right! you may want to read this section (added a few minutes ago) https://github.com/kamadorueda/nix-ipfs/blob/latest/README.md#implementing-the-local-server

@kevincox
Copy link

@kevincox kevincox commented Nov 10, 2020

We turn this FileHash into an IPFS CID by calling a remote translation service

I'm pretty sure the is no need for a translation service. You can just decode and re-encode the hash.

The only other nit is that you hardcode the assumption that nars live at nar/* which I don't think is required.

@kamadorueda
Copy link
Member

@kamadorueda kamadorueda commented Nov 10, 2020

We turn this FileHash into an IPFS CID by calling a remote translation service

I'm pretty sure the is no need for a translation service. You can just decode and re-encode the hash.

Man I did the math trying to translate the nix-sha256 into the IPFS CID and couldn't :(

I think I couldn't do it because the CID stores the hash of the merkle-whatever-techy-thing-composed-of-chunked-bits-with-metadata-and-raw-data-together instead of the nix-sha256 of the raw-data only

so nix_sha256_to_ipfs_cid(nix_sha256_hash_as_string) is not possible in terms of math operations it's possible in terms of OS/network commands if we download the entire data in order to ipfs add it and get the merkle-whatever hash (but this defeats the purpose of the entire project)

If you have any idea on this tell us please! of course that translation service is something I'd prefer not to develop (and pay for) but seems needed until know

The only other nit is that you hardcode the assumption that nars live at nar/* which I don't think is required.

That's true, although nothing to worry for now I think. If we follow the URL field of the .narinfo everything would be ok

@kamadorueda
Copy link
Member

@kamadorueda kamadorueda commented Nov 10, 2020

We turn this FileHash into an IPFS CID by calling a remote translation service

I'm pretty sure the is no need for a translation service. You can just decode and re-encode the hash.

If it's not possible in terms of math-only (I wish I'm wrong) something really helpful that saves us the translation service would be having a new field for the IPFSCID in the .narinfo.

In such case I think nix-copy-closure should be modified to add: IPFSCID = $(ipfs add -q --only-hash <.nar.xz>) (this just hash, this stores nothing in the host)

@kamadorueda
Copy link
Member

@kamadorueda kamadorueda commented Nov 10, 2020

$ nix-hash --type sha256 --to-base16 17g1n8hxhq7h5h4jh0vy15pp6l1yyy1rg9mdq3pi60znnj53dzzz
ffff368ab4f60313efc0ada69783f73e50736f097e0328092cf060d821b2e19d

$ sha256sum 17g1n8hxhq7h5h4jh0vy15pp6l1yyy1rg9mdq3pi60znnj53dzzz.nar.xz 
ffff368ab4f60313efc0ada69783f73e50736f097e0328092cf060d821b2e19d  17g1n8hxhq7h5h4jh0vy15pp6l1yyy1rg9mdq3pi60znnj53dzzz.nar.xz

$ ipfs add -q 17g1n8hxhq7h5h4jh0vy15pp6l1yyy1rg9mdq3pi60znnj53dzzz.nar.xz
QmPW7pVJGdV4wkANRgZDmTnMiQvUrwy4EnQpVn4qHAdrTj

https://cid.ipfs.io/#QmPW7pVJGdV4wkANRgZDmTnMiQvUrwy4EnQpVn4qHAdrTj

base58btc - cidv0 - dag-pb - (sha2-256 : 256 : 1148914FBEEBDBB92D2DEC92697CFA76D7D36DA30339F84FCE76222941015BA2)

ipfs sha256: 1148914FBEEBDBB92D2DEC92697CFA76D7D36DA30339F84FCE76222941015BA2
nix  sha256: ffff368ab4f60313efc0ada69783f73e50736f097e0328092cf060d821b2e19d

ipfs hash is the hash of a data-structure composed of metadata and linked chunks, nix hash is just the hash of the raw content

image

@kevincox
Copy link

@kevincox kevincox commented Nov 11, 2020

Ah shoot you are right. The file will at least have the proto wrapper. And it gets more complicated if the file is multiple blocks in size (which it probably is). I think I was confused but the IPFS git model because it has isomorphic hashes. However it appears that it doesn't really work, it just breaks for files larger than a block. I guess I'll sleep on it and see if there is something clever we can do.

In such case I think nix-copy-closure should be modified to add: IPFSCID = $(ipfs add -q --only-hash <.nar.xz>) (this just hash, this stores nothing in the host)

Of course this forces the chunking strategy to be the current default. It would probably be better to use variable length hashing. (This is probably something worth adding to the current design). But either way encoding the CID without actually pinning the file to IPFS or somehow indicating the chunking method will probably result in issues down the line.

@Ericson2314
Copy link
Member

@Ericson2314 Ericson2314 commented Nov 11, 2020

I do still hope my idea at the bottom, https://discuss.ipfs.io/t/git-on-ipfs-links-and-references/730/24, will work. It could work for nars too (modern IPFS underneath the hood cares more about the underlying multihash than the multicodec part of the CID).

@kamadorueda
Copy link
Member

@kamadorueda kamadorueda commented Nov 11, 2020

In such case I think nix-copy-closure should be modified to add: IPFSCID = $(ipfs add -q --only-hash <.nar.xz>) (this just hash, this stores nothing in the host)

Of course this forces the chunking strategy to be the current default.

this one can be specified

-s, --chunker string - Chunking algorithm, size-[bytes], rabin-[min]-[avg]-[max] or buzhash. Default: size-262144.

so maybe adding another field to the .narinfo: IPFSChunking = size=262144, could work

this way the ipfs add can be reproduced on any host, past or future

from a user perspective the ipfs get will work for any chunking strategy

It would probably be better to use variable length hashing. (This is probably something worth adding to the current design). But either way encoding the CID without actually pinning the file to IPFS or somehow indicating the chunking method will probably result in issues down the line.

Maybe, yes, someone who reads the .narinfo can be tempted to think the file is pinned/stored somewhere on the ipfs swarm and then discover it's not

at the end of the day I think this is kind of intended-behaviour, everyone knows data can be available, and then not! only data that people cares about remains over time

@kevincox
Copy link

@kevincox kevincox commented Nov 11, 2020

so maybe adding another field to the .narinfo: IPFSChunking = size=262144, could work

Yeah, I think that would be a necessary addition if we are going to do that.

from a user perspective the ipfs get will work for any chunking strategy

Yes, but my understanding is that this proposal relies on users uploading the nar. And if they can't upload the nar and end up with the same hash no one will never be able to download it from IPFS.

at the end of the day I think this is kind of intended-behaviour, everyone knows data can be available, and then not! only data that people cares about remains over time

This is an okay guarantee if we want to keep the fallback forever. However it would be nice if this was a solution that could potentiality replace the current cache. (Of course an HTTP gateway would be provided for those not using IPFS natively)


I'm starting to wonder if this is the best approach. What about something like this:

  1. Publish a feed of narinfo files published to cache.nixos.org
  2. Write a service that consumes this feed and:
  3. Implement an IPFS store that is backed by HTTP and use this to advertise the hash.
  4. Publishes its own narinfo files that point at IPFS. (the URL field is modified to /ipfs/{hash})
    1. For now this could be any sort of storage.
    1. Eventually it would be nice to use a directory published in IPNS.

The obvious downside is that the service itself will use more bandwidth as it needs to upload the nar files (hopefully only occasionally). It also requires writing an IPFS Store HTTP backend that doesn't yet exist (AFAIK).

The upsides are:

  1. The cache could transparently be made self-standing. By hosting the nar files directly we can remove s3 from the equation (if one day this becomes the most popular solution).
  2. No extra software to run on the client. The client only needs to run an IPFS node. (Or use a public gateway, but probably bust to encourage running your own node)
  3. Could be run "fully decentralized" with the IPNS directory. (Although we need to have someone publishing it)
  4. The "uploader" is the one writing the hash so there is no concern with chunking. It can be changed at any time and be fine.

I think the thing I like about this is that it is simple to the user. It just looks like we are hosting a cache over IPFS. They don't need to worry about proxies and explicitly uploading files that they downloaded.


It is probably also worth pointing out that the long-term solution is probably to drop nar files altogether and just upload the directories themselves to IPFS. I think all of the proposals could work fine with this, you just need to add a field to the narinfo saying that it is a directory rather than an archive. However this would require much bigger client-side changes and would not be directly accessible over HTTP gateways. So I think that is a long way off.

@lordcirth
Copy link

@lordcirth lordcirth commented Nov 11, 2020

I think chunking should be set to Rabin; if the majority of these packages are going to be uploaded by this implementation anyway, there is little downside to being non-standard. Rabin is more advanced and should save on incremental update sizes. Though maybe that doesn't apply to compressed files?

@kevincox
Copy link

@kevincox kevincox commented Nov 12, 2020

It depends on how you do the compression. gzip has the --rsyncable option but the xz cli doesn't appear to have it. Basically if you reset the compression stream every once and a while (ideally using some rolling hash yourself) the hashes can sync up.

Of course the ideal IPFS chunking would just reuse the compression chunking but I don't think that is supported by the current implementation (and would need to be custom for each compression format).

So I guess the answer is that as of today it probably won't help much.

@ohAitch
Copy link

@ohAitch ohAitch commented Nov 12, 2020

You could always do the https://bup.github.io/ thing and store the explicitly chunked into files compressed version in IPFS, especially if the chunks are small enough you know they won't be separately chunked by IPFS itself.

@lordcirth
Copy link

@lordcirth lordcirth commented Nov 12, 2020

I guess if there is no big win, then it is best to stick with IPFS defaults. Incremental dedup between versions isn't the main reason we want IPFS anyway.

@kamadorueda
Copy link
Member

@kamadorueda kamadorueda commented Dec 26, 2020

Guys let me introduce you to a beta of CachIPFS, an encrypted (and private) read/write Nix binary cache over IPFS

https://4shells.com/docs#about-cachipfs

Sure it has a lot of rough edges, but it is an MVP that works, so let me know your thoughts!


Demo.sh
Demo GIF

@bbigras
Copy link

@bbigras bbigras commented Dec 30, 2020

@kamadorueda I tried CachIPFS twice. The first run took a long time (which I guess is normal) but the second run still took over 11 hours. Is that normal? I was publishing my nix-config.

I wasn't running 4s cachipfs daemon since in the demo it looks like it's only used for fetching.

@kamadorueda
Copy link
Member

@kamadorueda kamadorueda commented Dec 30, 2020

I think having faster second and later executions is a must

The current algorithm is very naive:

  • create a temporary directory
  • do nix copy the /nix/store/path you want to publish to the directory
  • encrypt the files
  • add them to IPFS
  • publish them to the CachIPFS coordinator

But naive is slow

We'll definitely improve it, thanks for the feedback @bbigras, I'm taking note! 😄


Does someone have a user case beyond publishing/retrieving from a private cache? we'd love to hear it

@bbigras
Copy link

@bbigras bbigras commented Dec 31, 2020

Does someone have a user case beyond publishing/retrieving from a private cache? we'd love to hear it

If both a stranger and me publish our nix-config with CachIPFS. Could both caches by used by the 2 of us?

Maybe another similar use case would be, if a lot of people are using CachIPFS, and they are all using the same channel (let say unstable). It could be nice and efficient to all share the same stuff.

If the file are trustable. I didn't read everything on CachIPFS yet.

@kamadorueda
Copy link
Member

@kamadorueda kamadorueda commented Jan 2, 2021

If both a stranger and me publish our nix-config with CachIPFS. Could both caches by used by the 2 of us?

Yes! As long as both of you use the same CachIPFS API token

Every account has associated:

  • an api token
  • a secret key for encryption/decryption
  • a private binary cache that can be accessed with the api token and encrypted/decrypted with the secret key

We'll have the ability to rotate those secrets soon. If many machines (you + your friend) use the same api token, they use the same encryption keys, and upload/retrieve from the same private binary cache

This is the private layer of CachIPFS, it requires trust but this is actually a feature (we don't want untrusted people to read/modify our data)

Maybe another similar use case would be, if a lot of people are using CachIPFS, and they are all using the same channel (let say unstable). It could be nice and efficient to all share the same stuff.

If the file are trustable. I didn't read everything on CachIPFS yet.

I'm thinking on this one, this would be the CachIPFS public layer in which all Nix users share the binary cache with all Nix users. This creates a distributed binary cache over IPFS (this is my dream and purpose)

The problem is security, An attacker can place a virus under /nix/store/gqm07as49jn3gqmxlxrgpnqhzmm18374-gcc-9.3.0 and upload it to the binary cache. If someone else requires gcc, (s)he downloads the virus instead of gcc. This is why trust is very important, you only want to fetch data from people you can trust (not hackers)

But trust can be negotiated in many ways:

  • users can decide which users to share data with. Sounds like the CachIPFS private layer? well, it is
  • with algorithms: https://www.tweag.io/blog/2020-12-16-trustix-announcement/ which is not very different from CachIPFS private layer but we find this solution very cool, we just need to find a way in which we can guarantee to the network 100% confidence they are downloading legit nix store paths (I take security very seriously)
  • by implementing cryptographic protocols that make trust unnecessary, like git https://blog.ipfs.io/2020-09-08-nix-ipfs-milestone-1/ this is by far the ideal solution, but the implementation is hard and there may be many months/years until we can offer a good user experience to the community, also it may not be possible to have 100% content-addressability in all cases (https://www.tweag.io/blog/2020-11-18-nix-cas-self-references/) and those cases will require trust anyway

This is a very exciting topic, we are thinking about it every day

In the big picture, CachIPFS can be defined as a let's-implement-something-useful-with-the-things-we-have-today

@bbigras
Copy link

@bbigras bbigras commented May 12, 2021

Any progress on IPFS now that there's the call to test Content-addressed Nix?

@Ericson2314
Copy link
Member

@Ericson2314 Ericson2314 commented May 12, 2021

@bbigras We at have not done any more IPFS work lately, because the implementation was basically complete and the main blocker is consensus around merging. But rest assured, all the recent work polishing content-addressed Nix builds upon the foundation for CA we laid with @regnat and @edolstra last summer while working on IPFS × Nix, and a win for content-addressed Nix is a win for IPFS × Nix.

I have merged master in our outstanding PRs from time to time, maybe it's time for me to do that again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet