Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cargo.lock considered harmful #327063

Open
Atemu opened this issue Jul 14, 2024 · 41 comments
Open

Cargo.lock considered harmful #327063

Atemu opened this issue Jul 14, 2024 · 41 comments
Labels
6.topic: hygiene 6.topic: rust 9.needs: maintainer feedback significant Novel ideas, large API changes, notable refactorings, issues with RFC potential, etc.

Comments

@Atemu
Copy link
Member

Atemu commented Jul 14, 2024

Introduction

I've been doing a little investigation on the impact of Cargo.lock files because, if you run ncdu against a Nixpkgs checkout, they're usually the largest individual files you come across and rust packages are frequently at the top in any given sub-directory.

AFAICT the functionality to import Cargo.lock has existed since May 2021. Usage has exploded since:

$ for ver in 21.05 22.05 23.05 24.05 ; do echo -n "nixos-$ver " ; git checkout --quiet NixOS/nixos-$ver && fd '^Cargo.lock$' . | wc -l ; done
nixos-21.05 4
nixos-22.05 15
nixos-23.05 208
nixos-24.05 316

Measurements

Next I measured the total disk usage of all Cargo.lock files combined:

$ for ver in 21.05 22.05 23.05 24.05 ; do echo -n "nixos-$ver " ; git checkout --quiet NixOS/nixos-$ver && fd '^Cargo.lock$' . -X du -b | cut -f 1 | jq -s 'add' ; done
nixos-21.05 156292
nixos-22.05 113643
nixos-23.05 12505511
nixos-24.05 24485533

24MiB!

Realistically though, anyone who cares about space efficiency in any way will use compression, so I measured again with each Cargo.lock compressed individually:

$ for ver in 21.05 22.05 23.05 24.05 ; do echo -n "nixos-$ver " ; git checkout --quiet NixOS/nixos-$ver && fd '^Cargo.lock$' . -x sh -c 'gzip -9 < {} | wc -c' | jq -s 'add' ; done
nixos-21.05 38708
nixos-22.05 30104
nixos-23.05 3075375
nixos-24.05 5986458

Further, evidence in #320528 (comment) suggests that handling Cargo.lock adds significant eval overhead. Eval time for Nixpkgs via nix-env is ~28% lower if parsing/handling of Cargo.lock files is stubbed.

Analysis

Just ~300/116231 packages (~0.25%) make up ~6MiB of our ~41MiB compressed nixpkgs tarball which is about 15% in relative terms (18.5KiB per package).
For comparison, our hackage-packages.nix containing the entire Hackage package set (18191 packages) is ~2.3MiB compressed (133 Bytes per package).

Breaking down eval time by package reveals that each Cargo.lock takes on average about 76.67 ms to handle/parse.

Discussion

I do not believe that this trend is sustainable, especially given the likely increasing importance of rust in the coming years. If we had one order of magnitude more rust packages in Nixpkgs and assumed the same amount of data per package that we currently observe, just the rust packages alone would take up ~54 MiB compressed.

If nothing is done, I could very well see the compressed Nixpkgs tarball bloat beyond 100MiB in just a few years.

Extrapolating eval time does not paint a bright picture either: If we assume one order of magnitude more Cargo.lock packages again, evaluating just those packages would take ~4x as long as evaluating the entire rest of Nixpkgs currently does.

This does not scale.

Solutions

I'm not deep into rust packaging but I remember the vendorHash being the predominant pattern a few years ago which did not have any of these issue as it's just one 32 Byte string literal per package.

Would it be possible to revert back to using vendorHashes again?

(At least for packages in Nixpkgs, having Cargo.lock support available for external use is fine.)

What else could be done to mitigate this situation?

Limitations/Future work

Files were compressed individually, adding gzip overhead for each lockfile. You could create a tarball out of all Cargo.lock files and compress it as a whole to mitigate this effect.

I found some Cargo.lock files that have a different name or a prefix/suffix and were not considered.


CC @NixOS/rust

@Atemu Atemu added 6.topic: rust 6.topic: hygiene 9.needs: maintainer feedback significant Novel ideas, large API changes, notable refactorings, issues with RFC potential, etc. labels Jul 14, 2024
@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/cargo-lock-considered-harmful/49047/1

@superherointj
Copy link
Contributor

To add to the evidence:
#221716 (comment)

@emilazy
Copy link
Member

emilazy commented Jul 14, 2024

I don’t think this should be prohibited entirely because I think it’s the only option if upstream doesn’t provide a Cargo.lock or it is somehow inadequate.

@diogotcorreia
Copy link
Member

I'm currently including a Cargo.lock in a package I maintain (pgvecto-rs) because upstream uses git dependencies and vendorHash does not work with that last time I checked. I'd love to be able to stop including Cargo.lock in nixpkgs, so let me know if there is an alternative.

@superherointj
Copy link
Contributor

I don’t think this should be prohibited entirely because I think it’s the only option if upstream doesn’t provide a Cargo.lock or it is somehow inadequate.

On making too easy for doing the wrong thing, only the wrong thing is done.
We should encourage to work with upstream first then.
For failures, we need to discuss the correct process for this.

@emilazy
Copy link
Member

emilazy commented Jul 14, 2024

I don’t think blocking Nixpkgs on upstream packaging issues is a viable approach, especially ones that primarily only affect us. We wouldn’t have a package set at all. In this case, many upstreams are actively unwilling to maintain a Cargo.lock file in version control.

@patka-123
Copy link
Contributor

The same goes for composer.lock files for php packages. There are soooo many upstreams that will never accept a lockfile, no matter how much you talk with them.

Perhaps we can still allow lockfiles, with the recommendation that you first talk to upstream to see if they are willing to accept it there. And reduce the tarball size slowly that way?

@Atemu
Copy link
Member Author

Atemu commented Jul 14, 2024

@emilazy I don't think @superherointj suggested doing that. They suggested prioritising upstream collaboration and only falling back to in-tree workarounds when collaboration fails but have a proper process for that in place.

I don't think we need to eradicate Cargo.lock entirely myself. There are likely edge-cases where anything else is simply impossible. It needs to be the exception rather than the norm though; something that 30 packages might do, not 300.


@patka-123 w.r.t. composer.lock and friends, also see #327064.


I have two further thoughts on possible solutions:

  1. Consider having a rustPackages set for deps, similar to haskellPackages or pythonPackages if technologically feasible. It'd be a lot more work but would make for much higher quality packages and we'd never run into issues like this.
  2. Consider generating lockfiles ourselves but storing them elsewhere. We could create a separate repo which would house lockfiles for projects where upstream lockfiles don't exist. In nixpkgs, we'd simply fetchurl from there. We don't actually need the lockfiles to be in the same repo as version granularity should be enough; we just need an immutable lockfile for a certain version of the upstream software somewhere. This could actually be a cross-distro effort as we're certainly not the only ones relying on lockfiles for r13y (any GUIX readers here?).

@emilazy
Copy link
Member

emilazy commented Jul 14, 2024

I’m all for upstream collaboration – I’ve opened like 7 upstream PRs during Nixpkgs work in the past month or so – but from my experience of upstream responsivity, I don’t think we can viably have a workflow like “work with upstream, then work around when that fails”. A lot of the time upstreams just take ages to even acknowledge an issue or PR, and even when they do it can take several rounds to achieve mutual understanding and consensus. “Work with upstream, apply a workaround in the meantime, then remove it if upstream collaboration succeeds” is much better for Nixpkgs maintainers and users. I think that making sure that workarounds get removed in a timely fashion when no longer necessary, and aren’t introduced unnecessarily in the first place, are more effective sites of intervention.

@superherointj
Copy link
Contributor

Since a simple way to avoid a lockfile is to ask upstream to add the lockfile.

My proposal:

  • When making a PR adding a lockfile we should open an issue at upstream and reference this issue as a comment near the lockfile code. If upstream accepts it, the package maintainer should update the package to remove the lockfile from nixpkgs. If not, we need some other solution then. (Which is unclear atm.)

Can we agree with this?

@emilazy
Copy link
Member

emilazy commented Jul 14, 2024

I think that is a reasonable enough expectation for packages with non‐dormant upstreams that haven’t previously expressed an opinion on lock files, when there is no other obstacle to removing the lock file (e.g. Git dependencies or us needing to extensively patch it for our own purposes), yeah.

Note that Cargo upstream used to explicitly recommend committing lock files for applications but not libraries, but they have since changed their guidance and it is now considerably more equivocal. So we don’t really have anything slam‐dunk that we can point people to here.

@Atemu
Copy link
Member Author

Atemu commented Jul 14, 2024

Consider generating lockfiles ourselves but storing them elsewhere. We could create a separate repo which would house lockfiles for projects where upstream lockfiles don't exist. In nixpkgs, we'd simply fetchurl from there. We don't actually need the lockfiles to be in the same repo as version granularity should be enough; we just need an immutable lockfile for a certain version of the upstream software somewhere. This could actually be a cross-distro effort as we're certainly not the only ones relying on lockfiles for r13y (any GUIX readers here?).

I've written this out more thoroughly in an amendment in #327064

@emilazy
Copy link
Member

emilazy commented Jul 14, 2024

I think that the Rust stuff can’t process a fetcher‐produced Cargo.lock because it’d be IFD. I’m not sure how it works in the case where you just use a cargoHash, though, and I could be mixing it up with one of the third‐party Nix Rust build solutions.

@lucasew
Copy link
Contributor

lucasew commented Jul 14, 2024

What if Nix would be able to import files from zip files?

Flakes and stuff could just download zip files instead of the extracted tree and Nix could load the files from the inside of the zip file without unpacking. Wouldn't solve the problem but could be a little easier on the IOPS, so, probably more usable in slower storage.

@emilazy
Copy link
Member

emilazy commented Jul 14, 2024

See the perpetual‐work‐in‐progress lazy trees work. It’s a hard problem, unfortunately, so we shouldn’t hold our breaths for it.

@Aleksanaa
Copy link
Member

I haven't studied the problem of git dependencies carefully, but this problem may be solved by using some scripts (instead of cargo) to parse Cargo.lock at build time (instead of eval time)?

@Aleksanaa
Copy link
Member

#217084 The original plan was to replace every cargoHash with Cargo.lock, but this was not implemented. This PR also lists some benefits of migrating to Cargo.lock.

@minijackson
Copy link
Member

minijackson commented Jul 14, 2024

I think I remember people recommending parsing the Cargo.lock file instead of using cargoHash, in order to have an easier time scanning for known vulnerabilities in dependencies.

For example, with nix path-info -r --derivation '.#gnvim-unwrapped' you can see without building anything which source dependencies it has.

But with nix path-info -r --derivation '.#bingrep' you just get a bingrep-0.11.0-vendor.tar.gz.drv.

Is this something we want to abandon, and either have ecosystem-specific tools, or wait for something like recursive Nix?

@alyssais
Copy link
Member

We don't need all the data in the Cargo.lock file. (We could drop the dependencies array that every package has.)

@Mic92
Copy link
Member

Mic92 commented Jul 14, 2024

The other big reason we vendor Cargo.lock are git dependencies. Is there maybe a way to make them worm without dependency vendoring?

@alyssais
Copy link
Member

Eventually, the way I'd like this to work would be that we have deduplication as well — some big file that defines every version of a Cargo package required for Nixpkgs, that we'd only have to process once. We could even eventually deduplicate semver-compatible packages, since the Rust ecosystem is very good about this. This would mitigate the problem of relying on upstream to update for dependency crates for security fixes.

But this would require some tooling to keep that file up to date when adding / updating a package. The quicker fix would be to remove the dependency information as suggested above, which would just require doing the deletion, and modifying the code that checks the Cargo.lock file matches to allow this.

@Mic92
Copy link
Member

Mic92 commented Jul 14, 2024

We don't need all the data in the Cargo.lock file. (We could drop the dependencies array that every package has.)

This would help with filesystem size but does it help with RAM usage?
Is the ram usage coming from that and not from the many small derivations that gets created from these crate sources?

@emilazy
Copy link
Member

emilazy commented Jul 14, 2024

I like the idea of One Gigantic Lock, but a directory with one package per file would probably be better than one file, even if less efficient on disk, because we won’t be fighting Git merges constantly. And of course we’d probably still want per‐package local additions/overrides for things like random Git fork dependencies.

@Aleksanaa
Copy link
Member

some big file that defines every version of a Cargo package required for Nixpkgs, that we'd only have to process once.

If I remember correctly, Flutter currently seems to do this in part, which also allows us to patch these dependencies (otherwise it will become very troublesome)

The quicker fix would be to remove the dependency information as suggested above, which would just require doing the deletion, and modifying the code that checks the Cargo.lock file matches to allow this.

This is how Arch Linux handles golang, and it seems to be the same for rust. However, the problem is that people in the rust ecosystem are already familiar with the feeling of vendoring, which may undermine our assumptions about version compatibility.

@Mic92
Copy link
Member

Mic92 commented Jul 14, 2024

Yeah, we have prior art with big files with pkgs/development/node-packages/node-packages.nix, which is not really fun to update.

@alyssais
Copy link
Member

This is how Arch Linux handles golang, and it seems to be the same for rust. However, the problem is that people in the rust ecosystem are already familiar with the feeling of vendoring, which may undermine our assumptions about version compatibility.

I think you misunderstand. Deleting the version information from Cargo.lock files in Nixpkgs would not change what versions of dependencies are used. The real Cargo.lock file would still be used in the derivation — we just don't need that information at eval time, because the Cargo.lock file would still contain a list of all the packages required for the build, just not which ones depended on which other ones.

@diogotcorreia
Copy link
Member

The other big reason we vendor Cargo.lock are git dependencies. Is there maybe a way to make them work without dependency vendoring?

@Mic92 I've just tried using the Cargo.lock from src on a random package that has git dependencies and it works (the package builds successfully at least). Is there any drawback for doing this?

i.e. do

cargoLock.lockFile = "${src}/Cargo.lock";

instead of

cargoLock.lockFile = ./Cargo.lock
Full diff
diff --git a/pkgs/applications/misc/faircamp/Cargo.lock b/pkgs/applications/misc/faircamp/Cargo.lock
deleted file mode 100644
index deeaca6b86be..000000000000
--- a/pkgs/applications/misc/faircamp/Cargo.lock
+++ /dev/null
@@ -1,3169 +0,0 @@
-# This file is automatically @generated by Cargo.
-# It is not intended for manual editing.
-version = 3

// -snip-

diff --git a/pkgs/applications/misc/faircamp/default.nix b/pkgs/applications/misc/faircamp/default.nix
index b243dccf9734..9e359f370aea 100644
--- a/pkgs/applications/misc/faircamp/default.nix
+++ b/pkgs/applications/misc/faircamp/default.nix
@@ -27,7 +27,7 @@ rustPlatform.buildRustPackage rec {
   };
 
   cargoLock = {
-    lockFile = ./Cargo.lock;
+    lockFile = "${src}/Cargo.lock";
     outputHashes = {
       "enolib-0.4.2" = "sha256-FJuWKcwjoi/wKfTzxghobNWblhnKRdUvHOejhpCF7kY=";
     };

@Aleksanaa
Copy link
Member

Aleksanaa commented Jul 14, 2024

I've just tried using the Cargo.lock from src on a random package that has git dependencies and it works (the package builds successfully at least). Is there any drawback for doing this?

That's IFD (Import From Derivation), which is not supported in Nixpkgs because Nix 2.3 doesn't support it (?) and it would introduce extra overhead to hydra. If you commit this code to Nixpkgs, CI will fail immediately.

@MostAwesomeDude
Copy link
Contributor

I've talked with enough upstreams to be fully convinced that they will intentionally make the wrong choice about Cargo.lock. In particular, many of them think of a false "library" vs "binary" dichotomy based on poor advice that used to be enshrined in Cargo's FAQ, and while the live FAQ no longer has this exact wording, it's still a very popular meme among Rust crate authors. We can't just think of ourselves and hope that upstreams will come around to our way of thinking.

@lucasew
Copy link
Contributor

lucasew commented Jul 14, 2024

Flutter packages today convert the pubspec.lock to JSON. The same thing could be done with rust. BTW flet-client-flutter is living proof that the update can be done automatically. BTW which is better? Generate and load nix files with data directly or use something like lib.importJSON?

@emilazy
Copy link
Member

emilazy commented Jul 14, 2024

In fairness we don’t really care about lock files for libraries, as far as I know; i.e. it’s only our leaf‐ish application packages that consume Cargo.locks.

(IMO importJSON > import every time.)

@natsukium
Copy link
Member

Also, as people realize the benefits of maturin, more and more Cargo.locks will be vendored in python packages.

@Mic92
Copy link
Member

Mic92 commented Jul 15, 2024

Yeah, we have prior art with big files with pkgs/development/node-packages/node-packages.nix, which is not really fun to update.

I thought about this for longer and maybe we can set this up in a way that doesn't suck as much. Still it will require some engineering. So if we had a way for users to submit Cargo.lock file to some sort of pastebin prior to submitting the update to nixpkgs and than we had some tooling that could work like this:

  # stored in a separate file so it's easy to discover with find without having to
  # evaluate nix code
  cargoDeps = fetchFromNixpkgsCargoLocks ./nix-cargo.lock;

./nix-cargo.lock

sha256-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

For projects that do have a cargo.lock in their own source tree we woudn't need the pastebin.
But we can specify instead how to get the cargo.lock from the source package.

In the context of the pull request this would use IFD, but when we merge it to nixpkgs, we could use something like a merge-queue that converts IFD to a datatstructure that is in nixpkgs.

 all-cargo-locks.nix
 # unclear how to deduplicate version information, but one thing we could do is just
 # storing all versions that we do need per crate in one file.
 crates/li/libc.json
 crates/ni/nix.json
 crates/to/tokio.json
 ...

If the file is 100% auto-generated, we should not have issues with merge conflicts as we can just delete the whole thing.

We would also need to modify our CI tooling to allow IFD from this service that provides our cargo.lock

@Mic92
Copy link
Member

Mic92 commented Jul 15, 2024

Of course if we have computed derivations as in NixOS/rfcs#92
than we could make this whole thing a lot more elegant and compute these crate fetcher derivation at build time instead of runtime.

@dani0854
Copy link
Contributor

dani0854 commented Jul 15, 2024

I have done a similar analysis of Cargo.lock file out of curiosity a few weeks ago.
But I focused more on git repository itself, rather than tarballs.

git rev-list --objects --filter=object:type=blob origin/master | git cat-file --batch-check='%(objectsize:disk) %(rest)' | awk '{n=split($2,a,"/"); arr[a[n]]+=$1} END {for (i in arr) print arr[i], i}' | sort --numeric-sort --reverse | numfmt --to=iec-i --suffix=B --padding=7 --round=nearest | head -n 50

Turns out that Cargo.lock is far from being the worst offender with regard to git repository size. With hackage-packages.nix being the worst.

 787MiB hackage-packages.nix
 477MiB all-packages.nix
 222MiB default.nix
 164MiB python-packages.nix
  92MiB node-packages.nix
  60MiB perl-packages.nix
  35MiB maintainer-list.nix
  31MiB generated.nix
  31MiB melpa-generated.nix
  26MiB cran-packages.nix
  22MiB recipes-archive-melpa.json
  18MiB Cargo.lock
  12MiB melpa-stable-generated.nix
  11MiB yarn.lock

Though I do agree that with regard to tarballs, it's a different story.

@matklad
Copy link
Member

matklad commented Jul 15, 2024

Eventually, the way I'd like this to work would be that we have deduplication as well — some big file that defines every version of a Cargo package required for Nixpkgs

Thinking from first-principle, it seems the way this should work is that nixpkgs should define a _sub_set of crates.io packages use that set as a registry for cargo:

[source.crates-io]
replace-with = "nixpkgs-flavored-crates.io"

[registries.nixpkgs-flavored-crates-io]
index = "/nix/store/..."

So, when building a Rust project, upstream Cargo.lock is ignored, and instead a new Cargo.lock is generated, based on nixpkgs crates.io registry replacement.

The problems here:

  • now nixpkgs is on the hook for updating every Rust library, as opposed to just the leaf applciations. Automation is mandatory here.
  • Every Rust application would then depend on this registry. So, updating a library which isn't use by an app would cause the change of inputs and, conversely, a rebuild of the app. Ideally, this is solved with content-addressable derivation, such that we only check that the generated Cargo.lock stays the same, even if the the registry as a whole changed, and then skip busting the hashes for the application.
  • Then, there's a philosophical question whether we should package stuff as upstream intended it, or whether we as a distribution should unify dependencies across everything we package.

@alyssais
Copy link
Member

We used to do this and stopped. I don't know why.

@PedroHLC
Copy link
Member

For now, can we detect the cases in which this is being used without need? An empty outputHashes might be a good indicator, but I've seen projects that don't even provide their own Cargo.lock in-repo (like https://github.com/sched-ext/scx).

If we could throw errors in completely useless usages (suggesting using cargoHash instead), we could slow down the burden until we reinvent the wheel with something more efficient…

@alyssais
Copy link
Member

but I've seen projects that don't even provide their own Cargo.lock in-repo (like https://github.com/sched-ext/scx).

For these, as long as there are no git dependencies, there's still no need to read Cargo.lock from Nix with all the eval penalty that imposes. We can just copy the file into the source tree, which should be very low overhead. And that would mean we can treat empty outputHashes as a robust indicator. (Though there should be an option for out-of-tree projects to allow it.)

@emilazy
Copy link
Member

emilazy commented Jul 16, 2024

Does copying it into the tree in postPatch or whatever work with the fixed‐output derivation mechanism? I understand that patching the Cargo files has to be done separately because of that?

@alyssais
Copy link
Member

You have to use cargoPatches or some other mechanism that works with the FOD. (There's no real reason we shouldn't just apply the whole patch phase in the FOD, except that at first we didn't so now we can't change it without breaking hashes.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: hygiene 6.topic: rust 9.needs: maintainer feedback significant Novel ideas, large API changes, notable refactorings, issues with RFC potential, etc.
Projects
None yet
Development

No branches or pull requests