-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make multi-user installer idempotent #7603
base: master
Are you sure you want to change the base?
Conversation
There are instructions at https://nixos.org/manual/nix/stable/contributing/hacking.html#installer-tests for setting up the installer jobs to run in your own fork. |
I guess you'd need to manually test the generated installer or maybe tweak the install test workflow to run the installation twice to test this specific fix. (may also need to add some cleanup to keep it from blocking on some other non-idempotence) |
@abathur Ah, thanks for the pointer! I found that page but it didn't seem to be what I was looking for. I got that setup working, which is of course nice, but running the CI just takes forever. What I was really looking for was this: $ nix-build -A hydraJobs.installerTests.ubuntu-22-04.x86_64-linux.install-force-daemon This allows me to build and test locally automatically. Very convenient. But yes, good point, I have to adapt the CI as well to avoid a regression on this in the future. However, the CI is using the cachix/install-nix action right now, which does a single-user install AND skips the installation if it finds the nix command, so that will be a little more involved. I also saw that there's #7215 which I also ran into with the automated tests, but I think for the purposes of this PR I would just add all the manual steps to the automated test before the second install attempt. |
Alright, I made the multi-user installer fully idempotent now, and the test was successful!
So that's looking good.
I'm not sure if making the github CI-pipeline test for this as well makes sense, as long as hydra is doing it. |
I tested this again, and it does seem like 250c118 from this PR kind-of fixes the Potentially, we could also fully delete /nix/* first (which might also resolve other issues), but I feel this is way too nuclear and has potential to break and/or delete other parts of the installation like per-user profiles and their generations, which ideally should be left alone during a re-install. |
YES! I finally figured it out! In bb0c4b9, this change was made: @@ -741,2 +741,2 @@
_sudo "to copy the basic Nix files to the new store at $NIX_ROOT/store"
\
- cp -RLp ./store/* "$NIX_ROOT/store/"
+ cp -RPp ./store/* "$NIX_ROOT/store/" Which made sure symlinks are copied verbatim and not followed. While technically correct, this created an incompatibility with installs made BEFORE this commit was released. Before, symlinks to directories would be followed and deep-copied. For example, libkrb contains a symlink $ ls -lF /nix/store/5sxcmklgrgl7lsij8bp9a98iws4q8fw0-libkrb5-1.19.2
total 4
dr-xr-xr-x 1 root root 44 Jan 1 1970 bin/
dr-xr-xr-x 1 root root 1108 Jan 1 1970 lib/
lrwxrwxrwx 1 root root 3 Jan 1 1970 sbin/
dr-xr-xr-x 1 root root 10 Jan 1 1970 share/ But when running the post-bb0c4b9 installer, it tries to create this: $ ls -lF /nix/store/5sxcmklgrgl7lsij8bp9a98iws4q8fw0-libkrb5-1.19.2
total 4
dr-xr-xr-x 1 root root 44 Jan 1 1970 bin/
dr-xr-xr-x 1 root root 1108 Jan 1 1970 lib/
lrwxrwxrwx 1 root root 3 Jan 1 1970 sbin -> bin/
dr-xr-xr-x 1 root root 10 Jan 1 1970 share/ Which is not possible and fails with the dreaded message
Now I can confidently say that this PR will fix all the I also found two related outdated issues that I believe were caused by 475fc10, cannot occur since bb0c4b9 and will not regress after this PR. They're basically the opposite problem, trying to overwrite a link with a directory: I will rebase the PR and add a short version of and link to this explanation to the commit that fixes the issue. After that it is ready to merge from my side. |
Fixes NixOS#6679 and all issues that contain `cp: cannot overwrite directory ... with non-directory` errors. These were caused by 475fc10 and bb0c4b9. Or rather, installations after 475fc10 erroneously followed and deep-copied symlinks, which was fixed in bb0c4b9. This meant installations installed with the installer released between these commits had some paths in their nix store with directories where symlinks should have been, causing the fixed installer to try to overwrite them with symlinks. The -n will not overwrite existing files, which is fine inside of the nix-store as identical store paths will have identical content. For additional details and examples, see NixOS#7603 (comment)
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/prs-ready-for-review/3032/1778 |
Triaged in the Nix team meeting:
|
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/2023-03-03-nix-team-meeting-minutes-37/25998/1 |
Fixes NixOS#6679 and all issues that contain `cp: cannot overwrite directory ... with non-directory` errors. These were caused by 475fc10 and bb0c4b9. Or rather, installations after 475fc10 erroneously followed and deep-copied symlinks, which was fixed in bb0c4b9. This meant installations installed with the installer released between these commits had some paths in their nix store with directories where symlinks should have been, causing the fixed installer to try to overwrite them with symlinks. The -n will not overwrite existing files, which is fine inside of the nix-store as identical store paths will have identical content. For additional details and examples, see NixOS#7603 (comment)
@abathur could you take a look? |
Yes, the GA tests aren't great for velocity--but they're still a big step up from the status quo, and I think it's ideal to lock in whatever gains we can get here with both sets of tests.
I'm pretty sure it uses multiuser and has for a while as long as you're on macOS or have systemd.
Ah. Yes. That's annoying (for our case...). I spent some time trying to find a ~cheap way to force this to work and do have a candidate: abathur@100c3d5#diff-b803fcb7f17ed9235f1e5cb1fcd2f5d3b2838429d4368ae4c57ce4436577f03fR120-R129 The few lines above the highlight are a failed attempt to just overwrite GITHUB_PATH (which had no effect), but removing nix from the default profile did let the cachix action try to go ahead. (There might be a better way to do this?) I added a set of jobs that try this with the current release installer, and a set that try it on top of your PR. Here's the CI run https://github.com/abathur/nix/actions/runs/5329640965 The sanity-check jobs fail where I think we expect them to:
The install-test jobs both also fail, so I'm curious what you think about where they tip over (maybe these are out of scope?):
|
@abathur Thanks for taking the time! Yes, the sanity-checks are exactly what we expect. But actually, the failing installer-test on macOS is also expected. See the issues I linked in the initial post:
This is what was fixed by adding the -n flag to cp. The unit file on linux I don't know about, I didn't test that. I assume it happens here, on line 101, but can't really explain why. I also searched the issue tracker, and it seems not a single person has ever reported this issue before. |
Fixes NixOS#6679 and all issues that contain `cp: cannot overwrite directory ... with non-directory` errors. These were caused by 475fc10 and bb0c4b9. Or rather, installations after 475fc10 erroneously followed and deep-copied symlinks, which was fixed in bb0c4b9. This meant installations installed with the installer released between these commits had some paths in their nix store with directories where symlinks should have been, causing the fixed installer to try to overwrite them with symlinks. The -n will not overwrite existing files, which is fine inside of the nix-store as identical store paths will have identical content. For additional details and examples, see NixOS#7603 (comment)
I am not sure I understand why this failure is expected. The latter set of tests includes your code (https://github.com/abathur/nix/blame/100c3d5400f315eec04cfdbeb99a20a51b66932c/scripts/install-multi-user.sh#L833), so it doesnt appear that the new flag stops the error? |
Ah sorry, I misunderstood. Yeah I guess. I checked the installer the pipeline downloaded from your cachix, and it seems to indeed be the correct version, so I have no idea why it doesn't work in CI but works when running the hydra tests locally. |
I tested this installer on a spare MacBook with an existing nix 2.13.2 install and it too failed at the cp step with overwrite directory errors. I'm not very familiar with the vm tests, so I don't have much of a mental model for why they might be differing. Are you sure the macOS variant ran? |
This will fix #6679 and #7215.
Additionally, this PR
cp: cannot overwrite directory ... with non-directory
errors