-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enabling ca-derivations can cause deadlock on "waiting for exclusive access to the Nix store for ca drvs..." #6666
Comments
I've worked around this by forcing the upgrade (update nix.conf, force a build of any drv, even if it fails) on machines when they're idle. I didn't get a verified reproduction, but it seems pretty easy to make it deadlock because I still got it several times when I tried without the initial drain. Perhaps if the machine is doing any build at the same time as the upgrade it will deadlock. |
Still happening in 2.8.1 |
I also encountered this. My hunch is that it's a NixOS issue (NixOS/nixpkgs#177142): Somehow |
There's an interesting tradeoff here. Upgrading the schema on-the-fly means that just using a new version might make the store backwards-incompatible. Otoh making that more independent would make things more complex both implementation-wise (because the code would have do deal with old db versions) and for users (because they would have to think about it).
Fwiw it's not :) The db schema change is intentionally just adding new tables/indexes/triggers, so that the old schema is still valid (meaning that it's always possible to transparently rollback to a non-ca world) |
The problem is not due to the actual schema change, as far as I can tell, but in the locking. Presumably sqlite doesn't know it's a "safe" change or doesn't support atomic safe updates or whatever, or you wouldn't have needed that lock at all. But anyway since you do need the lock, now you have deadlocks... so I'd say it definitely was risky in that it permanently wedges up any nix operation until you back out by reverting the config change and restarting the daemon. In our case the backing out is also done via nix so it's worse than usual when nix itself is the problem. My first thought was that something as dangerous as a manual global lock and schema upgrade should be done manually and explicitly, but I guess nix has been implicitly upgrading its schema for a long time now, and maybe the real problem is the manual lock part. And nix is full of manual locks, so maybe it's just that they have rules about acquire order or something which got broken for the schema upgrade code path... which is always the thing with manual locks. Anyway, I wound up recreating a manual upgrade by hand with the explicit drain, add enable-ca, realize a dummy drv, then undrain... so you could say the real problem is still the lock and I was working around with the drain. So maybe what nix needs is a less error-prone locking mechanism. However I do feel that crossing a backwards incompatible line should be done explicitly so we know. Otherwise it's too easy to try a canary to find bugs, find said bugs, but then be stuck again because rollback doesn't work. And rollback is something nix advertises that it is good at! It's surely more general than just nixos-rebuild since we don't use that and still saw it. It seems to happen if you are doing any nix operation when it gets added, but I didn't 100% verify that, just observed it. My other thought was that it happened if GC happened to be running, but later saw it when GC was not going so maybe not just that. |
@thufschmitt I doubt this comes from Something to avoid this would be killing off new processes rather than waiting for the lock. |
I also encounter this while doing nix copy and killing the remote nix daemon fixed it. |
Where you copy-ing to |
In our case, I copied as a non-root user and the daemon logged the connection (trusted) |
Yes. Are we not suppoed to do that? Or do we need to set |
Nope, there's nothing wrong with that in general. It's just unfortunate in that case because I assume the
Is it calling
That might be if they were building, but normally they shouldn't keep a lock on the db – these are only grabbed temporarily when needed, and because it's all nicely wrapped in an RAII interface, I have a reasonable confidence that they don't leak. But I wouldn't trust that too blindly either, so maybe they were keeping a lock after all. Were these running as root? |
@thufschmitt It's calling |
still an issue with how to break your nixos: 1. add # /etc/nixos/flake.nix
{
inputs.nixpkgs.url = "github:NixOS/nixpkgs/b00aa8ded743862adc8d6cd3220e91fb333b86d3"; # nixpkgs-unstable-2022-08-13
outputs = { self, ... }@inputs: {
nixosConfigurations.laptop1 = inputs.nixpkgs.lib.nixosSystem {
system = "x86_64-linux";
specialArgs = {
inherit inputs;
};
modules = [
({ pkgs, ... }: {
# add: ca-derivations
nix.extraOptions = ''
experimental-features = nix-command flakes recursive-nix impure-derivations ca-derivations
'';
})
];
};
};
} 2. rebuild sudo nixos-rebuild switch 3. try to rebuild again sudo nixos-rebuild switch -v
# waiting for exclusive access to the Nix store for ca drvs... how to fix your nixos: 4. remove sudo mv /etc/nix/nix.conf /etc/nix/nix.conf.broken
sudo cp /etc/nix/nix.conf.broken /etc/nix/nix.conf
sudo sed -i -E 's/(experimental-features =.*) ca-derivations/\1/' /etc/nix/nix.conf 5. revert step 1 = remove 6. rebuild sudo nixos-rebuild switch |
I ran into this when upgrading to
I cancelled the stuck command, rebooted and just retried. That worked for me, which fits with @elaforge's theory:
I would also guess that
should not run into this at all unless the new Nix really interferes with itself somehow. |
in the a shared lock (lockType = ltRead) is acquired in local-store.cc:264. this succeeds // nix/src/libstore/local-store.cc
LocalStore::LocalStore(const Params & params)
// line 264
/* Acquire the big fat lock in shared mode to make sure that no
schema upgrade is in progress. */
Path globalLockPath = dbDir + "/big-lock"; // /nix/var/nix/db/big-lock
globalLock = openLockFile(globalLockPath.c_str(), true);
if (!lockFile(globalLock.get(), ltRead, false)) {
printInfo("waiting for the big Nix store lock...");
lockFile(globalLock.get(), ltRead, true); // this succeeds
} an exclusive lock (lockType = ltWrite) is acquired in local-store.cc:92. this hangs // nix/src/libstore/local-store.cc
void migrateCASchema(SQLite& db, Path schemaPath, AutoCloseFD& lockFd)
// line 92
if (!lockFile(lockFd.get(), ltWrite, false)) {
printInfo("waiting for exclusive access to the Nix store for ca drvs...");
lockFile(lockFd.get(), ltWrite, true); // this hangs
} in theory, the exclusive lock will be released in local-store.cc:162 // nix/src/libstore/local-store.cc
void migrateCASchema(SQLite& db, Path schemaPath, AutoCloseFD& lockFd)
// line 162
lockFile(lockFd.get(), ltRead, true); not working: // nix/src/libstore/local-store.cc
void migrateCASchema(SQLite& db, Path schemaPath, AutoCloseFD& lockFd)
// line 92
printInfo("checking exclusive lock");
if (!lockFile(lockFd.get(), ltWrite, false)) {
printInfo("checking shared lock");
if (lockFile(lockFd.get(), ltRead, false)) {
printInfo("releasing shared lock ...");
lockFile(lockFd.get(), ltNone, true);
printInfo("releasing shared lock done");
}
printInfo("waiting for exclusive access to the Nix store for ca drvs...");
lockFile(lockFd.get(), ltWrite, true); // this hangs
} todo: |
I tried reproducing it on a fresh VM, but couldn't manage to do so (apart from manually keeping a Nix process open on the side). Anyone encountering this, can you try running |
Just got it while trying to upgrade from |
If you can build a self-contained example that exhibits this, that would be truly awesome! |
Just ran into this during a regular update of my desktop machine (NixOS 22.11 mainly), had to use the "edit |
Ran into this problem again today.
Edit, it seems like PackageKit is holding a lock on it? So, after killing the packagekit process, everything resumed as expected. |
I am truly not an expert on this but it looks like there's a TOC/TOU between when Nix checks that no database migration is happening by acquiring a shared lock on nix/src/libstore/local-store.cc Lines 264 to 272 in 26c7602
And when it later acts on this check by acquiring an exclusive lock: nix/src/libstore/local-store.cc Lines 92 to 95 in 26c7602
So if you start two Nix processes (or indeed two processes using the Nix store library, like PackageKit) at the same time, they can both get a shared lock on I guess a solution would be to try to get an exclusive lock right away, and only downgrade it to a shared lock after we've done the migration (or we've decided we don't need to do one). Am I making any sense? |
There are subtleties though: I was unable to get a deadlock to happen on my machine, even with a very simple C program that just calls
This lack of atomicity prevents the deadlock, which makes me wonder if there are situations where it does upgrade the lock atomically. People who experienced this bug: what version of Linux are/were you on? What type of filesystem is your I will still submit a PR under the assumption that this is what's happening, but it certainly is puzzling...
@milahu could you elaborate on this? Did you confirm that releasing the lock with |
NixOS (22.11 mainly), ZFS |
Still happening on Nix 2.17.0. Occurred when running
|
Doing a |
Describe the bug
After adding
ca-derivations
toexperimental-features
innix.conf
, many servers have all nix operations stuck on "waiting for exclusive access to the Nix store for ca drvs...".It looks like this is a deadlock when upgrading the sqlite schema, but I don't know when it happens. It seems to have happened on most of our servers though so it must be easy to trigger. Perhaps it happens when nix-daemon is restarted with the setting added and builds are currently in progress? In any case, everything is then permanently deadlocked at that point, so existing builds completing don't fix it.
Aside from the deadlock, tools to inspect the sqlite version and status, and possibly upgrade it independently would be less error-prone than attempting to upgrade the schema implicitly on the fly. I wasn't expecting that merely enabling a feature would be a risky operation.
nix-env --version
output2.4.1
Additional context
This has been separately reported in NixOS/nixops#1520, but I don't think this is a nixops-only problem.
The text was updated successfully, but these errors were encountered: