Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

acquiring/releasing lock: Resource deadlock avoided #2207

Closed
nh2 opened this issue Jun 3, 2018 · 20 comments
Closed

acquiring/releasing lock: Resource deadlock avoided #2207

nh2 opened this issue Jun 3, 2018 · 20 comments

Comments

@nh2
Copy link
Contributor

nh2 commented Jun 3, 2018

Nix 2.0.2 as invoked via nixops.

After upgrading the host machine running nixops to 18.03, I got for the first time, and seem to nondeterministically get a failure for some stuff I'm building:

node-1> building '/nix/store/hwkicwds4avmss6jhjw6ndplmqbhk2kk-busybox-1.28.1.drv'...
node-1> acquiring/releasing lock: Resource deadlock avoided

The problem goes away after running the build for a couple times, because eventually the build succeeds and then it's in the nix store.

acquiring/releasing lock: Resource deadlock avoided is apparently a string Google has never seen.

I'm not quite sure what component is emitting it; Resource deadlock avoided seems to be some system error message.

@dtzWill
Copy link
Member

dtzWill commented Jun 9, 2018

FWIW, the acquiring/releasing lock prefix looks like it comes from here: https://github.com/NixOS/nix/blob/master/src/libstore/pathlocks.cc#L56
(or ~10 lines later).

No clue about why this happens, though-- maybe a cycle in the builders/targets? :/.
Very interesting...

@typetetris
Copy link
Contributor

Had this too, running the same nix-build in two different shells. Maybe the order in which the dependencies are build is not deterministic? (It is just a wild guess, but that could explain a deadlock, if a lock is acquired for every store path to be built.)

@DzmitrySudnik
Copy link

Just hit me too, I was running multiple integration tests in parallel and some of them were using nix-shell with a specific package. One of the test failed with:

copying path '/nix/store/r8yzw6si82i9h3rg2xgs95pnig4g5ijr-terraform-0.11.8-bin' from 'https://cache.nixos.org'... acquiring/releasing lock: Resource deadlock avoided error: build of '/nix/store/4255l52x7fv2y5hsp2z7f1ns85928035-terraform-0.11.8.drv' failed

and the next attempt on a fresh VM worked.

@copumpkin
Copy link
Member

@edolstra I've seen this too at various random times. Any idea what might cause it?

From eyeballing the man page, it seems like there's a slight chance we're checking for the wrong return value. The code that @dtzWill points out checks == 0 for success and assumes failure otherwise. The man page seems to specify that != -1 is the condition for success, although all live instances I've ever observed have returned 0. Perhaps in some weird corner case it returns nonzero values? Other projects I looked at on GitHub seem to compare against -1 too.

In a few places I've seen some suggestion that the fcntl can return the pid of a process that has the lock on the file, but I don't see that really specified anywhere.

@nh2
Copy link
Contributor Author

nh2 commented Nov 20, 2018

man fcntl on my Ubuntu 16.04:

RETURN VALUE
       For a successful call, the return value depends on the operation:

       F_DUPFD  The new descriptor.

       F_GETFD  Value of file descriptor flags.

       F_GETFL  Value of file status flags.

       F_GETLEASE
                Type of lease held on file descriptor.

       F_GETOWN Value of descriptor owner.

       F_GETSIG Value of signal sent when read or write becomes possible, or zero for traditional SIGIO behavior.

       F_GETPIPE_SZ, F_SETPIPE_SZ
                The pipe capacity.

       F_GET_SEALS
                A bit mask identifying the seals that have been set for the inode referred to by fd.

       All other commands
                Zero.

       On error, -1 is returned, and errno is set appropriately.

F_SETLKW and F_SETLK aren't in the list, so All other commands Zero. should apply for success of this operation.

@edolstra
Copy link
Member

More fun facts from the manpage: "The deadlock-detection algorithm employed by the kernel when dealing with F_SETLKW requests can yield ... false positives (EDEADLK errors when there is no deadlock). ... In addition, the kernel may falsely indicate a deadlock when two or more processes created using the clone(2) CLONE_FILES flag place locks that appear (to the kernel) to conflict." Note that threads are created using CLONE_FILES.

BTW edolstra@58d1980 gets rid of POSIX file locks. It might fix this problem as a side-effect. flock() doesn't detect deadlocks.

@nh2
Copy link
Contributor Author

nh2 commented Nov 20, 2018

Nice, so it looks like what we really need is somebody to try write a reproducer (perhaps a script that uses nix-shell in parallel, as suggested above), and see if that commit makes it irreproducible.

@joshenders
Copy link

For what it's worth, we've been able to hit this message pretty reliably in our builds with 32 concurrent build agents attempting to use nix. Our installation is multi-user but all the processes are trying to write to a common /nix directory which must use some locking mechanism.

@domenkozar
Copy link
Member

@joshenders can you test with @edolstra patch?

@domenkozar
Copy link
Member

http://0pointer.de/blog/projects/locking.html explains how posix locks are not even thread safe, so I'm not entirely convinced that edolstra@effa4be prevents much.

@edolstra
Copy link
Member

I don't follow, since the patch gets rid of POSIX locks.

@domenkozar
Copy link
Member

What I mean is, I'm not entirely convinced that we need a schema bump - I'd really like to backport this to the maintenance branch.

I'll try to come up with a way to reproduce this easily and we can try different Nix versions without nix-daemon.

@domenkozar
Copy link
Member

@edolstra any objections cherry-picking edolstra@58d1980 to master?

@edolstra
Copy link
Member

Yes, we can't cherry-pick it because it's a schema change (it also requires edolstra@effa4be).

@edolstra
Copy link
Member

Also, there is no evidence that edolstra@58d1980 actually fixes this issue.

@joshenders
Copy link

joshenders commented Jan 16, 2019

I can reliably reproduce this issue in our environment and so I might be able to test 58d1980.

I’ve worked around it temporarily by preventing processes from calling nix concurrently.

@joshenders
Copy link

Just an update: planning on testing @edolstra's diff early next week. Should I be able to cherry pick this commit cleanly on the 2.2.1 tag?

@joshenders
Copy link

joshenders commented Jan 31, 2019

@edolstra A checkout of 2.2.1 with effa4be and 58d1980 cherry-picked from your repo isn't building cleanly. A build of 2.2.1 without effa4be and 58d1980 is building and testing cleanly. Are there other dependent commits I'm missing? Should I be building directly from your repo?

I'm invoking the build scripts with:
nix-build release.nix -A build.x86_64-darwin

@rickynils
Copy link
Member

@edolstra @joshenders I also experience problems similar to what is reported here, and I also tried 58d1980, on top of the 2.2.2 tag. Built like this:

  nixpkgs.overlays = [(self: super: {
    nix = self.nixUnstable;
    nixUnstable = super.nixUnstable.override rec {
      name = "nix-2.2_${suffix}";
      suffix = "53bd077_bsd_locks";
      src = super.fetchFromGitHub {
        owner = "evidentiae";
        repo = "nix";
        rev = "53bd077967847e8e85a46f0aa12158b1dc8a2214";
        sha256 = "1n37ang2x5xdxyl2q9yibilxjxsqmsxf9ldxx1mi9388xs84y3rj";
      };
    };
  })];

The compilation part goes through, but the repair.sh test seems to hang. Nix interrupts the build after 3600s of no output. Building without the patch (so, just using the 2.2.2 tag above) works fine.

dtzWill referenced this issue Aug 9, 2019
POSIX file locks are essentially incompatible with multithreading. BSD
locks have much saner semantics. We need this now that there can be
multiple concurrent LocalStore::buildPaths() invocations.
@edolstra
Copy link
Member

edolstra commented Oct 9, 2019

Will close this. Please reopen if anybody sees this issue on Nix >= 2.3.

@edolstra edolstra closed this as completed Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants