Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for build users when none are available #3564

Merged
merged 4 commits into from May 18, 2020

Conversation

@balsoft
Copy link
Member

balsoft commented May 5, 2020

Solves #1216

Descriptions from there:

It would be nice to have the option of waiting for a build user to become free instead of failing to build.
I'm talking about this error: error: all build users are currently in use; consider creating additional users and adding them to the ‘nixbld’ group
Is there ever a case where failing is the preferred option to waiting? Hopefully waiting can become the default behaviour.

Additional implementation details: when build users are available, the behavior is exactly the same as before (choose the first available user, take a lock on it, use it). However, when no build users are available at the moment, repeat the search every two seconds. Whenever a build user becomes available, choose it and continue as usual. Previous behavior was to error out.

Tested on:

  • NixOS
  • Non-NixOS Linux
  • Darwin
  • other platforms.

Note:
I have tried implementing a more "proper" solution with libfswatch (to only check the userpool when it changes), but for some reason the callback didn't trigger when lock files were released. If you believe that my work on implementing this with libfswatch is of interest, I can publish it.

@domenkozar
Copy link
Member

domenkozar commented May 5, 2020

Thank you a thousand times!

@LnL7
Copy link
Member

LnL7 commented May 5, 2020

I think this also resolves the fact that we make 32 build users in the installer, this was mostly the reason for it.

@balsoft
Copy link
Member Author

balsoft commented May 5, 2020

I think this also resolves the fact that we make 32 build users in the installer, this was mostly the reason for it.

This is my next nixpkgs PR, yes ;)

src/libstore/build.cc Outdated Show resolved Hide resolved
@edolstra
Copy link
Member

edolstra commented May 6, 2020

This is not a good solution because it puts the entire build loop to sleep (since it's single-threaded). So no other builds/substitutions will make any progress until another process releases a build user. In fact, it seems that this will deadlock if this process itself has acquired all build users (since then it can't make progress to release any build users).

For comparison, if we can't acquire a lock on an output path, the DerivationGoal is put to sleep so that other goals can make progress:

    if (!outputLocks.lockPaths(lockFiles, "", false)) {
        worker.waitForAWhile(shared_from_this());
        return;
    }
@balsoft
Copy link
Member Author

balsoft commented May 6, 2020

@edolstra Thanks for the suggestion, I'll implement the same logic.

@balsoft
Copy link
Member Author

balsoft commented May 8, 2020

Please, re-evaluate if possible.

I have changed the behaviour to match that of output locks. However, now #2069 is even more of an issue.

@domenkozar

This comment has been minimized.

Copy link
Member

domenkozar commented on src/libstore/build.cc in 772e5db May 8, 2020

Don't want to increase scope even further, but it would be ideal if there was a way to pass around why it's waiting to print the correct error message.

@balsoft
Copy link
Member Author

balsoft commented May 8, 2020

My testing methodology, for anyone interested:

  1. Run nix-daemon from this version of nix
  2. Create a file
    test-nix.nix
with import <nixpkgs> {};
let newPackage = n: hello.overrideAttrs (_: { name = "hello-${toString builtins.currentTime}-${toString n}"; });
in buildEnv { name = "env"; ignoreCollisions = true; paths = builtins.genList newPackage 100; }
  1. nix build -f test-nix.nix --max-jobs $BUILDUSERS+3 (to test for deadlock mentioned above)
  2. nix build -f test-nix.nix as many times as needed to run out of build users.

If the last two commands succeed and use up all available build users while not crashing and not using users two times over, my patch is correct.

I'd be very grateful if someone tested this and reported the results on non-NixOS linux and/or Mac.

@balsoft
Copy link
Member Author

balsoft commented May 8, 2020

@domenkozar I had a temptation to do so here, but I think it merits a separate PR. I'll create one now.

Actually, I think it's slightly more complex than I anticipated. We need to monitor for changes of the reason for waiting.

@balsoft
Copy link
Member Author

balsoft commented May 8, 2020

I'll come back to #2069 when I have time.

@balsoft
Copy link
Member Author

balsoft commented May 8, 2020

@domenkozar fixed (architecturally) in #3577, will rebase that if/when this PR is merged.

@domenkozar domenkozar requested a review from edolstra May 11, 2020
@@ -530,6 +532,12 @@ class UserLock
UserLock::UserLock()
{
assert(settings.buildUsersGroup != "");
createDirs(settings.nixStateDir + "/userpool");
/* Mark that user is not enabled by default */
uid = 0;

This comment has been minimized.

Copy link
@edolstra

edolstra May 12, 2020

Member

0 seems a risky choice as a sentinel value for UIDs given that 0 == root. So if there's ever any confusion, we might end up doing an operation as/to root. Might be better to use -1 or std::optional<uid_t>.

This comment has been minimized.

Copy link
@edolstra

edolstra May 12, 2020

Member

Alternatively, to ensure that UserLock always represents a valid user, have a builder function that returns std::unique_ptr<UserLock> so it can represent failure to acquire a user.

This comment has been minimized.

Copy link
@balsoft

balsoft May 12, 2020

Author Member

I've thought about adding a function that can represent a failure, and it does appear to be a better solution from some perspectives, but I didn't want to make the diff too big (since this is my first PR). Now that it's inevitably going to be massive, I'll wrap the result in optional and make it represent failure.

@@ -1391,6 +1394,30 @@ void DerivationGoal::tryToBuild()
{
trace("trying to build");

/* If `build-users-group' is not empty, then we have to build as
one of the members of that group. */

This comment has been minimized.

Copy link
@edolstra

edolstra May 12, 2020

Member

If I understand this correctly, it will try to acquire a build user even when we're doing a remote build, which is unnecessary. Maybe it's better to add a new state to the state machine, e.g. startBuild, with a transition from tryToBuild to startBuild if we're doing a local build.

This comment has been minimized.

Copy link
@balsoft

balsoft May 12, 2020

Author Member

Oh yes, this sounds very much like something I didn't even think about.

I'll try to wrap my head around this and implement it next weekend.

Thank you again!

This comment has been minimized.

Copy link
@balsoft

balsoft May 14, 2020

Author Member

Added a state and (hopefully) fixed this.

@balsoft
Copy link
Member Author

balsoft commented May 14, 2020

@edolstra I've (hopefully) fixed both of issues you've found, please look at it again when you have time.

The CI seems to be slightly broken (it can't find a builder for macOS).

I've tested this change on my NixOS desktop as explained above.

@edolstra edolstra merged commit 0ed946a into NixOS:master May 18, 2020
0 of 2 checks passed
0 of 2 checks passed
tests (ubuntu-18.04) tests (ubuntu-18.04)
Details
tests (macos) tests (macos)
Details
@edolstra
Copy link
Member

edolstra commented May 18, 2020

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.