Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
nixos xkb-layouts-exist: try to debug on Hydra
I hate having to do this. We're unable to reproduce the problem locally.
- Loading branch information
2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW below would appear to fix the failure for me:
From my observations the failure here is non-deterministic as I also had trouble reproducing this initially. After many attempts I managed to reproduce it somewhat reliably however it still behaves somewhat erratically with build randomly succeeding or failing.
For reasons I don't fully understand the above patch makes the build consistent for me, I just completed 100 runs without a failure:
With patch reverted:
It seems this might be an issue in nix itself as in the failing cases it appears to be killing the builder process before it exits on its own, hence the
signal 9 (Killed)
error. The process gets killed after producing full output (touch $out
) but before it fully finishes executing. It is unclear to me as to why that occurs and why the behavior is random.EDIT I'm testing on top of current unstable + xkb commit.
2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the
exec
in top-level process confuses nix daemon.2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exec
confusing things was my hypothesis as well but again not sure as to why :-(. Based on output produced this could be related to this comment and the assumption that the process has already exited, that for some reason doesn't hold in this case.2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't see why
exec
should cause EOF, but I'm not so skilled in POSIX/bash details.2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess @pbogdan is on the right track here, this looks like some kind of race condition.
In theory this should simulate something similar what @pbogdan has tried but I couldn't reproduce it :-/
@pbogdan: Which Nix version did you test this on?
2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it doesn't seem deterministic even on Hydra.
2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vcunat: The patch from @pbogdan however has the problem that if the grep fails the build process is just failing without the error message. I'm going to push a similar patch in a minute.
2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aszlig this is on
nix-env (Nix) 1.12pre5413_b4b1f452
2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pbogdan: Okay, I've tested it on the some version, maybe my machine is not loaded enough to trigger the race condition. Could you try whether you can also reproduce it with the one-liner I mentioned?
2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The early exit with
exec
wasn't exactly readable to me, too, though perhaps it's about getting used to such stuff.2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aszlig :
Behavior is random, I can get few iterations to complete but eventually it fails before completing a full run.
2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed in e82d126.
2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this clearly was just a workaround. I really think we should fix this in Nix, but we first need to figure out why exactly this happens. @pbogdan: Please let me know whether you could reproduce this.
2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice cooperation ❤️
2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pbogdan: Ah, sorry, didn't see your comment.
This is an
strace -e trace=close
of the test case:So it seems that fd 1/2 have been closed before the actual process substitution, but the close was within a subprocess so in theory it shouldn't have any effect on the builder.
2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tested and the original commit + patch passes 100 iterations 😄 . Here's what I was seeing with just the original commit in case it helps:
Success:
Failure:
EDIT above is a build without sandboxing. I verified sandboxing fails in the same way but didn't capture strace output for that.
2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pbogdan: Oh nice! So it seems Nix only cares about FD 1 and because it is closed before FD 2, it kills the process.
2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but it doesn't seem to have to do with this, because
stderr
is dup()ed tostderr
. I'll write up a Nix issue.2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See NixOS/nix#1426.
2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So
touch
presumably closes stdout immediately because it doesn't need it?2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vcunat: Not immediately, but on exit.
2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/touch.c?id=e13fe20049166da816cc5102f13210264c866008#n169
2014db3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, I probably don't want to follow that fd juggling with
fd_reopen
, etc. It's enough here to know that it's closed "prematurely".