Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixosTests.pt2-clone and others hang #144613

Closed
fgaz opened this issue Nov 4, 2021 · 8 comments · Fixed by #144679
Closed

nixosTests.pt2-clone and others hang #144613

fgaz opened this issue Nov 4, 2021 · 8 comments · Fixed by #144679
Labels
0.kind: bug Something is broken 0.kind: regression Something that worked before working no longer 6.topic: testing Tooling for automated testing of packages and modules

Comments

@fgaz
Copy link
Member

fgaz commented Nov 4, 2021

Describe the bug

pt2-clone.tests/nixosTests.pt2-clone, nixosTests.ft2-clone, nixosTests.shattered-pixel-dungeon, nixosTests.xterm (and probably others) hangs:

[...]
machine # [   25.071572] systemd[952]: Reached target Current graphical user session.
(27.39 seconds)
machine # [   26.830709] systemd-udevd[687]: controlC0: Process '/nix/store/i1jkln6xh32aw11xmjk4p8f14cpzszjz-alsa-utils-1.2.5.1/sbin/alsactl restore 0' failed with exit code 99.
machine # [   26.889435] systemd[1]: Reached target Sound Card.
machine # [   26.918075] systemd[952]: Reached target Sound Card.
^Cerror: interrupted by the user

Full log: pt2-clone.tests.txt

Steps To Reproduce

nix-build -A pt2-clone.tests

Expected behavior

[...]
machine: must succeed: xwininfo -root -tree | sed 's/.*0x[0-9a-f]* \"\([^\"]*\)\".*/\1/; t; d'
(0.08 seconds)
(3.31 seconds)
machine: must succeed: sleep 5
(5.10 seconds)
machine: sending monitor command: screendump /build/tmp2_cesjz4/ppm
machine: making screenshot /nix/store/yscyb0ymgmn8nshkb260gmqca05krsg0-vm-test-run-pt2-clone/screen.png
machine: sending monitor command: screendump /nix/store/yscyb0ymgmn8nshkb260gmqca05krsg0-vm-test-run-pt2-clone/screen.png.ppm
(0.15 seconds)
(62.30 seconds)
test script finished in 62.45s
cleanup
kill machine (pid 8)
machine # qemu-system-x86_64: terminating on signal 15 from pid 6 (/nix/store/8dxxjbiyxwkvh53q5kh6nydla2anacgi-python3-3.9.6/bin/python3)
(0.09 seconds)
/nix/store/yscyb0ymgmn8nshkb260gmqca05krsg0-vm-test-run-pt2-clone

Additional context

I ran a bisect and it looks like 1640359 (PR #142747) is the cause

1640359f333d8af40bf1f3e7961943ea04c6d1a1 is the first bad commit
commit 1640359f333d8af40bf1f3e7961943ea04c6d1a1
Author: Janne Heß <janne@hess.ooo>
Date:   Sun Oct 24 15:46:45 2021 +0200

    nixos/test-runner: Fix execute() flakiness
    
    Instead of using the magic string, we now just base64-encode everything
    and check for a newline.

 .../development/writing-nixos-tests.section.md     |  5 +++
 .../development/writing-nixos-tests.section.xml    |  8 ++++-
 nixos/lib/test-driver/test-driver.py               | 40 +++++++++++++++-------
 nixos/tests/hibernate.nix                          |  4 +--
 nixos/tests/kexec.nix                              |  2 +-
 5 files changed, 43 insertions(+), 16 deletions(-)

Notify maintainers

Author of the above commit: @dasJ

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

[user@system:~]$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.10.48, NixOS, 21.05.1412.cf59fbd5396 (Okapi)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.4pre20210601_5985b8b`
 - channels(root): `"nixos-21.05.1412.cf59fbd5396"`
 - channels(fgaz): `"nixpkgs-21.11pre326821.509d236edf0, stable-20.09.4407.1c1f5649bb9"`
 - nixpkgs: `/home/fgaz/.nix-defexpr/channels/nixpkgs`

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
- nixosTests.pt2-clone
- nixosTests.ft2-clone
- nixosTests.shattered-pixel-dungeon
- nixosTests.xterm
# a list of nixos modules affected by the problem
module:
@fgaz fgaz added 0.kind: bug Something is broken 0.kind: regression Something that worked before working no longer 6.topic: testing Tooling for automated testing of packages and modules labels Nov 4, 2021
@dasJ
Copy link
Member

dasJ commented Nov 4, 2021

This fails because the command never exits and so base64 never exit which causes the loop to hang forever. Any ideas @tfc @roberth? If not we can just revert my PR

A workaround could be adding yet another flag to execute() but that's not really nice

@roberth
Copy link
Member

roberth commented Nov 4, 2021

succeed has always returned stdout, so then how did that work before? Is it forking a process that keeps the pipe open? That's asking for SIGPIPE in the child, unless it doesn't write anything ever of course. What a mess. Perhaps we should have distinct methods for each use case.
Maybe we can force-close the pipe when the command exits? We might need to change it for a socket connection pair though, so we can use a socket man 2 shutdown.

@roberth
Copy link
Member

roberth commented Nov 4, 2021

I've skipped a few steps here. The reason to do so is we can force the shutdown in-band, without risking the loss of buffered data.
Perhaps a hackier way to achieve this is to have a process that behaves like cat until it sees a specific string in its input, and then exits before writing that input to its output. Iirc linux man 2 write is atomic up to 4kb, so interleaving (invaliding the special string) shouldn't be an issue.

Or we could just use this of course https://docs.microsoft.com/en-us/windows/win32/api/namedpipeapi/nf-namedpipeapi-disconnectnamedpipe :trollface:
Or maybe I'm just an idiot who doesn't know the obvious way to shut down a pipe on unix.

@fgaz fgaz changed the title pt2-clone.tests hangs pt2-clone.tests and others hang Nov 4, 2021
@fgaz fgaz changed the title pt2-clone.tests and others hang nixosTests.pt2-clone and others hang Nov 4, 2021
@dasJ
Copy link
Member

dasJ commented Nov 4, 2021

We surely don't want to shut down the pipe since it's a socket that connects us to the backdoor service or am I misunderstanding your point here?

@dasJ
Copy link
Member

dasJ commented Nov 4, 2021

Turns out tests still hang sometimes so my fix didn't really fix the root issue. We can probably just revert it and live with the pain of hanging tests

@roberth
Copy link
Member

roberth commented Nov 4, 2021

UPDATE: I'm not a fan of the solution I describe here.

We surely don't want to shut down the pipe since it's a socket that connects us to the backdoor service or am I misunderstanding your point here?

Yeah we only want to terminate the command. I wasn't clear about which pipe.

# conceptually
($cmd | base64) | testScript
      ^
    this pipe
    is the pipe we want to close, so that
    base64 can terminate and testScript
    can continue

That's not something we can do with a normal unix fifo, as far as I'm aware.
A socket connection can be closed but that will require a lot of custom stuff.
So what we want is something like this:

($cmd; while ! [[ -f pid ]]; sleep 0.1; done; kill $(<pid) ) | (cat & echo $!>pid) | base64

except this can throw away unbuffered data when cat didn't have enough time to forward stuff to base64.
That's where the socket connection comes in. At this point, bash is not of much use anymore.

with-terminating-pipe $cmd | base64

Excuse my pseudocode.

with-terminating-pipe:

# first perform some rituals to get a pair of socket _connections_
socketPath <- mkTemp
socket <- createUnixDomainSocket(socketPath)
listen(socket)
cmdSide <- openAndBind(socketPath)
wrapperSide <- accept(socket)
close(socketPath)
rm(socketPath)

# we now have cmdSide and wrapperSide file descriptors,
# connected to each other. Capable of bidirectional communication
# as well, but we'll treat them as fifo fds. Should be fine.
# point is, we now can use `man 2 shutdown` on them.
# We can also shut down the direction we don't need, but that's not important.

shovelThread <- detachThread {
  while (data <- read(wrapperSide)) {
    write(stdout, data)
  }
}
# now run $cmd with its output going into our socket pair
cmdExitStatus <- forkAndWaitForExit {
  stdin <- /dev/null
  stdout <- cmdSide
  stderr <- /dev/console # or maybe it already is, idk
  return system($cmd)
}
# $cmd has exited, so it has put all its data in the buffer.
# now we can close it without losing data
shutdown(cmdSide)

# now we wait for sholveThread to encounter the closure and finish
wait(shovelThread)

# output is safely on its way to base64.
# we're almost done, just need to propagate the exit code.
exit(cmdExitStatus)

@roberth
Copy link
Member

roberth commented Nov 5, 2021

I think we should just wait for stdout. It's a sensible thing to do. See #144679

lovesegfault added a commit that referenced this issue Nov 5, 2021
…t-wait-stdout

nixosTest: document wait for stdout behavior, fix #144613
@roberth
Copy link
Member

roberth commented Nov 6, 2021

Sensible but not great. See #144875

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken 0.kind: regression Something that worked before working no longer 6.topic: testing Tooling for automated testing of packages and modules
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants