Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NixOS python tests intermittently fail with Fatal Python error #86889

Closed
srhb opened this issue May 5, 2020 · 13 comments
Closed

NixOS python tests intermittently fail with Fatal Python error #86889

srhb opened this issue May 5, 2020 · 13 comments
Labels
0.kind: bug 6.topic: nixos 6.topic: testing Tooling for automated testing of packages and modules
Milestone

Comments

@srhb
Copy link
Contributor

srhb commented May 5, 2020

Describe the bug

[...]
osd2 # [    5.294516] systemd[1]: Starting Create Volatile Files and Directories...
monA # [    3.796676] systemd[1]: Reached target Containers.
osd0 # [    6.469903] systemd[1]: Found device /dev/hvc0.
osd2 # [    5.307825] systemd[1]: Started Create Volatile Files and Directories.
(0.67 seconds)
monA: waiting for success: ceph osd stat | grep -e '3 osds: 3 up[^,]*, 3 in'
Fatal Python error: could not acquire lock for <_io.BufferedWriter name='<stderr>'> at interpreter shutdown, possibly due to daemon threads

Thread 0x00007ffff64c4700 (most recent call first):
  File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 94 in eprint
  File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 745 in process_serial_output

Thread 0x00007ffff5cc3700 (most recent call first):
  File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 94 in eprint
  File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 745 in process_serial_output

Thread 0x00007ffff54c2700 (most recent call first):
  File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 94 in eprint
  File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 745 in process_serial_output

Thread 0x00007ffff4cc1700 (most recent call first):
  File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 94 in eprint
  File "/nix/store/ygl9dx10myiy1ka6kgi35prn6c535xp8-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 745 in process_serial_output

Current thread 0x00007ffff7893f40 (most recent call first):
(0.79 seconds)
monA: waiting for success: ceph -s | grep 'mgr: a(active,'
(0.79 seconds)
monA: waiting for success: ceph -s | grep 'HEALTH_OK'
/nix/store/vilh5ays3ymz3xkwk0fri2a70lha7pfc-stdenv-linux/setup: line 1271:     3 Aborted                 (core dumped) LOGFILE=$out/log.xml tests='exec(os.environ["testScript"])' /nix/store/wkmmx61nyr0ndh8kh1vpyh8l6csc5jp9-nixos-test-driver-basic-multi-node-ceph-cluster/bin/nixos-test-driver
builder for '/nix/store/33jy4sdn9jzxkmrdlidr6pzgxilanh3y-vm-test-run-basic-multi-node-ceph-cluster.drv' failed with exit code 134
error: build of '/nix/store/33jy4sdn9jzxkmrdlidr6pzgxilanh3y-vm-test-run-basic-multi-node-ceph-cluster.drv', '/nix/store/5p6zrdni0q24p2qwmxmjdx3n72mwc9yk-vm-test-run-basic-single-node-ceph-cluster.drv' failed

To Reproduce
This one is hard to reproduce. The above snippet is from nix-build nixos/tests/ceph-single-node.nix on a feature commit on top of release-20.03 e5cb3ef

I know @aanderse has seen it happen in other python tests.

@veprbl veprbl added 6.topic: testing Tooling for automated testing of packages and modules 6.topic: nixos labels May 5, 2020
@flokli
Copy link
Contributor

flokli commented May 6, 2020

I also saw this in various tests.

@FRidh
Copy link
Member

FRidh commented May 8, 2020

Some general explanation regarding this issue
https://stackoverflow.com/questions/45267439/fatal-python-error-and-bufferedwriter

_thread.start_new_thread(process_serial_output, ())

@FRidh FRidh added this to the 20.09 milestone May 8, 2020
@flokli
Copy link
Contributor

flokli commented May 9, 2020

As explained in #86486 (comment), our method of writing to stderr isn't thread-safe. I gave moving all this to the native python logging module, which should be threadsafe in my wip-test-driver-logging branch.

@roberth
Copy link
Member

roberth commented May 22, 2020

@flokli Your patch works well! It fixes a test that was failing consistently in a private code base. Could you make a PR?

nitpick: I noticed it's possible to do logger.getLogger() to get the root logger. Not sure if that's equal to logger.getLogger("") but at least for configuring the handler I think it's best to use logger.getLogger(). Maybe I don't know about some python conventions though :)

@flokli
Copy link
Contributor

flokli commented May 22, 2020

@roberth This was mostly a quick hack to see if it fixes things :-)

I'm still waiting on some reaction to #87191 (comment). Knowing whether log outputs will still be part of the output or not, and if there will be some structured/nested format will affect that code.

@roberth
Copy link
Member

roberth commented May 22, 2020

@flokli Perhaps it makes sense to add the logging code to #86486 which also needs it

@flokli
Copy link
Contributor

flokli commented May 22, 2020

Hm, that's yet another instance of "general flakyness". I downgraded the referenced PR to a draft.

I'd really like to see the necessary discussion to happen, before putting more time into this.

@khumba
Copy link
Contributor

khumba commented Jul 1, 2020

Well darn, I'd been redirecting stdout to stderr to see test progress in real time, but with machine.succeed("... >&2") now, this leads to hung tests.

This is easily reproducible with the test below. With 1000 iterations, it passes. With 2000 iterations, it produces the could not acquire lock error (though adding a time.sleep(1) at the end of the script eliminates the problem). With 3000 iterations or anything higher, regardless of sleeps, the test hangs.

let
  usePython = true;

  driver =
    if usePython
    then <nixpkgs/nixos/tests/make-test-python.nix>
    else <nixpkgs/nixos/tests/make-test.nix>;

  test = import driver {
    name = "nixos-log-flooding-test";

    machine = { pkgs, ... }: {
      environment.systemPackages = [
        (pkgs.writeScriptBin "do-test" ''
          seq 2000 | while read N; do
            echo "************************************************** $N"
          done
        '')
      ];
    };

    testScript = if usePython then ''
      machine.succeed("do-test >&2")
    '' else ''
      $machine->succeed("do-test >&2");
    '';
  };
in test

@Ma27
Copy link
Member

Ma27 commented Jul 8, 2020

Just stumbled upon this as well. If we don't fix this before 20.09 we shouldn't drop the perl driver IMHO.

@flokli
Copy link
Contributor

flokli commented Jul 8, 2020

As expressed in #86486 (comment), I'd like to get some feedback on how to treat "test results" - see #87191 (comment) before putting more time into #86486.

@Ma27
Copy link
Member

Ma27 commented Jul 8, 2020

Sorry, should've read the entire thread more closely. Will take a look.

@Ma27
Copy link
Member

Ma27 commented Jul 12, 2020

FYI #86486 has been closed due to a lack of time to get this finished. It would be awesome if anyone could finish this! :)

@flokli
Copy link
Contributor

flokli commented Aug 28, 2020

The logging has been moved to use pythons logging module - this should have been fixed by #96254.

@flokli flokli closed this as completed Aug 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug 6.topic: nixos 6.topic: testing Tooling for automated testing of packages and modules
Projects
None yet
Development

No branches or pull requests

7 participants