nixos/lib/test-driver: use QMP API to watch for VM state #257535

RaitoBezarius · 2023-09-27T00:42:33Z

Description of changes

I realized that our test framework is terrible when it comes to boot-level crashes or hangs.

This is because the framework has no way to distinguish between a rightful long computation which shows random panic on the screen and an actual boot-level panic that prevent any further movement.

Who is responsible for knowing this? Computers usually have a concept of POST codes, which are there to communicate some code that says the current state of the machine, e.g. panicked, running, shutdown, memory failures, etc.

In VMs, there's no reason we could not have the same and even better.

This is what we attempt to bring by wiring up the QMP API which is the rich API that QEMU has internally and can keep us in touch with the current VM state: https://www.qemu.org/docs/master/interop/qemu-qmp-ref.html#qapidoc-81.

The only challenge of this PR is that the QEMU QMP API is inherently asynchronous, and our code is very synchronous. No problem, we will:

The aim of this PR is to enable enough technology to showcase a fix of the mentioned issue in the first place.

Things done

nixos/lib/test-driver/test_driver/machine.py

mweinelt · 2023-09-29T02:26:59Z

Triggering eval after it was fixed on master.

@ofborg eval

RaitoBezarius · 2023-09-29T19:33:00Z

@tfc I would like to get this merged as-is as any downstream consumer can start building things on the top of that.

Once we understand more how we want to drive this, we can incrementally improve the new APIs.

I also want to discuss the new APIs with @nikstur and @ElvishJerricco.

nixos/lib/test-driver/test_driver/qmp.py

RaitoBezarius · 2023-09-29T20:16:52Z

Thank you for the in-depth review @tfc ! Will address all of this.

tfc · 2023-10-20T12:39:01Z

@ofborg test login

RaitoBezarius · 2023-10-20T13:24:06Z

I think I fucked up and didn't rebase everything so will rerun the test Le ven. 20 oct. 2023, 13:39, Jacek Galowicz ***@***.***> a écrit :

…

@ofborg <https://github.com/ofborg> test login — Reply to this email directly, view it on GitHub <#257535 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACMZRE5UFDM2LT7V75VTGDYAJWHBAVCNFSM6AAAAAA5IQ3VSCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZSGY3DOMRZG4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Now that we have a QMP client, we can wire it up in the test driver. For now, it is almost completely useless because of the need of a constant "event loop", especially for event listening. In the next commits, we will slowly enable more and more usecases.

tfc · 2023-10-22T08:21:29Z

@ofborg test login

AleXoundOS · 2024-01-26T14:32:58Z

@RaitoBezarius, this PR introduced regression on non-NixOS machines!
Here is just a minimal NixOS test:

before PR, works: nix run github:alexoundos/minimal-nixos-test-flake?rev=b1a57b24eb74dde875ba441fe37a47940a4ef3bc#checks.x86_64-linux.dummy-test.driver
after PR, fails: nix run github:alexoundos/minimal-nixos-test-flake?rev=ecaf8ebcf3eb4b457744a501c279ee858f77a93d#checks.x86_64-linux.dummy-test.driver

Traceback (most recent call last):
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/bin/.nixos-test-driver-wrapped", line 9, in <module>
    sys.exit(main())
             ^^^^^^
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/__init__.py", line 126, in main
    driver.run_tests()
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/driver.py", line 159, in run_tests
    self.test_script()
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/driver.py", line 151, in test_script
    exec(self.tests, symbols, None)
  File "<string>", line 1, in <module>
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/driver.py", line 169, in start_all
    machine.start()
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/machine.py", line 1132, in start
    self.qmp_client = QMPSession.from_path(self.qmp_path)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 50, in from_path
    return cls(sock)
           ^^^^^^^^^
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 44, in __init__
    self.send("qmp_capabilities")
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 98, in send
    return self._wait_for_new_result()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 58, in _wait_for_new_result
    self.read_pending_messages()
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 62, in read_pending_messages
    line = self.reader.readline()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/qp5zys77biz7imbk6yy85q5pdv7qk84j-python3-3.11.6/lib/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer

(Sorry, the Traceback is not exactly from this test run, but the messages are the same)

AleXoundOS · 2024-01-26T14:46:27Z

@tfc, regardless of the problem cause on non-NixOS machines, I think it would be nice to have an option to disable QMP.

RaitoBezarius · 2024-01-26T15:05:11Z

It would be nice to know the root cause though of the failure on non-NixOS machine, that's not normal.

AleXoundOS · 2024-01-26T18:06:45Z

It would be nice to know the root cause though of the failure on non-NixOS machine, that's not normal.

Agree. But I confirmed this behavior on 2 non-NixOS machines already. It's currently a blocker for our team, since we depend on running the test driver, passing arguments to it in non-interactive mode.

@RaitoBezarius, I should've noted that driverInteractive works just fine on non-NixOS: nix run github:alexoundos/minimal-nixos-test-flake?rev=ecaf8ebcf3eb4b457744a501c279ee858f77a93d#checks.x86_64-linux.dummy-test.driverInteractive --no-interactive. But such mode still opens up a qemu window which is not desirable (same behavior was before PR too).

RaitoBezarius · 2024-01-26T18:09:14Z

Understandable, I guess you can revert this commit in the meantime but we really need a root cause analysis or more logs on what's going on. I don't have non NixOS systems that can run a VM test so... Unfortunately I cannot reproduce anything you sent so far.

It'd be helpful to have full systems details like OS, presence of sandbox, etc. What nix-info gives you.

ElvishJerricco · 2024-01-27T11:00:05Z

@AleXoundOS I am unable to reproduce your problem on Debian 12 using nix run github:alexoundos/minimal-nixos-test-flake?rev=ecaf8ebcf3eb4b457744a501c279ee858f77a93d#checks.x86_64-linux.dummy-test.driver.

AleXoundOS · 2024-01-31T13:42:25Z

I am unable to reproduce your problem on Debian 12

Tested on Fedora and it works too!

The problem is indeed specific to Arch Linux. Both machines which fail are Arch Linux in my case.

lorenzleutgeb · 2024-04-15T20:06:53Z

(I just arrived at this PR after brief confusion, so I'm going to leave this here for posterity.)

When qemu fails to run, e.g. because you specified fatally wrong QEMU_OPTS, then you might get a stack trace that will (on the surface) look like an error related to QMP. Look closer, and you'll realize that the problem occurs even earlier. Below, the root cause for being unable to connect to the QMP socket is the address resolution that failed earlier.

$ QEMU_OPTS="-vnc example.invalid:1337,reverse=on" ./result/bin/nixos-test-driver

Finished at 21:55:29 after 0s
[...]
additionally exposed symbols:
    client, server,
    vlan1,
    start_all, test_script, machines, vlans, driver, log, os, create_machine, subtest, run_tests, join_all, retry, serial_stdout_off, serial_stdout_on, polling_condition, Machine
>>> start_all()
start all VMs
client: starting vm
mke2fs 1.47.0 (5-Feb-2023)
qemu-kvm: -vnc example.invalid:1337,reverse=on: address resolution failed for example.invalid:5500: Name or service not known
Traceback (most recent call last):
  File "", line 1, in 
  File "/nix/store/…-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/driver.py", line 173, in start_all
    machine.start()
  File "/nix/store/…-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/machine.py", line 1089, in start
    self.qmp_client = QMPSession.from_path(self.qmp_path)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/…-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 50, in from_path
    return cls(sock)
           ^^^^^^^^^
  File "/nix/store/…-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 44, in __init__
    self.send("qmp_capabilities")
  File "/nix/store/…-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 98, in send
    return self._wait_for_new_result()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/…-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 58, in _wait_for_new_result
    self.read_pending_messages()
  File "/nix/store/…-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 62, in read_pending_messages
    line = self.reader.readline()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/…-python3-3.11.8/lib/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer

Of course it'd be nice if the testing framework would abort directly after the qemu failure. I don't know and didn't check whether that's possible.

github-actions bot added 6.topic: python 6.topic: nixos labels Sep 27, 2023

RaitoBezarius mentioned this pull request Sep 27, 2023

Hercules CI disabled for stuck VM tests nix-community/lanzaboote#213

Closed

ofborg bot added 8.has: package (new) 11.by: package-maintainer 10.rebuild-darwin: 1-10 10.rebuild-linux: 1-10 labels Sep 27, 2023

Mic92 reviewed Sep 27, 2023

View reviewed changes

nixos/lib/test-driver/test_driver/machine.py Outdated Show resolved Hide resolved

RaitoBezarius force-pushed the vmstate branch from 8b4c520 to f8f646c Compare September 29, 2023 02:19

RaitoBezarius marked this pull request as ready for review September 29, 2023 19:11

RaitoBezarius requested a review from tfc as a code owner September 29, 2023 19:11

RaitoBezarius force-pushed the vmstate branch from f8f646c to 65275f9 Compare September 29, 2023 19:11

github-actions bot removed the 6.topic: python label Sep 29, 2023

ofborg bot added 10.rebuild-darwin: 0 and removed 10.rebuild-darwin: 1-10 labels Sep 29, 2023

tfc reviewed Sep 29, 2023

View reviewed changes

nixos/lib/test-driver/test_driver/qmp.py Outdated Show resolved Hide resolved

tfc reviewed Sep 29, 2023

View reviewed changes

nixos/lib/test-driver/test_driver/qmp.py Outdated Show resolved Hide resolved

tfc reviewed Sep 29, 2023

View reviewed changes

nixos/lib/test-driver/test_driver/qmp.py Outdated Show resolved Hide resolved

tfc reviewed Sep 29, 2023

View reviewed changes

nixos/lib/test-driver/test_driver/qmp.py Show resolved Hide resolved

RaitoBezarius force-pushed the vmstate branch 5 times, most recently from f4e1371 to 9c749b7 Compare October 19, 2023 15:46

RaitoBezarius force-pushed the vmstate branch from 9c749b7 to f94876a Compare October 21, 2023 11:03

tfc merged commit dda77fc into NixOS:master Oct 22, 2023
22 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nixos/lib/test-driver: use QMP API to watch for VM state #257535

nixos/lib/test-driver: use QMP API to watch for VM state #257535

RaitoBezarius commented Sep 27, 2023 •

edited

mweinelt commented Sep 29, 2023

RaitoBezarius commented Sep 29, 2023

RaitoBezarius commented Sep 29, 2023

tfc commented Oct 20, 2023

RaitoBezarius commented Oct 20, 2023 via email

tfc commented Oct 22, 2023

AleXoundOS commented Jan 26, 2024

AleXoundOS commented Jan 26, 2024

RaitoBezarius commented Jan 26, 2024

AleXoundOS commented Jan 26, 2024

RaitoBezarius commented Jan 26, 2024

ElvishJerricco commented Jan 27, 2024 •

edited

AleXoundOS commented Jan 31, 2024

lorenzleutgeb commented Apr 15, 2024 •

edited

nixos/lib/test-driver: use QMP API to watch for VM state #257535

nixos/lib/test-driver: use QMP API to watch for VM state #257535

Conversation

RaitoBezarius commented Sep 27, 2023 • edited

Description of changes

Things done

mweinelt commented Sep 29, 2023

RaitoBezarius commented Sep 29, 2023

RaitoBezarius commented Sep 29, 2023

tfc commented Oct 20, 2023

RaitoBezarius commented Oct 20, 2023 via email

tfc commented Oct 22, 2023

AleXoundOS commented Jan 26, 2024

AleXoundOS commented Jan 26, 2024

RaitoBezarius commented Jan 26, 2024

AleXoundOS commented Jan 26, 2024

RaitoBezarius commented Jan 26, 2024

ElvishJerricco commented Jan 27, 2024 • edited

AleXoundOS commented Jan 31, 2024

lorenzleutgeb commented Apr 15, 2024 • edited

RaitoBezarius commented Sep 27, 2023 •

edited

ElvishJerricco commented Jan 27, 2024 •

edited

lorenzleutgeb commented Apr 15, 2024 •

edited