Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/lib/test-driver: use QMP API to watch for VM state #257535

Merged
merged 1 commit into from Oct 22, 2023

Conversation

RaitoBezarius
Copy link
Member

@RaitoBezarius RaitoBezarius commented Sep 27, 2023

Description of changes

Following nix-community/lanzaboote#213

I realized that our test framework is terrible when it comes to boot-level crashes or hangs.

This is because the framework has no way to distinguish between a rightful long computation which shows random panic on the screen and an actual boot-level panic that prevent any further movement.

Who is responsible for knowing this? Computers usually have a concept of POST codes, which are there to communicate some code that says the current state of the machine, e.g. panicked, running, shutdown, memory failures, etc.

In VMs, there's no reason we could not have the same and even better.

This is what we attempt to bring by wiring up the QMP API which is the rich API that QEMU has internally and can keep us in touch with the current VM state: https://www.qemu.org/docs/master/interop/qemu-qmp-ref.html#qapidoc-81.

The only challenge of this PR is that the QEMU QMP API is inherently asynchronous, and our code is very synchronous. No problem, we will:

The aim of this PR is to enable enough technology to showcase a fix of the mentioned issue in the first place.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 23.11 Release Notes (or backporting 23.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

@mweinelt
Copy link
Member

Triggering eval after it was fixed on master.

@ofborg eval

@RaitoBezarius
Copy link
Member Author

@tfc I would like to get this merged as-is as any downstream consumer can start building things on the top of that.

Once we understand more how we want to drive this, we can incrementally improve the new APIs.

I also want to discuss the new APIs with @nikstur and @ElvishJerricco.

@RaitoBezarius
Copy link
Member Author

Thank you for the in-depth review @tfc ! Will address all of this.

@tfc
Copy link
Contributor

tfc commented Oct 20, 2023

@ofborg test login

@RaitoBezarius
Copy link
Member Author

RaitoBezarius commented Oct 20, 2023 via email

Now that we have a QMP client, we can wire it up in the test driver.

For now, it is almost completely useless because of the need of a constant "event loop", especially
for event listening.

In the next commits, we will slowly enable more and more usecases.
@tfc
Copy link
Contributor

tfc commented Oct 22, 2023

@ofborg test login

@tfc tfc merged commit dda77fc into NixOS:master Oct 22, 2023
22 of 23 checks passed
@AleXoundOS
Copy link
Contributor

@RaitoBezarius, this PR introduced regression on non-NixOS machines!
Here is just a minimal NixOS test:

  • before PR, works: nix run github:alexoundos/minimal-nixos-test-flake?rev=b1a57b24eb74dde875ba441fe37a47940a4ef3bc#checks.x86_64-linux.dummy-test.driver
  • after PR, fails: nix run github:alexoundos/minimal-nixos-test-flake?rev=ecaf8ebcf3eb4b457744a501c279ee858f77a93d#checks.x86_64-linux.dummy-test.driver
Traceback (most recent call last):
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/bin/.nixos-test-driver-wrapped", line 9, in <module>
    sys.exit(main())
             ^^^^^^
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/__init__.py", line 126, in main
    driver.run_tests()
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/driver.py", line 159, in run_tests
    self.test_script()
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/driver.py", line 151, in test_script
    exec(self.tests, symbols, None)
  File "<string>", line 1, in <module>
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/driver.py", line 169, in start_all
    machine.start()
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/machine.py", line 1132, in start
    self.qmp_client = QMPSession.from_path(self.qmp_path)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 50, in from_path
    return cls(sock)
           ^^^^^^^^^
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 44, in __init__
    self.send("qmp_capabilities")
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 98, in send
    return self._wait_for_new_result()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 58, in _wait_for_new_result
    self.read_pending_messages()
  File "/nix/store/0q8sk7njr8nbbp2gww6m4yrm2f52j375-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 62, in read_pending_messages
    line = self.reader.readline()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/qp5zys77biz7imbk6yy85q5pdv7qk84j-python3-3.11.6/lib/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer

(Sorry, the Traceback is not exactly from this test run, but the messages are the same)

@AleXoundOS
Copy link
Contributor

@tfc, regardless of the problem cause on non-NixOS machines, I think it would be nice to have an option to disable QMP.

@RaitoBezarius
Copy link
Member Author

It would be nice to know the root cause though of the failure on non-NixOS machine, that's not normal.

@AleXoundOS
Copy link
Contributor

It would be nice to know the root cause though of the failure on non-NixOS machine, that's not normal.

Agree. But I confirmed this behavior on 2 non-NixOS machines already. It's currently a blocker for our team, since we depend on running the test driver, passing arguments to it in non-interactive mode.

@RaitoBezarius, I should've noted that driverInteractive works just fine on non-NixOS: nix run github:alexoundos/minimal-nixos-test-flake?rev=ecaf8ebcf3eb4b457744a501c279ee858f77a93d#checks.x86_64-linux.dummy-test.driverInteractive --no-interactive. But such mode still opens up a qemu window which is not desirable (same behavior was before PR too).

@RaitoBezarius
Copy link
Member Author

Understandable, I guess you can revert this commit in the meantime but we really need a root cause analysis or more logs on what's going on. I don't have non NixOS systems that can run a VM test so... Unfortunately I cannot reproduce anything you sent so far.

It'd be helpful to have full systems details like OS, presence of sandbox, etc. What nix-info gives you.

@ElvishJerricco
Copy link
Contributor

ElvishJerricco commented Jan 27, 2024

@AleXoundOS I am unable to reproduce your problem on Debian 12 using nix run github:alexoundos/minimal-nixos-test-flake?rev=ecaf8ebcf3eb4b457744a501c279ee858f77a93d#checks.x86_64-linux.dummy-test.driver.

@AleXoundOS
Copy link
Contributor

I am unable to reproduce your problem on Debian 12

Tested on Fedora and it works too!

The problem is indeed specific to Arch Linux. Both machines which fail are Arch Linux in my case.

@lorenzleutgeb
Copy link
Member

lorenzleutgeb commented Apr 15, 2024

(I just arrived at this PR after brief confusion, so I'm going to leave this here for posterity.)

When qemu fails to run, e.g. because you specified fatally wrong QEMU_OPTS, then you might get a stack trace that will (on the surface) look like an error related to QMP. Look closer, and you'll realize that the problem occurs even earlier. Below, the root cause for being unable to connect to the QMP socket is the address resolution that failed earlier.

$ QEMU_OPTS="-vnc example.invalid:1337,reverse=on" ./result/bin/nixos-test-driver
Finished at 21:55:29 after 0s
[...]
additionally exposed symbols:
    client, server,
    vlan1,
    start_all, test_script, machines, vlans, driver, log, os, create_machine, subtest, run_tests, join_all, retry, serial_stdout_off, serial_stdout_on, polling_condition, Machine
>>> start_all()
start all VMs
client: starting vm
mke2fs 1.47.0 (5-Feb-2023)
qemu-kvm: -vnc example.invalid:1337,reverse=on: address resolution failed for example.invalid:5500: Name or service not known
Traceback (most recent call last):
  File "", line 1, in 
  File "/nix/store/…-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/driver.py", line 173, in start_all
    machine.start()
  File "/nix/store/…-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/machine.py", line 1089, in start
    self.qmp_client = QMPSession.from_path(self.qmp_path)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/…-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 50, in from_path
    return cls(sock)
           ^^^^^^^^^
  File "/nix/store/…-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 44, in __init__
    self.send("qmp_capabilities")
  File "/nix/store/…-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 98, in send
    return self._wait_for_new_result()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/…-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 58, in _wait_for_new_result
    self.read_pending_messages()
  File "/nix/store/…-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/qmp.py", line 62, in read_pending_messages
    line = self.reader.readline()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/…-python3-3.11.8/lib/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer

Of course it'd be nice if the testing framework would abort directly after the qemu failure. I don't know and didn't check whether that's possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants