Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i686 swap problems in tests (master) #23107

Closed
vcunat opened this issue Feb 23, 2017 · 16 comments
Closed

i686 swap problems in tests (master) #23107

vcunat opened this issue Feb 23, 2017 · 16 comments
Labels
0.kind: regression Something that worked before working no longer 6.topic: nixos

Comments

@vcunat
Copy link
Member

vcunat commented Feb 23, 2017

Issue description

We're reproducibly getting problems in the hibernate test, but only on i686-linux.

machine# [   11.350824] swapon[633]: swapon: /dev/vdb: read swap header failed
machine# [   11.358592] systemd[1]: dev-vdb.swap: Swap process exited, code=exited status=255
machine# [   11.372330] systemd[1]: Failed to activate swap /dev/vdb.
machine# [   11.377620] systemd[1]: Dependency failed for Swap.
machine# [   11.382812] systemd[1]: swap.target: Job swap.target/start failed with result 'dependency'.
machine# [   11.392925] systemd[1]: dev-vdb.swap: Unit entered failed state.

As a result, the channel is blocked, missing security updates, etc. I don't know if the problem also happens outside VM or outside tests.

@abbradar
Copy link
Member

Aaargh, I would have looked at this but I've accidentially reverted to configuration with unstable Nix! So now I need to reinstall NixOS again or I can't run tests because of #22868 D:

@vcunat
Copy link
Member Author

vcunat commented Feb 23, 2017

Oh, is there not a way to go back? I'd think you "only" need to downgrade the database schema.

@abbradar
Copy link
Member

abbradar commented Feb 23, 2017

Actually I was just stupid and haven't read the manual good enough:

# nix-build -A nixUnstable -o nix-new
# nix-build -A nix -o nix-old
# nixos-rebuild switch # to configuration with stable Nix
# nix-new/bin/nix-store --dump-db > db.nix
# mv /nix/var /nix/var.old
# systemctl restart nix-daemon.socket
# cat db.nix | nix-old/bin/nix-store --load-db

Not a nice procedure but better than a reinstall!

@vcunat
Copy link
Member Author

vcunat commented Feb 23, 2017

It would seem nice if we improved the situation, as it's one of the rare things you can't just roll back.

@abbradar
Copy link
Member

abbradar commented Feb 23, 2017

A rough sketch of how could that work:

  systemd.services.nix-daemon.preStart = ''
    # Check that we can read the schema somehow
    if ! check-schema; then
      /nix/var/nix/current/bin/nix-store --dump-db > /nix/var/nix/db.dump
      mv /nix/var/nix/db /nix/var/nix/db.old
      if ! cat /nix/var/nix/db.dump | ${cfg.package}/bin/nix-store --load-db; then
        rm -rf /nix/var/nix/db
        mv /nix/var/nix/db.old /nix/var/nix/db
        echo "Failed to downgrade database schema" >&2
        exit 1
      fi
      rm /nix/var/nix/db.dump
      rm -rf /nix/var/nix/db.old
    fi
    ln -sf /nix/var/nix/current ${cfg.package}
  '';

There's still question of how to lock concurrent access...

@abbradar
Copy link
Member

I get this on this test:

machine# KVM internal error. Suberror: 1
machine# emulation failure
machine# EAX=d70d9dc8 EBX=00018000 ECX=0000007b EDX=00000000
machine# ESI=ceb74000 EDI=d584ee00 EBP=d70d9e24 ESP=d70d9dc4
machine# EIP=d584ee00 EFL=00010083 [--S---C] CPL=0 II=0 A20=1 SMM=0 HLT=0
machine# ES =007b 00000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
machine# CS =0060 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
machine# SS =0068 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
machine# DS =007b 00000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
machine# FS =00d8 0169f000 ffffffff 00809300 DPL=0 DS16 [-WA]
machine# GS =00e0 d7647200 00000018 00409100 DPL=0 DS   [--A]
machine# LDT=0000 00000000 ffffffff 00000000
machine# TR =0080 d7645080 0000206b 00008b00 DPL=0 TSS32-busy
machine# GDT=     d763c000 000000ff
machine# IDT=     fffba000 000007ff
machine# CR0=80050033 CR2=d603b478 CR3=0bfb3000 CR4=000006b0
machine# DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000 
machine# DR6=ffff0ff0 DR7=00000400
machine# EFER=0000000000000000
machine# Code=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

@vcunat
Copy link
Member Author

vcunat commented Feb 23, 2017

Well, that looks very different from the errors on Hydra.

@abbradar
Copy link
Member

Eeeeh, first I couldn't reproduce my error anymore and now it just works.

@abbradar
Copy link
Member

So, out of 10 runs 7 times it succeeds and 3 tiimes it fails without any visible errors. Couldn't reproduce a KVM error.

@abbradar
Copy link
Member

abbradar commented Feb 23, 2017

I'm testing this with dezgeg@7bc78cd (see #23020) now, just in case.

EDIT: no luck with that too.

@abbradar
Copy link
Member

Still no luck with determining the cause. We may just move on and mark this test as non-mandatory given that it's unstable...

@abbradar
Copy link
Member

abbradar commented Feb 25, 2017

More testing; I think I see three distinctive patterns:

  1. Test succeeds;
  2. Test fails infinitely waiting for machine to respond to network request after resume.
  3. (rare) Internal KVM error

This shows up in log in situation 2:

machine# [    1.529586] PM: Image loading progress: 100%
machine# [    1.530081] PM: Image loading done.
machine# [    1.530342] PM: Read 89340 kbytes in 0.22 seconds (406.09 MB/s)
machine# [    1.531371] Suspending console(s) (use no_console_suspend to debug)
vde_switch: EOF data port: Interrupted system call

i.e. something happens with vde_switch or qemu, maybe machine suddenly halts -- not sure...

Either way, this is not because the machine hangs, netcat doesn't start or something like this -- on successes we see more logs from the machine, like:

machine# [    1.537849] Suspending console(s) (use no_console_suspend to debug)
machine# [    5.336002] PM: freeze of devices complete after 21.770 msecs
machine# [    5.336133] PM: late freeze of devices complete after 0.127 msecs
machine# [    5.338409] PM: noirq freeze of devices complete after 2.272 msecs
machine# [    5.338411] ACPI: Preparing to enter system sleep state S4
machine# [    5.338435] PM: Saving platform NVS memory
machine# [    5.338435] Disabling non-boot CPUs ...
machine# [    5.338671] PM: Creating hibernation image:
machine# [    5.342469] PM: Need to copy 22304 pages
machine# [    1.587801] kvm-clock: cpu 0, msr 0:17fd9001, primary cpu clock, resume
machine# [    1.587824] PM: Restoring platform NVS memory
machine# [    1.588518] Suspended for 3.543 seconds
machine# [    1.588597] ACPI: Waking up from system sleep state S4
machine# [    1.590730] PM: noirq restore of devices complete after 2.101 msecs
machine# [    1.590853] PM: early restore of devices complete after 0.060 msecs
machine# [    1.593212] rtc_cmos 00:00: System wakeup disabled by ACPI
machine# [    1.594761] usb usb1: root hub lost power or was reset
machine# [    1.783424] ata2.00: configured for MWDMA2
machine# [    1.950600] usb 1-1: reset full-speed USB device number 2 using uhci_hcd
machine# [    2.091663] PM: restore of devices complete after 498.495 msecs
machine# [    2.099568] Restarting tasks ...

@vcunat
Copy link
Member Author

vcunat commented Feb 25, 2017

Apparently some attempts on Hydra passed now and nixos-unstable got updated :-)

@vcunat
Copy link
Member Author

vcunat commented Mar 2, 2017

I dropped it from the tested job for now, as i686 is lower-priority, etc.

globin pushed a commit that referenced this issue Mar 2, 2017
@abbradar
Copy link
Member

abbradar commented Apr 2, 2017

Possible fix: affce1e.

@fpletz
Copy link
Member

fpletz commented Oct 14, 2017

The hibernate tests for i686 has been green for a while so this is fixed.

@fpletz fpletz closed this as completed Oct 14, 2017
adrianpk added a commit to adrianpk/nixpkgs that referenced this issue May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: regression Something that worked before working no longer 6.topic: nixos
Projects
None yet
Development

No branches or pull requests

3 participants