i686 swap problems in tests (master) #23107

vcunat · 2017-02-23T14:34:18Z

Issue description

We're reproducibly getting problems in the hibernate test, but only on i686-linux.

machine# [   11.350824] swapon[633]: swapon: /dev/vdb: read swap header failed
machine# [   11.358592] systemd[1]: dev-vdb.swap: Swap process exited, code=exited status=255
machine# [   11.372330] systemd[1]: Failed to activate swap /dev/vdb.
machine# [   11.377620] systemd[1]: Dependency failed for Swap.
machine# [   11.382812] systemd[1]: swap.target: Job swap.target/start failed with result 'dependency'.
machine# [   11.392925] systemd[1]: dev-vdb.swap: Unit entered failed state.

As a result, the channel is blocked, missing security updates, etc. I don't know if the problem also happens outside VM or outside tests.

The text was updated successfully, but these errors were encountered:

abbradar · 2017-02-23T16:21:32Z

Aaargh, I would have looked at this but I've accidentially reverted to configuration with unstable Nix! So now I need to reinstall NixOS again or I can't run tests because of #22868 D:

vcunat · 2017-02-23T16:37:14Z

Oh, is there not a way to go back? I'd think you "only" need to downgrade the database schema.

abbradar · 2017-02-23T16:46:03Z

Actually I was just stupid and haven't read the manual good enough:

# nix-build -A nixUnstable -o nix-new
# nix-build -A nix -o nix-old
# nixos-rebuild switch # to configuration with stable Nix
# nix-new/bin/nix-store --dump-db > db.nix
# mv /nix/var /nix/var.old
# systemctl restart nix-daemon.socket
# cat db.nix | nix-old/bin/nix-store --load-db

Not a nice procedure but better than a reinstall!

vcunat · 2017-02-23T16:52:50Z

It would seem nice if we improved the situation, as it's one of the rare things you can't just roll back.

abbradar · 2017-02-23T17:11:04Z

A rough sketch of how could that work:

  systemd.services.nix-daemon.preStart = ''
    # Check that we can read the schema somehow
    if ! check-schema; then
      /nix/var/nix/current/bin/nix-store --dump-db > /nix/var/nix/db.dump
      mv /nix/var/nix/db /nix/var/nix/db.old
      if ! cat /nix/var/nix/db.dump | ${cfg.package}/bin/nix-store --load-db; then
        rm -rf /nix/var/nix/db
        mv /nix/var/nix/db.old /nix/var/nix/db
        echo "Failed to downgrade database schema" >&2
        exit 1
      fi
      rm /nix/var/nix/db.dump
      rm -rf /nix/var/nix/db.old
    fi
    ln -sf /nix/var/nix/current ${cfg.package}
  '';

There's still question of how to lock concurrent access...

abbradar · 2017-02-23T17:20:12Z

I get this on this test:

machine# KVM internal error. Suberror: 1
machine# emulation failure
machine# EAX=d70d9dc8 EBX=00018000 ECX=0000007b EDX=00000000
machine# ESI=ceb74000 EDI=d584ee00 EBP=d70d9e24 ESP=d70d9dc4
machine# EIP=d584ee00 EFL=00010083 [--S---C] CPL=0 II=0 A20=1 SMM=0 HLT=0
machine# ES =007b 00000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
machine# CS =0060 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
machine# SS =0068 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
machine# DS =007b 00000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
machine# FS =00d8 0169f000 ffffffff 00809300 DPL=0 DS16 [-WA]
machine# GS =00e0 d7647200 00000018 00409100 DPL=0 DS   [--A]
machine# LDT=0000 00000000 ffffffff 00000000
machine# TR =0080 d7645080 0000206b 00008b00 DPL=0 TSS32-busy
machine# GDT=     d763c000 000000ff
machine# IDT=     fffba000 000007ff
machine# CR0=80050033 CR2=d603b478 CR3=0bfb3000 CR4=000006b0
machine# DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000 
machine# DR6=ffff0ff0 DR7=00000400
machine# EFER=0000000000000000
machine# Code=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

vcunat · 2017-02-23T17:57:15Z

Well, that looks very different from the errors on Hydra.

abbradar · 2017-02-23T18:07:34Z

Eeeeh, first I couldn't reproduce my error anymore and now it just works.

abbradar · 2017-02-23T18:13:38Z

So, out of 10 runs 7 times it succeeds and 3 tiimes it fails without any visible errors. Couldn't reproduce a KVM error.

abbradar · 2017-02-23T18:31:15Z

I'm testing this with dezgeg@7bc78cd (see #23020) now, just in case.

EDIT: no luck with that too.

abbradar · 2017-02-24T17:28:58Z

Still no luck with determining the cause. We may just move on and mark this test as non-mandatory given that it's unstable...

abbradar · 2017-02-25T13:12:51Z

More testing; I think I see three distinctive patterns:

Test succeeds;
Test fails infinitely waiting for machine to respond to network request after resume.
(rare) Internal KVM error

This shows up in log in situation 2:

machine# [    1.529586] PM: Image loading progress: 100%
machine# [    1.530081] PM: Image loading done.
machine# [    1.530342] PM: Read 89340 kbytes in 0.22 seconds (406.09 MB/s)
machine# [    1.531371] Suspending console(s) (use no_console_suspend to debug)
vde_switch: EOF data port: Interrupted system call

i.e. something happens with vde_switch or qemu, maybe machine suddenly halts -- not sure...

Either way, this is not because the machine hangs, netcat doesn't start or something like this -- on successes we see more logs from the machine, like:

machine# [    1.537849] Suspending console(s) (use no_console_suspend to debug)
machine# [    5.336002] PM: freeze of devices complete after 21.770 msecs
machine# [    5.336133] PM: late freeze of devices complete after 0.127 msecs
machine# [    5.338409] PM: noirq freeze of devices complete after 2.272 msecs
machine# [    5.338411] ACPI: Preparing to enter system sleep state S4
machine# [    5.338435] PM: Saving platform NVS memory
machine# [    5.338435] Disabling non-boot CPUs ...
machine# [    5.338671] PM: Creating hibernation image:
machine# [    5.342469] PM: Need to copy 22304 pages
machine# [    1.587801] kvm-clock: cpu 0, msr 0:17fd9001, primary cpu clock, resume
machine# [    1.587824] PM: Restoring platform NVS memory
machine# [    1.588518] Suspended for 3.543 seconds
machine# [    1.588597] ACPI: Waking up from system sleep state S4
machine# [    1.590730] PM: noirq restore of devices complete after 2.101 msecs
machine# [    1.590853] PM: early restore of devices complete after 0.060 msecs
machine# [    1.593212] rtc_cmos 00:00: System wakeup disabled by ACPI
machine# [    1.594761] usb usb1: root hub lost power or was reset
machine# [    1.783424] ata2.00: configured for MWDMA2
machine# [    1.950600] usb 1-1: reset full-speed USB device number 2 using uhci_hcd
machine# [    2.091663] PM: restore of devices complete after 498.495 msecs
machine# [    2.099568] Restarting tasks ...

vcunat · 2017-02-25T21:20:46Z

Apparently some attempts on Hydra passed now and nixos-unstable got updated :-)

/cc #23107.

vcunat · 2017-03-02T06:36:19Z

I dropped it from the tested job for now, as i686 is lower-priority, etc.

/cc #23107. (cherry picked from commit 45344fd)

abbradar · 2017-04-02T09:33:32Z

Possible fix: affce1e.

fpletz · 2017-10-14T20:24:23Z

The hibernate tests for i686 has been green for a while so this is fixed.

/cc NixOS#23107. (cherry picked from commit 45344fd)

vcunat added 0.kind: regression Something that worked before working no longer 1.severity: blocker 1.severity: security 6.topic: nixos labels Feb 23, 2017

abbradar mentioned this issue Feb 23, 2017

nix service: try to downgrade schema #23117

Closed

7 tasks

vcunat removed 1.severity: blocker 1.severity: security labels Feb 25, 2017

vcunat added a commit that referenced this issue Mar 2, 2017

tested job: drop the hibernate test on i686 for now

45344fd

/cc #23107.

globin pushed a commit that referenced this issue Mar 2, 2017

tested job: drop the hibernate test on i686 for now

bfca6a9

/cc #23107. (cherry picked from commit 45344fd)

fpletz closed this as completed Oct 14, 2017

adrianpk added a commit to adrianpk/nixpkgs that referenced this issue May 31, 2024

tested job: drop the hibernate test on i686 for now

1e1b70f

/cc NixOS#23107. (cherry picked from commit 45344fd)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

i686 swap problems in tests (master) #23107

i686 swap problems in tests (master) #23107

vcunat commented Feb 23, 2017

abbradar commented Feb 23, 2017

vcunat commented Feb 23, 2017

abbradar commented Feb 23, 2017 •

edited

vcunat commented Feb 23, 2017 •

edited

abbradar commented Feb 23, 2017 •

edited

abbradar commented Feb 23, 2017

vcunat commented Feb 23, 2017

abbradar commented Feb 23, 2017

abbradar commented Feb 23, 2017

abbradar commented Feb 23, 2017 •

edited

abbradar commented Feb 24, 2017

abbradar commented Feb 25, 2017 •

edited

vcunat commented Feb 25, 2017

vcunat commented Mar 2, 2017

abbradar commented Apr 2, 2017

fpletz commented Oct 14, 2017

i686 swap problems in tests (master) #23107

i686 swap problems in tests (master) #23107

Comments

vcunat commented Feb 23, 2017

Issue description

abbradar commented Feb 23, 2017

vcunat commented Feb 23, 2017

abbradar commented Feb 23, 2017 • edited

vcunat commented Feb 23, 2017 • edited

abbradar commented Feb 23, 2017 • edited

abbradar commented Feb 23, 2017

vcunat commented Feb 23, 2017

abbradar commented Feb 23, 2017

abbradar commented Feb 23, 2017

abbradar commented Feb 23, 2017 • edited

abbradar commented Feb 24, 2017

abbradar commented Feb 25, 2017 • edited

vcunat commented Feb 25, 2017

vcunat commented Mar 2, 2017

abbradar commented Apr 2, 2017

fpletz commented Oct 14, 2017

abbradar commented Feb 23, 2017 •

edited

vcunat commented Feb 23, 2017 •

edited

abbradar commented Feb 23, 2017 •

edited

abbradar commented Feb 23, 2017 •

edited

abbradar commented Feb 25, 2017 •

edited