Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

system instability after updating systems #106791

Closed
ghost opened this issue Dec 12, 2020 · 10 comments
Closed

system instability after updating systems #106791

ghost opened this issue Dec 12, 2020 · 10 comments

Comments

@ghost
Copy link

ghost commented Dec 12, 2020

Describe the bug
I updated 6 system to the latest staging commit to get openssl 1.1.1i yesterday.
Shortly after that these systems started being reported as down by monitoring every few hours. I could ssh into them but the systemd process wouldn't respond ("reboot" or "systemctl" didn't work for example). See the logs below.
I then downgraded again to latest master today, but the same thing kept happening on all 6 servers.

This is the hardware. It's all rented from Hetzner Online.
2x Intel(R) Core(TM) i7-8700 CPU / MSI Z370 GAMING PLUS
4x Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz / Gigabyte B360 HD3P-LM

To Reproduce
Steps to reproduce the behavior:
I am not sure if this can be reproduced reliably. If you rent a server with i7-8700 or i9-9900K from Hetzner try upgrading your system to latest master and see what happens.

Expected behavior
The system should not crash.

Screenshots

[ 1991.670981] show_signal_msg: 39 callbacks suppressed
[ 1991.670983] type1[13843]: segfault at fffffffffffffffe ip 0000000000401625 sp 00007fffffffa8f0 error 5 in type1[401000+2000]
[ 1991.670987] Code: fa ff ff 48 c7 05 2f 3a 00 00 00 00 00 00 eb dd 0f 1f 44 00 00 48 63 05 2d 3a 00 00 48 8b 15 2a 3a 00 00 41 54 55 48 89 fd 53 <80> 7c 02 fe 0d 49 89 c4 75 0e 41 83 ec 01 44 89 25 0a 3a 00 00 49
[ 1991.758554] type1[13854]: segfault at fffffffffffffffe ip 0000000000401625 sp 00007fffffffa8f0 error 5 in type1[401000+2000]
[ 1991.758558] Code: fa ff ff 48 c7 05 2f 3a 00 00 00 00 00 00 eb dd 0f 1f 44 00 00 48 63 05 2d 3a 00 00 48 8b 15 2a 3a 00 00 41 54 55 48 89 fd 53 <80> 7c 02 fe 0d 49 89 c4 75 0e 41 83 ec 01 44 89 25 0a 3a 00 00 49
[  574.515888] show_signal: 37 callbacks suppressed
[  574.515891] traps: gjs-console[22290] trap int3 ip:7ffff7de3bf5 sp:7fffffff2730 error:0 in libglib-2.0.so.0.6600.3[7ffff7da9000+85000]
[  575.103587] traps: gjs-console[22936] trap int3 ip:7ffff7de3bf5 sp:7fffffff2980 error:0 in libglib-2.0.so.0.6600.3[7ffff7da9000+85000]
[  575.186718] traps: gjs-console[22966] trap int3 ip:7ffff7de3bf5 sp:7fffffff2710 error:0 in libglib-2.0.so.0.6600.3[7ffff7da9000+85000]
[ 2063.395936] traps: systemd[1] general protection fault ip:557329718d71 sp:7ffe90d66100 error:0 in systemd[557329694000+cf000]
[mil@build-worker-04:~]$ systemctl
Failed to list units: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)

[mil@build-worker-04:~]$

Additional context
Add any other context about the problem here.

Notify maintainers

Metadata

 - system: `"x86_64-linux"`
 - host os: `Linux 5.9.14, NixOS, 21.03pre-git (Okapi)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.3.9`
 - nixpkgs: `/etc/src/nixpkgs`

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
- systemd
- glib
# a list of nixos modules affected by the problem
module:
@ghost ghost added the 0.kind: bug label Dec 12, 2020
@ghost
Copy link
Author

ghost commented Dec 12, 2020

Not 100% sure if this is related but nixUnstable just segfaulted on another machine

[  997.000597] nix[20191]: segfault at 7fe4ec021000 ip 00007fe522b24d40 sp 00007fe4bd885dd0 error 4 in libgc.so.1.4.3[7fe522b1d000+1b000]
[  997.000607] Code: 83 ec 08 48 85 f6 74 3d 48 83 ed 08 48 39 eb 77 34 48 8b 05 ea e1 01 00 4c 8b 20 48 8b 05 f0 e1 01 00 4c 8b 28 0f 1f 44 0
0 00 <48> 8b 3b 4c 39 e7 72 0a 4c 39 ef 73 05 e8 4e fd ff ff 48 83 c3 08
[nix-shell:~/proj/nixfiles]$ nix build --experimental-features nix-command -f . deploy.all
[203.6 MiB DL]Segmentation fault (core dumped)

[nix-shell:~/proj/nixfiles]$ nix build --experimental-features nix-command -f . deploy.all
Segmentation fault (core dumped)

[nix-shell:~/proj/nixfiles]$

How to (probably) reproduce:

nix build -f https://github.com/nixos/nixpkgs/archive/889638221cb6ca61468b2681830c57e8a80d1b26.tar.gz nixUnstable
result/bin/nix build --experimental-features nix-command -f https://git.petabyte.dev/petabyteboy/nixfiles/archive/2f8dc9c3090bee76986e305855688780174db61e.tar.gz deploy.x86_64-linux

CC @edolstra

@xaverdh
Copy link
Contributor

xaverdh commented Dec 13, 2020

sounds like a broken user session.. did you try restarting (running reboot as root should work)?

@ghost
Copy link
Author

ghost commented Dec 13, 2020

sounds like a broken user session.. did you try restarting (running reboot as root should work)?

Yes. For the Nix segfault: It was reproducable across session across systems.
For the libsystemd one: Running reboot as root user did not work because the system init process was not responding.

@chvp
Copy link
Member

chvp commented Dec 13, 2020

I'm seeing this as well (or at least, I think I'm seeing the same thing). This is on latest nixos-unstable-small.

Dec 13 15:09:46 hostname kernel: show_signal_msg: 1 callbacks suppressed
Dec 13 15:09:46 hostname kernel: systemd[1]: segfault at 7f4ade96ff70 ip 00007f4ade96ff70 sp 00007ffd8063df58 error 15 in libc-2.32.so[7f4ade96f000+2000]
Dec 13 15:09:46 hostname kernel: Code: 00 00 30 ff 96 de 4a 7f 00 00 30 84 1d b2 53 56 00 00 70 95 1d b2 53 56 00 00 50 ff 96 de 4a 7f 00 00 50 ff 96 de 4a 7f 00 00 <80> b5 22 b2 53 56 00 00 80 b5 22 b2 53 56 00 00 70 ff 96 de 4a 7f
Dec 13 15:09:46 hostname systemd-coredump[745]: Due to PID 1 having crashed coredump collection will now be turned off.
Dec 13 15:09:51 hostname systemd-coredump[745]: Cannot resolve systemd-coredump user. Proceeding to dump core as root: No such process
Dec 13 15:09:51 hostname systemd-coredump[745]: Process 743 (systemd) of user 0 dumped core.
Dec 13 15:09:51 hostname systemd[1]: Caught <SEGV>, dumped core as pid 743.
Dec 13 15:09:51 hostname systemd[1]: Freezing execution.

Reboot only works with systemctl reboot --force --force (and does not fix the problem).

@chvp
Copy link
Member

chvp commented Dec 13, 2020

I have the coredump I can give to someone if that would help. (Not posting publicly since I have no idea what is in there.) This is the backtrace:

#0  0x00007f4ade7ed587 in kill () from /nix/store/1yvpgm763b3hvg8q4fzpzmflr5674x4j-glibc-2.32-10/lib/libc.so.6
#1  0x00005653b108ecdf in crash ()
#2  <signal handler called>
#3  0x00007f4ade96ff70 in main_arena () from /nix/store/1yvpgm763b3hvg8q4fzpzmflr5674x4j-glibc-2.32-10/lib/libc.so.6
#4  0x00007f4adebb33d7 in base_bucket_hash ()
   from /nix/store/19b2j9a2k9plk17rqb92cvr5bkjsb011-systemd-247/lib/systemd/libsystemd-shared-247.so
#5  0x00007f4adebb55ab in _hashmap_remove_value ()
   from /nix/store/19b2j9a2k9plk17rqb92cvr5bkjsb011-systemd-247/lib/systemd/libsystemd-shared-247.so
#6  0x00005653b1108d7d in unit_unwatch_pid ()
#7  0x00005653b10cf549 in manager_invoke_sigchld_event ()
#8  0x00005653b10cf931 in manager_dispatch_sigchld ()
#9  0x00007f4adec5043a in source_dispatch ()
   from /nix/store/19b2j9a2k9plk17rqb92cvr5bkjsb011-systemd-247/lib/systemd/libsystemd-shared-247.so
#10 0x00007f4adec508b1 in sd_event_dispatch ()
   from /nix/store/19b2j9a2k9plk17rqb92cvr5bkjsb011-systemd-247/lib/systemd/libsystemd-shared-247.so
#11 0x00007f4adec50f98 in sd_event_run ()
   from /nix/store/19b2j9a2k9plk17rqb92cvr5bkjsb011-systemd-247/lib/systemd/libsystemd-shared-247.so
#12 0x00005653b10d932a in manager_loop ()
#13 0x00005653b10891dc in main ()

@chvp
Copy link
Member

chvp commented Dec 13, 2020

Could it be this issue: systemd/systemd#17768? The backtraces in there seem similar to me.

@chvp chvp mentioned this issue Dec 13, 2020
10 tasks
@ghost
Copy link
Author

ghost commented Dec 13, 2020

Thanks a lot for discovering what is probably the cause and fixing it! I will close this issue for now unless this keeps happening with systemd 247.1.

@ghost ghost closed this as completed Dec 13, 2020
@pinpox
Copy link
Member

pinpox commented Dec 16, 2020

Is there a workaround until this hits nixos-unstable? I'm running into this issue on multiple servers and it's causing quite some trouble.

Any estimates on when this will be out of staging?

@ghost
Copy link
Author

ghost commented Dec 16, 2020

It's simple: don't update to latest master. systemd 247 never reached the nixos-unstable channel so if you are using that channel you should not see these issues in the first place.

@chvp
Copy link
Member

chvp commented Dec 16, 2020

nixos-unstable-small has systemd 247 (which is how I first ran into this issue). I did switch to the nixos-unstable channel temporarily to rollback to systemd 246.6.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants