Skip to content

bgpd: stuck in unresponsive state #18606

@g00g1

Description

@g00g1

Description

Honestly I don't understand what exactly happened, I can only attach relevant logs: frr-bgpd.txt

Version

FRRouting 10.2.1 (redacted) on Linux(5.14.0-427.22.1.el9_4.x86_64).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
configured with:
    '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--sbindir=/usr/lib/frr' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-static' '--disable-werror' '--enable-multipath=256' '--enable-vtysh' '--enable-ospfclient' '--enable-ospfapi' '--enable-rtadv' '--enable-ldpd' '--enable-pimd' '--enable-pim6d' '--enable-pbrd' '--enable-nhrpd' '--enable-eigrpd' '--enable-babeld' '--enable-vrrpd' '--enable-user=frr' '--enable-group=frr' '--enable-vty-group=frrvty' '--enable-fpm' '--enable-watchfrr' '--disable-bgp-vnc' '--enable-isisd' '--enable-rpki' '--enable-bfdd' '--enable-pathd' '--disable-grpc' '--enable-snmp' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig' 'CC=gcc' 'CXX=g++' 'LT_SYS_LIBRARY_PATH=/usr/lib64:'

How to reproduce

N/A

Expected behavior

watchfrr issuing kill -9 on timeout for kill -15 for unresponsive bgpd process

Actual behavior

bgpd become unresponsive and watchfrr haven't killed it for two hours

Additional context

This happened just before bgpd gone unresponsive:

Apr 8, 2025 @ 07:19:58.706	[VH6Z7-MNSN0][EC 33554511] 2001:7b8:62b:1:0:d4ff:fe72:7848(Unknown) has not made any SendQ progress for 1 holdtime (9s), peer overloaded?
Apr 8, 2025 @ 07:20:04.448	[VH6Z7-MNSN0][EC 33554511] 2001:7b8:62b:1:0:d4ff:fe72:7848(Unknown) has not made any SendQ progress for 1 holdtime (9s), peer overloaded?
Apr 8, 2025 @ 07:20:07.355	[JQ5A9-TEQYM][EC 33554512] 2001:7b8:62b:1:0:d4ff:fe72:7848(Unknown) has not made any SendQ progress for 2 holdtimes (18s), terminating session

I have resolved my problem by killing bgpd with SIGKILL.

Backtraces that are relevant (happened on systemctl restart frr):

Apr 8, 2025 @ 09:57:43.803  Received signal 11 at 1744106263 (si_addr 0x2d0, PC 0x7f441988b5f2); aborting...
Apr 8, 2025 @ 09:57:43.804  /lib64/libfrr.so.0(zlog_backtrace_sigsafe+0x71) [0x7f4419cceeb1]
Apr 8, 2025 @ 09:57:43.804  /lib64/libfrr.so.0(zlog_signal+0xf5) [0x7f4419ccf0b5]
Apr 8, 2025 @ 09:57:43.804  /lib64/libfrr.so.0(+0x109e45) [0x7f4419d09e45]
Apr 8, 2025 @ 09:57:43.804  /lib64/libc.so.6(+0x3e6f0) [0x7f441983e6f0]
Apr 8, 2025 @ 09:57:43.804  /lib64/libc.so.6(+0x8b5f2) [0x7f441988b5f2]
Apr 8, 2025 @ 09:57:43.804  /lib64/libfrr.so.0(+0xb69fd) [0x7f4419cb69fd]
Apr 8, 2025 @ 09:57:43.805  /lib64/libfrr.so.0(frr_pthread_stop_all+0x57) [0x7f4419cb5087]
Apr 8, 2025 @ 09:57:43.805  /lib64/libfrr.so.0(frr_pthread_finish+0x1d) [0x7f4419cb68dd]
Apr 8, 2025 @ 09:57:43.805  /lib64/libfrr.so.0(frr_fini+0x78) [0x7f4419cc5b78]
Apr 8, 2025 @ 09:57:43.805  /usr/lib/frr/bgpd(sigint+0x20b) [0x55cb8d69d09b]
Apr 8, 2025 @ 09:57:43.805  /lib64/libfrr.so.0(frr_sigevent_process+0x53) [0x7f4419d08b43]
Apr 8, 2025 @ 09:57:43.805  /lib64/libfrr.so.0(event_fetch+0x6b5) [0x7f4419d1cc85]
Apr 8, 2025 @ 09:57:43.805  /lib64/libfrr.so.0(frr_run+0xe3) [0x7f4419cc5933]
Apr 8, 2025 @ 09:57:43.805  /usr/lib/frr/bgpd(main+0x3f2) [0x55cb8d6943e2]
Apr 8, 2025 @ 09:57:43.806  /lib64/libc.so.6(+0x29590) [0x7f4419829590]
Apr 8, 2025 @ 09:57:43.806  /lib64/libc.so.6(__libc_start_main+0x80) [0x7f4419829640]
Apr 8, 2025 @ 09:57:43.806  /usr/lib/frr/bgpd(_start+0x25) [0x55cb8d695125]

gdb symbols:

Reading symbols from /usr/lib/debug/usr/lib64/libfrr.so.0.0.0-10.2.1-01.el9.x86_64.debug...
(gdb) info symbol 0x109e45
core_handler + 181 in section .text of /usr/lib64/libfrr.so.0.0.0
(gdb) info symbol 0xb69fd
fpt_halt + 61 in section .text of /usr/lib64/libfrr.so.0.0.0
gdb /lib64/libc.so.6


(gdb) info symbol 0x3e6f0
__restore_rt in section .text of /usr/lib64/libc.so.6
(gdb) info symbol 0x8b5f2
__pthread_clockjoin_ex + 34 in section .text of /usr/lib64/libc.so.6

Checklist

  • I have searched the open issues for this bug.
  • I have not included sensitive information in this report.

Metadata

Metadata

Assignees

Labels

triageNeeds further investigation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions