Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8.2.2 stuck with high number of peers/routes with RPKI #10826

Closed
liuxyon opened this issue Mar 18, 2022 · 23 comments · Fixed by #11138
Closed

8.2.2 stuck with high number of peers/routes with RPKI #10826

liuxyon opened this issue Mar 18, 2022 · 23 comments · Fixed by #11138
Assignees

Comments

@liuxyon
Copy link

liuxyon commented Mar 18, 2022

running frr v8.2.2 use ubuntu 20.0.4 and debian11 version in ubuntu 21.10 system, The routing system is stuck for no reason, causing the frr system to crash. I haven't found the reason yet, but is there any way to find out why?

Also request the release of frr for the latest system version of ubutntu. like ubuntu 21.10 and 21.04

@liuxyon liuxyon added the triage Needs further investigation label Mar 18, 2022
@ton31337
Copy link
Member

Can you provide at least a configuration?

@donaldsharp
Copy link
Member

or logs? This is pretty useless bug report.

@liuxyon
Copy link
Author

liuxyon commented Mar 18, 2022

2022/03/19 02:40:39 STATIC: [MRN6F-AYZC4] Terminating on signal
2022/03/19 02:40:39 ZEBRA: [XVBTQ-5QTVQ] Terminating on signal
2022/03/19 02:40:39 ZEBRA: [GE156-FS0MJ][EC 100663299] stream_read_try: read failed on fd 39: Connection reset by peer
2022/03/19 02:40:39 ZEBRA: [VXKFG-8SJRV][EC 4043309121] Client 'static' encountered an error and is shutting down.
2022/03/19 02:40:39 ZEBRA: [YDZ55-W3VM6] release_daemon_table_chunks: Released 0 table chunks
2022/03/19 02:40:39 ZEBRA: [JPSA8-5KYEA] client 17 disconnected 141713 bgp routes removed from the rib
2022/03/19 02:40:39 ZEBRA: [S929C-NZR3N] client 17 disconnected 0 bgp nhgs removed from the rib
2022/03/19 02:40:39 ZEBRA: [YDZ55-W3VM6] release_daemon_table_chunks: Released 0 table chunks
2022/03/19 02:40:39 ZEBRA: [JPSA8-5KYEA] client 32 disconnected 0 vnc routes removed from the rib
2022/03/19 02:40:39 ZEBRA: [S929C-NZR3N] client 32 disconnected 0 vnc nhgs removed from the rib
2022/03/19 02:40:39 ZEBRA: [YDZ55-W3VM6] release_daemon_table_chunks: Released 0 table chunks
2022/03/19 02:40:39 ZEBRA: [JPSA8-5KYEA] client 39 disconnected 0 static routes removed from the rib
2022/03/19 02:40:39 ZEBRA: [S929C-NZR3N] client 39 disconnected 0 static nhgs removed from the rib
2022/03/19 02:40:41 ZEBRA: [QS0NJ-H5QKJ] Zebra final shutdown
2022/03/19 02:44:40 ZEBRA: [V98V0-MTWPF] client 17 says hello and bids fair to announce only bgp routes vrf=0
2022/03/19 02:44:40 ZEBRA: [V98V0-MTWPF] client 32 says hello and bids fair to announce only vnc routes vrf=0
2022/03/19 02:44:40 ZEBRA: [V98V0-MTWPF] client 39 says hello and bids fair to announce only static routes vrf=0
2022/03/19 02:44:40 BGP: [GNAYN-F5F1G] Computing addpath IDs for addpath type All
2022/03/19 02:44:40 BGP: [MNE5N-K0G4Z] Resetting peer 2602:fed1:ca1:b::11 due to change in addpath config
2022/03/19 02:44:43 BGP: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from 2a0f:85c1:22:a:1:: in vrf default
2022/03/19 02:46:40 BGP: [MNE5N-K0G4Z] Resetting peer (null) due to change in addpath config
2022/03/19 02:46:42 BGP: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from 2602:fed1:ca1:b::11 in vrf default
2022/03/19 05:08:03 BGP: [MNE5N-K0G4Z] Resetting peer (null) due to change in addpath config
2022/03/19 05:08:05 BGP: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from 2602:fed1:ca1:b::11 in vrf default

@liuxyon
Copy link
Author

liuxyon commented Mar 19, 2022

Some IP addresses are modified or hidden

!
! Zebra configuration saved from vty
! 2022/03/12 19:53:42
!
frr version 8.1
frr defaults traditional
!
hostname sir
log file /etc/frr/frr.log
!
!
!
router bgp 29753
bgp router-id 134.196.121.55
no bgp ebgp-requires-policy
no bgp default ipv4-unicast
no bgp network import-check
neighbor 2602:fed2:ca1:b::11 remote-as 65105
neighbor 2602:fed2:ca1:b::11 description "my local "
neighbor 2602:fed2:ca1:b::11 disable-connected-check
neighbor 2602:fed2:ca1:b::11 update-source wg1
neighbor 2602:fed2:ca1:b::11 advertisement-interval 0
neighbor 2602:fed2:ca1:b::11 disable-connected-check
neighbor 2a09:5c0:fe0:8c::1 remote-as 68057
neighbor 2a09:5c0:fe0:8c::1 description tunnelbroke
neighbor 2a09:5c0:fe0:8c::1 update-source AS68057
neighbor 2a09:5c0:fe0:8c::1 advertisement-interval 0
neighbor 2a09:5c0:fe0:8c::1 capability dynamic
neighbor 2a09:5c0:fe0:8c::1 sender-as-path-loop-detection
neighbor 2a0f:85c1:22:a:1:: remote-as 306628
neighbor 2a0f:85c1:22:a:1:: description "AS306628 "
neighbor 2a0f:85c1:22:a:1:: disable-connected-check
neighbor 2a0f:85c1:22:a:1:: update-source ens19
neighbor 2a0f:85c1:22:a:1:: capability dynamic
!
address-family ipv4 unicast
exit-address-family
!
address-family ipv6 unicast
network 2602:fed1:ca1::/48
neighbor 2602:fed1:ca1:b::11 activate
neighbor 2602:fed1:ca1:b::11 addpath-tx-all-paths
neighbor 2602:fed1:ca1:b::11 next-hop-self
neighbor 2602:fed1:ca1:b::11 remove-private-AS all
neighbor 2602:fed1:ca1:b::11 soft-reconfiguration inbound
neighbor 2602:fed1:ca1:b::11 prefix-list mycn6out out
neighbor 2a0f:85c3:22:a:1:: activate
neighbor 2a0f:85c3:22:a:1:: remove-private-AS all
neighbor 2a0f:85c3:22:a:1:: soft-reconfiguration inbound
neighbor 2a0f:85c3:22:a:1:: prefix-list ipv6in in
neighbor 2a0f:85c3:22:a:1:: prefix-list myv6out out
neighbor 2a0f:85c3:22:a:1:: route-map A01 in
neighbor 2a0f:85c3:22:a:1:: route-map 80 out
exit-address-family
!
exit
!
ipv6 prefix-list ipv6in seq 105 deny ::1/128
ipv6 prefix-list ipv6in seq 110 deny ::/128
ipv6 prefix-list ipv6in seq 120 deny 3ffe::/16 le 128
ipv6 prefix-list ipv6in seq 130 deny 2001:db8::/32 le 128
ipv6 prefix-list ipv6in seq 140 deny 2001::/32
ipv6 prefix-list ipv6in seq 150 deny 2001::/32 le 128
ipv6 prefix-list ipv6in seq 160 permit 2002::/16
ipv6 prefix-list ipv6in seq 170 deny 2002::/16 le 128
ipv6 prefix-list ipv6in seq 180 deny ::/8 le 128
ipv6 prefix-list ipv6in seq 190 deny fe00::/9 le 128
ipv6 prefix-list ipv6in seq 200 deny ff00::/8 le 128
ipv6 prefix-list ipv6in seq 205 permit 2000::/3 le 48
ipv6 prefix-list ipv6in seq 900 deny ::/0 le 128
ipv6 prefix-list ipv6in seq 999 deny any
ipv6 prefix-list mycn6out seq 5 deny ::1/128
ipv6 prefix-list mycn6out seq 10 deny ::/128
ipv6 prefix-list mycn6out seq 15 deny 3ffe::/16 le 128
ipv6 prefix-list mycn6out seq 20 deny 2001:db8::/32 le 128
ipv6 prefix-list mycn6out seq 25 deny 2001:10::/28 le 128
ipv6 prefix-list mycn6out seq 30 deny 2001:2::/48 le 128
ipv6 prefix-list mycn6out seq 35 deny 100::/64 le 128
ipv6 prefix-list mycn6out seq 40 deny ::/8 le 128
ipv6 prefix-list mycn6out seq 45 deny fc00::/7 le 128
ipv6 prefix-list mycn6out seq 50 deny ff00::/8 le 128
ipv6 prefix-list mycn6out seq 55 deny 2002::/16 le 128
ipv6 prefix-list mycn6out seq 60 deny ::/0 ge 49 le 128
ipv6 prefix-list mycn6out seq 110 permit 2000::/3 le 48
ipv6 prefix-list mycn6out seq 999 deny any
ipv6 prefix-list myv6out seq 50 permit 2602:fed3:7021::/48
ipv6 prefix-list myv6out seq 100 permit 2602:fed1:ca1::/48
ipv6 prefix-list myv6out seq 999 deny any
!
bgp as-path access-list 2 seq 5 deny ^([0-9]+)(\1)+$
bgp as-path access-list 2 seq 10 permit .*
bgp as-path access-list 99 seq 5 permit (4294967[0-1][0-9][0-9])|
(42949672[0-8][0-9])|(429496729[0-4])_
bgp as-path access-list 99 seq 10 permit (42949[0-5][0-9][0-9][0-9][0-9])|(429496[0-6][0-9][0-9][0-9])
bgp as-path access-list 99 seq 15 permit (429[0-3][0-9][0-9][0-9][0-9][0-9][0-9])|(4294[0-8][0-9][0-9][0-9][0-9][0-9])
bgp as-path access-list 99 seq 20 permit (6449[6-9])|(6450[0-9])|(6451[0-1])|(6553[6-9])|(6554[0-9])|(6555[0-1])
bgp as-path access-list 99 seq 25 permit 0
bgp as-path access-list 99 seq 30 permit 1310[0-6][0-9]|13107[0-1]
bgp as-path access-list 99 seq 35 permit 23456
bgp as-path access-list 99 seq 40 permit 42[0-8][0-9][0-9][0-9][0-9][0-9][0-9][0-9]
bgp as-path access-list 99 seq 45 permit 6(4(5(1[2-9]|[2-9][0-9])|[6-9][0-9][0-9])|5([0-4][0-9][0-9]|5([0-2][0-9]|3[0-5])))
bgp as-path access-list 99 seq 50 permit 6555[2-9]|655[6-9][0-9]|65[6-9][0-9][0-9]|6[6-9][0-9][0-9][0-9]
bgp as-path access-list 99 seq 55 permit [7-9][0-9][0-9][0-9][0-9]|1[0-2][0-9][0-9][0-9][0-9]|130[0-9][0-9][0-9]
!
!
route-map 80 permit 50
set local-preference 100
set metric 0
exit
!
route-map A01 deny 11
match as-path 99
exit
!
route-map A01 deny 20
match rpki invalid
exit
!
route-map A01 permit 25
match as-path 2
exit
!
route-map A01 permit 30
match rpki notfound
set local-preference 100
set metric 0
set as-path prepend last-as 1
exit
!
route-map A01 permit 50
match rpki valid
set local-preference 200
set metric 0
exit
!
route-map 05 deny 20
match rpki invalid
exit
!
route-map 05 permit 30
match rpki notfound
set metric 0
exit
!
route-map 05 permit 50
match rpki valid
set metric 0
exit
!
route-map A02 deny 11
match as-path 99
exit
!
route-map A02 deny 20
match rpki invalid
exit
!
route-map A02 permit 25
match as-path 2
exit
!
route-map A02 permit 30
match rpki notfound
set local-preference 100
set metric 0
set as-path prepend last-as 5
exit
!
route-map A02 permit 50
match rpki valid
set local-preference 100
set metric 0
set as-path prepend last-as 3
exit
!
route-map 11 permit 30
set as-path prepend 29753
exit
!
route-map 13 permit 30
set as-path prepend 29753 29753 29753
exit
!
!
!
!
rpki
rpki polling_period 900
rpki cache 134.196.1.55 3323 preference 1
rpki cache 2602:fed1:ca1::face 3323 preference 2
exit
!

@ton31337
Copy link
Member

Is this the whole log? Terminating on signal tells it's kinda killed with SIGTERM or SIGINT.

@liuxyon
Copy link
Author

liuxyon commented Mar 22, 2022

yes, whole log.
and when i input vtysh command, frr no any output.
IMG_20220322_184530

I have tested on 4 servers and this happens all. 4 servers are all ubuntu21.10 systems.

@liuxyon
Copy link
Author

liuxyon commented Mar 22, 2022

2022/03/22 20:50:01 ZEBRA: [SWQK6-6JY63][EC 4043309074] 0:254:2602:fed3:7021::/48: Failed to enqueue dataplane install
2022/03/22 20:50:01 ZEBRA: [SWQK6-6JY63][EC 4043309074] 0:254:2a06:e882:119::/48: Failed to enqueue dataplane install
2022/03/22 20:50:01 ZEBRA: [SWQK6-6JY63][EC 4043309074] 0:254:2a0d:2405:511::/48: Failed to enqueue dataplane install
2022/03/22 20:50:01 ZEBRA: [SWQK6-6JY63][EC 4043309074] 0:254:2a10:2f02:100::/48: Failed to enqueue dataplane install
2022/03/22 20:50:03 STATIC: [MRN6F-AYZC4] Terminating on signal
2022/03/22 20:50:04 ZEBRA: [VXKFG-8SJRV][EC 4043309121] Client 'static' encountered an error and is shutting down.
2022/03/22 20:50:05 ZEBRA: [XVBTQ-5QTVQ] Terminating on signal
2022/03/22 20:50:06 ZEBRA: [YDZ55-W3VM6] release_daemon_table_chunks: Released 0 table chunks
2022/03/22 20:50:06 ZEBRA: [JPSA8-5KYEA] client 16 disconnected 141674 bgp routes removed from the rib
2022/03/22 20:50:06 ZEBRA: [S929C-NZR3N] client 16 disconnected 0 bgp nhgs removed from the rib
2022/03/22 20:50:06 ZEBRA: [YDZ55-W3VM6] release_daemon_table_chunks: Released 0 table chunks
2022/03/22 20:50:06 ZEBRA: [JPSA8-5KYEA] client 31 disconnected 0 vnc routes removed from the rib
2022/03/22 20:50:06 ZEBRA: [S929C-NZR3N] client 31 disconnected 0 vnc nhgs removed from the rib
2022/03/22 20:50:06 ZEBRA: [YDZ55-W3VM6] release_daemon_table_chunks: Released 0 table chunks
2022/03/22 20:50:06 ZEBRA: [JPSA8-5KYEA] client 38 disconnected 0 static routes removed from the rib
2022/03/22 20:50:06 ZEBRA: [S929C-NZR3N] client 38 disconnected 0 static nhgs removed from the rib
2022/03/22 20:50:09 ZEBRA: [YAF85-253AP][EC 100663299] buffer_flush_available: write error on fd 43: Broken pipe
2022/03/22 20:50:09 ZEBRA: [THHDB-YPEY6][EC 100663299] vtysh_flush: write error to fd 43, closing
2022/03/22 20:50:09 ZEBRA: [QS0NJ-H5QKJ] Zebra final shutdown
2022/03/22 20:54:22 ZEBRA: [V98V0-MTWPF] client 17 says hello and bids fair to announce only bgp routes vrf=0
2022/03/22 20:54:22 ZEBRA: [V98V0-MTWPF] client 32 says hello and bids fair to announce only vnc routes vrf=0
2022/03/22 20:54:22 ZEBRA: [V98V0-MTWPF] client 39 says hello and bids fair to announce only static routes vrf=0
2022/03/22 20:54:22 BGP: [GNAYN-F5F1G] Computing addpath IDs for addpath type All
2022/03/22 20:54:22 BGP: [MNE5N-K0G4Z] Resetting peer 2602:fede:ca1:b::11 due to change in addpath config
2022/03/22 20:54:25 BGP: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from 2a0f:85c2:22:a:1:: in vrf default
2022/03/22 20:54:54 BGP: [MNE5N-K0G4Z] Resetting peer (null) due to change in addpath config
2022/03/22 20:54:56 BGP: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from 2602:fede:ca1:b::11 in vrf default

@qlyoung
Copy link
Member

qlyoung commented Apr 1, 2022

It's too hard to figure out what you are trying to show when you dump information this way. I've asked you to use the template repeatedly and you never do it. You need to use the template in order for others to make sense of the issues you're reporting.

@qlyoung qlyoung closed this as completed Apr 1, 2022
@liuxyon
Copy link
Author

liuxyon commented Apr 3, 2022

the same report in mail list.

Today's Topics:

  1. BGPD hanging in FRR 8.2.2 (Philip Smith)
    Message: 1
    Date: Sat, 2 Apr 2022 20:47:42 +0100
    From: Philip Smith philip@nsrc.org
    To: frog@lists.frrouting.org
    Subject: [FROG] BGPD hanging in FRR 8.2.2
    Message-ID: 54869a9a-07db-2033-cc16-c0b8a6612060@nsrc.org
    Content-Type: text/plain; charset=UTF-8; format=flowed

Hi everyone,

Just following up on my previous note about BGPD hanging in FRR 8.2.2. I
now have more info to share.

As background, I've got around 60 BGP feeds total in 30 different
"views", to form a route collector for analysis work I'm doing of the
global R&E routing table.

This hang seems to have a period of 5-7 days. Using FRR 8.2.2 on Ubuntu
20.04. Not had any issue with FRR 8.1.0; this only started with FRR 8.2.2.

The latest hang earlier today allowed a colleague to grab debug info
which I hope will help.

/var/log/frr/frr.log shows entries like this:

Apr 2 11:46:42 frr watchfrr[52904]: [T58XM-TP956][EC 268435457] bgpd
state -> unresponsive : no response yet to ping sent 90 seconds ago
Apr 2 11:46:42 frr watchfrr[52904]: [YFT0P-5Q5YX] Forked background
command [pid 1674696]: /usr/lib/frr/watchfrr.sh restart bgpd
Apr 2 11:47:02 frr watchfrr[52904]: [ZE9RA-19PS5] restart bgpd child
process 1674696 still running after 20 seconds, sending signal 15
Apr 2 11:47:02 frr watchfrr[52904]: [SK7QP-A2GT9] restart bgpd process
1674696 terminated due to signal 15

Apr 2 14:18:03 frr watchfrr[52904]: [YFT0P-5Q5YX] Forked background
command [pid 1697956]: /usr/lib/frr/watchfrr.sh restart bgpd
Apr 2 14:18:23 frr watchfrr[52904]: [ZE9RA-19PS5] restart bgpd child
process 1697956 still running after 20 seconds, sending signal 15
Apr 2 14:18:23 frr watchfrr[52904]: [SK7QP-A2GT9] restart bgpd process
1697956 terminated due to signal 15

which just repeat every 10 minutes or so.

A few hours earlier I was getting:

Apr 1 22:53:19 frr bgpd[52925]: [YZRX4-ZXG0C][EC 100663315] Thread
Starvation: {(thread *)0x5566a35c01a0 arg=0x556682b31da0 timer r=-5.940
bgp_announce_route_timer_expired() &paf->t_announce_route from
bgpd/bgp_route.c:4763} was scheduled to pop greater than 4s ago

Apr 1 23:24:34 frr bgpd[52925]: [YZRX4-ZXG0C][EC 100663315] Thread
Starvation: {(thread *)0x5567954b16c0 arg=0x556682f14870 timer r=-5.224
bgp_announce_route_timer_expired() &paf->t_announce_route from
bgpd/bgp_route.c:4763} was scheduled to pop greater than 4s ago

Trying to connect by vtysh prints message of day, but never a command
prompt. Same if trying to connect via telnet.

The only way out is a kill -9 of the BGPD process, followed by a
"systemctl restart frr".

The process stack for bgpd shows:

root@frr:~# cat /proc/52925/stack
[<0>] futex_wait_queue_me+0xbb/0x120
[<0>] futex_wait+0x105/0x290
[<0>] do_futex+0x157/0x4d0
[<0>] __x64_sys_futex+0x13f/0x170
[<0>] do_syscall_64+0x57/0x190
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Thread debugging shows:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__pthread_clockjoin_ex (threadid=139670697043712, thread_return=0x0,
clockid=, abstime=,
block=) at pthread_join_common.c:145
145 pthread_join_common.c: No such file or directory.
(gdb) bt
#0 __pthread_clockjoin_ex (threadid=139670697043712, thread_return=0x0,
clockid=, abstime=,
block=) at pthread_join_common.c:145
#1 0x00007f07b1f3d985 in ?? () from /lib/x86_64-linux-gnu/librtr.so.0
#2 0x00007f07b1f38dc1 in rtr_mgr_stop () from
/lib/x86_64-linux-gnu/librtr.so.0
#3 0x00007f07b1f53ef0 in ?? () from
/usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so
#4 0x00007f07b1f53f7d in ?? () from
/usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so
#5 0x00007f07b1f543ca in ?? () from
/usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so
#6 0x00007f07b2586621 in thread_call () from
/usr/lib/x86_64-linux-gnu/frr/libfrr.so.0
#7 0x00007f07b2540198 in frr_run () from
/usr/lib/x86_64-linux-gnu/frr/libfrr.so.0
#8 0x00005566800b6678 in main ()

I've got about 2.5Mbytes of strace which I'll happily unicast to whoever
would like to have a look at it. It looks very repetitive/boring to my
non-developer eye, like something's got stuck waiting for something else.

BTW, this is what's running (after I killed and restarted), including
command line options:

1707406 ? S<s 0:02 /usr/lib/frr/watchfrr -d -F traditional
zebra bgpd staticd
1707423 ? S<sl 0:01 /usr/lib/frr/zebra -d -F traditional -A
127.0.0.1 -s 90000000
1707428 ? S<sl 17:03 /usr/lib/frr/bgpd -d -F traditional -Z -M rpki
1707435 ? S<s 0:00 /usr/lib/frr/staticd -d -F traditional -A
127.0.0.1

Any ideas? I'd hate to revert to 8.1 but...

philip

@ton31337 ton31337 added the bgp label Apr 3, 2022
@ton31337 ton31337 self-assigned this Apr 3, 2022
@ton31337 ton31337 reopened this Apr 3, 2022
@ton31337
Copy link
Member

ton31337 commented Apr 3, 2022

I'll try to replicate and work on this.

@ton31337 ton31337 changed the title frr v8.2.2 system stuck 8.2.2 stuck with high number of peers/routes with RPKI Apr 3, 2022
@ton31337
Copy link
Member

ton31337 commented Apr 3, 2022

@liuxyon could you enable debug rpki? Also, it would be useful to have show memory | include RPKI (as late as possible before not responding). And ps aufx | grep bgpd + free -m.

@liuxyon
Copy link
Author

liuxyon commented Apr 4, 2022

and work on this

Since version 8.2.2 cannot be used, we have all returned to using version 8.1

@ton31337
Copy link
Member

ton31337 commented Apr 4, 2022

I can't replicate this with 100k routes and two full RPKI validators (cache servers), but just found a memory leak (which might be a possible reason, don't know, that's why I asked for more details).

@pfsinoz
Copy link

pfsinoz commented Apr 4, 2022

@ton31337 I'm still staying with 8.2.2 and happy to help troubleshoot this. Will get you debug rpki etc when it next happens. For me, it's every 5 days this happens (sorry, we'll have to wait). Got 60 peers, probably 5 of them giving me full tables in v4 and v6, the rest just global R&E routes (which is about 20k IPv4 and 6k IPv6). Let me know if anything else needed.

@ton31337
Copy link
Member

ton31337 commented Apr 4, 2022

@pfsinoz cool, let me know when you have more details (as I described in a previous comment).

@pfsinoz
Copy link

pfsinoz commented Apr 4, 2022

@ton31337 is the stack trace I have from the last hang of any use at all?

@ton31337
Copy link
Member

ton31337 commented Apr 4, 2022

@ton31337 is the stack trace I have from the last hang of any use at all?

At least it's quite clear that RPKI-related...

@pfsinoz
Copy link

pfsinoz commented Apr 5, 2022

BTW, just for the record, this is what things look like with "situation normal":

******** (sh memory | include RPKI) *******
BGP RPKI Cache server         :  1222452 variable  49351728  2444538 109382928
BGP RPKI Cache server group   :        0    120           0        1       120
******** (free -m) ******
              total        used        free      shared  buff/cache   available
Mem:           9961        7030         308           1        2621        2613
Swap:             0           0           0
******** (ps aufx | grep bgpd) ******
root     1707406  0.0  0.0   8328  3036 ?        S<s  Apr02   0:36 /usr/lib/frr/watchfrr -d -F traditional zebra bgpd staticd
frr      1707428  6.2 61.5 6587548 6278184 ?     S<sl Apr02 259:33 /usr/lib/frr/bgpd -d -F traditional -Z -M rpki

Now we just have to wait for the next hang - probably 3-4 days time.

@ton31337
Copy link
Member

@pfsinoz maybe you have more details about this?

@pfsinoz
Copy link

pfsinoz commented Apr 19, 2022

@ton31337 Frustratingly it has not hung since! I'm still waiting, still gathering the data every hour. I've had to restart the system once for another reason, but still no hang since. This is the latest snapshot, from about 40 minutes ago:

******** (sh memory | include RPKI) *******
BGP RPKI Cache server         :  1231005 variable  49697848  2476764 116684816
BGP RPKI Cache server group   :        0    120           0        1       120
******** (free -m) ******
              total        used        free      shared  buff/cache   available
Mem:           9960        7102         322           1        2535        2541
Swap:             0           0           0
******** (ps aufx | grep bgpd) ******
root         771  0.0  0.0   8332  3060 ?        S<s  Apr13   1:14 /usr/lib/frr/watchfrr -d -F traditional zebra bgpd staticd
frr          812  6.3 60.3 6511740 6154024 ?     S<sl Apr13 579:04 /usr/lib/frr/bgpd -d -F traditional -Z -M rpki

I've had a couple of instances where the sh ip bgp on a full feed has caused the clogin driven by my scripts to timeout. But not repeatable.

@ton31337 ton31337 added regression and removed triage Needs further investigation labels May 3, 2022
@pfsinoz
Copy link

pfsinoz commented May 4, 2022

@ton31337 just a quick update... FRR has been up and running for last 12 days now and not exhibited the hang issue. The full BGP feeds do pause for about 20-30 seconds when I do a "sh ip bgp" on them, but I can replicate that on other FRR versions too. I'm left wondering if there were any validator issues that perhaps led to "funny" VRPs being sent to FRR, but I can't even think what those might be. Just weird that the issue has seemingly gone away all by itself. I'm happy to test new/updated code if need be.

@ton31337
Copy link
Member

ton31337 commented May 4, 2022

@pfsinoz thank you for the update. We are going to revert the latest changes related to connection handling (workarounds) that are fixed in librtr itself (0.8.0). #11138

You just have to make sure you have librtr 0.8.0 version.

@dylanjamesdev
Copy link

I'm currently facing this exact issue, FRR continues to crash and not recover.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants