-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FRRouting 9.1-0~ubuntu22.04.1 - bgpd segfault #15543
Comments
for what's worth here are some more segfaults:
rt2:
Nothing worth noting happened during these timestamps, no traffic was being pushed or anything else. The systems are pretty much idling right now |
You are going to need to add debug symbols and give us the decode. As it stands we don't know where/how FRR was compiled and as such we cannot do anything with the segfault data as given. You'll need to provide this to us |
@donaldsharp this FRR is from the official FRR-Repos (https://deb.frrouting.org/frr jammy frr-stable) but yeah just instruct me with what debug flags you need and I'll toss them in there. As it stands the segfault happens every few minutes |
install the debug symbols from there then. then when the next crash happens give us the decode |
@donaldsharp got a helper to decode the segfault? or do you want me to load up frr into valgrind? or what? bit lost here with terminologies |
I guess Donald means to do this: apt install frr-dbgsym (make sure you install frr-dbgsym from the same location - repository from where you install frr package) Then restart frr, wait for the next crash, and post here, logs + config + what you can see. This way logs may show where and why the crash happens. The log from (~/bgpd-valgrind.log): valgrind -s --leak-check=full --trace-children=yes --log-file=~/bgpd-valgrind.log bgpd -d -f /etc/frr/bgpd.conf -F traditional -u frr -g frr -A 127.0.0.1 Also can be useful, but in case your bgpd is heavy loaded (and it is you receive/announce couple full bgpd internet tables right?), will be hard because the bgpd process will be super slow and will consume a lot cpu when started in valgrind debugger. |
@IvayloJ how can I run that without watchfrr messing with it? |
@IvayloJ I tried disabling bgpd in daemons and then run the cmd you gave me but it immediately returned:
I dont think it even connected to any peers |
@IvayloJ I added This made the bgpd churn 100% and be absolutely unusable. But it also didnt segfault after ~2hrs of 100% CPU rock solid. I removed that line and it immediately segfaulted on startup. Sounds like a racecondition to me |
Managed to capture some perf samples if that helps |
As I wrote, it will be hard, because your bgpd is heavy loaded with couple full bgp internet tables. Regardless of this if you want to try with valgrind, first you have to stop all frr processes... --- login as root (sudo su - ) or you have to execute all commands with sudo --- systemctl stop frr (or /etc/init.d/frr stop) --- wait and watch when all frr processes will gone (ps -ax |grep frr) --- -- first start zebra: zebra -d -f /etc/frr/zebra.conf -F traditional -u frr -g frr -A 127.0.0.1 -s 90000000 -- then start bgpd in vallgrind debugger: valgrind -s --leak-check=full --trace-children=yes --log-file=~/bgpd-valgrind.log bgpd -d -f /etc/frr/bgpd.conf -F traditional -u frr -g frr -A 127.0.0.1 But as I said it will be hard and most likely not working, because your bgpd is heavy loaded. I can 100% confirm that frr 8.1 to 9.1 (on debian 8/9/10/11/12 very close to your ubuntu, as well on slackware 13 - 15 my custom compile) some of them works with 20+ peers and up to 5 full internet bgp ipv4+ipv6 tables + rpki checks + another nearly 100k prefixes, without unusual crashes. |
I got frr-dbgsym installed - the segfault kernel message hasn't changed a bit tho. Sure config is as simple as it can be:
Stripped out the unused route-maps that are there for later usage, changed IPs and ASNs. The peer-group TMP also run FRR but 5.x on my gentoo's since the dawn of times and push full-tables to these newer routers which are supposed to replace them. |
Valgrind just caught a sigsegv |
I found the 0x4000000018 stackdump via valgrind finally:
|
Downgraded to 8.1 provided from Ubuntu 22.04 repos (not the FRRouting repos) and no more segfaults! So there's some regression in 9.1 |
Are you able to compile from the master? It should (?) be fixed here. At least what I see from the Valgrind trace, is the related function |
@f0o Looking your valgrind log, and your config maybe really there are a problem. Probably in the redistribute. Are you redistribute the full internet bgp table over ospf or from ospf to bgp ? Is there kind of external script on that machine which may change the kernel route tables meanwhile ? Is only bgpd crash or you saw zebra/ospfd to crash too ? I guess it is in your production, and probably is hard for you to keep debug, but if you can compile frr from source (as @ton31337 asking) will be very useful to catch and fix this. I never used ospf nor use redistribute for big number of routes, so dont have much experience, and never had such crashes (I always work only with bgp for large number of internet routes). So I dont have such test setup to try simulate your case in controlled env. @ton31337 probably it is related, but at all for me it is because the redistribute of large tables and not proper thread/process data memory locking/checking or something like that. 8 bytes illegal read seems to be a pointer and in bgp_zebra.c:2630 it is a call to zlog_err() with *dest as argument... My first just guess in the dark is that something (thread/process) freed the prefix pointer (because it gone from the route table for example), and other thread/process in the same time try to apply something on that pointer. And because the huge number of prefixes + their flapping in internet cases it is shown on random intervals and more often. |
Hi @IvayloJ I do run OVS which has it's own bug with full-tables that it blocks execution; that might be related? I dont redistribute full tables but I do import them into VRFs. The odd part is that the downgrade do 8.1 really fixed it entirely without any other modification to the config or system. So 8.1->9.1 introduced regression somewhere. It is also only bgpd that crashes, Zebra and ospfd are both happy |
Hi @f0o I dont redistribute full tables but I do import them into VRFs. I am a bit confused now, because dont see VRFs in your config. Nor I see any other RTs you work with, except the main(254). Between 8.1 and 9.1 there are a lot of changes (commits), In large code program, with so many functions (like frr) it is very common a little change somewhere, to have great impact on completely different place in the code, and never is easy to catch it. If 8.1 is good for your setup can keep going with it (for me it still works too, already years), but in 9.1 have some important fixes and improvements. In some moment of the time you will have to upgrade - no choice (and could again hit this issue, because nobody fix it meanwhile). If my guesses are right, maybe you can do a stress test setup with couple virtual machines and a little bash scripting for your case.... Do as much as possible close config to your issue case setup (connect same amount of virtual machines over bgp to the test one). Make the test machine redistribute a RT. Write a little script/program to put (lets say 100k) routes in a table in a endless loop (if route not exist in the table, install it). Write another script again endless loop, to remove on every iteration a single random route from the same table, but with sleep of few milliseconds. Let it run for hours, and if no problem play with the sleep times. This way you will have something very close to your scenario I hope. |
The OpenVSwitch issue was resolved with a patch recently that I'm running right now. OVS did iterate through all routes with every change, I'm not sure if that iteration caused any blocking effect on the kernel interface there which would make bgpd have issues adding/removing routes. Regarding the VRFs, the configs have changed since the downgrade to 8.1 which effectively fixed any and all segfaults so now VRFs are introduced and these routers are now in production. The segfault occurred without any VRFs or redistribute. Once I added a transit into the mix the bgpd was segfaulting every few minutes very reliably. |
@f0o I am already a little lost in this issue, but anyway my mind never have been in state "found" :) The segfault occurred without any VRFs or redistribute. Once I added a transit into the mix the bgpd was segfaulting every few minutes very reliably. What you mean by "Once I added a transit into the mix", define exactly this. Can be written even commands you do or whatever to describe it more precisely - exactly. |
"Once I added a transit into the mix" as in added another BGP peer that isnt IBGP and supplies me with full-tables.
|
Which perf version did you use? |
perf version 5.15.143 |
Description
bgpd segfaults frequently seemingly out of nowhere:
Config is super slim, IBGP full mesh with 4 nodes sharing full-tables (~2.4M routes in kernel).
Hardware is 46G Ram and 56 Threads (Xeon Gold 6132) running Ubuntu 22.04 LTS - None of is is pegged, box is pretty idle.
Version
How to reproduce
Unclear, it's only handling 3 peers with full ipv4 tables in IBGP, no filtering or VRF or anything fancy done.
It does happen every so often seemingly without triggers
Expected behavior
Not segfault
Actual behavior
Segfault
Additional context
Happy to provide more logs, I saw some core_handler memstat print outs close to the segfault line
Checklist
The text was updated successfully, but these errors were encountered: