New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BGP sessions stuck in Idle state for a long time/parts of config missing depending on config size. #12049
Comments
Do you have |
Yeah I already checked that and no “end” at the bottom of the config. But does seem kind similar to that issue. |
Will check this. @ShaabamRouter is this the full config (in the description)? |
|
I was doing some more testing today. I noticed that after I start frr and it's in the broken state if I re-read the config in with: Also I haven't been able to replicate the issue when using separate config files for each daemon. It seems like it might be only when using (service integrated-vtysh-config) |
I think I found the issue my config was too long/cpu too slow for the default watchfrr restart-timeout of 20 seconds. So when watchfrr was starting all the services and loading in all the configs it must have been taking just over 20 seconds. by adding watchfrr_options="-T 30" to the daemons file it resolves the issue. Log output from when it was broken:
After increasing the timeout:
It might be a good idea to add a note about that setting to the default/sample daemons file. |
@ShaabamRouter what about #12055? |
|
@ShaabamRouter what are the resources of the router you use? I would like to replicate this, but I can't (for me it's a fast operation with the given config - 2 seconds). |
I'm running this machine in GNS3 using QEMU. The virtual machine is allocated 2 vCores and 2048MB RAM. The underlying host Dual X5650 Processors and 48GB of ram. |
@ShaabamRouter could you test this patch 2ab760f? And check how it works with the default (-T, --restart-time) settings and your config? |
@ton31337 I've tested the patch and everything is working!!! Thanks!! Is this going to get pushed into the next production release? |
To Reproduce
Once the frr.conf gets large enough, on starting of FRR, parts of the config go missing and bgp sessions stay stuck in an idle state for about 10 min before coming up.
Try loading this config from frr.conf
After loading the config it's stuck in idle state and missing the last route-map entry.
As you can see above the output of show run shows that the entry "route-map some-important-map permit 20" has vanished from the config as well as having peers stuck idle.
If I remove these entries from the frr.conf to "make room" and restart frr:
It works fine:
And when I show run I get the last route-map:
Expected behavior
On startup FRR should be able to load the full config and peers should not get stuck in an idle state.
The frr.conf config I posted is just a simple example that is able to replicate the same issue we're having with our actual production config. Our real config probably only has 250 or so prefix list entries and a lot of route-map entries. From what I've seen I could have caused the same issue by just adding alot of route maps instead of prefix lists. I just did prefix lists in this example since it's easier to see.
I've also tried to nail down the exact point the issue happens by adding and removing prefix list entries 1 by 1 to see if there's an exact point at which it breaks. I noticed that the bgp sessions will hang idle and you have to add a few more prefix list entries before the route maps start to disappear.
It also seems like it's not 100% consistent as to what number of prefix entries causes what result when you're getting close to the number of entries that it will work/break at.
Which makes me think maybe this isn't as much about the length of the config it self but how long it takes for the process to read it in or something? maybe time out related?
Or could I just be exhausting some kind of memory limit?
Versions
The text was updated successfully, but these errors were encountered: