-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bgp nexthop group optimisations #15488
base: master
Are you sure you want to change the base?
Bgp nexthop group optimisations #15488
Conversation
9c19460
to
40c4f30
Compare
40c4f30
to
241ecc1
Compare
06e873e
to
ce1577d
Compare
There is a CPU issue in ZEBRA when BGP installs and removes a lot of routes at the same time. The vtysh and shell become unreachable. This is the case of BGP failover scenarios with two peers, and one of the peers becoming unreachable. For each route change, it appears that nexthop tracking is called to check impact about a new route (un)availability. Two observations are done: - In the case of a specific route change, if a bigger route (or a default route is present like it is in the setup) exists, then nexthop tracking is called. there is no need to call nexthop tracking for the same default prefix, knowing that the dplane_result thread handled bulks of routes at the same time. - The first picture from the below link indicates nexthop tracking consumes time, and maintaining this activity in the zebra main thread will still result in STARVATION messages. Propose to separate the nht notifications from the dplane_result thread by creating a queue list that will store the prefixes to evaluate against nexthop tracking. Before enqueuing it, a check is done if the same prefix has not been called before. The processing is done in a separate 'rib_process_nht_thread_loop' function call. Link: FRRouting#16028 Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Add vty command for debug log messages related to BGP nexthop group implementation support. Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
There is no way to control how the BGP routes should be sent to the zebra. Add a general config knob to determine the nexthop behaviour of BGP. Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
By default, all BGP updates will attempt to use NHG identifiers. Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The bgp_debug_zebra_nh() only uses zlog traces to dump a nexthop. Nexthop information should be available in buffer format. Add a second function bgp_debug_zebra_nh_buffer() that uses a buffer as passed parameter to return the nexthop information. Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
On a transit BGP device that receives one full route from one ISP, if a failover happen, then BGP needs to remove the installed routes... which may take time. This takes time because the ZAPI interface between BGP and ZEBRA, that is in charge of transmitting routes add/updates/removal has to process a lot of route messages during the transition time. To reduce the number of messages at ZAPI level, a series of patches is proposed: - by splitting the route information in two, and separating the prefix information from the nexthop information. Two different messages will be issued - by avoiding sending route updates when a only the identified nexthop has been changed. This commit relies on the nexthop group library extensions. - A nexthop group identifier is created at BGP level, and reuses the RECURSION, IBGP and SRTE flags defined in nexthop_group.h - The nexthop group NHG_ADD message is sent before ROUTE_ADD is sent with the nexthop group identifier. - The NHG_DELETE message is sent when the last ROUTE_DEL message is sent. The changes : - only apply to BGP paths which have a single nexthop. - do not apply to EVPN routes or directly connected routes or blackhole routes, or ipv4 mapped ipv6 nexthops - do not apply to routes resolved over blackhole, default route - do not apply to non host routes that resolve over themselves The changes do not avoid sending route updates in this commit. Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Nexthop group utilities will be used by BGP, and until now, only the all_protocol_startup test was using it. Move some functions from all_protocol_startup to 'tests/topotests/lib/nexthopgroup.py' file. route_get_nhg_id(), verify_nexthop_group(), and verify_route_nexthop_group() are modified to handle the router name parameter. Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Some basic path information is used by the 'show bgp nexthops detail' command. This information only deals with path information, and could be used by other show commands to list bgp_paths. Move the specific code from bgp_nexthop.c to bgpd.c - bgp_show_bgp_path_info_flags() is renamed to bgp_path_info_show_flags() - add a new function bgp_path_info_display() to display this information The AS of the bgp path is displayed based on the BGP instance, where the path is. Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Add a vty function that displays the nexthop groups used by BGP. > r1# show bgp nexthop-group detail > ID: 75757571, #paths 1 > Flags: 0x0000 > State: 0x0001 (Installed) > via 172.31.10.7 (vrf vrf1) inactive, weight 1 > Paths: > 1/1 192.0.2.7/32 VRF vrf1 flags 0x418 > ID: 75757572, #paths 1 > Flags: 0x0000 > State: 0x0001 (Installed) > via 172.31.11.8 (vrf vrf2) inactive, weight 1 > Paths: > 1/1 192.0.2.7/32 VRF vrf2 flags 0x418 > ID: 75757575, #paths 1 > Flags: 0x0003 (allowRecursion, internalBgp) > State: 0x0001 (Installed) > via 192.0.2.6 (vrf default) inactive, weight 1 > Paths: > 1/1 192.0.2.9/32 VRF default flags 0x418 > ID: 75757576, #paths 1 > Flags: 0x0003 (allowRecursion, internalBgp) > State: 0x0001 (Installed) > via 192.0.2.6 (vrf default) inactive, label 6000, weight 1 > Paths: > 1/1 192.0.2.9/32 VRF vrf1 flags 0x4018 Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
This topology uses an IGP network with BGP between r1, and r5/r6 on the one hand, and two CEs at the edge side, on the other hand. The bgp nexthop-group mode is used, meaning that BGP gathers prefixes which use a shared nexthop together, and use the NHG_ADD zapi messages to inform ZEBRA. Consequently, ZEBRA will pick up an identifier (either the one proposed by BGP or an other one chosen by ZEBRA), and will stick that identifier to the BGP nexthop. The test ensures that that identifier is correctly used in the ZEBRA RIB, and the Linux FIB, even when the IGP resolution changes (interfaces going down, or remote IGP label value change). Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The nexthop group support is not applicable to ECMP BGP paths. For instance, when BGP addpath is configured, only prefixes associated to a nexthop are using nexthop groups at BGP level. Unlock the possibility to use nexthop groups for ECMP paths. The choice is done to create a nexthop group hierarchy with multiple nexthops. For each path to install, two nexthop groups will be used: nexthops and groups. - Link the parent nexthop group to the path (with pi->bgp_nhg) - Link the original bgp_nhg_cache nexthop to the path (with pi->bgp_nhg_nexthop). - Handle the installation order at proto level: the nexthop first, the parent after - Add nexthop extension in 'show bgp nexthop-group' - unlock the ability to handle multiple paths with NHGs - A double hash list is used to handle frr_each_safe() limitations: this technical choice is done in order to be able to detach both nexthops and groups in a single walk. Otherwise, asan errors are detected. Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Add technical information about BGP nexthop groups. Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The BGP documentation does need to mention how to alleviate memory and CPU issues due to scenarios where BGP may flap multiple times. The 'nexthop-group' commands are introduced (show, debug, config). Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The nexthop group test suite is extended with multipath support. The test ensures that on a given multipath configuration, when a failover happens, the new resulting paths use the same nexthop group without the failed nexthop. Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Until now, only the PATH_SELECTED path had a link to the parent nexthop group. This may be a problem if we have a single path with no bgp_nhg, and a bgp_nexthop. Instead of parsing the whole path list for a given prefix, reuse the bgp_nhg pointer for all entries. Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When using BGP nexthop-groups with route, it is not needed to reinstall routes when only the nexthop needs to be updated for any reason (IGP, or same incoming BGP update received with nexthop modification). Fix this by introducing a flag at node level, to know if a route update is needed or not. This feature works if 'bgp suppress-fib-enabled' functionality is used. Without that, the BGP service does not know if a route has already been installed or not. Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When using bgp nexthop groups, it is possible to avoid route updates, and this test ensures that the number of zapi route messages from BGP to ZEBRA can be reduced, for instance when a failover happens. Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Modify the current test, to add recursive BGP routes: those routes will resolve over BGP routes, similar to what happens when a full route is received. The IGP failure test checks that the NHIDs from those recursive routes is not changed, and that a nexthop group message update to ZEBRA is enough to change the routing to all incoming BGP routes. The BFD failover test is implemented, and warns that the nexthop group ID has changed. Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When not used by default, bgp nexthop group are not tested against the following route entry added: - route entry resolved over default route - route entry resolved over itself - route entry resolved over blackhole route - route entry connected (case mplsvpn) - route entry using ipv4 mapped ipv6 address This test suite covers the above use cases. Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When BGP uses nexthop groups by default, after unconfiguring the 'bgp no-rib' functionality, the BGP routes are re-added. Almost every BGP route is ok in the ZEBRA RIB, except one uninstalled route '192.168.101.0/24' which has its nexthop flag set to '3', indicating that the nexthop is active. Originally, when bgp nexthop-group is disabled, when a route is not installed, its next-hop is not installed either. But with nexthop-groups, the referenced nexthop may be used by an other route which is installed. Consider this flag as normal by ignoring the flags value on that specific route. Fixes: 251931a ("tests: Enhance bgp_features topotest suite with tests for 'bgp no-rib'") Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The step8 fails to see in ZEBRA RIB 2 nexthop entries for the '203.0.113.8/30' prefix. BGP receives 2 eBGP paths. > # show bgp ipv4 > [..] > *= 203.0.113.8/30 198.51.100.10 0 64503 i > *> 10.0.3.2 0 64502 i Observed output: > # show ip route > [..] > B>* 203.0.113.8/30 [20/0] via 10.0.3.2, r1-eth1, weight 1, 00:22:49 Expected result: > # show ip route > [..] > B>* 203.0.113.8/30 [20/0] via 10.0.3.2, r1-eth1, weight 1, 00:00:21 > * via 198.51.100.10 inactive, weight 1, 00:00:21 This test demonstrates that BGP does not allow route recursivity for those prefixes, and ZEBRA has to only use the one with 10.0.3.2 nexthop (which is connected). Actually, because BGP nexthop groups is configured by default, the nexthop '198.51.100.10' is passed to ZEBRA, but ZEBRA will consider this nexthop as inactive, because the address does not match any connected interfaces: with ot without nexthop groups, only one nexthop is active. This commit proposes to only check that only one nexthop is active, and will not count the total number of nexthops. Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
An issue appears when BGP wants to add the 203.0.113.0/30 prefix from eBGP multiple paths: the 10.0.3.2 nexthop has not the 'fib' flag set, whereas it is expected. The test fails only when BGP nexthop group mode is used. The flag value does not change the behaviour. Only the nexthop which has not the 'duplicate nexthop removed' flag is not installed. Because the test deals with detecting the duplicated nexthop detection, do a check on the 'duplicate' flag. Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The test fails, because the unexpected 40.0.0.0/8 prefix is received by r3. It should have been filtered out at installation time; consequently, the bgp update should not have been sent to r3. The issue happens, because the 'ip protocol' command is completely bypassed for zapi based routes that use nexthop group identifiers. Let's disable the bgp nexthop group functionality when such configuration is present. Link: FRRouting#15169 Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The test fails. The usage of 'ip protocol bgp' command with bgp routes based on nexthop group identifiers, does not work. Let's disable the bgp nexthop group functionality in this test. Link: FRRouting#15169 Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
After disabling the 'bgp suppress-no-fib' service, the test fails to find the expected bgp route to '192.168.0.7': two paths are available (from BGP and OSPF), and only the OSPF one is selected, but the calculated recursive nexthop for BGP should not be installed. Actually, in the FIB, the '192.168.0.7' prefix selected is the one from OSPF. In the test, the bgp nexthop-group functionality is turned on: The BGP route for '192.168.0.7' uses a shared recursive nexthop which is already installed. But because the BGP route has a bigger distance than ospf, the BGP route is not installed. Because the nexthop is shared and not self to the route, do ignore the flags value of the nexthop, in this test. Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
during the tests with full route, A CPU issue appears on a multihomed setup with BGP peering with a primary and backup nexthop. The optimisation is related to the handling of NHG events in zebra. |
e8d7b98
to
66d2111
Compare
Hi @mjstapp
NHT information is presented 1996c17#diff-e8f99c58bc731e761209c2a21ee19d4a004718048d9517342b3875634ea0cd13R105 unless you want me to present something.
A piece of code in zebra has been added in this pull request. |
ci:rerun ospf6 issue? |
This patch set proposes to bring BGP changes. Among the reasons of those changes:
Link: https://github.com/pguibert6WIND/SONiC/blob/proposal_bgp/doc/pic/bgp_pic_arch_doc.md#9-frrouting-ipc-messaging-from-bgp-to-zebra