Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bgp nexthop group optimisations #15488

Open
wants to merge 37 commits into
base: master
Choose a base branch
from

Conversation

pguibert6WIND
Copy link
Member

This patch set proposes to bring BGP changes. Among the reasons of those changes:

  • split the BGP information between prefix and nexthop. This change applies to single path routes, but also multiple path routes (with addpath functionality).
  • reduce the message exchanged between BGP and ZEBRA during failover scenarios

Link: https://github.com/pguibert6WIND/SONiC/blob/proposal_bgp/doc/pic/bgp_pic_arch_doc.md#9-frrouting-ipc-messaging-from-bgp-to-zebra

There is a CPU issue in ZEBRA when BGP installs and removes
a lot of routes at the same time. The vtysh and shell become
unreachable. This is the case of BGP failover scenarios with
two peers, and one of the peers becoming unreachable.

For each route change, it appears that nexthop tracking is
called to check impact about a new route (un)availability.
Two observations are done:

- In the case of a specific route change, if a bigger route
(or a default route is present like it is in the setup) exists,
then nexthop tracking is called. there is no need to call nexthop
tracking for the same default prefix, knowing that the
dplane_result thread handled bulks of routes at the same time.

- The first picture from the below link indicates nexthop
tracking consumes time, and maintaining this activity in
the zebra main thread will still result in STARVATION messages.

Propose to separate the nht notifications from the dplane_result
thread by creating a queue list that will store the prefixes
to evaluate against nexthop tracking. Before enqueuing it, a check
is done if the same prefix has not been called before.
The processing is done in a separate 'rib_process_nht_thread_loop'
function call.

Link: FRRouting#16028

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Add vty command for debug log messages related to BGP
nexthop group implementation support.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
There is no way to control how the BGP routes should be sent
to the zebra.
Add a general config knob to determine the nexthop behaviour
of BGP.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
By default, all BGP updates will attempt to use NHG identifiers.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The bgp_debug_zebra_nh() only uses zlog traces to dump a nexthop.
Nexthop information should be available in buffer format.

Add a second function bgp_debug_zebra_nh_buffer() that uses a buffer
as passed parameter to return the nexthop information.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
On a transit BGP device that receives one full route from one
ISP, if a failover happen, then BGP needs to remove the
installed routes... which may take time.

This takes time because the ZAPI interface between BGP and ZEBRA,
that is in charge of transmitting routes add/updates/removal has
to process a lot of route messages during the transition time.

To reduce the number of messages at ZAPI level, a series of
patches is proposed:
- by splitting the route information in two, and separating the
prefix information from the nexthop information. Two different
messages will be issued
- by avoiding sending route updates when a only the identified
nexthop has been changed.

This commit relies on the nexthop group library extensions.
- A nexthop group identifier is created at BGP level, and reuses
the RECURSION, IBGP and SRTE flags defined in nexthop_group.h
- The nexthop group NHG_ADD message is sent before ROUTE_ADD is
sent with the nexthop group identifier.
- The NHG_DELETE message is sent when the last ROUTE_DEL message
is sent.

The changes :
- only apply to BGP paths which have a single nexthop.
- do not apply to EVPN routes or directly connected routes or blackhole
routes, or ipv4 mapped ipv6 nexthops
- do not apply to routes resolved over blackhole, default route
- do not apply to non host routes that resolve over themselves

The changes do not avoid sending route updates in this commit.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Nexthop group utilities will be used by BGP, and
until now, only the all_protocol_startup test was
using it.

Move some functions from all_protocol_startup to
'tests/topotests/lib/nexthopgroup.py' file.
route_get_nhg_id(), verify_nexthop_group(), and
verify_route_nexthop_group() are modified to handle
the router name parameter.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Some basic path information is used by the 'show bgp nexthops detail'
command. This information only deals with path information, and could
be used by other show commands to list bgp_paths.

Move the specific code from bgp_nexthop.c to bgpd.c
- bgp_show_bgp_path_info_flags() is renamed to bgp_path_info_show_flags()
- add a new function bgp_path_info_display() to display this information

The AS of the bgp path is displayed based on the BGP instance, where
the path is.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Add a vty function that displays the nexthop groups
used by BGP.

> r1# show bgp nexthop-group  detail
> ID: 75757571, #paths 1
>   Flags: 0x0000
>   State: 0x0001 (Installed)
>            via 172.31.10.7 (vrf vrf1) inactive, weight 1
>   Paths:
>     1/1 192.0.2.7/32 VRF vrf1 flags 0x418
> ID: 75757572, #paths 1
>   Flags: 0x0000
>   State: 0x0001 (Installed)
>            via 172.31.11.8 (vrf vrf2) inactive, weight 1
>   Paths:
>     1/1 192.0.2.7/32 VRF vrf2 flags 0x418
> ID: 75757575, #paths 1
>   Flags: 0x0003 (allowRecursion, internalBgp)
>   State: 0x0001 (Installed)
>            via 192.0.2.6 (vrf default) inactive, weight 1
>   Paths:
>     1/1 192.0.2.9/32 VRF default flags 0x418
> ID: 75757576, #paths 1
>   Flags: 0x0003 (allowRecursion, internalBgp)
>   State: 0x0001 (Installed)
>            via 192.0.2.6 (vrf default) inactive, label 6000, weight 1
>   Paths:
>     1/1 192.0.2.9/32 VRF vrf1 flags 0x4018

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
This topology uses an IGP network with BGP between r1, and
r5/r6 on the one hand, and two CEs at the edge side, on the
other hand. The bgp nexthop-group mode is used, meaning that
BGP gathers prefixes which use a shared nexthop together, and
use the NHG_ADD zapi messages to inform ZEBRA.

Consequently, ZEBRA will pick up an identifier (either the one
proposed by BGP or an other one chosen by ZEBRA), and will stick
that identifier to the BGP nexthop.

The test ensures that that identifier is correctly used in the
ZEBRA RIB, and the Linux FIB, even when the IGP resolution
changes (interfaces going down, or remote IGP label value change).

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The nexthop group support is not applicable to ECMP BGP paths.
For instance, when BGP addpath is configured, only prefixes
associated to a nexthop are using nexthop groups at BGP level.

Unlock the possibility to use nexthop groups for ECMP paths.
The choice is done to create a nexthop group hierarchy with
multiple nexthops.

For each path to install, two nexthop groups will be used:
nexthops and groups.
- Link the parent nexthop group to the path (with pi->bgp_nhg)
- Link the original bgp_nhg_cache nexthop to the path (with pi->bgp_nhg_nexthop).
- Handle the installation order at proto level: the nexthop first, the parent after
- Add nexthop extension in 'show bgp nexthop-group'
- unlock the ability to handle multiple paths with NHGs
- A double hash list is used to handle frr_each_safe() limitations:
this technical choice is done in order to be able to detach both
nexthops and groups in a single walk. Otherwise, asan errors are
detected.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Add technical information about BGP nexthop groups.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The BGP documentation does need to mention how to
alleviate memory and CPU issues due to scenarios where
BGP may flap multiple times. The 'nexthop-group' commands
are introduced (show, debug, config).

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The nexthop group test suite is extended with multipath support.
The test ensures that on a given multipath configuration, when a
failover happens, the new resulting paths use the same nexthop
group without the failed nexthop.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Until now, only the PATH_SELECTED path had a link to the
parent nexthop group. This may be a problem if we have a
single path with no bgp_nhg, and a bgp_nexthop.

Instead of parsing the whole path list for a given prefix,
reuse the bgp_nhg pointer for all entries.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When using BGP nexthop-groups with route, it is not needed to reinstall
routes when only the nexthop needs to be updated for any reason (IGP,
or same incoming BGP update received with nexthop modification).

Fix this by introducing a flag at node level, to know if a route update
is needed or not. This feature works if 'bgp suppress-fib-enabled'
functionality is used. Without that, the BGP service does not know if
a route has already been installed or not.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When using bgp nexthop groups, it is possible to avoid route
updates, and this test ensures that the number of zapi route
messages from BGP to ZEBRA can be reduced, for instance when
a failover happens.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Modify the current test, to add recursive BGP routes: those routes
will resolve over BGP routes, similar to what happens when a full
route is received.

The IGP failure test checks that the NHIDs from those recursive
routes is not changed, and that a nexthop group message update to
ZEBRA is enough to change the routing to all incoming BGP routes.

The BFD failover test is implemented, and warns that the nexthop
group ID has changed.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When not used by default, bgp nexthop group are not tested
against the following route entry added:
- route entry resolved over default route
- route entry resolved over itself
- route entry resolved over blackhole route
- route entry connected (case mplsvpn)
- route entry using ipv4 mapped ipv6 address

This test suite covers the above use cases.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When BGP uses nexthop groups by default, after unconfiguring the
'bgp no-rib' functionality, the BGP routes are re-added. Almost
every BGP route is ok in the ZEBRA RIB, except one uninstalled
route '192.168.101.0/24' which has its nexthop flag set to '3',
indicating that the nexthop is active.

Originally, when bgp nexthop-group is disabled, when a route is
not installed, its next-hop is not installed either. But with
nexthop-groups, the referenced nexthop may be used by an other
route which is installed.

Consider this flag as normal by ignoring the flags value on that
specific route.

Fixes: 251931a ("tests: Enhance bgp_features topotest suite with tests for 'bgp no-rib'")
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The step8 fails to see in ZEBRA RIB 2 nexthop entries for the
'203.0.113.8/30' prefix.

BGP receives 2 eBGP paths.
> # show bgp ipv4
> [..]
> *= 203.0.113.8/30   198.51.100.10                          0 64503 i
> *>                  10.0.3.2                               0 64502 i

Observed output:

> # show ip route
> [..]
> B>* 203.0.113.8/30 [20/0] via 10.0.3.2, r1-eth1, weight 1, 00:22:49

Expected result:

> # show ip route
> [..]
> B>* 203.0.113.8/30 [20/0] via 10.0.3.2, r1-eth1, weight 1, 00:00:21
>  *                       via 198.51.100.10 inactive, weight 1, 00:00:21

This test demonstrates that BGP does not allow route recursivity for
those prefixes, and ZEBRA has to only use the one with 10.0.3.2 nexthop
(which is connected).

Actually, because BGP nexthop groups is configured by default, the
nexthop '198.51.100.10' is passed to ZEBRA, but ZEBRA will consider this
nexthop as inactive, because the address does not match any connected
interfaces: with ot without nexthop groups, only one nexthop is active.

This commit proposes to only check that only one nexthop is active, and
will not count the total number of nexthops.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
An issue appears when BGP wants to add the 203.0.113.0/30 prefix from
eBGP multiple paths: the 10.0.3.2 nexthop has not the 'fib' flag
set, whereas it is expected.

The test fails only when BGP nexthop group mode is used. The flag
value does not change the behaviour. Only the nexthop which has
not the 'duplicate nexthop removed' flag is not installed.

Because the test deals with detecting the duplicated nexthop
detection, do a check on the 'duplicate' flag.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The test fails, because the unexpected 40.0.0.0/8 prefix is
received by r3. It should have been filtered out at installation
time; consequently, the bgp update should not have been sent to r3.

The issue happens, because the 'ip protocol' command is completely
bypassed for zapi based routes that use nexthop group identifiers.
Let's disable the bgp nexthop group functionality when such
configuration is present.

Link: FRRouting#15169

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The test fails.
The usage of 'ip protocol bgp' command with bgp routes based on nexthop
group identifiers, does not work.
Let's disable the bgp nexthop group functionality in this test.

Link: FRRouting#15169

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
After disabling the 'bgp suppress-no-fib' service, the test fails
to find the expected bgp route to '192.168.0.7': two paths are
available (from BGP and OSPF), and only the OSPF one is selected,
but the calculated recursive nexthop for BGP should not be installed.

Actually, in the FIB, the '192.168.0.7' prefix selected is the one
from OSPF.

In the test, the bgp nexthop-group functionality is turned on: The BGP
route for '192.168.0.7' uses a shared recursive nexthop which is
already installed. But because the BGP route has a bigger distance than
ospf, the BGP route is not installed.

Because the nexthop is shared and not self to the route, do ignore the
flags value of the nexthop, in this test.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
@pguibert6WIND
Copy link
Member Author

pguibert6WIND commented May 31, 2024

during the tests with full route, A CPU issue appears on a multihomed setup with BGP peering with a primary and backup nexthop.
The primary root cause was due to 'asan' compilation was used and was consuming too much memory. I had a lot of swap.
However, an optimisation may be welcome to handle thie high CPU usage.
When the failover happens, the shell and vtysh are hardly reachable.
The attached dataplane shows that the dataplane result handles takes a lot of time.
On each route node installed/uninstalled, nht is called to evaluate other routes.
It appears that if a default route is present, this prefix is systematically called as many times as there are some routes installed (whereas only one nht event is necessary).

graph1

The optimisation is related to the handling of NHG events in zebra.
Extra code has been added. This extra code parses the list of route nodes and takes time.
In addition of being added, the NHG events handling in zebra is optimised in that pull request too.

@pguibert6WIND
Copy link
Member Author

Hi @mjstapp

I'm very concerned about some of what's here, and I discussed in a comment earlier. I think it would be more successful to split this work into multiple PRs:

1. make it possible for zebra to resolve recursive nexthop groups owned by daemons (as it resolves implicit groups that it creates and owns)

#16028

2. enable BGP to create and use explicit nhgs for its routes. the policy implications will have to be carefully presented. NHT events trigger BGP to refresh its groups.

NHT information is presented 1996c17#diff-e8f99c58bc731e761209c2a21ee19d4a004718048d9517342b3875634ea0cd13R105

unless you want me to present something.

3. add your "zebra manages bgp's routes" proposal. as I've said, I think this is a mistake, but I think the discussion would be better if this aspect were separate from the other, more mechanical aspects.

A piece of code in zebra has been added in this pull request.

1e64cc6

@pguibert6WIND pguibert6WIND marked this pull request as ready for review June 3, 2024 08:19
@pguibert6WIND
Copy link
Member Author

ci:rerun ospf6 issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants