Skip to content

Conversation

@raja-rajasekar
Copy link
Contributor

zebra: avoid redundant NHG kernel install for singleton-equivalent groups

Duplicate nexthop (Ex: EPVN routes)

Problem:
When zebra receives duplicate nexthops (e.g., two paths resolving
to the same 192.168.1.1 for EVPN routes), it can end up installing a
singleton NHG pointing to another singleton NHG.

For ex: ip nexthop group (bgp_evpn_rt5-test_evpn_multipath)
id 34 via 192.168.1.1 dev bridge-101 scope link proto zebra onlink
id 44 group 34 proto zebra
id 45 group 33 proto zebra

How zebra assumes is that the
  - NHG 34 is a singleton NHG
  - NHG 44 is a singleton NHG which has a duplicate nexthop and depends
    on NHG 34.

However kernel and lower layers dont care about the duplicate nexthops
and this redundant NHG installation can cause resource exhaustion at scale.

Fix:
Add intelligence to detect and skip kernel installation of multipath
NHGs that are functionally equivalent to a singleton.

What this means is that the NHG zebra creates with received NHs are
still maintained, but the installed NHG differs from the displayed i.e.
    root@r2:/tmp/topotests/bgp_evpn_rt5.test_bgp_evpn/r2# vtysh -c "sh ip route vrf vrf-101 10.0.101.1/32 ne"
    Routing entry for 10.0.101.1/32
      Known via "bgp", distance 200, metric 0, vrf vrf-101, best
      Last update 00:00:32 ago
      Flags: Recursion iBGP Selected
      Status: None
      Nexthop Group ID: 108
      Installed Nexthop Group ID: 34 >>>>>>>>>
      Received Nexthop Group ID: 108 >>>>>>>>>
        192.168.1.1, via bridge-101 onlink, weight 1
        192.168.1.1, via bridge-101 (duplicate nexthop removed) onlink, weight 1
Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>

@frrbot frrbot bot added tests Topotests, make check, etc zebra labels Nov 12, 2025
…oups

Duplicate nexthop (Ex: EPVN routes)

Problem:
When zebra receives duplicate nexthops (e.g., two paths resolving
to the same 192.168.1.1 for EVPN routes), it can end up installing a
singleton NHG pointing to another singleton NHG.

For ex: ip nexthop group (bgp_evpn_rt5-test_evpn_multipath)
id 34 via 192.168.1.1 dev bridge-101 scope link proto zebra onlink
id 44 group 34 proto zebra
id 45 group 33 proto zebra

How zebra assumes is that the
  - NHG 34 is a singleton NHG
  - NHG 44 is a singleton NHG which has a duplicate nexthop and depends
    on NHG 34.

However kernel and lower layers dont care about the duplicate nexthops
and this redundant NHG installation can cause resource exhaustion at scale.

Fix:
Add intelligence to detect and skip kernel installation of multipath
NHGs that are functionally equivalent to a singleton.

What this means is that the NHG zebra creates with received NHs are
still maintained, but the installed NHG differs from the displayed i.e.

root@r2:/tmp/topotests/bgp_evpn_rt5.test_bgp_evpn/r2# vtysh -c "sh ip route vrf vrf-101 10.0.101.1/32 ne"
Routing entry for 10.0.101.1/32
  Known via "bgp", distance 200, metric 0, vrf vrf-101, best
  Last update 00:00:32 ago
  Flags: Recursion iBGP Selected
  Status: None
  Nexthop Group ID: 108
  Installed Nexthop Group ID: 34 >>>>>>>>>
  Received Nexthop Group ID: 108 >>>>>>>>>
    192.168.1.1, via bridge-101 onlink, weight 1
    192.168.1.1, via bridge-101 (duplicate nexthop removed) onlink, weight 1

Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
@raja-rajasekar raja-rajasekar force-pushed the rajasekarr/nhg_singleton_issue branch from 9fe045e to 2c97b98 Compare November 12, 2025 19:41
@raja-rajasekar
Copy link
Contributor Author

Test locally:

Without Fix: Baseline

root@r2:/tmp/topotests/bgp_evpn_rt5.test_bgp_evpn/r2# ip nexthop show
id 16 via 192.168.2.101 dev eth-rr scope link proto zebra 
id 31 via ::ffff:192.168.1.1 dev bridge-102 scope link proto zebra onlink 
id 32 via 192.168.1.1 dev bridge-102 scope link proto zebra onlink 
id 33 via ::ffff:192.168.1.1 dev bridge-101 scope link proto zebra onlink 
id 34 via 192.168.1.1 dev bridge-101 scope link proto zebra onlink 
id 44 group 34 proto zebra 
id 45 group 33 proto zebra 
id 46 group 32 proto zebra 
id 47 group 31 proto zebra 
root@r2:/tmp/topotests/bgp_evpn_rt5.test_bgp_evpn/r2# vtysh -c "sh ip route vrf vrf-101 ne "
Codes: K - kernel route, C - connected, L - local, S - static,
       R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric, t - Table-Direct,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

IPv4 unicast VRF vrf-101:
B>* 10.0.101.1/32 [200/0] (44) via 192.168.1.1, bridge-101 onlink, weight 1, 00:00:04
                               via 192.168.1.1, bridge-101 onlink (dup), weight 1, 00:00:04
L * 10.0.101.2/32 (11) is directly connected, loop101, weight 1, 00:00:54
C>* 10.0.101.2/32 (11) is directly connected, loop101, weight 1, 00:00:54
root@r2:/tmp/topotests/bgp_evpn_rt5.test_bgp_evpn/r2# vtysh -c "sh nexthop-group rib 44"
ID: 44 (zebra)
     RefCnt: 2
     Uptime: 00:00:18
     VRF: default(No AFI)
     Nexthop Count: 2
     Flags: 0x3
     Valid, Installed
     Depends: (34)
        via 192.168.1.1, bridge-101 (vrf vrf-101) onlink, weight 1
        via 192.168.1.1, bridge-101 (vrf vrf-101) onlink (dup), weight 1
root@r2:/tmp/topotests/bgp_evpn_rt5.test_bgp_evpn/r2# vtysh -c "sh nexthop-group rib 34"
ID: 34 (zebra)
     RefCnt: 7
     Uptime: 00:00:24
     VRF: default(IPv4)
     Nexthop Count: 1
     Flags: 0x3
     Valid, Installed
     Interface Index: 5
        via 192.168.1.1, bridge-101 (vrf vrf-101) onlink, weight 1
     Dependents: (44)

With Fix:

root@r2:/tmp/topotests/bgp_evpn_rt5.test_bgp_evpn/r2# ip nexthop show
id 16 via 192.168.2.101 dev eth-rr scope link proto zebra 
id 31 via ::ffff:192.168.1.1 dev bridge-102 scope link proto zebra onlink 
id 32 via 192.168.1.1 dev bridge-102 scope link proto zebra onlink 
id 33 via ::ffff:192.168.1.1 dev bridge-101 scope link proto zebra onlink 
id 34 via 192.168.1.1 dev bridge-101 scope link proto zebra onlink 
root@r2:/tmp/topotests/bgp_evpn_rt5.test_bgp_evpn/r2# vtysh -c "sh ip route vrf vrf-101 ne"
Codes: K - kernel route, C - connected, L - local, S - static,
       R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric, t - Table-Direct,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

IPv4 unicast VRF vrf-101:
B>  10.0.101.1/32 [200/0] (108) via 192.168.1.1, bridge-101 onlink, weight 1, 00:00:11
                                via 192.168.1.1, bridge-101 onlink (dup), weight 1, 00:00:11
L * 10.0.101.2/32 (11) is directly connected, loop101, weight 1, 00:01:10
C>* 10.0.101.2/32 (11) is directly connected, loop101, weight 1, 00:01:10
root@r2:/tmp/topotests/bgp_evpn_rt5.test_bgp_evpn/r2# vtysh -c "sh ip route vrf vrf-101 10.0.101.1/32 ne"
Routing entry for 10.0.101.1/32
  Known via "bgp", distance 200, metric 0, vrf vrf-101, best
  Last update 00:00:32 ago
  Flags: Recursion iBGP Selected 
  Status: None 
  Nexthop Group ID: 108
  Installed Nexthop Group ID: 34
  Received Nexthop Group ID: 108
    192.168.1.1, via bridge-101 onlink, weight 1
    192.168.1.1, via bridge-101 (duplicate nexthop removed) onlink, weight 1

root@r2:/tmp/topotests/bgp_evpn_rt5.test_bgp_evpn/r2# vtysh -c "sh nexthop-group rib 108"
ID: 108 (zebra)
     RefCnt: 2
     Uptime: 00:00:56
     VRF: default(No AFI)
     Nexthop Count: 2
     Flags: 0x1
     Valid
     Depends: (34)
        via 192.168.1.1, bridge-101 (vrf vrf-101) onlink, weight 1
        via 192.168.1.1, bridge-101 (vrf vrf-101) onlink (dup), weight 1
root@r2:/tmp/topotests/bgp_evpn_rt5.test_bgp_evpn/r2# vtysh -c "sh nexthop-group rib 34"
ID: 34 (zebra)
     RefCnt: 3
     Uptime: 00:01:00
     VRF: default(IPv4)
     Nexthop Count: 1
     Flags: 0x3
     Valid, Installed
     Interface Index: 5
        via 192.168.1.1, bridge-101 (vrf vrf-101) onlink, weight 1
     Dependents: (108)

Displayed 3 routes and 4 total paths

@raja-rajasekar raja-rajasekar force-pushed the rajasekarr/nhg_singleton_issue branch from d5ed5eb to 56dacda Compare November 12, 2025 23:40
… EVPN tests

Validate singleton-equivalent NHG optimization in existing EVPN tests

Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
Add test for singleton-equivalent for multipath scenarios (4paths 2 sets
of duplicates NH) i.e. NHG X [A,A,B,B]

Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
@raja-rajasekar raja-rajasekar force-pushed the rajasekarr/nhg_singleton_issue branch from 56dacda to e95517a Compare November 12, 2025 23:42

def _bgp_check_nexthop():
output = json.loads(r1.vtysh_cmd("show ip route 10.10.10.10/32 json"))
# With singleton-equivalent NHG optimization applied to duplicate nexthops,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we don't need these at all. Just remove installed, fib, and that should be fine, because we are checking just the addresses of nexthop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

…mization

Make changes in bgp_dynamic_capability to adapt to singleton-equivalent
NHG optimization

Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
@raja-rajasekar raja-rajasekar force-pushed the rajasekarr/nhg_singleton_issue branch from e95517a to 0621fb0 Compare November 18, 2025 06:57
@donaldsharp
Copy link
Member

The question is when shoiuld the deduplication pass actually happen?

Should we have this happen at reception of hte nexthops from the upper level protocol?
After the nexthop resolution but not before we send to the dplane?

@raja-rajasekar
Copy link
Contributor Author

The question is when shoiuld the deduplication pass actually happen?

Should we have this happen at reception of hte nexthops from the upper level protocol? After the nexthop resolution but not before we send to the dplane?

My original approach was to do it even before we create the NHG i.e. upon reception of the nexthops form upper level protocol. But this means we DONT preserve what upper level protocol sends us.

But upon discussing with you/mark, to Preserve what upper protocols sent , i am doing the deduplication at kernel installation time - zebra maintains the full NHG internally but installs only the singleton-equivalent to the kernel.

@mjstapp mjstapp self-requested a review December 8, 2025 20:52
@mjstapp
Copy link
Contributor

mjstapp commented Dec 8, 2025

yeah, I think some of this probably comes from the history, the legacy of the logic that existed before the NHG concepts came in, when the code thought about individual nexthops only. it sort of feels like we're missing a step, and that legacy code is pushing us into increasingly complicated workarounds - in this area, in the "caching" PR that's also open. maybe it would make sense to think about a less nexthop-oriented approach, maybe just at the point where we've done the nexthop resolution/validity check.

@raja-rajasekar
Copy link
Contributor Author

yeah, I think some of this probably comes from the history, the legacy of the logic that existed before the NHG concepts came in, when the code thought about individual nexthops only. it sort of feels like we're missing a step, and that legacy code is pushing us into increasingly complicated workarounds - in this area, in the "caching" PR that's also open. maybe it would make sense to think about a less nexthop-oriented approach, maybe just at the point where we've done the nexthop resolution/validity check.

Agree mark, I think the NHG is convoluted to such an extent that maybe a sit down and re-design it might help. Else, we will have to fix complexing them furhter.. But as for this PR, let me know what needs to be done @donaldsharp @mjstapp

@raja-rajasekar raja-rajasekar marked this pull request as draft January 22, 2026 21:48
@raja-rajasekar
Copy link
Contributor Author

some tests in evpn are failing internally.. let me rework on the fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

master rebase PR needs rebase size/L tests Topotests, make check, etc zebra

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants