Skip to content

Zebra fixup nhg handling from kernel#20732

Merged
riw777 merged 3 commits intoFRRouting:masterfrom
donaldsharp:zebra_fixup_nhg_handling_from_kernel
Feb 11, 2026
Merged

Zebra fixup nhg handling from kernel#20732
riw777 merged 3 commits intoFRRouting:masterfrom
donaldsharp:zebra_fixup_nhg_handling_from_kernel

Conversation

@donaldsharp
Copy link
Member

Zebra was not properly handling received nhg's from the kernel. Make it right

The decoding of the netlink message into a dplane ctx is storing the
nhg id not the nhe_id.  Let's actually retrieve the right value.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
FRR is currently receiving routes from the kernel that have nexthop groups
but is ignoring those nexthop groups and is creating new ones that do
not actually match what is in the kernel.  Modify the code such that
we track the kernel `found` nexthop groups and we allow routes received
from the kernel that use those nhg's actually have the right values.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Create a test that shows when receiving nhg's from the kernel and routes
using those nhg's that they are handled correctly in zebra.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
@frrbot frrbot bot added tests Topotests, make check, etc zebra labels Feb 9, 2026
@greptile-apps
Copy link

greptile-apps bot commented Feb 9, 2026

Greptile Overview

Greptile Summary

This PR improves zebra’s handling of nexthop-group (NHG) IDs received from the kernel by (1) reading the NHG ID from the dplane context when decoding netlink route updates, and (2) adding a new NHG flag (NEXTHOP_GROUP_RECEIVED_FROM_EXTERNAL) that is set when routes/NHGs originate from kernel-provided IDs. A new topotest (zebra_kernel_nhg) installs kernel nexthops/NHGs and verifies that show ip route nexthop-group summary json reports the expected receivedNexthopGroupId values.

Main concern is that nexthop_active_update() now treats the new “received from external” flag the same as PROTO_OWNED(), routing kernel-originated NHGs through the protocol-owned active-update path, which may bypass zebra’s normal ACTIVE nexthop refresh logic. The new topotest also asserts kernel NHG deduplication behavior for two identical traditional routes, which may be kernel-dependent and can cause spurious failures.

Confidence Score: 3/5

  • This PR is directionally correct but has a likely logic error in NHG active-update handling and a potentially kernel-dependent test assertion.
  • Core changes are small and targeted (correct NHG ID accessor; new flag and test). However, treating kernel-received NHGs as proto-owned in nexthop_active_update() can lead to incorrect ACTIVE nexthop refresh behavior, and the new test’s equality assertion on kernel-assigned NHG IDs may not be stable across environments.
  • zebra/zebra_nhg.c; tests/topotests/zebra_kernel_nhg/test_zebra_kernel_nhg.py

Important Files Changed

Filename Overview
tests/topotests/zebra_kernel_nhg/r1/frr.conf Adds minimal router config for new topotest; no functional concerns.
tests/topotests/zebra_kernel_nhg/test_zebra_kernel_nhg.py Adds new topotest installing kernel NHGs/routes and checking receivedNexthopGroupId; includes an assertion that two traditional routes must share the same NHG ID which may be kernel-dependent/flaky.
zebra/rt_netlink.c Fixes route-change parsing to read NHG ID via dplane_ctx_get_nhg_id() instead of get_nhe_id(); change is narrow but should be validated against dplane ctx semantics.
zebra/zebra_nhg.c Marks new NHGs as received-from-external and changes nexthop active update to treat that flag like PROTO_OWNED(), routing updates through proto-owned logic; likely incorrect for kernel-originated NHGs (see comment at nexthop_active_update).
zebra/zebra_nhg.h Adds NEXTHOP_GROUP_RECEIVED_FROM_EXTERNAL flag bit (1<<10) to NHG flags; bit position appears free.
zebra/zebra_rib.c When adding a route by re->nhe_id, sets NEXTHOP_GROUP_RECEIVED_FROM_EXTERNAL on the temporary NHE before copying; looks consistent with intent to preserve kernel-received NHG IDs.

Sequence Diagram

sequenceDiagram
    participant Kernel
    participant Netlink as zebra/rt_netlink.c
    participant DPlane as zebra_dplane_ctx
    participant RIB as zebra_rib.c
    participant NHG as zebra_nhg.c
    participant VTY as zebra_vty.c

    Kernel->>Netlink: "RTM_NEWROUTE (nhid / nexthops)"
    Netlink->>DPlane: "Populate ctx (zd_nhg_id, nhe.id, ng)"
    Netlink->>Netlink: "nhe_id = dplane_ctx_get_nhg_id(ctx)"
    Netlink->>RIB: "rib_add_multipath(..., re->nhe_id = nhe_id)"

    alt "Route uses explicit NHG id"
        RIB->>RIB: "Mark NHE RECEIVED_FROM_EXTERNAL"
        RIB->>NHG: "zebra_nhe_copy(&nhe, 0)"
    else "Route uses traditional via"
        RIB->>NHG: "zebra_nhe_copy(&nhe, 0)"
    end

    Kernel->>Netlink: "RTM_NEWNEXTHOP (NHG object)"
    Netlink->>NHG: "nhg_ctx_process_new(ctx)"
    NHG->>NHG: "SET_FLAG(RECEIVED_FROM_EXTERNAL)"

    alt "PROTO_OWNED or RECEIVED_FROM_EXTERNAL"
        NHG->>NHG: "proto_nhg_nexthop_active_update()"
    else "zebra-owned"
        NHG->>NHG: "refresh ACTIVE nexthops"
    end

    VTY->>NHG: "Query route/NHG summary"
    NHG-->>VTY: "Include receivedNexthopGroupId"

Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +3194 to 3196
if (PROTO_OWNED(re->nhe) ||
CHECK_FLAG(re->nhe->flags, NEXTHOP_GROUP_RECEIVED_FROM_EXTERNAL))
return proto_nhg_nexthop_active_update(&re->nhe->nhg);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kernel NHGs treated as proto-owned

nexthop_active_update() routes any NHG with NEXTHOP_GROUP_RECEIVED_FROM_EXTERNAL through proto_nhg_nexthop_active_update(). That path is meant for protocol-owned NHGs (checked via PROTO_OWNED()), and it skips the normal zebra active-resolution flow that copies the NHE and runs nexthop_list_active_update()/reinstall logic. For kernel-originated NHGs, this can leave route entries with stale ACTIVE nexthop state when interfaces/recursion change. Consider handling RECEIVED_FROM_EXTERNAL separately (i.e., don’t reuse the proto-owned update path).

Prompt To Fix With AI
This is a comment left during a code review.
Path: zebra/zebra_nhg.c
Line: 3194:3196

Comment:
**Kernel NHGs treated as proto-owned**

`nexthop_active_update()` routes any NHG with `NEXTHOP_GROUP_RECEIVED_FROM_EXTERNAL` through `proto_nhg_nexthop_active_update()`. That path is meant for protocol-owned NHGs (checked via `PROTO_OWNED()`), and it skips the normal zebra active-resolution flow that copies the NHE and runs `nexthop_list_active_update()`/reinstall logic. For kernel-originated NHGs, this can leave route entries with stale ACTIVE nexthop state when interfaces/recursion change. Consider handling `RECEIVED_FROM_EXTERNAL` separately (i.e., don’t reuse the proto-owned update path).

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +122 to +127
logger.info("Traditional prefix %s missing in summary output", prefix)
return False

nhg_40 = route_json["10.40.0.0/24"][0].get("receivedNexthopGroupId")
nhg_60 = route_json["10.60.0.0/24"][0].get("receivedNexthopGroupId")
if nhg_40 != nhg_60:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assumes stable NHG deduping

This test asserts that two “traditional” routes (10.40/24 and 10.60/24) that share the same gateway must have the same receivedNexthopGroupId. That’s not guaranteed by zebra itself (the ID is kernel-reported) and can vary by kernel version/config depending on whether/when it deduplicates/merges NHGs for identical nexthops. This can make the test fail even when zebra is correctly reporting kernel NHG IDs.

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/topotests/zebra_kernel_nhg/test_zebra_kernel_nhg.py
Line: 122:127

Comment:
**Assumes stable NHG deduping**

This test asserts that two “traditional” routes (10.40/24 and 10.60/24) that share the same gateway must have the same `receivedNexthopGroupId`. That’s not guaranteed by zebra itself (the ID is kernel-reported) and can vary by kernel version/config depending on whether/when it deduplicates/merges NHGs for identical nexthops. This can make the test fail even when zebra is correctly reporting kernel NHG IDs.

How can I resolve this? If you propose a fix, please make it concise.

@donaldsharp
Copy link
Member Author

@Mergifyio backport dev/10.6 stable/10.5 stable/10.4 stable/10.3 stable/10.2 stable/10.1 stable/10.0

@mergify
Copy link

mergify bot commented Feb 9, 2026

backport dev/10.6 stable/10.5 stable/10.4 stable/10.3 stable/10.2 stable/10.1 stable/10.0

✅ Backports have been created

Details

Cherry-pick of 4c55e4f has failed:

On branch mergify/bp/stable/10.2/pr-20732
Your branch is up to date with 'origin/stable/10.2'.

You are currently cherry-picking commit 4c55e4f0f.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   zebra/rt_netlink.c

no changes added to commit (use "git add" and/or "git commit -a")

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

Cherry-pick of 4c55e4f has failed:

On branch mergify/bp/stable/10.1/pr-20732
Your branch is up to date with 'origin/stable/10.1'.

You are currently cherry-picking commit 4c55e4f0f.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   zebra/rt_netlink.c

no changes added to commit (use "git add" and/or "git commit -a")

Cherry-pick of ce6311a has failed:

On branch mergify/bp/stable/10.1/pr-20732
Your branch is ahead of 'origin/stable/10.1' by 1 commit.
  (use "git push" to publish your local commits)

You are currently cherry-picking commit ce6311a7a.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   zebra/zebra_nhg.c

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   zebra/zebra_nhg.h
	both modified:   zebra/zebra_rib.c

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

Cherry-pick of 4c55e4f has failed:

On branch mergify/bp/stable/10.0/pr-20732
Your branch is up to date with 'origin/stable/10.0'.

You are currently cherry-picking commit 4c55e4f0f.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   zebra/rt_netlink.c

no changes added to commit (use "git add" and/or "git commit -a")

Cherry-pick of ce6311a has failed:

On branch mergify/bp/stable/10.0/pr-20732
Your branch is ahead of 'origin/stable/10.0' by 1 commit.
  (use "git push" to publish your local commits)

You are currently cherry-picking commit ce6311a7a.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   zebra/zebra_nhg.c

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   zebra/zebra_nhg.h
	both modified:   zebra/zebra_rib.c

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

Copy link
Member

@riw777 riw777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good ... I don't think the AI comments are a concern on this one

@riw777 riw777 merged commit f5bc4ec into FRRouting:master Feb 11, 2026
25 checks passed
donaldsharp added a commit that referenced this pull request Feb 11, 2026
Zebra fixup nhg handling from kernel (backport #20732)
donaldsharp added a commit that referenced this pull request Feb 11, 2026
Zebra fixup nhg handling from kernel (backport #20732)
donaldsharp added a commit that referenced this pull request Feb 11, 2026
Zebra fixup nhg handling from kernel (backport #20732)
donaldsharp added a commit that referenced this pull request Feb 11, 2026
Zebra fixup nhg handling from kernel (backport #20732)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants