Zebra fixup nhg handling from kernel#20732
Conversation
The decoding of the netlink message into a dplane ctx is storing the nhg id not the nhe_id. Let's actually retrieve the right value. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
FRR is currently receiving routes from the kernel that have nexthop groups but is ignoring those nexthop groups and is creating new ones that do not actually match what is in the kernel. Modify the code such that we track the kernel `found` nexthop groups and we allow routes received from the kernel that use those nhg's actually have the right values. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Create a test that shows when receiving nhg's from the kernel and routes using those nhg's that they are handled correctly in zebra. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Greptile OverviewGreptile SummaryThis PR improves zebra’s handling of nexthop-group (NHG) IDs received from the kernel by (1) reading the NHG ID from the dplane context when decoding netlink route updates, and (2) adding a new NHG flag ( Main concern is that Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Kernel
participant Netlink as zebra/rt_netlink.c
participant DPlane as zebra_dplane_ctx
participant RIB as zebra_rib.c
participant NHG as zebra_nhg.c
participant VTY as zebra_vty.c
Kernel->>Netlink: "RTM_NEWROUTE (nhid / nexthops)"
Netlink->>DPlane: "Populate ctx (zd_nhg_id, nhe.id, ng)"
Netlink->>Netlink: "nhe_id = dplane_ctx_get_nhg_id(ctx)"
Netlink->>RIB: "rib_add_multipath(..., re->nhe_id = nhe_id)"
alt "Route uses explicit NHG id"
RIB->>RIB: "Mark NHE RECEIVED_FROM_EXTERNAL"
RIB->>NHG: "zebra_nhe_copy(&nhe, 0)"
else "Route uses traditional via"
RIB->>NHG: "zebra_nhe_copy(&nhe, 0)"
end
Kernel->>Netlink: "RTM_NEWNEXTHOP (NHG object)"
Netlink->>NHG: "nhg_ctx_process_new(ctx)"
NHG->>NHG: "SET_FLAG(RECEIVED_FROM_EXTERNAL)"
alt "PROTO_OWNED or RECEIVED_FROM_EXTERNAL"
NHG->>NHG: "proto_nhg_nexthop_active_update()"
else "zebra-owned"
NHG->>NHG: "refresh ACTIVE nexthops"
end
VTY->>NHG: "Query route/NHG summary"
NHG-->>VTY: "Include receivedNexthopGroupId"
|
| if (PROTO_OWNED(re->nhe) || | ||
| CHECK_FLAG(re->nhe->flags, NEXTHOP_GROUP_RECEIVED_FROM_EXTERNAL)) | ||
| return proto_nhg_nexthop_active_update(&re->nhe->nhg); |
There was a problem hiding this comment.
Kernel NHGs treated as proto-owned
nexthop_active_update() routes any NHG with NEXTHOP_GROUP_RECEIVED_FROM_EXTERNAL through proto_nhg_nexthop_active_update(). That path is meant for protocol-owned NHGs (checked via PROTO_OWNED()), and it skips the normal zebra active-resolution flow that copies the NHE and runs nexthop_list_active_update()/reinstall logic. For kernel-originated NHGs, this can leave route entries with stale ACTIVE nexthop state when interfaces/recursion change. Consider handling RECEIVED_FROM_EXTERNAL separately (i.e., don’t reuse the proto-owned update path).
Prompt To Fix With AI
This is a comment left during a code review.
Path: zebra/zebra_nhg.c
Line: 3194:3196
Comment:
**Kernel NHGs treated as proto-owned**
`nexthop_active_update()` routes any NHG with `NEXTHOP_GROUP_RECEIVED_FROM_EXTERNAL` through `proto_nhg_nexthop_active_update()`. That path is meant for protocol-owned NHGs (checked via `PROTO_OWNED()`), and it skips the normal zebra active-resolution flow that copies the NHE and runs `nexthop_list_active_update()`/reinstall logic. For kernel-originated NHGs, this can leave route entries with stale ACTIVE nexthop state when interfaces/recursion change. Consider handling `RECEIVED_FROM_EXTERNAL` separately (i.e., don’t reuse the proto-owned update path).
How can I resolve this? If you propose a fix, please make it concise.| logger.info("Traditional prefix %s missing in summary output", prefix) | ||
| return False | ||
|
|
||
| nhg_40 = route_json["10.40.0.0/24"][0].get("receivedNexthopGroupId") | ||
| nhg_60 = route_json["10.60.0.0/24"][0].get("receivedNexthopGroupId") | ||
| if nhg_40 != nhg_60: |
There was a problem hiding this comment.
Assumes stable NHG deduping
This test asserts that two “traditional” routes (10.40/24 and 10.60/24) that share the same gateway must have the same receivedNexthopGroupId. That’s not guaranteed by zebra itself (the ID is kernel-reported) and can vary by kernel version/config depending on whether/when it deduplicates/merges NHGs for identical nexthops. This can make the test fail even when zebra is correctly reporting kernel NHG IDs.
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/topotests/zebra_kernel_nhg/test_zebra_kernel_nhg.py
Line: 122:127
Comment:
**Assumes stable NHG deduping**
This test asserts that two “traditional” routes (10.40/24 and 10.60/24) that share the same gateway must have the same `receivedNexthopGroupId`. That’s not guaranteed by zebra itself (the ID is kernel-reported) and can vary by kernel version/config depending on whether/when it deduplicates/merges NHGs for identical nexthops. This can make the test fail even when zebra is correctly reporting kernel NHG IDs.
How can I resolve this? If you propose a fix, please make it concise.|
@Mergifyio backport dev/10.6 stable/10.5 stable/10.4 stable/10.3 stable/10.2 stable/10.1 stable/10.0 |
✅ Backports have been createdDetails
Cherry-pick of 4c55e4f has failed: To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally
Cherry-pick of 4c55e4f has failed: Cherry-pick of ce6311a has failed: To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally
Cherry-pick of 4c55e4f has failed: Cherry-pick of ce6311a has failed: To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally |
riw777
left a comment
There was a problem hiding this comment.
looks good ... I don't think the AI comments are a concern on this one
Zebra fixup nhg handling from kernel (backport #20732)
Zebra fixup nhg handling from kernel (backport #20732)
Zebra fixup nhg handling from kernel (backport #20732)
Zebra fixup nhg handling from kernel (backport #20732)
Zebra was not properly handling received nhg's from the kernel. Make it right