frr: resync grout state when this one restarts#522
Conversation
|
Caution Review failedThe pull request is closed. ℹ️ Recent review infoConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThe changes add mutex-based thread-safety to if_map (introducing a global ifindex mutex and a new public function Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
frr/zebra_dplane_grout.c (1)
267-268: Manual bitfield reset - consider using a helper if available.The manual
memsetandn = 0reset works correctly, but FRR may have abf_reset()or similar macro for cleaner bitfield clearing. If not available, this approach is acceptable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@frr/zebra_dplane_grout.c` around lines 267 - 268, The manual clearing of the bitfield (memset on grout_ctx.sync_vrf.data and setting grout_ctx.sync_vrf.n = 0) should be replaced with the project's bitfield reset helper if one exists; locate the bitfield API used elsewhere (e.g., bf_reset, BITFIELD_RESET, bitfield_reset) and call that with grout_ctx.sync_vrf (or its address) instead of the raw memset and n = 0, otherwise leave the current code as-is if no helper is available.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@frr/zebra_dplane_grout.c`:
- Around line 267-268: The manual clearing of the bitfield (memset on
grout_ctx.sync_vrf.data and setting grout_ctx.sync_vrf.n = 0) should be replaced
with the project's bitfield reset helper if one exists; locate the bitfield API
used elsewhere (e.g., bf_reset, BITFIELD_RESET, bitfield_reset) and call that
with grout_ctx.sync_vrf (or its address) instead of the raw memset and n = 0,
otherwise leave the current code as-is if no helper is available.
5ba4b9e to
b679d81
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@frr/if_map.c`:
- Around line 118-125: The early return inside the frr_with_mutex block can
leave the mutex locked; instead, perform the find and set a local boolean (e.g.,
found_flag or result) or a pointer variable for found, do not return from inside
frr_with_mutex, and after leaving the mutex block return the appropriate value.
Specifically, call grout_to_frr_find(&grout_to_frr_mappings, &key) inside
frr_with_mutex, store the outcome in a local variable, only call
grout_to_frr_del and frr_to_grout_del if found, and move the final return
(return false/true) to after the frr_with_mutex block so the mutex is always
released by frr_with_mutex.
- Around line 66-86: The code inside the frr_with_mutex(...) block returns early
on error paths (after calls to grout_to_frr_add and frr_to_grout_add), which
prevents the frr_with_mutex loop from executing its unlock step; change the
logic to use a local bool result (e.g., success = true) and set it to false on
the error conditions, perform necessary cleanup calls (grout_to_frr_del, XFREE)
as currently done, then break out of the frr_with_mutex block and return the
local result after the block; reference the existing symbols grout_to_frr_add,
frr_to_grout_add, grout_to_frr_del, XFREE, mapping and the frr_with_mutex macro
when making the change.
grout_ctx.client is shared between the dplane thread (grout_client_send_recv) and the main thread (sync functions) with no synchronization. gr_api_client has no locking: its response queue and the static message_id counter in gr_api_client_send can be corrupted by concurrent access. Most of the time this is not an issue as the initial sync completes before the dplane thread starts processing, but as soon as interfaces are created in zebra the dplane thread may start sending requests concurrently with the main thread's streaming reads. Add a dedicated sync_client to grout_ctx for exclusive use by the main thread sync functions (grout_sync_ifaces, grout_sync_ifaces_addresses, grout_sync_nhs, grout_sync_routes). The existing client field remains for the dplane thread. Signed-off-by: Maxime Leroy <maxime@leroys.fr> Reviewed-by: Robin Jarry <rjarry@redhat.com>
The ifindex/VRF mapping hash tables are accessed from both the main thread (grout_sync_ifaces via grout_link_change, grout_sync_* via vrf_grout_to_frr lookups) and the dplane thread (zd_grout_process via grout_add_del_route, grout_add_del_nexthop, grout_add_del_address). FRR hash tables (DECLARE_HASH) are not thread-safe. Concurrent add, del or clear from one thread while the other traverses the hash can corrupt the internal structure or cause a use-after-free on the ifindex_mapping entries. This race exists from initial startup: grout_sync_ifaces populates the mappings on the main thread while the dplane provider may already be processing queued requests that perform lookups on the dplane thread. Add a pthread_mutex_t to serialize all hash table operations. Signed-off-by: Maxime Leroy <maxime@leroys.fr> Reviewed-by: Robin Jarry <rjarry@redhat.com>
grout_sync_ifaces and grout_sync_addrs (renamed from grout_sync_ifaces_addresses) were running on the zebra main thread instead of the dplane thread. Unlike the kernel datapath where the initial netlink sync completes before the dplane thread starts, the grout plugin syncs via the frr_config_post hook which fires after the dplane thread is already running. At that point, dplane_read_notifications is actively processing iface and address events, so the sync must run on the same thread to avoid concurrent access. Schedule grout_sync_ifaces and grout_sync_addrs on dplane_get_thread_master() so that API calls using the sync_client happen on the thread that owns it. Additionally, all VRFs were being synced concurrently by scheduling one event per VRF at the end of grout_sync_ifaces. Replace this with a sequential chain: grout_sync_ifaces starts the first VRF, and grout_sync_routes chains to the next VRF when done. This avoids interleaving API requests from multiple VRFs and ensures that the single dg_t_dplane_sync event pointer is not overwritten. Use a bitfield instead of a stack-allocated bool array to track which VRFs need syncing, so the set persists across the chained callbacks. Signed-off-by: Maxime Leroy <maxime@leroys.fr> Reviewed-by: Robin Jarry <rjarry@redhat.com>
When grout restarts, the zebra notification socket receives EOF. This triggers grout_reconnect which flushes all FRR state (VRFs, ifindex mappings), resets the namespace, and schedules grout_sync. grout_sync is the single entry point for connecting the main thread API client (sync_client) to grout. It retries every second until grout becomes available, then schedules notification channels and the interface sync chain. This same function is used at startup via the frr_config_post hook, allowing FRR to start before grout is ready. Connection failures are logged so operators can tell why FRR has not synced yet. Notification connect functions (dplane_grout_connect, zebra_grout_connect) no longer self-retry but schedule grout_reconnect on failure so the full sync is retried. grout_notif_subscribe disconnects any stale client before reconnecting since the ordering between EOF processing and reconnect scheduling is not deterministic. grout_client_send_recv no longer attempts inline reconnection. If the dplane client connection is lost, it fails the request, disconnects the client, and schedules grout_reconnect so recovery does not depend solely on the notification EOF path. Signed-off-by: Maxime Leroy <maxime@leroys.fr> Reviewed-by: Robin Jarry <rjarry@redhat.com>
When grout restarts, grout_ns_reset() tears down all VRFs and their interfaces via vrf_terminate(). Routes and nexthops are properly cleaned up through zebra_vrf_disable(), but interfaces are destroyed via if_delete() which never calls if_down(). In normal FRR operation with the kernel backend, if_down() has already been called in response to a netlink RTM_DELLINK notification before the interface is destroyed. With grout, there is no kernel notification, so global state referencing the interface (such as the RA timer wheel) is not cleaned up before the interface is freed. In particular, BGP unnumbered enables RA on peering interfaces which adds them to zrouter.ra_wheel. After the reset, the wheel timer thread calls process_rtadv() on a stale pointer, causing a SIGSEGV. Call if_down() on all interfaces before tearing down namespaces and VRFs. Signed-off-by: Maxime Leroy <maxime@leroys.fr> Reviewed-by: Robin Jarry <rjarry@redhat.com>
Summary by CodeRabbit
New Features
Bug Fixes