Memory leak problems.#21844
Conversation
004bc23 to
aa6daab
Compare
7d1f7dc to
d63a63f
Compare
d63a63f to
e93f656
Compare
|
@greptileai review |
|
@greptileai can you review skipping f4c4cbe that one does not need to be reviewed by you. That should get you under the limit |
Greptile SummaryThis PR adds shutdown cleanup (
Confidence Score: 3/5Not safe to merge as-is: the BGP EVPN teardown now calls Two reference-counting risks in the BGP teardown path and a daemon-crashing assert in bgpd/bgpd.c (EVPN double cleanup, peer refcount), bgpd/bgp_route.c (queue-drain unlock logic), and pbrd/pbr_map.c (assert on empty seqnumbers) Important Files Changed
Sequence DiagramsequenceDiagram
participant U as Operator / frr_fini
participant BD as bgp_delete
participant BF as bgp_free
participant EC as bgp_evpn_cleanup
U->>BD: no router bgp / daemon shutdown
BD->>BD: bgp_evpn_es_cleanup_routes(bgp) [new: unconditional]
BD->>BD: bgp_evpn_cleanup_per_vni_routes(bgp) [new]
BD->>EC: bgp_evpn_cleanup(bgp) [was: terminating-only]
BD->>BD: bgp_cleanup_routes(bgp)
BD->>BD: bgp_unlock(bgp)
Note over BD,BF: refcount to 0 triggers bgp_free
BD->>BF: bgp_free(bgp)
BF->>EC: bgp_evpn_cleanup(bgp) [pre-existing, called again!]
BF->>BF: bgp_addpath_finish_bgp_data() [new]
|
e93f656 to
f6f7ac5
Compare
|
@greptileai review |
|
@greptileai -> Can you point me to any place it is legal( or even possible! ) to create a pbr map without a sequence number? The claim makes no sense in light of how the cli system works. |
|
You're right. Looking at the CLI, My original comment assumed an empty Tip: You can give me more context with .greptile/rules.md so I get this right next time. |
|
I understand the frustration. Unfortunately, I don't have the ability to edit previously posted comments to update the priority badge — the P1/P2 badges are baked into the original comment body as static image URLs, and I have no tool to go back and modify them. What I can do is post a follow-up acknowledgment on the specific thread where I agreed with you. Which comment are you referring to? Based on the threads I see, it looks like it may be the Tip: You can give me more context with .greptile/rules.md so I get this right next time. |
|
@greptileai -> That is clearly not true at all. I've seen you update the confidence score multiple times when asked for a rereview. Yet here we are, you saying you can't do this, but empirical evidence says otherwise. |
| } while (connection != NULL); | ||
| } | ||
|
|
||
| if (bm->bgp_evpn == bgp) { | ||
| /* | ||
| * Clean ES route tables before peer and route teardown starts. | ||
| * This avoids ES-linked local MAC-IP updates from running after | ||
| * clear-route/cleanup has already reaped the local paths. | ||
| * | ||
| * Run for any teardown of the EVPN-owner instance, not just | ||
| * daemon termination: deleting the owner via "no router bgp" | ||
| * also needs the ES tables purged so paths can be reaped and |
There was a problem hiding this comment.
bgp_evpn_cleanup called twice for every EVPN-owner deletion
bgp_delete now calls bgp_evpn_cleanup(bgp) unconditionally when bm->bgp_evpn == bgp, and bgp_free — triggered immediately after by the bgp_unlock at the end of bgp_delete — calls it again (it was already there pre-PR). During daemon shutdown this was a pre-existing no-op because the EVPN hash tables happen to be idempotent; but for a user-triggered no router bgp of the EVPN owner the second call in bgp_free walks already-freed bgpevpn structs. A NULL check on bgp->vnihash at the start of bgp_evpn_cleanup (analogous to the guard added to bgp_evpn_handle_deferred_bestpath_for_vnis) would make the second call a safe no-op.
Prompt To Fix With AI
This is a comment left during a code review.
Path: bgpd/bgpd.c
Line: 4557-4568
Comment:
**`bgp_evpn_cleanup` called twice for every EVPN-owner deletion**
`bgp_delete` now calls `bgp_evpn_cleanup(bgp)` unconditionally when `bm->bgp_evpn == bgp`, and `bgp_free` — triggered immediately after by the `bgp_unlock` at the end of `bgp_delete` — calls it again (it was already there pre-PR). During daemon shutdown this was a pre-existing no-op because the EVPN hash tables happen to be idempotent; but for a user-triggered `no router bgp` of the EVPN owner the second call in `bgp_free` walks already-freed `bgpevpn` structs. A NULL check on `bgp->vnihash` at the start of `bgp_evpn_cleanup` (analogous to the guard added to `bgp_evpn_handle_deferred_bestpath_for_vnis`) would make the second call a safe no-op.
How can I resolve this? If you propose a fix, please make it concise.| while ((pbrm = RB_ROOT(pbr_map_entry_head, &pbr_maps))) { | ||
| struct pbr_map_sequence *pbrms; | ||
|
|
||
| pbrms = listnode_head(pbrm->seqnumbers); | ||
| assert(pbrms); | ||
|
|
||
| pbr_map_delete(pbrms); | ||
| } |
There was a problem hiding this comment.
assert(pbrms) terminates the daemon if any map has an empty sequence list at shutdown
The assert enforces the invariant that every pbr_map in the RB tree has at least one sequence entry. While this invariant should hold in the normal operational path, half-torn-down state during an earlier partial cleanup (e.g. if pbr_map_sequence_delete was called without removing the parent pbrm) would trigger a daemon crash at shutdown rather than graceful cleanup. Given the history of this loop (previous break and continue fixes), it may be worth adding a guard that removes the empty map and continues, or at minimum a comment explaining why the assert is safe in all production states.
Prompt To Fix With AI
This is a comment left during a code review.
Path: pbrd/pbr_map.c
Line: 965-972
Comment:
**`assert(pbrms)` terminates the daemon if any map has an empty sequence list at shutdown**
The assert enforces the invariant that every `pbr_map` in the RB tree has at least one sequence entry. While this invariant should hold in the normal operational path, half-torn-down state during an earlier partial cleanup (e.g. if `pbr_map_sequence_delete` was called without removing the parent `pbrm`) would trigger a daemon crash at shutdown rather than graceful cleanup. Given the history of this loop (previous `break` and `continue` fixes), it may be worth adding a guard that removes the empty map and continues, or at minimum a comment explaining why the assert is safe in all production states.
How can I resolve this? If you propose a fix, please make it concise.Make the complaint about a unneeded destroy in nb code go away. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Nexthop groups were leaking memory on shutdown. Add the ability to cleanup memory leaks from a using daemon. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Usage of the frr pthread specific error messages were not being cleaned up. Make it so. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add signal handlers for fpm_listener.c. This will allow it to be shutdown cleanly, such that gcoverage can work. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
This is not necessary, remove. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Allow gcoverage to work with fpm_listener Currently there is no way to tell fpm_listener to shutdown, so let's allow it to listen to signals and cleanly shutdown. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
a) isisd is using a MTYPE_TMP in a bunch of different places, add better descriptors and break it up a bit. b) Cleanup of sbuf usage such that memory is not leaked on shutdown. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
fa4bc45 to
327a10e
Compare
a) tx addpath id's not being cleaned up properly, fix. b) evpn memory leaks on shutdown, fix. c) route_nodes not being properly released on shutdown, fix. d) rpki memory leaks on shutdown, fix. e) srv6 memory leaks on shutdown, fix. f) hidden bgp instances memory leaks on recreate, fix. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
a) access_list memory leaks b) snmp memory leaks Signed-off-by: Donald Sharp <sharpd@nvidia.com>
a) pbr maps were being leaked, stop b) nexthop-groups were being leaked, stop c) access list's were being leaked, stop Signed-off-by: Donald Sharp <sharpd@nvidia.com>
a) interface gre memory leaks, stop b) packet request id memory leaks, stop c) event request id memory leaks, stop Signed-off-by: Donald Sharp <sharpd@nvidia.com>
a) srv6 memory leaks on shutdown, stop b) nexthop-group memory leaks on shutdown, stop Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Cleanup ted memory leaks on shutdown. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
a) vxlan memory leaks on shutdown, fix. b) evpn-mh memory leaks on shutdown, fix. c) mpls-fec memory leaks on shutdown, fix. d) route-map memory leaks on shutdown, fix. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
a) zclient memory not cleaned up on shutdown, fix. b) libfrr memory not cleaned up on shutdown, fix. c) lde label list not cleaned up on shutdown, fix. d) snmp memory not cleaned up on shutdown, fix. e) accept memory not cleaned up on shutdown, fix. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
bgpd, ospf6d, ospfd and zebra were not cleaning up snmp smux data structures no shutdown. Do so. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
a) route-map memory was not being cleaned up b) Some mld memory was not being cleaned up Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently the topotests output a bunch of memory leaks detected on shutdown, modify the code such that the memory leaks are detected and cause the test to fail. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Ran across a single case of a hash bucket not being freed, on shutdown. Let's convert everything to use hash_clean_and_free and remove usage of hash_free outside of the .c file. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
327a10e to
9890f17
Compare
Remove all memory leaks on shutdown from all daemons. Also turn on test failure so we catch ne wmemory leaks