New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[flexible-ipam] Fix IP leak issues #3314
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3314 +/- ##
==========================================
- Coverage 62.00% 52.21% -9.79%
==========================================
Files 266 239 -27
Lines 26546 34224 +7678
==========================================
+ Hits 16460 17871 +1411
- Misses 8281 14645 +6364
+ Partials 1805 1708 -97
Flags with carried forward coverage won't be shown. Click here to find out more.
|
c514e66
to
02bfa3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain more what leak is fixed? Probably we need some comments in code to explain the added code too.
I feel that delete operation has become a bit complex. One cause of complexity is that it is not trivial task to determine whether the event is owned by antrea IPAM. To determine that, we'll inspect pod annotations (possibly retrieving pools by podIndex), and then namespace annotations. Then, if antrea IPAM owns the pod, the search for corresponding container is run once again within the pool in order to free the address. So we have some degree of duplicated logic that complicates reading code and debugging. |
@annakhm |
02bfa3b
to
c48f814
Compare
Added more code comments. |
c48f814
to
ffb4025
Compare
/test-all |
mine, allocator, _, _, err := d.owns(k8sArgs) | ||
// When AntreaIPAM.owns() is called for a NodeIPAM Pod, the IPPool cannot be found for the Pod. We pass | ||
// swallowNotFoundError to the call to sallow the "NotFound" error. | ||
mine, allocator, _, _, err := d.owns(k8sArgs, swallowNotFoundError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why it should be fixed in this way. I have two questions:
- When a Pod is deleted, isn't CNI DEL called before the Pod is removed from the apiserver normally? I can imagine in some cases it could happen that the Pod is gone when it handles CNI DEL like Pod's delete grace period is set to 0, but is it your case?
- Should IPPool NotFound and Pod NotFound be handled differently? If Pod is not found, I think we are not sure whether this is owned by AntreaIPAM and should return "mine" as false. If Pod is found but IPPool is not, it should at least log error (as we don't allow deleting IPPool in use so this is not expected). But anyway this is for a AntreaIPAM pod, not like the comment explains?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- From e2e test, we can confirm that Pod may not exist when CNI DEL called. Thus we need to handle this case.
- Nice catch. I think it is harmless if we process
IPPoolNotFound
as current implementation which returns mine=false. Do you think we should check ifErrStatus.Details.Kind
isPod
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 2, it will invoke an unnecessary external call and hides a mistake which might in the code base.
I think the problem is the current return value from own
is not clear, its callers have to inject many code for their own business logic. For example:
When it's called by Add
, it's expected to return mine
as true
when the Pod is not found so Add
can fail immediately, but the actual logic is:
When the Pod is not found, we don't know which IPAM owns it, and its creation should be defered until the Pod is received, or if the Pod has been deleted from API, we won't receive more request anyway.
When it's called by Del
, it's expected to return mine
as false
when the Pod is not found so Del
can call other IPAM plugins to ensure it's cleaned up completely.
It's obscure to understand the logic by injecting them to the own
method and why own
returns different value in different context for same situation.
To put the business logic back to the caller, I think own
should be more clear what it returns:
If it should only return mine
as true
or false
when it really knows, otherwise return unknown
. Add
should call next plugin only for false
, Del
should call next plugin for unknown
and false
.
In this way, we don't mix PoolNotFound and PodNotFound:
For Add
, if the Pod is found but Pool is not found, own
should return mine
as true
and an error to indicate PoolNotFound, the request failed immediately with clear message about PoolNotFound.
For Add
, if the Pod is not found, own
should return unknown
and an error to indicate PodNotFound, the request failed immediately with clear message about PodNotFound.
For Del
, if the Pod is found but Pool is not found, own
should return mine
as true
and an error to indicate PoolNotFound, the request terminated and an error should be logged for this unexpected situation.
For Del
, if the Pod is not found, own
should return unknown
and an error to indicate PodNotFound, Del
should call the next plugin until one returns true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Refactored this part and added comments.
723b99c
to
248b98d
Compare
77fd690
to
3f23722
Compare
50d5cdc
to
4d37741
Compare
This commit fixed 2 IP leak issues: 1. AntreaIPAM enabled and the NodeIPAM Pods were deleted. 2. Incoming CniDel request when agent restart. Signed-off-by: gran <gran@vmware.com> [flexible-ipam] Signed-off-by: gran <gran@vmware.com>
/test-all |
This commit fixed 2 IP leak issues: 1. AntreaIPAM enabled and the NodeIPAM Pods were deleted. 2. Incoming CniDel request when agent restart. Signed-off-by: gran <gran@vmware.com>
This commit fixed 2 IP leak issues in below situations with AntreaIPAM enabled:
1. NodeIPAM Pods were deleted.
2. AntreaIPAM Pods were deleted when agent restart.
This PR closes #3333 and closes #3384
Signed-off-by: gran gran@vmware.com