Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flexible-ipam] Fix IP leak issues #3314

Merged
merged 1 commit into from Mar 21, 2022
Merged

Conversation

gran-vmv
Copy link
Contributor

@gran-vmv gran-vmv commented Feb 14, 2022

This commit fixed 2 IP leak issues in below situations with AntreaIPAM enabled:
1. NodeIPAM Pods were deleted.
2. AntreaIPAM Pods were deleted when agent restart.

This PR closes #3333 and closes #3384

Signed-off-by: gran gran@vmware.com

@gran-vmv gran-vmv marked this pull request as draft February 14, 2022 09:54
@gran-vmv gran-vmv marked this pull request as ready for review February 14, 2022 09:58
@codecov-commenter
Copy link

codecov-commenter commented Feb 14, 2022

Codecov Report

Merging #3314 (9bff07a) into main (78e3583) will decrease coverage by 9.78%.
The diff coverage is 44.53%.

❗ Current head 9bff07a differs from pull request most recent head d9470f5. Consider uploading reports for the commit d9470f5 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3314      +/-   ##
==========================================
- Coverage   62.00%   52.21%   -9.79%     
==========================================
  Files         266      239      -27     
  Lines       26546    34224    +7678     
==========================================
+ Hits        16460    17871    +1411     
- Misses       8281    14645    +6364     
+ Partials     1805     1708      -97     
Flag Coverage Δ
e2e-tests 52.21% <44.53%> (?)
kind-e2e-tests ?
unit-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/agent.go 50.12% <0.00%> (-1.22%) ⬇️
pkg/agent/apiserver/handlers/ovsflows/handler.go 13.48% <ø> (-61.52%) ⬇️
pkg/agent/cniserver/ipam/antrea_ipam.go 3.40% <0.00%> (-75.77%) ⬇️
pkg/agent/cniserver/ipam/antrea_ipam_controller.go 0.00% <0.00%> (-79.77%) ⬇️
pkg/agent/controller/networkpolicy/packetin.go 68.34% <ø> (+1.19%) ⬆️
pkg/agent/ipassigner/responder/ndp_responder.go 0.00% <0.00%> (ø)
pkg/agent/openflow/network_policy.go 64.02% <0.00%> (-19.22%) ⬇️
pkg/agent/openflow/pipeline_other.go 3.44% <0.00%> (+1.27%) ⬆️
pkg/agent/proxy/metrics/metrics.go 100.00% <ø> (ø)
pkg/agent/types/networkpolicy.go 81.08% <ø> (-2.26%) ⬇️
... and 299 more

Copy link
Contributor

@jianjuns jianjuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain more what leak is fixed? Probably we need some comments in code to explain the added code too.

@annakhm
Copy link
Contributor

annakhm commented Feb 16, 2022

I feel that delete operation has become a bit complex. One cause of complexity is that it is not trivial task to determine whether the event is owned by antrea IPAM. To determine that, we'll inspect pod annotations (possibly retrieving pools by podIndex), and then namespace annotations. Then, if antrea IPAM owns the pod, the search for corresponding container is run once again within the pool in order to free the address. So we have some degree of duplicated logic that complicates reading code and debugging.
I would like to run by you an alternative approach - is it a viable idea to maintain a store with mapping between container ID and pool names. This store will fast-track check/delete operations. It can wither be populated on agent startup for the current node, based of pool statuses (we might need to add Node indication in pool status, which could be useful info regardless), or be a disk store like in host-local plugin.
For potential duplicate delete requests we can have two solutions: either let the request propagate to host-local plugin (no harm done but not clean), or keep antrea-ipam-owned containers in the store for a period of time and then garbage collect them.

@gran-vmv
Copy link
Contributor Author

@annakhm
We are using PodIndexer on IPPool cache. I think we don't need another cache for this.
https://github.com/antrea-io/antrea/blob/main/pkg/agent/cniserver/ipam/antrea_ipam_controller.go#L136

@gran-vmv
Copy link
Contributor Author

Could you explain more what leak is fixed? Probably we need some comments in code to explain the added code too.

Added more code comments.

@gran-vmv
Copy link
Contributor Author

/test-all
/test-flexible-ipam-e2e
/test-ipv6-all
/test-ipv6-only-all
/test-windows-all
/test-multicluster-e2e

jianjuns
jianjuns previously approved these changes Feb 18, 2022
annakhm
annakhm previously approved these changes Feb 18, 2022
mine, allocator, _, _, err := d.owns(k8sArgs)
// When AntreaIPAM.owns() is called for a NodeIPAM Pod, the IPPool cannot be found for the Pod. We pass
// swallowNotFoundError to the call to sallow the "NotFound" error.
mine, allocator, _, _, err := d.owns(k8sArgs, swallowNotFoundError)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why it should be fixed in this way. I have two questions:

  1. When a Pod is deleted, isn't CNI DEL called before the Pod is removed from the apiserver normally? I can imagine in some cases it could happen that the Pod is gone when it handles CNI DEL like Pod's delete grace period is set to 0, but is it your case?
  2. Should IPPool NotFound and Pod NotFound be handled differently? If Pod is not found, I think we are not sure whether this is owned by AntreaIPAM and should return "mine" as false. If Pod is found but IPPool is not, it should at least log error (as we don't allow deleting IPPool in use so this is not expected). But anyway this is for a AntreaIPAM pod, not like the comment explains?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. From e2e test, we can confirm that Pod may not exist when CNI DEL called. Thus we need to handle this case.
  2. Nice catch. I think it is harmless if we process IPPoolNotFound as current implementation which returns mine=false. Do you think we should check if ErrStatus.Details.Kind is Pod?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 2, it will invoke an unnecessary external call and hides a mistake which might in the code base.

I think the problem is the current return value from own is not clear, its callers have to inject many code for their own business logic. For example:
When it's called by Add, it's expected to return mine as true when the Pod is not found so Add can fail immediately, but the actual logic is:
When the Pod is not found, we don't know which IPAM owns it, and its creation should be defered until the Pod is received, or if the Pod has been deleted from API, we won't receive more request anyway.

When it's called by Del, it's expected to return mine as false when the Pod is not found so Del can call other IPAM plugins to ensure it's cleaned up completely.

It's obscure to understand the logic by injecting them to the own method and why own returns different value in different context for same situation.

To put the business logic back to the caller, I think own should be more clear what it returns:
If it should only return mine as true or false when it really knows, otherwise return unknown. Add should call next plugin only for false, Del should call next plugin for unknown and false.
In this way, we don't mix PoolNotFound and PodNotFound:
For Add, if the Pod is found but Pool is not found, own should return mine as true and an error to indicate PoolNotFound, the request failed immediately with clear message about PoolNotFound.
For Add, if the Pod is not found, own should return unknown and an error to indicate PodNotFound, the request failed immediately with clear message about PodNotFound.
For Del, if the Pod is found but Pool is not found, own should return mine as true and an error to indicate PoolNotFound, the request terminated and an error should be logged for this unexpected situation.
For Del, if the Pod is not found, own should return unknown and an error to indicate PodNotFound, Del should call the next plugin until one returns true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Refactored this part and added comments.

@gran-vmv gran-vmv requested a review from tnqn February 23, 2022 04:02
@gran-vmv gran-vmv added this to the Antrea v1.6 release milestone Mar 2, 2022
@gran-vmv gran-vmv dismissed stale reviews from annakhm and jianjuns via 9fb62d2 March 2, 2022 08:16
@gran-vmv gran-vmv force-pushed the ipam-delfix branch 4 times, most recently from 723b99c to 248b98d Compare March 8, 2022 02:14
@gran-vmv gran-vmv force-pushed the ipam-delfix branch 4 times, most recently from 77fd690 to 3f23722 Compare March 15, 2022 08:30
@gran-vmv gran-vmv force-pushed the ipam-delfix branch 2 times, most recently from 50d5cdc to 4d37741 Compare March 21, 2022 02:03
@gran-vmv gran-vmv changed the title [flexible-ipam] Fix IP leak on CNI CmdDel [flexible-ipam] Fix IP leak issues Mar 21, 2022
This commit fixed 2 IP leak issues:
1. AntreaIPAM enabled and the NodeIPAM Pods were deleted.
2. Incoming CniDel request when agent restart.

Signed-off-by: gran <gran@vmware.com>

[flexible-ipam]

Signed-off-by: gran <gran@vmware.com>
@gran-vmv
Copy link
Contributor Author

/test-all
/test-flexible-ipam-e2e

@tnqn tnqn added the action/release-note Indicates a PR that should be included in release notes. label Mar 21, 2022
@tnqn tnqn merged commit f5d2cdf into antrea-io:main Mar 21, 2022
gran-vmv added a commit to gran-vmv/antrea that referenced this pull request Mar 22, 2022
This commit fixed 2 IP leak issues:
1. AntreaIPAM enabled and the NodeIPAM Pods were deleted.
2. Incoming CniDel request when agent restart.

Signed-off-by: gran <gran@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/release-note Indicates a PR that should be included in release notes.
Projects
None yet
5 participants