[flexible-ipam] Fix IP leak issues #3314

gran-vmv · 2022-02-14T09:54:08Z

This commit fixed 2 IP leak issues in below situations with AntreaIPAM enabled:
1. NodeIPAM Pods were deleted.
2. AntreaIPAM Pods were deleted when agent restart.

This PR closes #3333 and closes #3384

Signed-off-by: gran gran@vmware.com

codecov-commenter · 2022-02-14T10:00:27Z

Codecov Report

Merging #3314 (9bff07a) into main (78e3583) will decrease coverage by 9.78%.
The diff coverage is 44.53%.

❗ Current head 9bff07a differs from pull request most recent head d9470f5. Consider uploading reports for the commit d9470f5 to get more accurate results

@@            Coverage Diff             @@
##             main    #3314      +/-   ##
==========================================
- Coverage   62.00%   52.21%   -9.79%     
==========================================
  Files         266      239      -27     
  Lines       26546    34224    +7678     
==========================================
+ Hits        16460    17871    +1411     
- Misses       8281    14645    +6364     
+ Partials     1805     1708      -97

Flag	Coverage Δ
e2e-tests	`52.21% <44.53%> (?)`
kind-e2e-tests	`?`
unit-tests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/agent/agent.go	`50.12% <0.00%> (-1.22%)`	⬇️
pkg/agent/apiserver/handlers/ovsflows/handler.go	`13.48% <ø> (-61.52%)`	⬇️
pkg/agent/cniserver/ipam/antrea_ipam.go	`3.40% <0.00%> (-75.77%)`	⬇️
pkg/agent/cniserver/ipam/antrea_ipam_controller.go	`0.00% <0.00%> (-79.77%)`	⬇️
pkg/agent/controller/networkpolicy/packetin.go	`68.34% <ø> (+1.19%)`	⬆️
pkg/agent/ipassigner/responder/ndp_responder.go	`0.00% <0.00%> (ø)`
pkg/agent/openflow/network_policy.go	`64.02% <0.00%> (-19.22%)`	⬇️
pkg/agent/openflow/pipeline_other.go	`3.44% <0.00%> (+1.27%)`	⬆️
pkg/agent/proxy/metrics/metrics.go	`100.00% <ø> (ø)`
pkg/agent/types/networkpolicy.go	`81.08% <ø> (-2.26%)`	⬇️
... and 299 more

jianjuns

Could you explain more what leak is fixed? Probably we need some comments in code to explain the added code too.

annakhm · 2022-02-16T19:41:06Z

I feel that delete operation has become a bit complex. One cause of complexity is that it is not trivial task to determine whether the event is owned by antrea IPAM. To determine that, we'll inspect pod annotations (possibly retrieving pools by podIndex), and then namespace annotations. Then, if antrea IPAM owns the pod, the search for corresponding container is run once again within the pool in order to free the address. So we have some degree of duplicated logic that complicates reading code and debugging.
I would like to run by you an alternative approach - is it a viable idea to maintain a store with mapping between container ID and pool names. This store will fast-track check/delete operations. It can wither be populated on agent startup for the current node, based of pool statuses (we might need to add Node indication in pool status, which could be useful info regardless), or be a disk store like in host-local plugin.
For potential duplicate delete requests we can have two solutions: either let the request propagate to host-local plugin (no harm done but not clean), or keep antrea-ipam-owned containers in the store for a period of time and then garbage collect them.

gran-vmv · 2022-02-17T02:09:06Z

@annakhm
We are using PodIndexer on IPPool cache. I think we don't need another cache for this.
https://github.com/antrea-io/antrea/blob/main/pkg/agent/cniserver/ipam/antrea_ipam_controller.go#L136

gran-vmv · 2022-02-17T08:27:28Z

Could you explain more what leak is fixed? Probably we need some comments in code to explain the added code too.

Added more code comments.

pkg/agent/cniserver/ipam/antrea_ipam.go

gran-vmv · 2022-02-18T10:04:34Z

/test-all
/test-flexible-ipam-e2e
/test-ipv6-all
/test-ipv6-only-all
/test-windows-all
/test-multicluster-e2e

tnqn · 2022-02-22T17:37:01Z

pkg/agent/cniserver/ipam/antrea_ipam.go

-	mine, allocator, _, _, err := d.owns(k8sArgs)
+	// When AntreaIPAM.owns() is called for a NodeIPAM Pod, the IPPool cannot be found for the Pod. We pass
+	// swallowNotFoundError to the call to sallow the "NotFound" error.
+	mine, allocator, _, _, err := d.owns(k8sArgs, swallowNotFoundError)


I'm not sure why it should be fixed in this way. I have two questions:

When a Pod is deleted, isn't CNI DEL called before the Pod is removed from the apiserver normally? I can imagine in some cases it could happen that the Pod is gone when it handles CNI DEL like Pod's delete grace period is set to 0, but is it your case?

Should IPPool NotFound and Pod NotFound be handled differently? If Pod is not found, I think we are not sure whether this is owned by AntreaIPAM and should return "mine" as false. If Pod is found but IPPool is not, it should at least log error (as we don't allow deleting IPPool in use so this is not expected). But anyway this is for a AntreaIPAM pod, not like the comment explains?

From e2e test, we can confirm that Pod may not exist when CNI DEL called. Thus we need to handle this case.

Nice catch. I think it is harmless if we process IPPoolNotFound as current implementation which returns mine=false. Do you think we should check if ErrStatus.Details.Kind is Pod?

For 2, it will invoke an unnecessary external call and hides a mistake which might in the code base.

I think the problem is the current return value from own is not clear, its callers have to inject many code for their own business logic. For example:
When it's called by Add, it's expected to return mine as true when the Pod is not found so Add can fail immediately, but the actual logic is:
When the Pod is not found, we don't know which IPAM owns it, and its creation should be defered until the Pod is received, or if the Pod has been deleted from API, we won't receive more request anyway.

When it's called by Del, it's expected to return mine as false when the Pod is not found so Del can call other IPAM plugins to ensure it's cleaned up completely.

It's obscure to understand the logic by injecting them to the own method and why own returns different value in different context for same situation.

To put the business logic back to the caller, I think own should be more clear what it returns:
If it should only return mine as true or false when it really knows, otherwise return unknown. Add should call next plugin only for false, Del should call next plugin for unknown and false.
In this way, we don't mix PoolNotFound and PodNotFound:
For Add, if the Pod is found but Pool is not found, own should return mine as true and an error to indicate PoolNotFound, the request failed immediately with clear message about PoolNotFound.
For Add, if the Pod is not found, own should return unknown and an error to indicate PodNotFound, the request failed immediately with clear message about PodNotFound.
For Del, if the Pod is found but Pool is not found, own should return mine as true and an error to indicate PoolNotFound, the request terminated and an error should be logged for this unexpected situation.
For Del, if the Pod is not found, own should return unknown and an error to indicate PodNotFound, Del should call the next plugin until one returns true.

Thanks. Refactored this part and added comments.

pkg/agent/cniserver/ipam/antrea_ipam.go

This commit fixed 2 IP leak issues: 1. AntreaIPAM enabled and the NodeIPAM Pods were deleted. 2. Incoming CniDel request when agent restart. Signed-off-by: gran <gran@vmware.com> [flexible-ipam] Signed-off-by: gran <gran@vmware.com>

gran-vmv · 2022-03-21T04:56:14Z

/test-all
/test-flexible-ipam-e2e

This commit fixed 2 IP leak issues: 1. AntreaIPAM enabled and the NodeIPAM Pods were deleted. 2. Incoming CniDel request when agent restart. Signed-off-by: gran <gran@vmware.com>

gran-vmv marked this pull request as draft February 14, 2022 09:54

gran-vmv marked this pull request as ready for review February 14, 2022 09:58

gran-vmv requested review from annakhm, jianjuns and tnqn February 15, 2022 02:04

gran-vmv force-pushed the ipam-delfix branch from c514e66 to 02bfa3b Compare February 16, 2022 01:28

jianjuns reviewed Feb 16, 2022

View reviewed changes

gran-vmv force-pushed the ipam-delfix branch from 02bfa3b to c48f814 Compare February 17, 2022 08:27

gran-vmv requested a review from jianjuns February 17, 2022 09:12

jianjuns reviewed Feb 17, 2022

View reviewed changes

pkg/agent/cniserver/ipam/antrea_ipam.go Outdated Show resolved Hide resolved

gran-vmv force-pushed the ipam-delfix branch from c48f814 to ffb4025 Compare February 18, 2022 10:02

jianjuns previously approved these changes Feb 18, 2022

View reviewed changes

annakhm previously approved these changes Feb 18, 2022

View reviewed changes

tnqn reviewed Feb 22, 2022

View reviewed changes

gran-vmv requested a review from tnqn February 23, 2022 04:02

gran-vmv added this to the Antrea v1.6 release milestone Mar 2, 2022

gran-vmv dismissed stale reviews from annakhm and jianjuns via 9fb62d2 March 2, 2022 08:16

gran-vmv requested review from annakhm and jianjuns March 2, 2022 08:16

gran-vmv mentioned this pull request Mar 4, 2022

[flexible-ipam] Multiple-VLAN support #3247

Merged

gran-vmv force-pushed the ipam-delfix branch 4 times, most recently from 723b99c to 248b98d Compare March 8, 2022 02:14

gran-vmv force-pushed the ipam-delfix branch 4 times, most recently from 77fd690 to 3f23722 Compare March 15, 2022 08:30

gran-vmv force-pushed the ipam-delfix branch 2 times, most recently from 50d5cdc to 4d37741 Compare March 21, 2022 02:03

gran-vmv changed the title ~~[flexible-ipam] Fix IP leak on CNI CmdDel~~ [flexible-ipam] Fix IP leak issues Mar 21, 2022

tnqn reviewed Mar 21, 2022

View reviewed changes

pkg/agent/cniserver/ipam/antrea_ipam.go Outdated Show resolved Hide resolved

[flexible-ipam] Fix IP leak issues

57f6094

This commit fixed 2 IP leak issues: 1. AntreaIPAM enabled and the NodeIPAM Pods were deleted. 2. Incoming CniDel request when agent restart. Signed-off-by: gran <gran@vmware.com> [flexible-ipam] Signed-off-by: gran <gran@vmware.com>

gran-vmv force-pushed the ipam-delfix branch from 4d37741 to 57f6094 Compare March 21, 2022 02:51

tnqn approved these changes Mar 21, 2022

View reviewed changes

tnqn added the action/release-note Indicates a PR that should be included in release notes. label Mar 21, 2022

tnqn merged commit f5d2cdf into antrea-io:main Mar 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flexible-ipam] Fix IP leak issues #3314

[flexible-ipam] Fix IP leak issues #3314

gran-vmv commented Feb 14, 2022 •

edited

codecov-commenter commented Feb 14, 2022 •

edited

jianjuns left a comment

annakhm commented Feb 16, 2022

gran-vmv commented Feb 17, 2022

gran-vmv commented Feb 17, 2022

gran-vmv commented Feb 18, 2022

tnqn Feb 22, 2022

gran-vmv Feb 23, 2022

tnqn Mar 18, 2022

gran-vmv Mar 21, 2022

gran-vmv commented Mar 21, 2022

[flexible-ipam] Fix IP leak issues #3314

[flexible-ipam] Fix IP leak issues #3314

Conversation

gran-vmv commented Feb 14, 2022 • edited

codecov-commenter commented Feb 14, 2022 • edited

Codecov Report

jianjuns left a comment

Choose a reason for hiding this comment

annakhm commented Feb 16, 2022

gran-vmv commented Feb 17, 2022

gran-vmv commented Feb 17, 2022

gran-vmv commented Feb 18, 2022

tnqn Feb 22, 2022

Choose a reason for hiding this comment

gran-vmv Feb 23, 2022

Choose a reason for hiding this comment

tnqn Mar 18, 2022

Choose a reason for hiding this comment

gran-vmv Mar 21, 2022

Choose a reason for hiding this comment

gran-vmv commented Mar 21, 2022

gran-vmv commented Feb 14, 2022 •

edited

codecov-commenter commented Feb 14, 2022 •

edited