feat: [NPM] Clean up iptables chains in Linux v2 #1090

huntergregory · 2021-11-06T00:43:21Z

Features:

Cleans up dead policy chains in the background.
Removes and adds all chains when the number of policies becomes 0.

JungukCho

I like dropping get prefix. I will follow this practice while it would be nice to separate dropping get prefix in another PR to help reviewer.

I may misunderstand the code, but if I correctly understand the PR, please revise them if they make sense.

JungukCho · 2021-11-12T17:30:53Z

npm/pkg/dataplane/policies/chain-management_linux.go

+	}
+}
+
+func (pMgr *PolicyManager) oldPolicyChains() []string {


If I correctly understand the code, this is stale policy which should be deleted?
If so, is stalePolicyChains() better?

Is it for deletion safe in pMgr.chainToCleanup map?

It seems go is safe to delete element in loop.
https://golang.org/doc/effective_go#for

I could delete here instead of deleting down below in cleanupChains() and add in there if there's a failure instead

I may misunderstand it, but I meant if go map is safe to delete element in loop, we do not copying and adding it again.
When the operation is successful, deleting the element from the cache.
For example,

// have to use slice argument for deterministic behavior for UTs func (pMgr *PolicyManager) cleanupChains(chains []string) error { var aggregateError error for staleChain := pMgr.staleChains { errCode, err := pMgr.runIPTablesCommand(util.IptablesDestroyFlag, staleChain ) // TODO run the one that ignores doesNotExistErrorCode if err != nil && errCode != doesNotExistErrorCode { currentErrString := fmt.Sprintf("failed to clean up policy chain %s with err [%v]", chain, err) if aggregateError == nil { aggregateError = npmerrors.SimpleError(currentErrString) } else { aggregateError = npmerrors.SimpleErrorWrapper(fmt.Sprintf("%s and had previous error", currentErrString), aggregateError) } }else { delete(pMgr.staleChains, staleChain) } } if aggregateError != nil { return npmerrors.SimpleErrorWrapper("failed to clean up some policy chains with errors", aggregateError) } return nil }

we have to loop over a slice though for UT purposes

In addition, it may look better to have cleanup receiver function in staleChains struct.

I thought about this, but then we'd have to pass pMgr as an arg, which is messy:
pMgr.staleChains.cleanup(pMgr)

JungukCho · 2021-11-12T17:44:35Z

npm/pkg/dataplane/policies/policymanager_linux.go

 	if restoreErr != nil {
 		return npmerrors.SimpleErrorWrapper("failed to flush policies", restoreErr)
 	}
+	for _, chain := range allChainNames {


I think the related chain from networkPolicy is deleted with restore function.
I am curious, even restore succeeded, why do you add the chain?
If you want to add these failed chains, does for loop locate in if condition?

If this is correct, even I am not sure this is needed since it the failed ones are reconciled again from upper layer.
Is it to clean up these failed ones in regular reconcile function?
If so, I am not sure, but there is time to mess up something when multiple events (e.g., add, delete policy and reconcile) happens at the same time. Is it guaranteed with lock in policymanager?

in iptables-restore, you can't destroy a chain, you can only create/flush a chain. So this code says "I want to delete these policy chains later since we successfully removed the old policy's rules"

We lock before AddPolicy, DeletePolicy, and Reconcile, so I don't think there could be a problem.

Oh. I see. Basically, the will do regular clean-up process for chains in reconcile.

Question
But why not trying cleanup the chain here as well?
If cleaning-up process is failure, add the failed one into staleChains.
You postpone it due to time consuming?

I guess there are pros and cons between two approach.
If there are many chains, Reconcile hold a lock for longtime and delay normal operator too long while it speeds-up normal operations (e.g., add, and deletion).

Before discussion, I guess benchmarking time for deleting one chain is necessary to have productive discussion.

I'll send a link to a doc in Teams

JungukCho · 2021-11-12T17:53:11Z

npm/pkg/dataplane/policies/chain-management_linux.go

+}
+
+func (pMgr *PolicyManager) reboot() error {
+	// TODO for the sake of UTs, need to have a pMgr config specifying whether or not this reboot happens


Do we need reboot function?
It seems it is called when there is no more networkpolicy.
So, it is ok to call just reset.

Or is it clean-up all and then re-install default NPM chains?

ya we have reset which cleans up and deletes, and initialize which reinstalls. This reboot will do nothing in Windows

Just curious.
I guess you decide to do a pro-active approach to avoid time to install default chains when the first networkpolicy comes again while logically only reset is called when there is no more networkpolicy and initialize is called if the first networkpolicy comes.

oh I see your point. In v1 it looks we just reset, and then when the first policy comes in, we initialize.

Given that the current DP design initializes the pMgr on creation, it might be simpler to always have pMgr initialized. Need to consider if there are any security or perf concerns for this approach.

JungukCho · 2021-11-12T17:58:02Z

npm/pkg/dataplane/policies/chain-management_linux.go

 	ingressOrEgressPolicyChainPattern = fmt.Sprintf("'Chain %s-\\|Chain %s-'", util.IptablesAzureIngressPolicyChainPrefix, util.IptablesAzureEgressPolicyChainPrefix)
 )

+type osTools struct {


I am not sure naming, is it better to use stableChains?
Also this use multiple places (e.g., initialization, map key copies, etc), so it would be nice to leverage receiver methods and help better understanding.
It guess it also need a more fine-grained lock.

I like name change (staleChains right?), and will use methods where possible.

The pMgr is locked before the use of this struct, so I think we're ok right? See the reconcile() logic. Within that function we could lock the pMgr only for repositioning the jump to azure chain, then unlock, then lock the staleChains only while we delete stale chains, but is this too complicated?

EDIT: we should lock the whole pMgr so that we don't conflict with iptables-restore calls

vakalapa · 2021-11-15T22:55:09Z

npm/pkg/dataplane/policies/chain-management_linux.go

 )

+type staleChains struct {
+	chainsToCleanup map[string]struct{}


We will need locks for this stalechain because two different threads are going to read/write into this, one reconcile thread and the normal pMgr thread.

there's some discussion about this with me and Junguk above. We lock the whole pMgr before Add/Remove Policy, and I lock pMgr before reconcile now too. Besides the staleChains field that is used in all three of these methods, ioshim is also used to make syscalls in all three

vakalapa · 2021-11-15T22:57:49Z

npm/pkg/dataplane/policies/chain-management_linux.go

 	if err := pMgr.removeNPMChains(); err != nil {
 		return npmerrors.SimpleErrorWrapper("failed to remove NPM chains", err)
 	}
+	pMgr.staleChains.empty()


It does not look like we are deleting staleChains on reset ? if NPM comes up after a crash, we lost the stalechain cache and we would have to remove them right ? I think we will need to read all existing chains and delete chains with "Azure-NPM" prefix, wdyt ?

removeNPMChains actually does grep for any policy chains and deletes them. It actually creates chains that might not exist though, and greps for ingress/egress policy chains only. I will change it to grep for anything with that prefix, and then we will only be flushing and deleting chains that already exist

Doing this in another PR

npm/pkg/dataplane/policies/policymanager_linux.go

…o more policies

JungukCho

LGTM! My comments are minor. You can resolved them in your next follow-up PRs if they make sense.

npm/pkg/dataplane/policies/chain-management_linux.go

huntergregory · 2021-11-16T19:07:47Z

/azp run

azure-pipelines · 2021-11-16T19:08:03Z

Azure Pipelines successfully started running 2 pipeline(s).

huntergregory · 2021-11-16T23:39:43Z

/azp run

azure-pipelines · 2021-11-16T23:39:59Z

Azure Pipelines successfully started running 2 pipeline(s).

huntergregory · 2021-11-17T01:20:36Z

/azp run

azure-pipelines · 2021-11-17T01:20:52Z

Azure Pipelines successfully started running 2 pipeline(s).

huntergregory · 2021-11-17T03:17:53Z

/azp run

azure-pipelines · 2021-11-17T03:18:10Z

Azure Pipelines successfully started running 2 pipeline(s).

huntergregory · 2021-11-17T17:12:40Z

/azp run

azure-pipelines · 2021-11-17T17:12:56Z

Azure Pipelines successfully started running 2 pipeline(s).

huntergregory requested review from JungukCho and vakalapa November 6, 2021 00:43

huntergregory changed the title ~~[NPM] Update for Linux Chain Management~~ feat: [NPM] Update for Linux Chain Management Nov 9, 2021

huntergregory changed the title ~~feat: [NPM] Update for Linux Chain Management~~ feat: [NPM] Clean up iptables chains in Linux v2 Nov 11, 2021

JungukCho reviewed Nov 12, 2021

View reviewed changes

vakalapa reviewed Nov 15, 2021

View reviewed changes

huntergregory added 6 commits November 15, 2021 18:08

cleanup old policy chains and reboot iptables chains when there are n…

521b9cc

…o more policies

remove get prefix for all functions per junguks feedback

beb9b04

clean up code for port specs and fix a lint

95b6e40

address comments

1504b70

remove stop channel in OS-specific reconcile

77bc75f

move policy methods to policy_linux.go

1112cca

huntergregory force-pushed the npm-chain-management branch from b4f813c to 1112cca Compare November 16, 2021 02:11

huntergregory marked this pull request as ready for review November 16, 2021 02:12

JungukCho previously approved these changes Nov 16, 2021

View reviewed changes

npm/pkg/dataplane/policies/chain-management_linux.go Outdated Show resolved Hide resolved

npm/pkg/dataplane/policies/chain-management_linux.go Show resolved Hide resolved

huntergregory added the npm Related to NPM. label Nov 16, 2021

huntergregory dismissed JungukCho’s stale review via 1112cca November 16, 2021 23:30

add comments based on suggestions

ae909ff

JungukCho previously approved these changes Nov 16, 2021

View reviewed changes

huntergregory dismissed JungukCho’s stale review via fae1c4c November 17, 2021 17:33

fix build issue: move a constant from linux file to generic file

fae1c4c

JungukCho approved these changes Nov 17, 2021

View reviewed changes

huntergregory merged commit 618a654 into master Nov 17, 2021

rbtr deleted the npm-chain-management branch November 30, 2021 20:15

feat: [NPM] Clean up iptables chains in Linux v2 #1090

feat: [NPM] Clean up iptables chains in Linux v2 #1090

Uh oh!

Conversation

huntergregory commented Nov 6, 2021

Uh oh!

JungukCho left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JungukCho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

huntergregory commented Nov 16, 2021

Uh oh!

azure-pipelines bot commented Nov 16, 2021

Uh oh!

huntergregory commented Nov 16, 2021

Uh oh!

azure-pipelines bot commented Nov 16, 2021

Uh oh!

huntergregory commented Nov 17, 2021

Uh oh!

azure-pipelines bot commented Nov 17, 2021

Uh oh!

huntergregory commented Nov 17, 2021

Uh oh!

azure-pipelines bot commented Nov 17, 2021

Uh oh!

huntergregory commented Nov 17, 2021