Skip to content

Conversation

@huntergregory
Copy link
Contributor

Features:

  • Cleans up dead policy chains in the background.
  • Removes and adds all chains when the number of policies becomes 0.

@huntergregory huntergregory changed the title [NPM] Update for Linux Chain Management feat: [NPM] Update for Linux Chain Management Nov 9, 2021
@huntergregory huntergregory changed the title feat: [NPM] Update for Linux Chain Management feat: [NPM] Clean up iptables chains in Linux v2 Nov 11, 2021
Copy link
Contributor

@JungukCho JungukCho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like dropping get prefix. I will follow this practice while it would be nice to separate dropping get prefix in another PR to help reviewer.

I may misunderstand the code, but if I correctly understand the PR, please revise them if they make sense.

}
}

func (pMgr *PolicyManager) oldPolicyChains() []string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I correctly understand the code, this is stale policy which should be deleted?
If so, is stalePolicyChains() better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it for deletion safe in pMgr.chainToCleanup map?

It seems go is safe to delete element in loop.
https://golang.org/doc/effective_go#for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could delete here instead of deleting down below in cleanupChains() and add in there if there's a failure instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may misunderstand it, but I meant if go map is safe to delete element in loop, we do not copying and adding it again.
When the operation is successful, deleting the element from the cache.
For example,

// have to use slice argument for deterministic behavior for UTs
func (pMgr *PolicyManager) cleanupChains(chains []string) error {
	var aggregateError error
        for staleChain := pMgr.staleChains {
		errCode, err := pMgr.runIPTablesCommand(util.IptablesDestroyFlag, staleChain ) // TODO run the one that ignores doesNotExistErrorCode
		if err != nil && errCode != doesNotExistErrorCode {
			currentErrString := fmt.Sprintf("failed to clean up policy chain %s with err [%v]", chain, err)
			if aggregateError == nil {
				aggregateError = npmerrors.SimpleError(currentErrString)
			} else {
				aggregateError = npmerrors.SimpleErrorWrapper(fmt.Sprintf("%s and had previous error", currentErrString), aggregateError)
			}
		}else {
                     delete(pMgr.staleChains, staleChain)
                }
	}
	if aggregateError != nil {
		return npmerrors.SimpleErrorWrapper("failed to clean up some policy chains with errors", aggregateError)
	}
	return nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have to loop over a slice though for UT purposes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, it may look better to have cleanup receiver function in staleChains struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, but then we'd have to pass pMgr as an arg, which is messy:
pMgr.staleChains.cleanup(pMgr)

if restoreErr != nil {
return npmerrors.SimpleErrorWrapper("failed to flush policies", restoreErr)
}
for _, chain := range allChainNames {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the related chain from networkPolicy is deleted with restore function.
I am curious, even restore succeeded, why do you add the chain?
If you want to add these failed chains, does for loop locate in if condition?

If this is correct, even I am not sure this is needed since it the failed ones are reconciled again from upper layer.
Is it to clean up these failed ones in regular reconcile function?
If so, I am not sure, but there is time to mess up something when multiple events (e.g., add, delete policy and reconcile) happens at the same time. Is it guaranteed with lock in policymanager?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in iptables-restore, you can't destroy a chain, you can only create/flush a chain. So this code says "I want to delete these policy chains later since we successfully removed the old policy's rules"

We lock before AddPolicy, DeletePolicy, and Reconcile, so I don't think there could be a problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I see. Basically, the will do regular clean-up process for chains in reconcile.

Question
But why not trying cleanup the chain here as well?
If cleaning-up process is failure, add the failed one into staleChains.
You postpone it due to time consuming?

I guess there are pros and cons between two approach.
If there are many chains, Reconcile hold a lock for longtime and delay normal operator too long while it speeds-up normal operations (e.g., add, and deletion).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before discussion, I guess benchmarking time for deleting one chain is necessary to have productive discussion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll send a link to a doc in Teams

}

func (pMgr *PolicyManager) reboot() error {
// TODO for the sake of UTs, need to have a pMgr config specifying whether or not this reboot happens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need reboot function?
It seems it is called when there is no more networkpolicy.
So, it is ok to call just reset.

Or is it clean-up all and then re-install default NPM chains?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya we have reset which cleans up and deletes, and initialize which reinstalls. This reboot will do nothing in Windows

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious.
I guess you decide to do a pro-active approach to avoid time to install default chains when the first networkpolicy comes again while logically only reset is called when there is no more networkpolicy and initialize is called if the first networkpolicy comes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see your point. In v1 it looks we just reset, and then when the first policy comes in, we initialize.

Given that the current DP design initializes the pMgr on creation, it might be simpler to always have pMgr initialized. Need to consider if there are any security or perf concerns for this approach.

ingressOrEgressPolicyChainPattern = fmt.Sprintf("'Chain %s-\\|Chain %s-'", util.IptablesAzureIngressPolicyChainPrefix, util.IptablesAzureEgressPolicyChainPrefix)
)

type osTools struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure naming, is it better to use stableChains?
Also this use multiple places (e.g., initialization, map key copies, etc), so it would be nice to leverage receiver methods and help better understanding.
It guess it also need a more fine-grained lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like name change (staleChains right?), and will use methods where possible.

The pMgr is locked before the use of this struct, so I think we're ok right? See the reconcile() logic. Within that function we could lock the pMgr only for repositioning the jump to azure chain, then unlock, then lock the staleChains only while we delete stale chains, but is this too complicated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: we should lock the whole pMgr so that we don't conflict with iptables-restore calls

)

type staleChains struct {
chainsToCleanup map[string]struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need locks for this stalechain because two different threads are going to read/write into this, one reconcile thread and the normal pMgr thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's some discussion about this with me and Junguk above. We lock the whole pMgr before Add/Remove Policy, and I lock pMgr before reconcile now too. Besides the staleChains field that is used in all three of these methods, ioshim is also used to make syscalls in all three

if err := pMgr.removeNPMChains(); err != nil {
return npmerrors.SimpleErrorWrapper("failed to remove NPM chains", err)
}
pMgr.staleChains.empty()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not look like we are deleting staleChains on reset ? if NPM comes up after a crash, we lost the stalechain cache and we would have to remove them right ? I think we will need to read all existing chains and delete chains with "Azure-NPM" prefix, wdyt ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removeNPMChains actually does grep for any policy chains and deletes them. It actually creates chains that might not exist though, and greps for ingress/egress policy chains only. I will change it to grep for anything with that prefix, and then we will only be flushing and deleting chains that already exist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this in another PR

@huntergregory huntergregory marked this pull request as ready for review November 16, 2021 02:12
JungukCho
JungukCho previously approved these changes Nov 16, 2021
Copy link
Contributor

@JungukCho JungukCho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! My comments are minor. You can resolved them in your next follow-up PRs if they make sense.

@huntergregory
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@huntergregory
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

JungukCho
JungukCho previously approved these changes Nov 16, 2021
@huntergregory
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@huntergregory
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@huntergregory
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@huntergregory huntergregory merged commit 618a654 into master Nov 17, 2021
@rbtr rbtr deleted the npm-chain-management branch November 30, 2021 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

npm Related to NPM.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants