-
Notifications
You must be signed in to change notification settings - Fork 260
feat: [NPM] Clean up iptables chains in Linux v2 #1090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
JungukCho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like dropping get prefix. I will follow this practice while it would be nice to separate dropping get prefix in another PR to help reviewer.
I may misunderstand the code, but if I correctly understand the PR, please revise them if they make sense.
| } | ||
| } | ||
|
|
||
| func (pMgr *PolicyManager) oldPolicyChains() []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I correctly understand the code, this is stale policy which should be deleted?
If so, is stalePolicyChains() better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it for deletion safe in pMgr.chainToCleanup map?
It seems go is safe to delete element in loop.
https://golang.org/doc/effective_go#for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could delete here instead of deleting down below in cleanupChains() and add in there if there's a failure instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may misunderstand it, but I meant if go map is safe to delete element in loop, we do not copying and adding it again.
When the operation is successful, deleting the element from the cache.
For example,
// have to use slice argument for deterministic behavior for UTs
func (pMgr *PolicyManager) cleanupChains(chains []string) error {
var aggregateError error
for staleChain := pMgr.staleChains {
errCode, err := pMgr.runIPTablesCommand(util.IptablesDestroyFlag, staleChain ) // TODO run the one that ignores doesNotExistErrorCode
if err != nil && errCode != doesNotExistErrorCode {
currentErrString := fmt.Sprintf("failed to clean up policy chain %s with err [%v]", chain, err)
if aggregateError == nil {
aggregateError = npmerrors.SimpleError(currentErrString)
} else {
aggregateError = npmerrors.SimpleErrorWrapper(fmt.Sprintf("%s and had previous error", currentErrString), aggregateError)
}
}else {
delete(pMgr.staleChains, staleChain)
}
}
if aggregateError != nil {
return npmerrors.SimpleErrorWrapper("failed to clean up some policy chains with errors", aggregateError)
}
return nil
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have to loop over a slice though for UT purposes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, it may look better to have cleanup receiver function in staleChains struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this, but then we'd have to pass pMgr as an arg, which is messy:
pMgr.staleChains.cleanup(pMgr)
| if restoreErr != nil { | ||
| return npmerrors.SimpleErrorWrapper("failed to flush policies", restoreErr) | ||
| } | ||
| for _, chain := range allChainNames { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the related chain from networkPolicy is deleted with restore function.
I am curious, even restore succeeded, why do you add the chain?
If you want to add these failed chains, does for loop locate in if condition?
If this is correct, even I am not sure this is needed since it the failed ones are reconciled again from upper layer.
Is it to clean up these failed ones in regular reconcile function?
If so, I am not sure, but there is time to mess up something when multiple events (e.g., add, delete policy and reconcile) happens at the same time. Is it guaranteed with lock in policymanager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in iptables-restore, you can't destroy a chain, you can only create/flush a chain. So this code says "I want to delete these policy chains later since we successfully removed the old policy's rules"
We lock before AddPolicy, DeletePolicy, and Reconcile, so I don't think there could be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I see. Basically, the will do regular clean-up process for chains in reconcile.
Question
But why not trying cleanup the chain here as well?
If cleaning-up process is failure, add the failed one into staleChains.
You postpone it due to time consuming?
I guess there are pros and cons between two approach.
If there are many chains, Reconcile hold a lock for longtime and delay normal operator too long while it speeds-up normal operations (e.g., add, and deletion).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before discussion, I guess benchmarking time for deleting one chain is necessary to have productive discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll send a link to a doc in Teams
| } | ||
|
|
||
| func (pMgr *PolicyManager) reboot() error { | ||
| // TODO for the sake of UTs, need to have a pMgr config specifying whether or not this reboot happens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need reboot function?
It seems it is called when there is no more networkpolicy.
So, it is ok to call just reset.
Or is it clean-up all and then re-install default NPM chains?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya we have reset which cleans up and deletes, and initialize which reinstalls. This reboot will do nothing in Windows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious.
I guess you decide to do a pro-active approach to avoid time to install default chains when the first networkpolicy comes again while logically only reset is called when there is no more networkpolicy and initialize is called if the first networkpolicy comes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see your point. In v1 it looks we just reset, and then when the first policy comes in, we initialize.
Given that the current DP design initializes the pMgr on creation, it might be simpler to always have pMgr initialized. Need to consider if there are any security or perf concerns for this approach.
| ingressOrEgressPolicyChainPattern = fmt.Sprintf("'Chain %s-\\|Chain %s-'", util.IptablesAzureIngressPolicyChainPrefix, util.IptablesAzureEgressPolicyChainPrefix) | ||
| ) | ||
|
|
||
| type osTools struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure naming, is it better to use stableChains?
Also this use multiple places (e.g., initialization, map key copies, etc), so it would be nice to leverage receiver methods and help better understanding.
It guess it also need a more fine-grained lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like name change (staleChains right?), and will use methods where possible.
The pMgr is locked before the use of this struct, so I think we're ok right? See the reconcile() logic. Within that function we could lock the pMgr only for repositioning the jump to azure chain, then unlock, then lock the staleChains only while we delete stale chains, but is this too complicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: we should lock the whole pMgr so that we don't conflict with iptables-restore calls
| ) | ||
|
|
||
| type staleChains struct { | ||
| chainsToCleanup map[string]struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need locks for this stalechain because two different threads are going to read/write into this, one reconcile thread and the normal pMgr thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's some discussion about this with me and Junguk above. We lock the whole pMgr before Add/Remove Policy, and I lock pMgr before reconcile now too. Besides the staleChains field that is used in all three of these methods, ioshim is also used to make syscalls in all three
| if err := pMgr.removeNPMChains(); err != nil { | ||
| return npmerrors.SimpleErrorWrapper("failed to remove NPM chains", err) | ||
| } | ||
| pMgr.staleChains.empty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not look like we are deleting staleChains on reset ? if NPM comes up after a crash, we lost the stalechain cache and we would have to remove them right ? I think we will need to read all existing chains and delete chains with "Azure-NPM" prefix, wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removeNPMChains actually does grep for any policy chains and deletes them. It actually creates chains that might not exist though, and greps for ingress/egress policy chains only. I will change it to grep for anything with that prefix, and then we will only be flushing and deleting chains that already exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this in another PR
b4f813c to
1112cca
Compare
JungukCho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! My comments are minor. You can resolved them in your next follow-up PRs if they make sense.
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
Features: