-
Notifications
You must be signed in to change notification settings - Fork 260
[NPM] better error handling and cache building #848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #848 +/- ##
==========================================
- Coverage 42.22% 42.05% -0.17%
==========================================
Files 158 158
Lines 15025 15136 +111
==========================================
+ Hits 6344 6366 +22
- Misses 7904 7995 +91
+ Partials 777 775 -2 |
| metrics.SendErrorLogAndMetric(util.NSID, "[UpdateNamespace] Error: failed to add namespace %s to ipset list %s with err: %v", newNsName, labelKey, err) | ||
| return err | ||
| } | ||
| // {IMPORTANT} Same as above order is assumed to be key and then key+val. NPM should only append to existing labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same. Can we resolve this dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed offline, keeping this as it is for now.
| log.Logf("Adding pod %s to ipset %s", newPodObj.Status.PodIP, addIPSetName) | ||
| if err = ipsMgr.AddToSet(addIPSetName, newPodObj.Status.PodIP, util.IpsetNetHashFlag, podKey); err != nil { | ||
| return fmt.Errorf("[syncAddAndUpdatePod] Error: failed to add pod to label ipset with err: %v", err) | ||
| // {IMPORTANT} The order of compared list will be key and then key+val. NPM should only append after both key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Same as namespace comment) Can we resolve this issue in this PR? As I understood this issue and we discussed this before, but if possible, it is better to resolve this issue to avoid future possible issue.
Current GetLabelKVFromSet returns key and value. So better to returns (string, string) in the GetLabelKVFromSet
function and use "removedKey" and "removedValue" in this function
| if err = ipsMgr.AddToSet(addIPSetName, newPodObj.Status.PodIP, util.IpsetNetHashFlag, podKey); err != nil { | ||
| return fmt.Errorf("[syncAddAndUpdatePod] Error: failed to add pod to label ipset with err: %v", err) | ||
| } | ||
| // {IMPORTANT} Same as above order is assumed to be key and then key+val. NPM should only append to existing labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we resolve this issue in this PR?
| c.npMgr.PodMap[podKey].appendLabels(map[string]string{addedLabel[0]: addedLabel[1]}, AppendToExitingLabels) | ||
| } | ||
| } | ||
| // This will ensure after all labels are worked on to overwrite. This way will reduce any bugs introduced above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are possible bugs? It would be nice to update bugs case as comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment. Due to the ordering issues if we not delete and add a label, this appedlabels will help override it.
| // the IP addresses of the cached npmPod and newPodObj is the same | ||
| // If no change in labels, then GetIPSetListCompareLabels will return empty list. | ||
| // Otherwise it returns list of deleted PodIP from cached pod's labels and list of added PodIp from new pod's labels | ||
| addToIPSets, deleteFromIPSets := util.GetIPSetListCompareLabels(cachedNpmPodObj.Labels, newPodObj.Labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From here, do we deal with the first case? If so, can we keep using cachedNpmPodObj and update the cachedNpmPodObj if needed? Are there specific reasons to create newNpmPod() and update cache, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason we are creating new obj is because we have fields like Phase, PodIPs, Rv etc which needs to be updated to reflect the latest newPodobj ... we can add them individually too if it makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good improvement, adding this change
npm/podController.go
Outdated
| portList := getContainerPortList(podObj) | ||
| // To remove any references to objects sent to events by sharedinformer, | ||
| // NPM will need to deepcopy data from the provided objects | ||
| copiedPortList := make([]corev1.ContainerPort, len(portList)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getContainerPortList returns new slices with copy. Why do we need the same operation again?
| // Delete the pod from its namespace's ipset. | ||
| if err = ipsMgr.DeleteFromSet(cachedNpmPodObj.Namespace, cachedNpmPodObj.PodIP, podKey); err != nil { | ||
| return fmt.Errorf("[syncAddAndUpdatePod] Error: failed to delete pod from namespace ipset with err: %v", err) | ||
| metrics.SendErrorLogAndMetric(util.PodID, "[syncAddAndUpdatePod] Info: Unexpected state. Pod (Namespace:%s, Name:%s, newUid:%s) , has cachedPodIp:%s which is different from PodIp:%s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we do not need this message since it can happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line was very useful in Genesys debugging, we would want to see why the delete and add happened for any new folks.
| podObj, ok := obj.(*corev1.Pod) | ||
| if !ok { | ||
| metrics.SendErrorLogAndMetric(util.NpmID, "ADD Pod: Received unexpected object type: %v", obj) | ||
| return key, needSync |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About metrics.SEndErrorLogAndMetric().
I think we do not need send this information since it can be ignorable. If we need to keep it, may need to change "util.NpmID" -> util.PodID". Same opinion for metrics.SEndErrorLogAndMetric() in this function.
It would be nice to check this in the rest of code (e.g., deletePod(), it uses "util.NpmID" and is better to use ""util.PodID" ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch changed this in 3 locations.
JungukCho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for applying comments! Looks good to me.
Today, during a resource event handling, NPM will return immediately if it encounters an error and local NPM cache will not be updated. This results in leaked ipset/iptables state in kernel which will never be cleaned up since the local NPM cache is not aware of it.
Solution:
Build cache after each successful operation. Return on first error.
This PR, incorporates the above logic, NPM now keeps updating the local cache after every successful operation. NPM also ignores the notexists errors in Delete set operations.
Logging is updated to use klog instead of using acn/log.
if the Cached Pod IP changes, now NPM will delete the existing cached pod and adds a new pod.
CleanUpDeletedPod will now only require the key of the pod which needs to be deleted.