-
Notifications
You must be signed in to change notification settings - Fork 260
Description
Rarity and Auto-Mitigation
When the exact condition in the title occurs, the chance of the race condition is ~0.001% (encountering a ~0.2 second interval within a 5 minute window). Auto-mitigation is also likely since the issue is resolved when:
- the pod causing the issue is deleted
- the label causing the issue is added to any pod
Symptoms
Similar lines will exist in the NPM Pod logs (from kubectl logs -n kube-system <npm-pod>). In this case, the label with key pod2 was removed from Pod b in namespace x.
I0913 20:37:47.855816 38024 dataplane_windows.go:108] [DataPlane] updatePod called for Pod Key x/b
2022/09/13 20:37:47 [38024] failed to update pod while applying the dataplane. key: [x/b], err: [Operation [GetSelectorReference] failed with error code [999], full cmd [], full error [ipset manager] selector ipset podlabel-pod2 does not exist]
After this point, Network Policy events on the node of the given NPM Pod will be requeued until the above auto-mitigation steps.
For example of a Network Policy event requeuing:
I0913 20:38:51.402549 38024 dataplane_windows.go:108] [DataPlane] updatePod called for Pod Key x/b
E0913 20:38:51.402549 38024 networkPolicyController.go:195] error syncing 'x/allow-client-a-via-pod-selector': [syncNetPol] error: [cleanUpNetworkPolicy] Error: failed to remove policy due to [DataPlane] error while applying dataplane: [DataPlane] error while updating pods: %!w(<nil>) when network policy is not found, requeuing
2022/09/13 20:38:51 [38024] syncNetPol error due to error syncing 'x/allow-client-a-via-pod-selector': [syncNetPol] error: [cleanUpNetworkPolicy] Error: failed to remove policy due to [DataPlane] error while applying dataplane: [DataPlane] error while updating pods: %!!(MISSING)w(<nil>) when network policy is not found, requeuing
To Mitigate
Option 1: Restart Windows NPM on the impacted node (kubectl delete pod -n kube-system <npm-pod>).
Option 2: Delete the Pod where the label issue occurred.
Option 3: Add the label which caused the issue to any Pod.
RCA
For any Pod, Namespace, or Network Policy event, dp.ApplyDataplane() will be called along the code path. Within dpApplyDataplane(), the function ipsetMgr.ApplyIPSets() is called before dp.updatePod(). The latter requires that currently modified IPSets exist in the ipsetMgr's cache. In between these two function calls, the background ipsetMgr reconcile thread can delete a required IPSet from the cache (and mark it to be deleted). Then the dp.updatePod() call will fail. Currently, the Pod and Namespace controller ignore dp errors, but the Network Policy controller will requeue when it encounters this error. So dp.AddPolicy() will fail at dp.ApplyDataplane() before it can apply the policy to all relevant endpoints. However, dp.RemovePolicy() would successfully remove the policy from endpoints before failing to at dp.ApplyDataplane().
Related Fix
The dataplane's pod update cache should not consider an IPSet modified for a Pod if its IP is removed from a set and then added back to the set before dp.updatePod() succeeds. Same with vice versa. This could happen especially if a Pod is labeled/unlabeled, and HNS fails at first to refresh pod endpoints (a transient error).