Azure-NPM Windows: rare race condition where removing a label from a Pod when the Pod is the last one using the label may cause Network Policy Create/Update to fail on the node of the given NPM Pod

### Rarity and Auto-Mitigation
When the exact condition in the title occurs, the chance of the race condition is ~0.001% (encountering a ~0.2 second interval within a 5 minute window). Auto-mitigation is also likely since the issue is resolved when:
- the pod causing the issue is deleted
- the label causing the issue is added to any pod

### Symptoms
Similar lines will exist in the NPM Pod logs (from `kubectl logs -n kube-system <npm-pod>`). In this case, the label with key `pod2` was removed from Pod b in namespace x.
```
I0913 20:37:47.855816   38024 dataplane_windows.go:108] [DataPlane] updatePod called for Pod Key x/b
2022/09/13 20:37:47 [38024] failed to update pod while applying the dataplane. key: [x/b], err: [Operation [GetSelectorReference] failed with error code [999], full cmd [], full error [ipset manager] selector ipset podlabel-pod2 does not exist]
```

After this point, Network Policy events on the node of the given NPM Pod will be requeued until the above auto-mitigation steps.

For example of a Network Policy event requeuing:
```
I0913 20:38:51.402549   38024 dataplane_windows.go:108] [DataPlane] updatePod called for Pod Key x/b
E0913 20:38:51.402549   38024 networkPolicyController.go:195] error syncing 'x/allow-client-a-via-pod-selector': [syncNetPol] error: [cleanUpNetworkPolicy] Error: failed to remove policy due to [DataPlane] error while applying dataplane: [DataPlane] error while updating pods: %!w(<nil>) when network policy is not found, requeuing
2022/09/13 20:38:51 [38024] syncNetPol error due to error syncing 'x/allow-client-a-via-pod-selector': [syncNetPol] error: [cleanUpNetworkPolicy] Error: failed to remove policy due to [DataPlane] error while applying dataplane: [DataPlane] error while updating pods: %!!(MISSING)w(<nil>) when network policy is not found, requeuing
```

### To Mitigate
Option 1: Restart Windows NPM on the impacted node (`kubectl delete pod -n kube-system <npm-pod>`).

Option 2: Delete the Pod where the label issue occurred.

Option 3: Add the label which caused the issue to any Pod.

### RCA
For any Pod, Namespace, or Network Policy event, `dp.ApplyDataplane()` will be called along the code path. Within `dpApplyDataplane()`, the function `ipsetMgr.ApplyIPSets()` is called before `dp.updatePod()`. The latter requires that currently modified IPSets exist in the ipsetMgr's cache. In between these two function calls, the background ipsetMgr reconcile thread can delete a required IPSet from the cache (and mark it to be deleted). Then the `dp.updatePod()` call will fail. Currently, the Pod and Namespace controller ignore dp errors, but the Network Policy controller will requeue when it encounters this error. So `dp.AddPolicy()` will fail at `dp.ApplyDataplane()` before it can apply the policy to all relevant endpoints. However, `dp.RemovePolicy()` would successfully remove the policy from endpoints before failing to at `dp.ApplyDataplane()`.

### Related Fix
The dataplane's pod update cache should not consider an IPSet modified for a Pod if its IP is removed from a set and then added back to the set before `dp.updatePod()` succeeds. Same with vice versa. This could happen especially if a Pod is labeled/unlabeled, and HNS fails at first to refresh pod endpoints (a transient error).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Azure-NPM Windows: rare race condition where removing a label from a Pod when the Pod is the last one using the label may cause Network Policy Create/Update to fail on the node of the given NPM Pod #1613

Rarity and Auto-Mitigation

Symptoms

To Mitigate

RCA

Related Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Azure-NPM Windows: rare race condition where removing a label from a Pod when the Pod is the last one using the label may cause Network Policy Create/Update to fail on the node of the given NPM Pod #1613

Description

Rarity and Auto-Mitigation

Symptoms

To Mitigate

RCA

Related Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions