Skip to content

Azure-NPM Windows: rare race condition where removing a label from a Pod when the Pod is the last one using the label may cause Network Policy Create/Update to fail on the node of the given NPM Pod #1613

@huntergregory

Description

@huntergregory

Rarity and Auto-Mitigation

When the exact condition in the title occurs, the chance of the race condition is ~0.001% (encountering a ~0.2 second interval within a 5 minute window). Auto-mitigation is also likely since the issue is resolved when:

  • the pod causing the issue is deleted
  • the label causing the issue is added to any pod

Symptoms

Similar lines will exist in the NPM Pod logs (from kubectl logs -n kube-system <npm-pod>). In this case, the label with key pod2 was removed from Pod b in namespace x.

I0913 20:37:47.855816   38024 dataplane_windows.go:108] [DataPlane] updatePod called for Pod Key x/b
2022/09/13 20:37:47 [38024] failed to update pod while applying the dataplane. key: [x/b], err: [Operation [GetSelectorReference] failed with error code [999], full cmd [], full error [ipset manager] selector ipset podlabel-pod2 does not exist]

After this point, Network Policy events on the node of the given NPM Pod will be requeued until the above auto-mitigation steps.

For example of a Network Policy event requeuing:

I0913 20:38:51.402549   38024 dataplane_windows.go:108] [DataPlane] updatePod called for Pod Key x/b
E0913 20:38:51.402549   38024 networkPolicyController.go:195] error syncing 'x/allow-client-a-via-pod-selector': [syncNetPol] error: [cleanUpNetworkPolicy] Error: failed to remove policy due to [DataPlane] error while applying dataplane: [DataPlane] error while updating pods: %!w(<nil>) when network policy is not found, requeuing
2022/09/13 20:38:51 [38024] syncNetPol error due to error syncing 'x/allow-client-a-via-pod-selector': [syncNetPol] error: [cleanUpNetworkPolicy] Error: failed to remove policy due to [DataPlane] error while applying dataplane: [DataPlane] error while updating pods: %!!(MISSING)w(<nil>) when network policy is not found, requeuing

To Mitigate

Option 1: Restart Windows NPM on the impacted node (kubectl delete pod -n kube-system <npm-pod>).

Option 2: Delete the Pod where the label issue occurred.

Option 3: Add the label which caused the issue to any Pod.

RCA

For any Pod, Namespace, or Network Policy event, dp.ApplyDataplane() will be called along the code path. Within dpApplyDataplane(), the function ipsetMgr.ApplyIPSets() is called before dp.updatePod(). The latter requires that currently modified IPSets exist in the ipsetMgr's cache. In between these two function calls, the background ipsetMgr reconcile thread can delete a required IPSet from the cache (and mark it to be deleted). Then the dp.updatePod() call will fail. Currently, the Pod and Namespace controller ignore dp errors, but the Network Policy controller will requeue when it encounters this error. So dp.AddPolicy() will fail at dp.ApplyDataplane() before it can apply the policy to all relevant endpoints. However, dp.RemovePolicy() would successfully remove the policy from endpoints before failing to at dp.ApplyDataplane().

Related Fix

The dataplane's pod update cache should not consider an IPSet modified for a Pod if its IP is removed from a set and then added back to the set before dp.updatePod() succeeds. Same with vice versa. This could happen especially if a Pod is labeled/unlabeled, and HNS fails at first to refresh pod endpoints (a transient error).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions