Skip to content

Conversation

@huntergregory
Copy link
Contributor

Reason for Change:
Prevent a race condition during a narrow window at bootup where we could miss adding a NetworkPolicy to Pods (missed in both updatePod() and AddPolicy()).

Issue Fixed:

Requirements:

Notes:
Race can occur if NetworkPolicy controller applies dataplane while bootup phase is finished.

The race could have logs similar to below:

I0621 23:07:54.371058    4184 dataplane.go:471] [DataPlane] Update Policy called for x/base
I0621 23:07:54.371058    4184 dataplane.go:474] [DataPlane] Policy x/base is not found.
I0621 23:07:54.371058    4184 dataplane.go:367] [DataPlane] Add Policy called for x/base
I0621 23:07:54.371058    4184 dataplane.go:292] [DataPlane] [ADD-NETPOL] new batch count: 3
I0621 23:07:54.371058    4184 dataplane.go:295] [DataPlane] [ADD-NETPOL] applying now since reached maximum batch count: 3
I0621 23:07:54.371058    4184 dataplane.go:303] [DataPlane] [ApplyDataPlane] [ADD-NETPOL] starting to apply ipsets
I0621 23:07:54.371058    4184 ipsetmanager.go:455] [IPSetManager] dirty caches. toAddUpdateCache: to create: [nslabel-k2: &{},podlabel-k1: &{},ns-x: &{},nslabel-all-namespaces: &{},empty-emptyhashset: &{},ns-y: &{},nslabel-k1: &{},nslabel-k1:v1: &{},nslabel-k2:v2: &{},podlabel-k1:v1: &{}], to update: [], toDeleteCache: map[]
I0621 23:07:54.371058    4184 dataplane.go:137] [DataPlane] finished bootup phase
I0621 23:07:54.435247    4184 dataplane.go:303] [DataPlane] [ApplyDataPlane] [BACKGROUND] starting to apply ipsets
I0621 23:07:54.435247    4184 ipsetmanager_windows.go:331] [IPSetManager Windows] Add operation on set policies is called
I0621 23:07:54.435247    4184 ipsetmanager_windows.go:431] [Dataplane Windows] marshalling IPSET(s)
I0621 23:07:54.435247    4184 ipsetmanager_windows.go:363] [IPSetManager Windows] modifying network settings. operation: Add, policyType: IPSET
I0621 23:07:54.489675    4184 ipsetmanager_windows.go:431] [Dataplane Windows] marshalling NESTEDIPSET(s)
I0621 23:07:54.489675    4184 ipsetmanager_windows.go:363] [IPSetManager Windows] modifying network settings. operation: Add, policyType: NESTEDIPSET
I0621 23:07:54.546808    4184 ipsetmanager_windows.go:235] [IPSetManager Windows] Done applying IPSets.
I0621 23:07:54.546808    4184 dataplane.go:308] [DataPlane] [ApplyDataPlane] [ADD-NETPOL] finished applying ipsets
I0621 23:07:54.546808    4184 dataplane.go:326] [DataPlane] [ApplyDataPlane] [ADD-NETPOL] refreshing endpoints before updating pods
I0621 23:07:54.546808    4184 dataplane_windows.go:344] getting local endpoints
I0621 23:07:54.546808    4184 ipsetmanager.go:451] [IPSetManager] No IPSets to apply
I0621 23:07:54.546808    4184 dataplane.go:308] [DataPlane] [ApplyDataPlane] [BACKGROUND] finished applying ipsets
I0621 23:07:54.546808    4184 dataplane.go:326] [DataPlane] [ApplyDataPlane] [BACKGROUND] refreshing endpoints before updating pods
I0621 23:07:54.546808    4184 dataplane_windows.go:344] getting local endpoints
I0621 23:07:54.642138    4184 dataplane_windows.go:407] updating endpoint cache to include 10.0.0.1: &{name:test1 id:test1 ip:10.0.0.1 podKey: previousIncorrectPodKey: netPolReference:map[]}
I0621 23:07:54.642138    4184 dataplane.go:335] [DataPlane] [ApplyDataPlane] [ADD-NETPOL] refreshed endpoints
I0621 23:07:54.642138    4184 dataplane.go:342] [DataPlane] [ApplyDataPlane] [ADD-NETPOL] starting to update pods
I0621 23:07:54.642138    4184 dataplane_windows.go:134] [DataPlane] updatePod called. podKey: x/a
I0621 23:07:54.642138    4184 dataplane_windows.go:155] [DataPlane] associating pod with endpoint. podKey: x/a. endpoint: &{name:test1 id:test1 ip:10.0.0.1 podKey: previousIncorrectPodKey: netPolReference:map[]}
I0621 23:07:54.642138    4184 dataplane_windows.go:252] [DataPlane] while updating pod, policy is referenced but does not exist. pod: [x/a], policy: [x/base], set [ns-x]
I0621 23:07:54.642138    4184 dataplane_windows.go:252] [DataPlane] while updating pod, policy is referenced but does not exist. pod: [x/a], policy: [x/base], set [podlabel-k1:v1]
I0621 23:07:54.642138    4184 dataplane.go:360] [DataPlane] [ApplyDataPlane] [ADD-NETPOL] finished updating pods
I0621 23:07:54.642138    4184 policymanager_windows.go:259] [PolicyManagerWindows] No Endpoints to apply policy x/base on
I0621 23:07:54.696492    4184 dataplane.go:335] [DataPlane] [ApplyDataPlane] [BACKGROUND] refreshed endpoints
I0621 23:07:54.696492    4184 dataplane.go:342] [DataPlane] [ApplyDataPlane] [BACKGROUND] starting to update pods
I0621 23:07:54.696492    4184 dataplane.go:360] [DataPlane] [ApplyDataPlane] [BACKGROUND] finished updating pods

UT Failure:

--- FAIL: TestMultiJobApplyInBackground (1.43s)
    --- FAIL: TestMultiJobApplyInBackground/create_namespaces,_pods,_and_a_policy_which_applies_to_a_pod (1.43s)
        dataplane_windows_test.go:134: beginning test #0. Description: [create namespaces, pods, and a policy which applies to a pod]. Tags: [namespace-crud pod-crud netpol-crud apply-in-background]
        dataplane_windows_test.go:187: 
            	Error Trace:	D:/a/_work/1/s/npm/pkg/dataplane/testutils/utils_windows.go:123
            	            				D:/a/_work/1/s/npm/pkg/dataplane/testutils/utils_windows.go:76
            	            				D:/a/_work/1/s/npm/pkg/dataplane/dataplane_windows_test.go:187
            	Error:      	Not equal: 
            	            	expected: 3
            	            	actual  : 0
            	Test:       	TestMultiJobApplyInBackground/create_namespaces,_pods,_and_a_policy_which_applies_to_a_pod
            	Messages:   	unexpected number of ACLs for Endpoint with ID: test1
        dataplane_windows_test.go:187: 
            	Error Trace:	D:/a/_work/1/s/npm/pkg/dataplane/testutils/utils_windows.go:79
            	            				D:/a/_work/1/s/npm/pkg/dataplane/dataplane_windows_test.go:187
            	Error:      	hns cache had unexpected state. printing hns cache...
            	            	networks: [ID: 1234, Name: azure, SetPolicies: [[&{Id:azure-npm-2083743908 Name:nslabel-k2:v2 PolicyType:NESTEDIPSET Values:azure-npm-3530799710,azure-npm-2837910840}],[&{Id:azure-npm-3790070506 Name:nslabel-k1:v1 PolicyType:NESTEDIPSET Values:azure-npm-3530799710,azure-npm-2854688459}],[&{Id:azure-npm-2867360572 Name:podlabel-k1:v1 PolicyType:IPSET Values:10.0.0.2,10.0.0.1}],[&{Id:azure-npm-2854688459 Name:ns-x PolicyType:IPSET Values:10.0.0.1}],[&{Id:azure-npm-3530799710 Name:empty-emptyhashset PolicyType:IPSET Values:}],[&{Id:azure-npm-2837910840 Name:ns-y PolicyType:IPSET Values:10.0.0.2}],[&{Id:azure-npm-1639206293 Name:nslabel-all-namespaces PolicyType:NESTEDIPSET Values:azure-npm-2837910840,azure-npm-3530799710,azure-npm-2854688459}],[&{Id:azure-npm-3821728622 Name:nslabel-k2 PolicyType:NESTEDIPSET Values:azure-npm-3530799710,azure-npm-2837910840}],[&{Id:azure-npm-3804951003 Name:nslabel-k1 PolicyType:NESTEDIPSET Values:azure-npm-3530799710,azure-npm-2854688459}],[&{Id:azure-npm-3291508017 Name:podlabel-k1 PolicyType:IPSET Values:10.0.0.1,10.0.0.2}]]]
            	            	endpoints: [ID: test1, Name: test1, IP: 10.0.0.1, ACLs: []],[ID: test2, Name: test2, IP: 10.0.0.2, ACLs: []]
            	Test:       	TestMultiJobApplyInBackground/create_namespaces,_pods,_and_a_policy_which_applies_to_a_pod

@huntergregory huntergregory added npm Related to NPM. windows labels Jun 22, 2023
@huntergregory huntergregory requested a review from a team as a code owner June 22, 2023 19:20
@huntergregory huntergregory requested a review from matmerr June 22, 2023 19:20
if newCount >= dp.ApplyMaxBatches {
klog.Infof("[DataPlane] [%s] applying now since reached maximum batch count: %d", contextAddNetPolBootup, newCount)
klog.Infof("[DataPlane] [%s] starting to apply ipsets", contextAddNetPolBootup)
err := dp.ipsetMgr.ApplyIPSets()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

during bootup phase, we only need to apply IPSets. No need to do other operations in applyDataPlaneNow():

  1. Refresh Endpoints
  2. Check updatePodCache for dirty Pods to add NetPols too

@huntergregory huntergregory changed the title fix: [WIN-NPM] prevent AddPolicy race at bootup fix: [WIN-NPM] race during bootup where we may not add one NetPol to a Pod Jun 22, 2023
@huntergregory huntergregory enabled auto-merge (squash) June 23, 2023 16:58
@vakalapa vakalapa disabled auto-merge June 23, 2023 17:03
@vakalapa vakalapa merged commit 36b67b4 into master Jun 23, 2023
@vakalapa vakalapa deleted the hgregory/06-22-bootup-lock branch June 23, 2023 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

npm Related to NPM. windows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants