fix: [NPM-LINUX] resiliency for several non-retriable errors #1566

huntergregory · 2022-08-31T20:21:04Z

Currently, if a cluster had 2+ persistent errors in dirty IPSets, then new changes may not get applied.

This PR also enhances logging:

~~print each line in the restore file~~ (commented out until perf is understood)
sending AppInsights error logs

Fixes:

Fix bug in error handling after a second error
Increase the max retry limit to 5.

…rors

…azure-container-networking into npm-exponential-try-count

huntergregory · 2022-11-02T01:07:48Z

npm/util/ioutil/restore.go

-		if currentLineIndex == lineNum-1 {
-			lineIndex = currentLineIndex
+		if currentLineNum == lineNum {
+			lineIndex = i


bug fix here that impacts the 2nd+ retry

before, we may map a line error on line 5 to the original line 5 (not the new line 5 after skipping lines on retry). This could lead to using the wrong line error handling for the second retry

huntergregory · 2022-11-02T17:22:14Z

Example Logs:

I1102 00:11:20.866743       1 podController.go:417] [syncAddAndUpdatePod] updating Pod with key x/a
I1102 00:11:20.866769       1 podController.go:456] Deleting pod x/a (ip : 10.224.0.15) from ipset debug:true
I1102 00:11:20.866793       1 podController.go:478] Creating ipset debug:false if it doesn't already exist
I1102 00:11:20.866800       1 podController.go:487] Adding pod x/a (ip : 10.224.0.15) to ipset debug:false
I1102 00:11:20.866846       1 ipsetmanager.go:461] [IPSetManager] dirty caches. toAddUpdateCache: to create: [podlabel-debug:false: &{membersToAdd:map[10.224.0.15:{}] membersToDelete:map[]},podlabel-debug: &{membersToAdd:map[10.224.0.15:{}] membersToDelete:map[]},podlabel-debug:true: &{membersToAdd:map[] membersToDelete:map[]}], to update: [], toDeleteCache: map[podlabel-k1:0xc000952450 podlabel-k1:v1:0xc000952430 podlabel-k2:0xc000952490 podlabel-k2:v2:0xc000952460 podlabel-k3:0xc000952420 podlabel-k3:v3:0xc000952470]
I1102 00:11:20.866918       1 restore.go:183] running this restore command: [ipset restore]
I1102 00:11:20.866924       1 restore.go:319] line 1 of restore command [ipset restore] with section ID [add/update-podlabel-debug:false]: [-N azure-npm-3081258205 --exist nethash]
I1102 00:11:20.866930       1 restore.go:319] line 2 of restore command [ipset restore] with section ID [add/update-podlabel-debug]: [-N azure-npm-2955716136 --exist nethash]
I1102 00:11:20.866934       1 restore.go:319] line 3 of restore command [ipset restore] with section ID [add/update-podlabel-debug:true]: [-N azure-npm-3853671530 --exist nethash]
I1102 00:11:20.866938       1 restore.go:319] line 4 of restore command [ipset restore] with section ID [add/update-podlabel-debug:false]: [-A azure-npm-3081258205 10.224.0.15]
I1102 00:11:20.866942       1 restore.go:319] line 5 of restore command [ipset restore] with section ID [add/update-podlabel-debug]: [-A azure-npm-2955716136 10.224.0.15]
I1102 00:11:20.866946       1 restore.go:319] line 6 of restore command [ipset restore] with section ID [delete-podlabel-k2]: [-F azure-npm-3241175160]
I1102 00:11:20.866950       1 restore.go:319] line 7 of restore command [ipset restore] with section ID [delete-podlabel-k3]: [-F azure-npm-3257952779]
I1102 00:11:20.866954       1 restore.go:319] line 8 of restore command [ipset restore] with section ID [delete-podlabel-k1:v1]: [-F azure-npm-2867360572]
I1102 00:11:20.866958       1 restore.go:319] line 9 of restore command [ipset restore] with section ID [delete-podlabel-k1]: [-F azure-npm-3291508017]
I1102 00:11:20.866962       1 restore.go:319] line 10 of restore command [ipset restore] with section ID [delete-podlabel-k2:v2]: [-F azure-npm-646662566]
I1102 00:11:20.866966       1 restore.go:319] line 11 of restore command [ipset restore] with section ID [delete-podlabel-k3:v3]: [-F azure-npm-2649853620]
I1102 00:11:20.866970       1 restore.go:319] line 12 of restore command [ipset restore] with section ID [delete-podlabel-k1]: [-X azure-npm-3291508017]
I1102 00:11:20.866974       1 restore.go:319] line 13 of restore command [ipset restore] with section ID [delete-podlabel-k2:v2]: [-X azure-npm-646662566]
I1102 00:11:20.866978       1 restore.go:319] line 14 of restore command [ipset restore] with section ID [delete-podlabel-k3:v3]: [-X azure-npm-2649853620]
I1102 00:11:20.866982       1 restore.go:319] line 15 of restore command [ipset restore] with section ID [delete-podlabel-k2]: [-X azure-npm-3241175160]
I1102 00:11:20.866986       1 restore.go:319] line 16 of restore command [ipset restore] with section ID [delete-podlabel-k3]: [-X azure-npm-3257952779]
I1102 00:11:20.866990       1 restore.go:319] line 17 of restore command [ipset restore] with section ID [delete-podlabel-k1:v1]: [-X azure-npm-2867360572]
I1102 00:11:20.869419       1 restore.go:295] continuing after line 8 and aborting section [delete-podlabel-k1:v1] for command [ipset restore]
I1102 00:11:20.869432       1 ipsetmanager_linux.go:689] skipping flush and upcoming destroy for set podlabel-k1:v1 since the set doesn't exist
I1102 00:11:20.869520       1 restore.go:183] running this restore command: [ipset restore]
I1102 00:11:20.869531       1 restore.go:343] on try 1 of restore command [ipset restore]. mapping of current line numbers to original line numbers: [1->9 2->10 3->11 4->12 5->13 6->14 7->15 8->16]
2022/11/02 00:11:20 [1] error: on try number 2, failed to run command [ipset restore]. Rerunning with updated file. err: [line-number error for line [-F azure-npm-646662566]: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 2: The set with the given name does not exist
]]
I1102 00:11:20.872200       1 restore.go:295] continuing after line 2 and aborting section [delete-podlabel-k2:v2] for command [ipset restore]
I1102 00:11:20.872220       1 ipsetmanager_linux.go:689] skipping flush and upcoming destroy for set podlabel-k2:v2 since the set doesn't exist
I1102 00:11:20.872302       1 restore.go:183] running this restore command: [ipset restore]
I1102 00:11:20.872351       1 restore.go:343] on try 2 of restore command [ipset restore]. mapping of current line numbers to original line numbers: [1->11 2->12 3->14 4->15 5->16]
2022/11/02 00:11:20 [1] error: on try number 3, failed to run command [ipset restore]. Rerunning with updated file. err: [line-number error for line [-F azure-npm-2649853620]: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 1: The set with the given name does not exist
]]
I1102 00:11:20.874102       1 restore.go:295] continuing after line 1 and aborting section [delete-podlabel-k3:v3] for command [ipset restore]
I1102 00:11:20.874131       1 ipsetmanager_linux.go:689] skipping flush and upcoming destroy for set podlabel-k3:v3 since the set doesn't exist
I1102 00:11:20.874233       1 restore.go:183] running this restore command: [ipset restore]
I1102 00:11:20.874245       1 restore.go:343] on try 3 of restore command [ipset restore]. mapping of current line numbers to original line numbers: [1->12 2->15 3->16]
I1102 00:11:20.902260       1 restore.go:153] successfully ran command [ipset restore] on try number 4
I1102 00:11:20.902293       1 podController.go:234] Successfully synced 'x/a'

vakalapa · 2022-11-03T20:43:20Z

npm/pkg/dataplane/ipsets/ipsetmanager.go

 	if err != nil {
-		metrics.SendErrorLogAndMetric(util.IpsmID, "error: failed to apply ipsets: %s", err.Error())
+		// exponentially increase maxRestoreTryCount
+		iMgr.maxRestoreTryCount *= 2


Why do we want to exponentially increase here?

this just results in 10 tries, we can just remove this and change the largestretry to 10

discussed offline. setting a constant limit of 5

let's reconsider the perf impacts of dumping all ipset lines

) * adaptively modify linux max restore try count to prevent perpetual errors * remove debug print * log restore file and send ipsetmanager_linux errors * send other appropriate errors * fix handleLineError function * fix printing restore lines and enhance a log * fix lints and wrap chainLineNumber errors * fix one off error for logging the try count * revert exponential increase to try limit * update try count to 5 and update UTs * do not log lines for every restore call until perf is understood

adaptively modify linux max restore try count to prevent perpetual er…

78462b8

…rors

huntergregory added the npm Related to NPM. label Aug 31, 2022

huntergregory requested a review from a team as a code owner August 31, 2022 20:21

huntergregory requested review from ck319 and removed request for a team August 31, 2022 20:21

huntergregory changed the title ~~fix: [NPM] modify linux max restore try count to prevent perpetual errors~~ fix: [NPM] modify linux max restore try count in case of perpetual errors Sep 8, 2022

huntergregory changed the title ~~fix: [NPM] modify linux max restore try count in case of perpetual errors~~ fix: [NPM] modify linux max restore try count in case of several non-retriable errors Sep 8, 2022

Merge branch 'master' into npm-exponential-try-count

9fbbd04

huntergregory changed the title ~~fix: [NPM] modify linux max restore try count in case of several non-retriable errors~~ fix: [NPM] increase linux retry limit exponentially in case of several non-retriable errors Oct 27, 2022

huntergregory added the linux label Oct 27, 2022

huntergregory changed the title ~~fix: [NPM] increase linux retry limit exponentially in case of several non-retriable errors~~ fix: [NPM-LINUX] exponentially increase retry limit exponentially in case of several non-retriable errors Oct 27, 2022

huntergregory added 6 commits October 27, 2022 21:18

remove debug print

8b28fa9

log restore file and send ipsetmanager_linux errors

e5c8f38

Merge branch 'npm-exponential-try-count' of https://github.com/Azure/…

f9e801c

…azure-container-networking into npm-exponential-try-count

send other appropriate errors

cb73cfe

fix handleLineError function

f4eb47e

fix printing restore lines and enhance a log

e4d873d

huntergregory changed the title ~~fix: [NPM-LINUX] exponentially increase retry limit exponentially in case of several non-retriable errors~~ fix: [NPM-LINUX] resiliency for several non-retriable errors Nov 1, 2022

huntergregory added 3 commits November 1, 2022 11:15

fix lints and wrap chainLineNumber errors

435b238

Merge branch 'master' into npm-exponential-try-count

f507212

fix one off error for logging the try count

44c7bd1

huntergregory commented Nov 2, 2022

View reviewed changes

vakalapa reviewed Nov 3, 2022

View reviewed changes

huntergregory added 2 commits November 3, 2022 14:29

revert exponential increase to try limit

8757b3b

update try count to 5 and update UTs

35d2d51

matmerr previously approved these changes Nov 8, 2022

View reviewed changes

do not log lines for every restore call until perf is understood

477e53a

matmerr approved these changes Nov 22, 2022

View reviewed changes

vakalapa merged commit 8cc8e7f into master Nov 23, 2022

vakalapa deleted the npm-exponential-try-count branch November 23, 2022 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: [NPM-LINUX] resiliency for several non-retriable errors #1566

fix: [NPM-LINUX] resiliency for several non-retriable errors #1566

huntergregory commented Aug 31, 2022 •

edited

Loading

Uh oh!

huntergregory Nov 2, 2022 •

edited

Loading

Uh oh!

huntergregory Nov 3, 2022 •

edited

Loading

Uh oh!

huntergregory commented Nov 2, 2022

Uh oh!

vakalapa Nov 3, 2022

Uh oh!

vakalapa Nov 3, 2022

Uh oh!

huntergregory Nov 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: [NPM-LINUX] resiliency for several non-retriable errors #1566

fix: [NPM-LINUX] resiliency for several non-retriable errors #1566

Conversation

huntergregory commented Aug 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huntergregory Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huntergregory Nov 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huntergregory commented Nov 2, 2022

Uh oh!

vakalapa Nov 3, 2022

Choose a reason for hiding this comment

Uh oh!

vakalapa Nov 3, 2022

Choose a reason for hiding this comment

Uh oh!

huntergregory Nov 3, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

huntergregory commented Aug 31, 2022 •

edited

Loading

huntergregory Nov 2, 2022 •

edited

Loading

huntergregory Nov 3, 2022 •

edited

Loading