Skip to content

Conversation

@huntergregory
Copy link
Contributor

@huntergregory huntergregory commented Aug 31, 2022

Currently, if a cluster had 2+ persistent errors in dirty IPSets, then new changes may not get applied.

This PR also enhances logging:

  • print each line in the restore file (commented out until perf is understood)
  • sending AppInsights error logs

Fixes:

  1. Fix bug in error handling after a second error
  2. Increase the max retry limit to 5.

@huntergregory huntergregory added the npm Related to NPM. label Aug 31, 2022
@huntergregory huntergregory requested a review from a team as a code owner August 31, 2022 20:21
@huntergregory huntergregory requested review from ck319 and removed request for a team August 31, 2022 20:21
@huntergregory huntergregory changed the title fix: [NPM] modify linux max restore try count to prevent perpetual errors fix: [NPM] modify linux max restore try count in case of perpetual errors Sep 8, 2022
@huntergregory huntergregory changed the title fix: [NPM] modify linux max restore try count in case of perpetual errors fix: [NPM] modify linux max restore try count in case of several non-retriable errors Sep 8, 2022
@huntergregory huntergregory changed the title fix: [NPM] modify linux max restore try count in case of several non-retriable errors fix: [NPM] increase linux retry limit exponentially in case of several non-retriable errors Oct 27, 2022
@huntergregory huntergregory changed the title fix: [NPM] increase linux retry limit exponentially in case of several non-retriable errors fix: [NPM-LINUX] exponentially increase retry limit exponentially in case of several non-retriable errors Oct 27, 2022
@huntergregory huntergregory changed the title fix: [NPM-LINUX] exponentially increase retry limit exponentially in case of several non-retriable errors fix: [NPM-LINUX] resiliency for several non-retriable errors Nov 1, 2022
if currentLineIndex == lineNum-1 {
lineIndex = currentLineIndex
if currentLineNum == lineNum {
lineIndex = i
Copy link
Contributor Author

@huntergregory huntergregory Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug fix here that impacts the 2nd+ retry

Copy link
Contributor Author

@huntergregory huntergregory Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before, we may map a line error on line 5 to the original line 5 (not the new line 5 after skipping lines on retry). This could lead to using the wrong line error handling for the second retry

@huntergregory
Copy link
Contributor Author

Example Logs:

I1102 00:11:20.866743       1 podController.go:417] [syncAddAndUpdatePod] updating Pod with key x/a
I1102 00:11:20.866769       1 podController.go:456] Deleting pod x/a (ip : 10.224.0.15) from ipset debug:true
I1102 00:11:20.866793       1 podController.go:478] Creating ipset debug:false if it doesn't already exist
I1102 00:11:20.866800       1 podController.go:487] Adding pod x/a (ip : 10.224.0.15) to ipset debug:false
I1102 00:11:20.866846       1 ipsetmanager.go:461] [IPSetManager] dirty caches. toAddUpdateCache: to create: [podlabel-debug:false: &{membersToAdd:map[10.224.0.15:{}] membersToDelete:map[]},podlabel-debug: &{membersToAdd:map[10.224.0.15:{}] membersToDelete:map[]},podlabel-debug:true: &{membersToAdd:map[] membersToDelete:map[]}], to update: [], toDeleteCache: map[podlabel-k1:0xc000952450 podlabel-k1:v1:0xc000952430 podlabel-k2:0xc000952490 podlabel-k2:v2:0xc000952460 podlabel-k3:0xc000952420 podlabel-k3:v3:0xc000952470]
I1102 00:11:20.866918       1 restore.go:183] running this restore command: [ipset restore]
I1102 00:11:20.866924       1 restore.go:319] line 1 of restore command [ipset restore] with section ID [add/update-podlabel-debug:false]: [-N azure-npm-3081258205 --exist nethash]
I1102 00:11:20.866930       1 restore.go:319] line 2 of restore command [ipset restore] with section ID [add/update-podlabel-debug]: [-N azure-npm-2955716136 --exist nethash]
I1102 00:11:20.866934       1 restore.go:319] line 3 of restore command [ipset restore] with section ID [add/update-podlabel-debug:true]: [-N azure-npm-3853671530 --exist nethash]
I1102 00:11:20.866938       1 restore.go:319] line 4 of restore command [ipset restore] with section ID [add/update-podlabel-debug:false]: [-A azure-npm-3081258205 10.224.0.15]
I1102 00:11:20.866942       1 restore.go:319] line 5 of restore command [ipset restore] with section ID [add/update-podlabel-debug]: [-A azure-npm-2955716136 10.224.0.15]
I1102 00:11:20.866946       1 restore.go:319] line 6 of restore command [ipset restore] with section ID [delete-podlabel-k2]: [-F azure-npm-3241175160]
I1102 00:11:20.866950       1 restore.go:319] line 7 of restore command [ipset restore] with section ID [delete-podlabel-k3]: [-F azure-npm-3257952779]
I1102 00:11:20.866954       1 restore.go:319] line 8 of restore command [ipset restore] with section ID [delete-podlabel-k1:v1]: [-F azure-npm-2867360572]
I1102 00:11:20.866958       1 restore.go:319] line 9 of restore command [ipset restore] with section ID [delete-podlabel-k1]: [-F azure-npm-3291508017]
I1102 00:11:20.866962       1 restore.go:319] line 10 of restore command [ipset restore] with section ID [delete-podlabel-k2:v2]: [-F azure-npm-646662566]
I1102 00:11:20.866966       1 restore.go:319] line 11 of restore command [ipset restore] with section ID [delete-podlabel-k3:v3]: [-F azure-npm-2649853620]
I1102 00:11:20.866970       1 restore.go:319] line 12 of restore command [ipset restore] with section ID [delete-podlabel-k1]: [-X azure-npm-3291508017]
I1102 00:11:20.866974       1 restore.go:319] line 13 of restore command [ipset restore] with section ID [delete-podlabel-k2:v2]: [-X azure-npm-646662566]
I1102 00:11:20.866978       1 restore.go:319] line 14 of restore command [ipset restore] with section ID [delete-podlabel-k3:v3]: [-X azure-npm-2649853620]
I1102 00:11:20.866982       1 restore.go:319] line 15 of restore command [ipset restore] with section ID [delete-podlabel-k2]: [-X azure-npm-3241175160]
I1102 00:11:20.866986       1 restore.go:319] line 16 of restore command [ipset restore] with section ID [delete-podlabel-k3]: [-X azure-npm-3257952779]
I1102 00:11:20.866990       1 restore.go:319] line 17 of restore command [ipset restore] with section ID [delete-podlabel-k1:v1]: [-X azure-npm-2867360572]
I1102 00:11:20.869419       1 restore.go:295] continuing after line 8 and aborting section [delete-podlabel-k1:v1] for command [ipset restore]
I1102 00:11:20.869432       1 ipsetmanager_linux.go:689] skipping flush and upcoming destroy for set podlabel-k1:v1 since the set doesn't exist
I1102 00:11:20.869520       1 restore.go:183] running this restore command: [ipset restore]
I1102 00:11:20.869531       1 restore.go:343] on try 1 of restore command [ipset restore]. mapping of current line numbers to original line numbers: [1->9 2->10 3->11 4->12 5->13 6->14 7->15 8->16]
2022/11/02 00:11:20 [1] error: on try number 2, failed to run command [ipset restore]. Rerunning with updated file. err: [line-number error for line [-F azure-npm-646662566]: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 2: The set with the given name does not exist
]]
I1102 00:11:20.872200       1 restore.go:295] continuing after line 2 and aborting section [delete-podlabel-k2:v2] for command [ipset restore]
I1102 00:11:20.872220       1 ipsetmanager_linux.go:689] skipping flush and upcoming destroy for set podlabel-k2:v2 since the set doesn't exist
I1102 00:11:20.872302       1 restore.go:183] running this restore command: [ipset restore]
I1102 00:11:20.872351       1 restore.go:343] on try 2 of restore command [ipset restore]. mapping of current line numbers to original line numbers: [1->11 2->12 3->14 4->15 5->16]
2022/11/02 00:11:20 [1] error: on try number 3, failed to run command [ipset restore]. Rerunning with updated file. err: [line-number error for line [-F azure-npm-2649853620]: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 1: The set with the given name does not exist
]]
I1102 00:11:20.874102       1 restore.go:295] continuing after line 1 and aborting section [delete-podlabel-k3:v3] for command [ipset restore]
I1102 00:11:20.874131       1 ipsetmanager_linux.go:689] skipping flush and upcoming destroy for set podlabel-k3:v3 since the set doesn't exist
I1102 00:11:20.874233       1 restore.go:183] running this restore command: [ipset restore]
I1102 00:11:20.874245       1 restore.go:343] on try 3 of restore command [ipset restore]. mapping of current line numbers to original line numbers: [1->12 2->15 3->16]
I1102 00:11:20.902260       1 restore.go:153] successfully ran command [ipset restore] on try number 4
I1102 00:11:20.902293       1 podController.go:234] Successfully synced 'x/a'

if err != nil {
metrics.SendErrorLogAndMetric(util.IpsmID, "error: failed to apply ipsets: %s", err.Error())
// exponentially increase maxRestoreTryCount
iMgr.maxRestoreTryCount *= 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to exponentially increase here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this just results in 10 tries, we can just remove this and change the largestretry to 10

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline. setting a constant limit of 5

matmerr
matmerr previously approved these changes Nov 8, 2022
@matmerr matmerr dismissed their stale review November 9, 2022 19:24

let's reconsider the perf impacts of dumping all ipset lines

@vakalapa vakalapa merged commit 8cc8e7f into master Nov 23, 2022
@vakalapa vakalapa deleted the npm-exponential-try-count branch November 23, 2022 18:38
rjdenney pushed a commit to rjdenney/azure-container-networking that referenced this pull request Jan 19, 2023
)

* adaptively modify linux max restore try count to prevent perpetual errors

* remove debug print

* log restore file and send ipsetmanager_linux errors

* send other appropriate errors

* fix handleLineError function

* fix printing restore lines and enhance a log

* fix lints and wrap chainLineNumber errors

* fix one off error for logging the try count

* revert exponential increase to try limit

* update try count to 5 and update UTs

* do not log lines for every restore call until perf is understood
smittal22 pushed a commit to smittal22/azure-container-networking that referenced this pull request Jan 26, 2023
)

* adaptively modify linux max restore try count to prevent perpetual errors

* remove debug print

* log restore file and send ipsetmanager_linux errors

* send other appropriate errors

* fix handleLineError function

* fix printing restore lines and enhance a log

* fix lints and wrap chainLineNumber errors

* fix one off error for logging the try count

* revert exponential increase to try limit

* update try count to 5 and update UTs

* do not log lines for every restore call until perf is understood
smittal22 pushed a commit to smittal22/azure-container-networking that referenced this pull request Jan 30, 2023
)

* adaptively modify linux max restore try count to prevent perpetual errors

* remove debug print

* log restore file and send ipsetmanager_linux errors

* send other appropriate errors

* fix handleLineError function

* fix printing restore lines and enhance a log

* fix lints and wrap chainLineNumber errors

* fix one off error for logging the try count

* revert exponential increase to try limit

* update try count to 5 and update UTs

* do not log lines for every restore call until perf is understood
smittal22 pushed a commit to smittal22/azure-container-networking that referenced this pull request Feb 3, 2023
)

* adaptively modify linux max restore try count to prevent perpetual errors

* remove debug print

* log restore file and send ipsetmanager_linux errors

* send other appropriate errors

* fix handleLineError function

* fix printing restore lines and enhance a log

* fix lints and wrap chainLineNumber errors

* fix one off error for logging the try count

* revert exponential increase to try limit

* update try count to 5 and update UTs

* do not log lines for every restore call until perf is understood
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

linux npm Related to NPM.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants