-
Notifications
You must be signed in to change notification settings - Fork 260
mNAT + LB support for swift containers #680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #680 +/- ##
==========================================
- Coverage 39.12% 39.06% -0.06%
==========================================
Files 83 83
Lines 10697 10743 +46
==========================================
+ Hits 4185 4197 +12
- Misses 6010 6039 +29
- Partials 502 507 +5 |
ramiro-gamarra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments/questions
cns/NetworkContainerContract.go
Outdated
| type NodeInfoResponse struct { | ||
| NetworkContainers []CreateNetworkContainerRequest | ||
| GetNCVersionURLFmt string | ||
| NmAgentApisMissing bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably more extensible to return the list of available actions than just a flag to say that some are missing. This way, the client can validate if the specific ones they are looking for are missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now i do not think there are any other requests or actions apart from the one in question. SO kept it simple with a single flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must have been confused before. This is what DNC will be returning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. CNS will ask for regular syncNodeStatus, as a response, DNC sends back list of NCs, version and the flag for NMAgentApis list. And DNC will set this flag only if the NodeConfig has GreCapableNotSet property. If the Node is not capable or if the Node has been previously updated with this list then DNC will skip asking for this information.
cns/restserver/internalapi.go
Outdated
| service.saveState() | ||
| service.Unlock() | ||
|
|
||
| if nodeInfoResponse.NmAgentApisMissing { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem like it is needed. CNS would not get past start up if register node did not succeed in the main, so this information would be there if the node was registered with DNC earlier or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will work as kind of a re-try mechanism. If DNC fails to set its properties due to some error, in the next NodeSyncStatus it can solicit for these details again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I don't quite understand is if DNC fails to set this properties, then the register node call should fail, so we would never reach here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today, register Node happens only once, at the time of initial CNS bringup, main.go L#385. And yes, we do not want to register if DNC sets this flag as false (same as todays behavior)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow either. If I'm reading main correctly, if node registration failed during start up, that terminates CNS. Registration must succeed before the sync loop is even started. Once it's registered, why do we need to try to register it again here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking through and testing, i removed the re-register logic. Now if the Node needs to get the ability to add GreKeys, Cx will need to drain NCs, unregister the Node and register again.
ramiro-gamarra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more comments/questions
cns/NetworkContainerContract.go
Outdated
| type NodeInfoResponse struct { | ||
| NetworkContainers []CreateNetworkContainerRequest | ||
| GetNCVersionURLFmt string | ||
| NmAgentApisMissing bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must have been confused before. This is what DNC will be returning?
cns/restserver/internalapi.go
Outdated
| service.saveState() | ||
| service.Unlock() | ||
|
|
||
| if nodeInfoResponse.NmAgentApisMissing { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow either. If I'm reading main correctly, if node registration failed during start up, that terminates CNS. Registration must succeed before the sync loop is even started. Once it's registered, why do we need to try to register it again here?
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
ramiro-gamarra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments
| case <-nodeRegisterTicker.C: | ||
| go sendRegisterNodeRequest(httpc, httpRestService, nodeRegisterRequest, url, responseChan) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would wait five seconds before it attempts to send the first request. Not sure if that's the intended behavior. It also seems you may not need goroutines if these operations are sequential. Do we want to retry the operation forever or crash at some point? I would recommend looking at the retry package that got added a couple weeks ago, might come in handy here: https://github.com/Azure/azure-container-networking/blob/master/test/integration/retry/retry.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will correct the first 5 sec delay. Existing behavior is to retry forever. You are right, we need a better mechanism here. I can add a work item to improve this logic for exponential back off as a separate effort ?
cns/service/main.go
Outdated
| } | ||
| response.Body.Close() | ||
| } else { | ||
| logger.Errorf("[Azure CNS] Failed to register node with err: %+v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why this path doesn't send the error? Also, i know this is just adding support of channels, but probably a good opportunity to clean up the function since its kinda hard to follow the code flow. Suggest returning early like:
res, err := post()
if err != nil { /* return */ }
defer res.body.close()
if res.StatusCode != { /* return */ }
if err := decode(req); err != nil { /* return */ }
setNodeOrch(req)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay will refactor. Yes, I am assuming this error is retriable error, like timeout or some other transport path error. (which most probably can be due to delays in setting up of azure services). The other errors are specific from DNC side, which i am attributing as not retriable.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
ramiro-gamarra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
| export REGIONS=$(AKS_ENGINE_REGION) | ||
| export IS_JENKINS=false | ||
| export DEBUG_CRASHING_PODS=true | ||
| export |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To print out all the env variables. For better debuggability. This will mask any secrets like below:
declare -x CLIENT_ID="55990953-d723-433b-a204-01af59561ed8"
declare -x CLIENT_SECRET="***"
declare -x CLUSTER_DEFINITION="./cniLinux1604.json"
Reason for Change:
mNAT + LB support changes in CNS
Notes: