Skip to content

Conversation

@vakalapa
Copy link
Contributor

@vakalapa vakalapa commented Sep 29, 2020

Reason for Change:
mNAT + LB support changes in CNS

Notes:

@codecov
Copy link

codecov bot commented Sep 29, 2020

Codecov Report

Merging #680 (4d3de91) into master (0309922) will decrease coverage by 0.05%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master     #680      +/-   ##
==========================================
- Coverage   39.12%   39.06%   -0.06%     
==========================================
  Files          83       83              
  Lines       10697    10743      +46     
==========================================
+ Hits         4185     4197      +12     
- Misses       6010     6039      +29     
- Partials      502      507       +5     

Copy link
Contributor

@ramiro-gamarra ramiro-gamarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments/questions

type NodeInfoResponse struct {
NetworkContainers []CreateNetworkContainerRequest
GetNCVersionURLFmt string
NmAgentApisMissing bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably more extensible to return the list of available actions than just a flag to say that some are missing. This way, the client can validate if the specific ones they are looking for are missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now i do not think there are any other requests or actions apart from the one in question. SO kept it simple with a single flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must have been confused before. This is what DNC will be returning?

Copy link
Contributor Author

@vakalapa vakalapa Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. CNS will ask for regular syncNodeStatus, as a response, DNC sends back list of NCs, version and the flag for NMAgentApis list. And DNC will set this flag only if the NodeConfig has GreCapableNotSet property. If the Node is not capable or if the Node has been previously updated with this list then DNC will skip asking for this information.

@vakalapa vakalapa marked this pull request as ready for review September 30, 2020 19:31
service.saveState()
service.Unlock()

if nodeInfoResponse.NmAgentApisMissing {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like it is needed. CNS would not get past start up if register node did not succeed in the main, so this information would be there if the node was registered with DNC earlier or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will work as kind of a re-try mechanism. If DNC fails to set its properties due to some error, in the next NodeSyncStatus it can solicit for these details again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I don't quite understand is if DNC fails to set this properties, then the register node call should fail, so we would never reach here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today, register Node happens only once, at the time of initial CNS bringup, main.go L#385. And yes, we do not want to register if DNC sets this flag as false (same as todays behavior)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow either. If I'm reading main correctly, if node registration failed during start up, that terminates CNS. Registration must succeed before the sync loop is even started. Once it's registered, why do we need to try to register it again here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking through and testing, i removed the re-register logic. Now if the Node needs to get the ability to add GreKeys, Cx will need to drain NCs, unregister the Node and register again.

Copy link
Contributor

@ramiro-gamarra ramiro-gamarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments/questions

type NodeInfoResponse struct {
NetworkContainers []CreateNetworkContainerRequest
GetNCVersionURLFmt string
NmAgentApisMissing bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must have been confused before. This is what DNC will be returning?

service.saveState()
service.Unlock()

if nodeInfoResponse.NmAgentApisMissing {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow either. If I'm reading main correctly, if node registration failed during start up, that terminates CNS. Registration must succeed before the sync loop is even started. Once it's registered, why do we need to try to register it again here?

@matmerr
Copy link
Member

matmerr commented Oct 14, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@ramiro-gamarra ramiro-gamarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments

case <-nodeRegisterTicker.C:
go sendRegisterNodeRequest(httpc, httpRestService, nodeRegisterRequest, url, responseChan)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would wait five seconds before it attempts to send the first request. Not sure if that's the intended behavior. It also seems you may not need goroutines if these operations are sequential. Do we want to retry the operation forever or crash at some point? I would recommend looking at the retry package that got added a couple weeks ago, might come in handy here: https://github.com/Azure/azure-container-networking/blob/master/test/integration/retry/retry.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will correct the first 5 sec delay. Existing behavior is to retry forever. You are right, we need a better mechanism here. I can add a work item to improve this logic for exponential back off as a separate effort ?

}
response.Body.Close()
} else {
logger.Errorf("[Azure CNS] Failed to register node with err: %+v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why this path doesn't send the error? Also, i know this is just adding support of channels, but probably a good opportunity to clean up the function since its kinda hard to follow the code flow. Suggest returning early like:

res, err := post()
if err != nil { /* return */ }

defer res.body.close()

if res.StatusCode != { /* return */ }

if err := decode(req); err != nil { /* return */ }

setNodeOrch(req)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay will refactor. Yes, I am assuming this error is retriable error, like timeout or some other transport path error. (which most probably can be due to delays in setting up of azure services). The other errors are specific from DNC side, which i am attributing as not retriable.

@vakalapa
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@ramiro-gamarra ramiro-gamarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@vakalapa vakalapa merged commit a22a852 into master Nov 19, 2020
@vakalapa vakalapa deleted the vakr/cns_lb_mnat branch November 19, 2020 22:15
export REGIONS=$(AKS_ENGINE_REGION)
export IS_JENKINS=false
export DEBUG_CRASHING_PODS=true
export
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To print out all the env variables. For better debuggability. This will mask any secrets like below:

declare -x CLIENT_ID="55990953-d723-433b-a204-01af59561ed8" 
declare -x CLIENT_SECRET="***" 
declare -x CLUSTER_DEFINITION="./cniLinux1604.json"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants