-
Notifications
You must be signed in to change notification settings - Fork 260
feat: Creating NNC with HomeAz info in AKS-Swift Workflows when CNS starts up behind a configuration flag #3157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@nairashu If this is still WIP, can you mark this as draft? |
ramiro-gamarra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions/thoughts
cns/configuration/configuration.go
Outdated
| EnableSubnetScarcity bool | ||
| EnableSwiftV2 bool | ||
| InitializeFromCNI bool | ||
| EnableHomeAz bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be EnableNNCCreation. It will be clearer when AKS reviews toggles to CNS config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We typically control the component configurations closer to our service feature naming so keeping it related to HomeAz makes more sense as the flag is to use the HomeAzMonitor overall to retrieve the Az and then create the NNC from that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that "Az" is everywhere, so it's probably impractical to change at this point, but "AZ" is an abbreviation, so this should be EnableHomeAZ by typical Go naming conventions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to EnableHomeAZ
cns/service/main.go
Outdated
| var nnc *v1alpha.NodeNetworkConfig | ||
| if nnc, err = directnnccli.Get(ctx, types.NamespacedName{Namespace: "kube-system", Name: nodeName}); err != nil { | ||
| logger.Errorf("[Azure CNS] failed to get existing NNC: %v", err) | ||
| } | ||
|
|
||
| newNNC := createBaseNNC(node) | ||
| if nnc == nil { | ||
| logger.Printf("[Azure CNS] Creating new base NNC") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error conditions are not clear here. If there is a transient network error, the get should be retried no? Otherwise, is there a way to detect that we reached the apiserver and it returned "not found"? That should be the only time when we attempt to create the nnc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a retrier in here for transient errors.
cns/restserver/internalapi.go
Outdated
| func (service *HTTPRestService) GetHomeAz(ctx context.Context) (homeAzResponse cns.GetHomeAzResponse) { | ||
| service.RLock() | ||
| homeAzResponse = service.homeAzMonitor.GetHomeAz(ctx) | ||
| service.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be defer-ed to insulate it from panics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the defer but also noticed this file is missing defer in multiple other places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Leave the campsite cleaner than you found it" applies -- but also I only care about this instance for sake of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally on the same page with that philosophy so did add in the defer above.
cns/configuration/configuration.go
Outdated
| EnableSubnetScarcity bool | ||
| EnableSwiftV2 bool | ||
| InitializeFromCNI bool | ||
| EnableHomeAz bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that "Az" is everywhere, so it's probably impractical to change at this point, but "AZ" is an abbreviation, so this should be EnableHomeAZ by typical Go naming conventions.
| EnableSubnetScarcity bool | ||
| EnableSwiftV2 bool | ||
| InitializeFromCNI bool | ||
| EnableHomeAZ bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the previous comment got lost but I think we need further discussion on the name of this field. I don't think EnableHomeAZ is the correct name because:
- Home AZ is already enabled when mode is direct, which makes this flag confusing.
- The major feature is the NNC creation, which should be independent of home az being available or not.
I still thinkEnableNNCCreationis a better name. Would like to get feedback from @thatmattlong and @rbtr on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, flag should tightly describe what function it does, not the abstract name of the scenario.
CreateNNC would be even better
| directscopedcli := nncctrl.NewScopedClient(directnnccli, types.NamespacedName{Namespace: "kube-system", Name: nodeName}) | ||
|
|
||
| // Create the base NNC CRD if HomeAz is enabled | ||
| if cnsconfig.EnableHomeAZ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to extract the scope opened by this flag to a different function. it will help clean up some code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would like to see NNC creation decoupled completely from HomeAz
| - apiGroups: ["acn.azure.com"] | ||
| resources: ["nodenetworkconfigs"] | ||
| verbs: ["get", "list", "watch", "patch", "update"] | ||
| verbs: ["create", "delete", "get", "list", "watch", "patch", "update"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we'll need delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design is to move the node reconcile logic over to CNS.
It makes more sense to create and cleanup the NNC based on when the node is created and deleted. As a result, I added the operational capability to delete the NNC as well to CNS as eventually we can clean up the nodes faster in this case. DNC, DNC-RC and DNCCleanup can work async as it does today to cleanup the NC and other associations.
Why do we want to avoid CNS from doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what scenario could CNS know it should delete an NNC?
I think it's impossible. The NNC should only be deleted once the Node no longer exists. If the Node doesn't exist, CNS cannot be running and can't delete the NNC.
Even DNC-RC does not delete NNCs - when the NNC is created it gets its OwnerRef set to the Node object. Then when the Node object is deleted from Kubernetes, the NNC is automatically garbage collected. These refs use object UUIDs; this is the only safe way to delete an NNC.
| // GetHomeAz - Get the Home Az for the Node where CNS is running. | ||
| func (service *HTTPRestService) GetHomeAz(ctx context.Context) (cns.GetHomeAzResponse, error) { | ||
| service.RLock() | ||
| defer service.RUnlock() | ||
| homeAzResponse := service.homeAzMonitor.GetHomeAz(ctx) | ||
| if homeAzResponse.Response.ReturnCode == types.NotFound { | ||
| return homeAzResponse, errors.New(homeAzResponse.Response.Message) | ||
| } | ||
|
|
||
| return homeAzResponse, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what value is this adding over calling service.homeAzMonitor.GetHomeAz directly?
|
This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days |
|
Pull request closed due to inactivity. |
Reason for Change:
feat: Creating NNC with HomeAz info in AKS-Swift Workflows when CNS starts up behind a configuration flag
Issue Fixed:
N/A
Requirements:
Notes: