-
Notifications
You must be signed in to change notification settings - Fork 260
fix: wait for vnet ns to create and ensure veths are inside namespace instead of assuming #2341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
e37e09d to
7288514
Compare
| _, err := client.netioshim.GetNetworkInterfaceByName(client.vlanIfName) | ||
| return errors.Wrap(err, "failed to get vlan interface in namespace") | ||
| }) | ||
| if vlanIfNotFoundErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error will not be not found always: https://cs.opensource.google/go/go/+/refs/tags/go1.21.3:src/net/interface.go;l=156-172
Should we compare with errNoSuchInterface Or we want to delete the interface every time there is an error in this case renaming the variable should help ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, the possible errors that I see based on the documentation appear to be:
- No such interface, in which case we should delete the namespace on line 143
- Some failure with the iptable or the interface name isn't valid, in which case it could be a one-off error and the vlan interface could actually exist (we just can't get it right now), in which case containers may exist that use this namespace, which means the namespace should not be deleted.
- If there are other error scenarios feel free to mention
I'll update the code to match on the no such interface error.
90465ae to
eb2fe5a
Compare
12184d3 to
473e7e8
Compare
| COPY --from=azure-ipam /azure-ipam/bin/* pkg/embed/fs | ||
| COPY --from=azure-vnet /azure-container-networking/cni/azure-$OS.conflist pkg/embed/fs/azure.conflist | ||
| COPY --from=azure-vnet /azure-container-networking/cni/azure-$OS-swift.conflist pkg/embed/fs/azure-swift.conflist | ||
| COPY --from=azure-vnet /azure-container-networking/cni/azure-$OS-multitenancy-transparent-vlan.conflist pkg/embed/fs/azure-multitenancy-transparent-vlan.conflist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related ?
473e7e8 to
997c89f
Compare
tamilmani1989
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i haven't reviewed fully. will look later once i find some time
|
|
||
| // If the vlan veth for sure doesn't exist, we delete the vnet ns, but if we can't query, we can't delete the ns (the vlan veth might exist!) | ||
| if vlanIfErr != nil { | ||
| if strings.Contains(vlanIfErr.Error(), "no such network interface") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of checking for this error message, can cni just remove network namespace if it gets error in finding vlan device in veth namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes (checking the error message on 144, logging the error on 151, logging the error on 166, highlighted by your comments) stem from my previous conversation with Vipul (#2341 (comment)) regarding the possibility that an error returned from client.netioshim.GetNetworkInterfaceByName does not always mean the interface was not found. It seems possible based on the source that the process to get the network interface could itself fail, meaning that a failure could occur even when the vlan interface exists (it's just that the method errored out before trying to get the interface). However, if you've never seen it fail with anything other than "interface not found", we can remove checking for the particular error and logging the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my concern is what if they changed the error message in future? Im fine with this if we have UT for catching this error message with exact string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, do you mean catching/matching when the error is not "no such network interface" (ex: SyscallError) and any other error will trigger deletion of the namespace?
|
|
||
| func (client *TransparentVlanEndpointClient) createNetworkNamespace(vmNS, numRetries int) error { | ||
| var isNamespaceUnique bool | ||
| // Called from PopulateVM, Namespace: VM and Vnet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its callled from AddEndpoints. update comment
| return errors.Wrap(vlanIfErr, "could not determine if vlan veth exists in vnet namespace") | ||
| } | ||
| } | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log message is confusing.. why do you expect vnet namespace not to exist. instead it should just log "vnet namespace doesn't exist with error message"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think log is required.. anyway we log in populateVM right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we do log in populateVM, but here (see above comment) we want to know if GetNetworkInterfaceByName returned an error other than "interface not found", in case the method itself fails. Again, if you've only ever seen it fail if the interface is not found, we can probably remove the log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya lets remove the log.. anyway we log in populateVM
| return errors.Wrap(delErr, "failed to clean up vlan interface") | ||
| } | ||
| } else { | ||
| logger.Info("Expecting vlan interface not to exist", zap.String("result", vlanIfInVMErr.Error())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see above comment)
|
|
||
| // If the vlan veth for sure doesn't exist, we delete the vnet ns, but if we can't query, we can't delete the ns (the vlan veth might exist!) | ||
| if vlanIfErr != nil { | ||
| if strings.Contains(vlanIfErr.Error(), "no such network interface") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my concern is what if they changed the error message in future? Im fine with this if we have UT for catching this error message with exact string
| return errors.Wrap(vlanIfErr, "could not determine if vlan veth exists in vnet namespace") | ||
| } | ||
| } | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya lets remove the log.. anyway we log in populateVM
| logger.Info("No existing NS detected. Creating the vnet namespace and switching to it") | ||
|
|
||
| if err = client.createNetworkNamespace(vmNS, numRetries); err != nil { | ||
| if err = client.createNetworkNamespace(vmNS); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why dont we call this also with RunwithRetries instead of calling within the method?
| // This will also (we assume) mean the vlan veth does not exist | ||
| if existingErr != nil { | ||
| // We assume the only possible error is that the namespace doesn't exist | ||
| logger.Info("No existing NS detected. Creating the vnet namespace and switching to it") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we log error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I will add the existingErr message to this logger.Info statement.
| // If the vlan veth for sure doesn't exist, we delete the vnet ns, but if we can't query, we can't delete the ns (the vlan veth might exist!) | ||
| if vlanIfErr != nil { | ||
| if strings.Contains(vlanIfErr.Error(), "no such network interface") { | ||
| logger.Info("Vlan interface doesn't exist even though network namespace exists, deleting network namespace...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VLAN or vlan, but never Vlan. it's an acronym. and logs don't need to be sentences
|
@tamilmani1989 The network namespace
|
2a7debf to
b5a7b4d
Compare
Reuse existing code to enable ip forwarding Add forwarding message Add transparent vlan config to dropgz Add enable ip forward to each network namespace for redundancy
…e ip forwarding fix as it is in other branch
b5a7b4d to
104935e
Compare
… instead of assuming (#2341) * Make initial changes to enable forwarding Reuse existing code to enable ip forwarding Add forwarding message Add transparent vlan config to dropgz Add enable ip forward to each network namespace for redundancy * Add general run with retries function * Migrate existing code to use retry function * Add retries for checking of vnet and container veth creation * Add preliminary repair item fixes * Clarify log message * Fix unit tests, move ensure clean populate vm to add endpoints, remove ip forwarding fix as it is in other branch * Move setting link netns and confirmation to function * Add tests for setLinkNetNSAndConfirm * Address linter issues * Add vm ns to vnet namespace log statement * Fix after rebasing in swift refactoring changes from master * Address feedback and check error message * Address feedback * Address linter issue * Address log feedback * Address feedback by asserting any error is an interface not found error
Reason for Change:
In the transparent vlan endpoint client add endpoint path, detecting the vnet network namespace would previously assume the vlan interface was already inside the namespace. However, this was not always the case (the vlan interface may not exist or may be in the vm namespace). The fix removes any stray vlan interface pertaining to the container in the vm if it exists and also deletes the network namespace if it does not have a valid vlan interface. Then, the code can proceed as though the vnet namespace never existed. Deleting the vnet namespace will not cause existing endpoints to become inaccessible because they must have errored (and been cleaned up) previously to enter this state.
Additionally, we also explicitly confirm the vlan interface as well as the vnet interface are successfully created and moved into the vnet namespace. If the vlan interface fails to move, we delete the vnet namespace and the vlan interface. If the vnet interface fails to move into the vnet namespace, we only delete the vnet interface.
Code changes have been made to create unit tests.
Issue Fixed:
See above.
Requirements:
Notes:
The following were manually tested (against commit "Address feedback"):