Skip to content

Conversation

@QxBytes
Copy link
Contributor

@QxBytes QxBytes commented Oct 30, 2023

Reason for Change:

In the transparent vlan endpoint client add endpoint path, detecting the vnet network namespace would previously assume the vlan interface was already inside the namespace. However, this was not always the case (the vlan interface may not exist or may be in the vm namespace). The fix removes any stray vlan interface pertaining to the container in the vm if it exists and also deletes the network namespace if it does not have a valid vlan interface. Then, the code can proceed as though the vnet namespace never existed. Deleting the vnet namespace will not cause existing endpoints to become inaccessible because they must have errored (and been cleaned up) previously to enter this state.

Additionally, we also explicitly confirm the vlan interface as well as the vnet interface are successfully created and moved into the vnet namespace. If the vlan interface fails to move, we delete the vnet namespace and the vlan interface. If the vnet interface fails to move into the vnet namespace, we only delete the vnet interface.

Code changes have been made to create unit tests.

Issue Fixed:

See above.

Requirements:

Notes:
The following were manually tested (against commit "Address feedback"):

  • create pod when its vnet namespace does not exist
  • create pod when its vnet namespace and vlan interface does exist
  • create pod when its vnet namespace exists but the vlan interface does not exist
  • create pod when its vlan interface is in the vm namespace (as if it failed to move into the vnet namespace)
  • in all scenarios, we test pod-pod same vnet same vm connectivity, pod-pod same vnet different vm connectivity, pod-internet connectivity, pod-pod same vm different vnet with no connectivity

@QxBytes
Copy link
Contributor Author

QxBytes commented Oct 31, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@QxBytes QxBytes force-pushed the alew/transparent-vlan-repair branch from e37e09d to 7288514 Compare November 1, 2023 22:57
@QxBytes QxBytes marked this pull request as ready for review November 2, 2023 19:00
@QxBytes QxBytes requested a review from a team as a code owner November 2, 2023 19:00
@QxBytes QxBytes requested a review from matmerr November 2, 2023 19:00
_, err := client.netioshim.GetNetworkInterfaceByName(client.vlanIfName)
return errors.Wrap(err, "failed to get vlan interface in namespace")
})
if vlanIfNotFoundErr != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error will not be not found always: https://cs.opensource.google/go/go/+/refs/tags/go1.21.3:src/net/interface.go;l=156-172
Should we compare with errNoSuchInterface Or we want to delete the interface every time there is an error in this case renaming the variable should help ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, the possible errors that I see based on the documentation appear to be:

  • No such interface, in which case we should delete the namespace on line 143
  • Some failure with the iptable or the interface name isn't valid, in which case it could be a one-off error and the vlan interface could actually exist (we just can't get it right now), in which case containers may exist that use this namespace, which means the namespace should not be deleted.
  • If there are other error scenarios feel free to mention
    I'll update the code to match on the no such interface error.

@QxBytes QxBytes force-pushed the alew/transparent-vlan-repair branch from 90465ae to eb2fe5a Compare November 3, 2023 23:24
@QxBytes QxBytes force-pushed the alew/transparent-vlan-repair branch 2 times, most recently from 12184d3 to 473e7e8 Compare November 7, 2023 23:44
COPY --from=azure-ipam /azure-ipam/bin/* pkg/embed/fs
COPY --from=azure-vnet /azure-container-networking/cni/azure-$OS.conflist pkg/embed/fs/azure.conflist
COPY --from=azure-vnet /azure-container-networking/cni/azure-$OS-swift.conflist pkg/embed/fs/azure-swift.conflist
COPY --from=azure-vnet /azure-container-networking/cni/azure-$OS-multitenancy-transparent-vlan.conflist pkg/embed/fs/azure-multitenancy-transparent-vlan.conflist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related ?

@QxBytes QxBytes force-pushed the alew/transparent-vlan-repair branch from 473e7e8 to 997c89f Compare November 7, 2023 23:57
vipul-21
vipul-21 previously approved these changes Nov 8, 2023
Copy link
Member

@tamilmani1989 tamilmani1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i haven't reviewed fully. will look later once i find some time


// If the vlan veth for sure doesn't exist, we delete the vnet ns, but if we can't query, we can't delete the ns (the vlan veth might exist!)
if vlanIfErr != nil {
if strings.Contains(vlanIfErr.Error(), "no such network interface") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of checking for this error message, can cni just remove network namespace if it gets error in finding vlan device in veth namespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes (checking the error message on 144, logging the error on 151, logging the error on 166, highlighted by your comments) stem from my previous conversation with Vipul (#2341 (comment)) regarding the possibility that an error returned from client.netioshim.GetNetworkInterfaceByName does not always mean the interface was not found. It seems possible based on the source that the process to get the network interface could itself fail, meaning that a failure could occur even when the vlan interface exists (it's just that the method errored out before trying to get the interface). However, if you've never seen it fail with anything other than "interface not found", we can remove checking for the particular error and logging the error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my concern is what if they changed the error message in future? Im fine with this if we have UT for catching this error message with exact string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, do you mean catching/matching when the error is not "no such network interface" (ex: SyscallError) and any other error will trigger deletion of the namespace?


func (client *TransparentVlanEndpointClient) createNetworkNamespace(vmNS, numRetries int) error {
var isNamespaceUnique bool
// Called from PopulateVM, Namespace: VM and Vnet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its callled from AddEndpoints. update comment

return errors.Wrap(vlanIfErr, "could not determine if vlan veth exists in vnet namespace")
}
}
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log message is confusing.. why do you expect vnet namespace not to exist. instead it should just log "vnet namespace doesn't exist with error message"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think log is required.. anyway we log in populateVM right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we do log in populateVM, but here (see above comment) we want to know if GetNetworkInterfaceByName returned an error other than "interface not found", in case the method itself fails. Again, if you've only ever seen it fail if the interface is not found, we can probably remove the log.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya lets remove the log.. anyway we log in populateVM

return errors.Wrap(delErr, "failed to clean up vlan interface")
}
} else {
logger.Info("Expecting vlan interface not to exist", zap.String("result", vlanIfInVMErr.Error()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see above comment)


// If the vlan veth for sure doesn't exist, we delete the vnet ns, but if we can't query, we can't delete the ns (the vlan veth might exist!)
if vlanIfErr != nil {
if strings.Contains(vlanIfErr.Error(), "no such network interface") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my concern is what if they changed the error message in future? Im fine with this if we have UT for catching this error message with exact string

return errors.Wrap(vlanIfErr, "could not determine if vlan veth exists in vnet namespace")
}
}
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya lets remove the log.. anyway we log in populateVM

logger.Info("No existing NS detected. Creating the vnet namespace and switching to it")

if err = client.createNetworkNamespace(vmNS, numRetries); err != nil {
if err = client.createNetworkNamespace(vmNS); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why dont we call this also with RunwithRetries instead of calling within the method?

// This will also (we assume) mean the vlan veth does not exist
if existingErr != nil {
// We assume the only possible error is that the namespace doesn't exist
logger.Info("No existing NS detected. Creating the vnet namespace and switching to it")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we log error here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I will add the existingErr message to this logger.Info statement.

// If the vlan veth for sure doesn't exist, we delete the vnet ns, but if we can't query, we can't delete the ns (the vlan veth might exist!)
if vlanIfErr != nil {
if strings.Contains(vlanIfErr.Error(), "no such network interface") {
logger.Info("Vlan interface doesn't exist even though network namespace exists, deleting network namespace...")
Copy link
Collaborator

@rbtr rbtr Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VLAN or vlan, but never Vlan. it's an acronym. and logs don't need to be sentences

@QxBytes
Copy link
Contributor Author

QxBytes commented Nov 10, 2023

@tamilmani1989 The network namespace GetNamed(...) function doesn't have a clearly named error-- it seems like it will return some kind of syscall error (any error related to opening the file).

  • Checked CNI statefile before creating pods, after creating pods, and after deleting pods, and the file looks the same as when it started.
  • Routes appear to be properly deleted and there don't appear to be any straggling interfaces (other than what we expect) after deletion.

@tamilmani1989 tamilmani1989 force-pushed the alew/transparent-vlan-repair branch from 2a7debf to b5a7b4d Compare November 10, 2023 17:54
Reuse existing code to enable ip forwarding

Add forwarding message

Add transparent vlan config to dropgz

Add enable ip forward to each network namespace for redundancy
@QxBytes QxBytes force-pushed the alew/transparent-vlan-repair branch from b5a7b4d to 104935e Compare November 10, 2023 21:45
@tamilmani1989 tamilmani1989 merged commit 4e6cc2f into master Nov 11, 2023
@tamilmani1989 tamilmani1989 deleted the alew/transparent-vlan-repair branch November 11, 2023 00:07
matmerr pushed a commit that referenced this pull request Jan 17, 2024
… instead of assuming (#2341)

* Make initial changes to enable forwarding

Reuse existing code to enable ip forwarding

Add forwarding message

Add transparent vlan config to dropgz

Add enable ip forward to each network namespace for redundancy

* Add general run with retries function

* Migrate existing code to use retry function

* Add retries for checking of vnet and container veth creation

* Add preliminary repair item fixes

* Clarify log message

* Fix unit tests, move ensure clean populate vm to add endpoints, remove ip forwarding fix as it is in other branch

* Move setting link netns and confirmation to function

* Add tests for setLinkNetNSAndConfirm

* Address linter issues

* Add vm ns to vnet namespace log statement

* Fix after rebasing in swift refactoring changes from master

* Address feedback and check error message

* Address feedback

* Address linter issue

* Address log feedback

* Address feedback by asserting any error is an interface not found error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants