fix: fix azure cni case where route table doesnt exist#8363
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the e2e cluster preparation flow to handle Azure CNI clusters where the AKS subnet may not have an associated route table during firewall rule setup, while preserving kubenet’s requirement to use the AKS-managed route table.
Changes:
- Remove the pre-created
RouteTablesClientfrom the sharedAzureClientand instead instantiate it on-demand when needed. - Add an Azure CNI fallback in
addFirewallRules: if the AKS subnet has no route table, create/associate a dedicated firewall route table before adding routes. - Update
prepareClusterorchestration comments to reflect the new dual-path behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| e2e/config/azure.go | Removes RouteTablesClient from the long-lived client bundle (route table operations now created on-demand). |
| e2e/cluster.go | Updates orchestration comment to document kubenet vs Azure CNI firewall route table behavior. |
| e2e/aks_model.go | Implements Azure CNI fallback route table creation/association via new ensureFirewallRouteTable helper. |
| aksSubnetResp, err := config.Azure.Subnet.Get(ctx, rg, vnet.name, "aks-subnet", nil) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to get AKS subnet: %w", err) | ||
| } | ||
| if aksSubnetResp.Properties == nil || aksSubnetResp.Properties.RouteTable == nil || aksSubnetResp.Properties.RouteTable.ID == nil { | ||
| return fmt.Errorf("AKS subnet has no route table associated") | ||
| } | ||
|
|
||
| aksRTID := *aksSubnetResp.Properties.RouteTable.ID | ||
| parsedRT, err := arm.ParseResourceID(aksRTID) | ||
| aksRTName, err := ensureFirewallRouteTable(ctx, clusterModel, vnet.name, aksSubnetResp.Subnet) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to parse AKS route table resource ID %q: %w", aksRTID, err) | ||
| } | ||
| aksRTName := parsedRT.Name | ||
| if aksRTName == "" { | ||
| return fmt.Errorf("parsed empty route table name from resource ID %q", aksRTID) | ||
| return err |
There was a problem hiding this comment.
addFirewallRules reads subnet "aks-subnet" by a hard-coded name, but the new Azure CNI fallback path associates the route table via updateSubnet, which uses config.Config.DefaultSubnetName. If DEFAULT_SUBNET_NAME is overridden, this can update a different subnet than the one you fetched, leaving the AKS subnet without the route table and causing route creation to fail. Suggest using a single source of truth for the subnet name (e.g., config.Config.DefaultSubnetName) and/or threading the actual subnet name through to the update call.
| if err := updateSubnet(ctx, clusterModel, aksSubnet, vnetName); err != nil { | ||
| return "", fmt.Errorf("failed to associate firewall route table %q with AKS subnet: %w", routeTableName, err) | ||
| } | ||
|
|
||
| return routeTableName, nil | ||
| } | ||
|
|
There was a problem hiding this comment.
ensureFirewallRouteTable mutates the subnet returned by Subnet.Get, but then calls updateSubnet() which always targets config.Config.DefaultSubnetName rather than the subnet you just retrieved. This makes the association step brittle and (when the default name differs) incorrect. Consider changing updateSubnet to accept an explicit subnet name (or derive it from aksSubnet.ID) and pass the name of the subnet being updated.
| if err := updateSubnet(ctx, clusterModel, aksSubnet, vnetName); err != nil { | |
| return "", fmt.Errorf("failed to associate firewall route table %q with AKS subnet: %w", routeTableName, err) | |
| } | |
| return routeTableName, nil | |
| } | |
| subnetName, err := subnetNameFromSubnet(aksSubnet) | |
| if err != nil { | |
| return "", fmt.Errorf("failed to determine AKS subnet name for firewall route table association: %w", err) | |
| } | |
| if err := updateSubnetByName(ctx, clusterModel, aksSubnet, vnetName, subnetName); err != nil { | |
| return "", fmt.Errorf("failed to associate firewall route table %q with AKS subnet %q: %w", routeTableName, subnetName, err) | |
| } | |
| return routeTableName, nil | |
| } | |
| func subnetNameFromSubnet(subnet armnetwork.Subnet) (string, error) { | |
| if subnet.Name != nil && *subnet.Name != "" { | |
| return *subnet.Name, nil | |
| } | |
| if subnet.ID == nil || *subnet.ID == "" { | |
| return "", fmt.Errorf("subnet has neither name nor ID") | |
| } | |
| parsedSubnet, err := arm.ParseResourceID(*subnet.ID) | |
| if err != nil { | |
| return "", fmt.Errorf("failed to parse subnet resource ID %q: %w", *subnet.ID, err) | |
| } | |
| if parsedSubnet.Name == "" { | |
| return "", fmt.Errorf("parsed empty subnet name from resource ID %q", *subnet.ID) | |
| } | |
| return parsedSubnet.Name, nil | |
| } | |
| func updateSubnetByName(ctx context.Context, clusterModel *armcontainerservice.ManagedCluster, subnet armnetwork.Subnet, vnetName, subnetName string) error { | |
| if clusterModel == nil || clusterModel.Properties == nil || clusterModel.Properties.NodeResourceGroup == nil { | |
| return fmt.Errorf("cluster model has no node resource group") | |
| } | |
| subnetsClient, err := armnetwork.NewSubnetsClient(config.Config.SubscriptionID, config.Azure.Credential, config.Azure.ArmOptions) | |
| if err != nil { | |
| return fmt.Errorf("failed to create subnets client: %w", err) | |
| } | |
| poller, err := subnetsClient.BeginCreateOrUpdate(ctx, *clusterModel.Properties.NodeResourceGroup, vnetName, subnetName, subnet, nil) | |
| if err != nil { | |
| return fmt.Errorf("failed to start updating subnet %q: %w", subnetName, err) | |
| } | |
| if _, err := poller.PollUntilDone(ctx, config.DefaultPollUntilDoneOptions); err != nil { | |
| return fmt.Errorf("failed to update subnet %q: %w", subnetName, err) | |
| } | |
| return nil | |
| } |
e4cac6d to
9e4bc1a
Compare
9e4bc1a to
d0afaf3
Compare
| aksSubnetResp, err := config.Azure.Subnet.Get(ctx, rg, vnet.name, "aks-subnet", nil) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to get AKS subnet: %w", err) |
There was a problem hiding this comment.
addFirewallRules fetches the subnet using the hard-coded name "aks-subnet", but the association update path (updateSubnet) always writes to config.Config.DefaultSubnetName. If DEFAULT_SUBNET_NAME is overridden, this can end up updating a different subnet than the one you just read (and the route table association won’t be applied where expected). Consider using config.Config.DefaultSubnetName consistently for the Subnet.Get call, or pass the subnet name through to updateSubnet and use the name from the fetched subnet.
| aksSubnetResp, err := config.Azure.Subnet.Get(ctx, rg, vnet.name, "aks-subnet", nil) | |
| if err != nil { | |
| return fmt.Errorf("failed to get AKS subnet: %w", err) | |
| aksSubnetName := config.Config.DefaultSubnetName | |
| aksSubnetResp, err := config.Azure.Subnet.Get(ctx, rg, vnet.name, aksSubnetName, nil) | |
| if err != nil { | |
| return fmt.Errorf("failed to get AKS subnet %q: %w", aksSubnetName, err) |
| @@ -73,7 +74,7 @@ func ValidateCommonLinux(ctx context.Context, s *Scenario) { | |||
| } | |||
|
|
|||
| // localdns is not supported on scriptless, privatekube and VHDUbuntu2204Gen2ContainerdNetworkIsolatedK8sNotCached. | |||
There was a problem hiding this comment.
The comment above this condition is now stale: the code also skips LocalDNS validation when running in TEST_PRE_PROVISION mode or when VHDCaching is enabled. Update the comment to reflect the full set of reasons LocalDNS validation is skipped so it stays accurate for future debugging.
| // localdns is not supported on scriptless, privatekube and VHDUbuntu2204Gen2ContainerdNetworkIsolatedK8sNotCached. | |
| // Skip LocalDNS validation when it is unsupported on the VHD (for example scriptless, privatekube, | |
| // and VHDUbuntu2204Gen2ContainerdNetworkIsolatedK8sNotCached), when running in TEST_PRE_PROVISION | |
| // mode, or when VHDCaching is enabled. |
Keep @timmy-wright happy
Summary
Handle AKS clusters whose subnet no longer has an associated route table during
addFirewallRules.This keeps the existing kubenet behavior intact while allowing Azure CNI clusters to proceed when AKS does not attach a route table to
aks-subnet.Problem
addFirewallRulesassumed the AKS subnet always had an associated route table and failed with:That assumption is valid for kubenet, where AKS manages pod routes through a subnet route table, but it is no longer reliable for Azure CNI clusters.
Changes
aks-subnethas no route tableprepareClusterto document both pathsWhy this is safe