Skip to content

fix: fix azure cni case where route table doesnt exist#8363

Merged
awesomenix merged 1 commit intomainfrom
nishp/fix/e2e/azurecni
Apr 21, 2026
Merged

fix: fix azure cni case where route table doesnt exist#8363
awesomenix merged 1 commit intomainfrom
nishp/fix/e2e/azurecni

Conversation

@awesomenix
Copy link
Copy Markdown
Contributor

Keep @timmy-wright happy

Summary

Handle AKS clusters whose subnet no longer has an associated route table during addFirewallRules.

This keeps the existing kubenet behavior intact while allowing Azure CNI clusters to proceed when AKS does not attach a route table to aks-subnet.

Problem

addFirewallRules assumed the AKS subnet always had an associated route table and failed with:

AKS subnet has no route table associated

That assumption is valid for kubenet, where AKS manages pod routes through a subnet route table, but it is no longer reliable for Azure CNI clusters.

Changes

  • Keep the existing kubenet behavior:
    • require the AKS-managed route table to already exist
    • add firewall routes to that table in place
  • Add an Azure CNI fallback path:
    • detect when aks-subnet has no route table
    • create a dedicated firewall route table
    • associate it to the subnet
    • add the firewall routes to the new table
  • Update the orchestration comment in prepareCluster to document both paths

Why this is safe

  • Kubenet still uses the existing AKS-managed route table, which preserves pod CIDR routes managed by cloud-provider-azure.
  • Azure CNI can now work even when AKS does not create or attach a route table to the subnet.
  • The change only affects the firewall setup path and does not change cluster creation behavior outside that flow.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the e2e cluster preparation flow to handle Azure CNI clusters where the AKS subnet may not have an associated route table during firewall rule setup, while preserving kubenet’s requirement to use the AKS-managed route table.

Changes:

  • Remove the pre-created RouteTablesClient from the shared AzureClient and instead instantiate it on-demand when needed.
  • Add an Azure CNI fallback in addFirewallRules: if the AKS subnet has no route table, create/associate a dedicated firewall route table before adding routes.
  • Update prepareCluster orchestration comments to reflect the new dual-path behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
e2e/config/azure.go Removes RouteTablesClient from the long-lived client bundle (route table operations now created on-demand).
e2e/cluster.go Updates orchestration comment to document kubenet vs Azure CNI firewall route table behavior.
e2e/aks_model.go Implements Azure CNI fallback route table creation/association via new ensureFirewallRouteTable helper.

Comment thread e2e/aks_model.go
Comment on lines 317 to +323
aksSubnetResp, err := config.Azure.Subnet.Get(ctx, rg, vnet.name, "aks-subnet", nil)
if err != nil {
return fmt.Errorf("failed to get AKS subnet: %w", err)
}
if aksSubnetResp.Properties == nil || aksSubnetResp.Properties.RouteTable == nil || aksSubnetResp.Properties.RouteTable.ID == nil {
return fmt.Errorf("AKS subnet has no route table associated")
}

aksRTID := *aksSubnetResp.Properties.RouteTable.ID
parsedRT, err := arm.ParseResourceID(aksRTID)
aksRTName, err := ensureFirewallRouteTable(ctx, clusterModel, vnet.name, aksSubnetResp.Subnet)
if err != nil {
return fmt.Errorf("failed to parse AKS route table resource ID %q: %w", aksRTID, err)
}
aksRTName := parsedRT.Name
if aksRTName == "" {
return fmt.Errorf("parsed empty route table name from resource ID %q", aksRTID)
return err
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addFirewallRules reads subnet "aks-subnet" by a hard-coded name, but the new Azure CNI fallback path associates the route table via updateSubnet, which uses config.Config.DefaultSubnetName. If DEFAULT_SUBNET_NAME is overridden, this can update a different subnet than the one you fetched, leaving the AKS subnet without the route table and causing route creation to fail. Suggest using a single source of truth for the subnet name (e.g., config.Config.DefaultSubnetName) and/or threading the actual subnet name through to the update call.

Copilot uses AI. Check for mistakes.
Comment thread e2e/aks_model.go
Comment on lines +499 to +505
if err := updateSubnet(ctx, clusterModel, aksSubnet, vnetName); err != nil {
return "", fmt.Errorf("failed to associate firewall route table %q with AKS subnet: %w", routeTableName, err)
}

return routeTableName, nil
}

Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensureFirewallRouteTable mutates the subnet returned by Subnet.Get, but then calls updateSubnet() which always targets config.Config.DefaultSubnetName rather than the subnet you just retrieved. This makes the association step brittle and (when the default name differs) incorrect. Consider changing updateSubnet to accept an explicit subnet name (or derive it from aksSubnet.ID) and pass the name of the subnet being updated.

Suggested change
if err := updateSubnet(ctx, clusterModel, aksSubnet, vnetName); err != nil {
return "", fmt.Errorf("failed to associate firewall route table %q with AKS subnet: %w", routeTableName, err)
}
return routeTableName, nil
}
subnetName, err := subnetNameFromSubnet(aksSubnet)
if err != nil {
return "", fmt.Errorf("failed to determine AKS subnet name for firewall route table association: %w", err)
}
if err := updateSubnetByName(ctx, clusterModel, aksSubnet, vnetName, subnetName); err != nil {
return "", fmt.Errorf("failed to associate firewall route table %q with AKS subnet %q: %w", routeTableName, subnetName, err)
}
return routeTableName, nil
}
func subnetNameFromSubnet(subnet armnetwork.Subnet) (string, error) {
if subnet.Name != nil && *subnet.Name != "" {
return *subnet.Name, nil
}
if subnet.ID == nil || *subnet.ID == "" {
return "", fmt.Errorf("subnet has neither name nor ID")
}
parsedSubnet, err := arm.ParseResourceID(*subnet.ID)
if err != nil {
return "", fmt.Errorf("failed to parse subnet resource ID %q: %w", *subnet.ID, err)
}
if parsedSubnet.Name == "" {
return "", fmt.Errorf("parsed empty subnet name from resource ID %q", *subnet.ID)
}
return parsedSubnet.Name, nil
}
func updateSubnetByName(ctx context.Context, clusterModel *armcontainerservice.ManagedCluster, subnet armnetwork.Subnet, vnetName, subnetName string) error {
if clusterModel == nil || clusterModel.Properties == nil || clusterModel.Properties.NodeResourceGroup == nil {
return fmt.Errorf("cluster model has no node resource group")
}
subnetsClient, err := armnetwork.NewSubnetsClient(config.Config.SubscriptionID, config.Azure.Credential, config.Azure.ArmOptions)
if err != nil {
return fmt.Errorf("failed to create subnets client: %w", err)
}
poller, err := subnetsClient.BeginCreateOrUpdate(ctx, *clusterModel.Properties.NodeResourceGroup, vnetName, subnetName, subnet, nil)
if err != nil {
return fmt.Errorf("failed to start updating subnet %q: %w", subnetName, err)
}
if _, err := poller.PollUntilDone(ctx, config.DefaultPollUntilDoneOptions); err != nil {
return fmt.Errorf("failed to update subnet %q: %w", subnetName, err)
}
return nil
}

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings April 21, 2026 06:04
@awesomenix awesomenix force-pushed the nishp/fix/e2e/azurecni branch from 9e4bc1a to d0afaf3 Compare April 21, 2026 06:04
@awesomenix awesomenix enabled auto-merge (squash) April 21, 2026 06:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread e2e/aks_model.go
Comment on lines 317 to 319
aksSubnetResp, err := config.Azure.Subnet.Get(ctx, rg, vnet.name, "aks-subnet", nil)
if err != nil {
return fmt.Errorf("failed to get AKS subnet: %w", err)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addFirewallRules fetches the subnet using the hard-coded name "aks-subnet", but the association update path (updateSubnet) always writes to config.Config.DefaultSubnetName. If DEFAULT_SUBNET_NAME is overridden, this can end up updating a different subnet than the one you just read (and the route table association won’t be applied where expected). Consider using config.Config.DefaultSubnetName consistently for the Subnet.Get call, or pass the subnet name through to updateSubnet and use the name from the fetched subnet.

Suggested change
aksSubnetResp, err := config.Azure.Subnet.Get(ctx, rg, vnet.name, "aks-subnet", nil)
if err != nil {
return fmt.Errorf("failed to get AKS subnet: %w", err)
aksSubnetName := config.Config.DefaultSubnetName
aksSubnetResp, err := config.Azure.Subnet.Get(ctx, rg, vnet.name, aksSubnetName, nil)
if err != nil {
return fmt.Errorf("failed to get AKS subnet %q: %w", aksSubnetName, err)

Copilot uses AI. Check for mistakes.
Comment thread e2e/validation.go
@@ -73,7 +74,7 @@ func ValidateCommonLinux(ctx context.Context, s *Scenario) {
}

// localdns is not supported on scriptless, privatekube and VHDUbuntu2204Gen2ContainerdNetworkIsolatedK8sNotCached.
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above this condition is now stale: the code also skips LocalDNS validation when running in TEST_PRE_PROVISION mode or when VHDCaching is enabled. Update the comment to reflect the full set of reasons LocalDNS validation is skipped so it stays accurate for future debugging.

Suggested change
// localdns is not supported on scriptless, privatekube and VHDUbuntu2204Gen2ContainerdNetworkIsolatedK8sNotCached.
// Skip LocalDNS validation when it is unsupported on the VHD (for example scriptless, privatekube,
// and VHDUbuntu2204Gen2ContainerdNetworkIsolatedK8sNotCached), when running in TEST_PRE_PROVISION
// mode, or when VHDCaching is enabled.

Copilot uses AI. Check for mistakes.
@awesomenix awesomenix disabled auto-merge April 21, 2026 06:31
@awesomenix awesomenix merged commit d494a41 into main Apr 21, 2026
25 of 32 checks passed
@awesomenix awesomenix deleted the nishp/fix/e2e/azurecni branch April 21, 2026 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants