fix(e2e): prevent kubenet route table detachment and stale node accumulation#8279
fix(e2e): prevent kubenet route table detachment and stale node accumulation#8279
Conversation
…ulation
Two fixes for the kubenet E2E test failures where ALL 40 kubenet tests
fail with 'cni plugin not initialized' (nodes never become Ready):
1. Add stale K8s Node cleanup to collectGarbageVMSS:
- collectGarbageVMSS deleted Azure VMSS instances but left Kubernetes
Node objects behind. Over multiple E2E runs, 99+ stale nodes
accumulated in the cluster.
- These stale nodes cause cloud-provider-azure route controller to
loop with 'FailedToCreateRoute: instance not found' errors,
saturating the controller and delaying route creation for real nodes.
- Now also deletes K8s Nodes whose backing VMSS no longer exists.
2. Add firewall routes to the AKS-managed route table instead of
creating a separate one:
- Previously, addFirewallRules created 'abe2e-fw-rt' and swapped
the subnet association, disconnecting the AKS-managed route table
(aks-agentpool-*-routetable) that holds pod CIDR routes.
- cloud-provider-azure kept writing pod routes to the detached table,
making them ineffective.
- Now adds firewall routes (vnet-local + default-to-firewall)
directly to the AKS route table using individual route operations,
preserving both pod routes and firewall routes on the same table.
Root cause: 41 VMSS nodes joining simultaneously overwhelmed the route
controller, which was already saturated processing 99 stale nodes.
Only 1 route table write occurred during the entire 20-minute test
window (confirmed via Azure Activity Logs).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Fixes kubenet E2E infrastructure failures by preventing AKS pod route table detachment during firewall setup and by cleaning up stale Kubernetes Node objects left behind after VMSS garbage collection.
Changes:
- Add
armnetwork.RoutesClientto the shared Azure client so firewall routes can be upserted individually. - Update cluster preparation to run VMSS GC with kube access and delete stale K8s Node objects whose backing VMSS no longer exists.
- Rewrite
addFirewallRulesto add firewall routes directly into the AKS-managed route table instead of swapping subnet route table associations.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| e2e/config/azure.go | Adds RoutesClient initialization to support per-route operations. |
| e2e/cluster.go | Extends VMSS garbage collection to also delete stale Kubernetes Node objects. |
| e2e/aks_model.go | Updates firewall routing to modify the existing AKS route table (no subnet route table reassociation). |
e2e/aks_model.go
Outdated
| // Check if firewall routes already exist in the AKS route table | ||
| aksRT, err := config.Azure.RouteTables.Get(ctx, rg, aksRTName, nil) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to get AKS route table %q: %w", aksRTName, err) | ||
| } | ||
|
|
||
| if aksRT.Properties != nil { | ||
| for _, route := range aksRT.Properties.Routes { | ||
| if route.Name != nil && *route.Name == "default-route-to-firewall" { | ||
| toolkit.Logf(ctx, "firewall routes already present in AKS route table %q", aksRTName) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
The idempotency check returns early if it finds a route named "default-route-to-firewall", which can leave the route table in a partially-configured or stale state (e.g., missing the required "vnet-local" route, or pointing the default route at an old firewallPrivateIP after a firewall recreation). Instead of returning early, ensure both routes exist and match the desired properties (or just always upsert both routes since CreateOrUpdate is already idempotent).
e2e/aks_model.go
Outdated
| aksRTID := *aksSubnetResp.Properties.RouteTable.ID | ||
| idParts := strings.Split(aksRTID, "/") | ||
| aksRTName := idParts[len(idParts)-1] | ||
|
|
There was a problem hiding this comment.
Deriving the route table name via strings.Split(aksRTID, "/") and taking the last element can yield an empty/incorrect name if the ID is malformed or ends with a trailing slash, which would later cause confusing failures (or panic if additional indexing is added). Consider parsing the Azure resource ID with azcore/arm helpers (e.g., arm.ParseResourceID) and validate that the extracted route table name is non-empty before proceeding.
e2e/cluster.go
Outdated
| func collectGarbageNodes(ctx context.Context, kube *Kubeclient, existingVMSS map[string]struct{}) { | ||
| defer toolkit.LogStepCtx(ctx, "collecting garbage K8s nodes")() | ||
|
|
||
| nodes, err := kube.Typed.CoreV1().Nodes().List(ctx, metav1.ListOptions{}) | ||
| if err != nil { | ||
| toolkit.Logf(ctx, "failed to list K8s nodes for garbage collection: %s", err) | ||
| return | ||
| } |
There was a problem hiding this comment.
collectGarbageNodes logs and returns on list/delete failures, which can silently leave stale Nodes behind and cause the next E2E run to fail again (route controller saturation). Consider returning an error from collectGarbageNodes and propagating it via collectGarbageVMSS so the prepareCluster DAG fails fast when garbage collection can’t complete.
…gate GC errors - Use arm.ParseResourceID instead of strings.Split for route table name - Remove early-return idempotency check; rely on CreateOrUpdate being idempotent to always ensure both firewall routes are correct - Return errors from collectGarbageNodes so DAG fails fast on GC issues Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem
ALL 40 kubenet E2E tests have been failing consistently with
cni plugin not initialized/ nodes stuck inWaiting for cloud routes. Root cause investigation revealed a compound failure involving two independent bugs in the E2E infrastructure.Root Cause Analysis
1. Stale K8s Node Accumulation
collectGarbageVMSSdeletes Azure VMSS instances but does NOT clean up the corresponding Kubernetes Node objects. Over multiple E2E runs, 99+ stale nodes accumulated in the kubenet cluster. The cloud-provider-azure route controller perpetually retriedCreateRoutefor these ghost nodes, receivinginstance not founderrors (36+ observed in events). This saturated the route controller, starving new legitimate nodes of routes.2. Disconnected Route Table
addFirewallRulescreated a separateabe2e-fw-rtroute table (containing only firewall routes) and swapped it onto the subnet, disconnecting the AKS-managedaks-agentpool-*-routetablethat holds pod CIDR routes. Cloud-provider-azure continued writing pod routes to the disconnected table — with no effect on actual traffic.Evidence
FailedToCreateRouteevents with "instance not found"Fix
Fix 1: Stale Node Cleanup (
e2e/cluster.go)Added
collectGarbageNodes()that:aks-)Wired into the DAG via
dag.Run1with kube dependency so it runs after VMSS GC.Fix 2: Preserve AKS Route Table (
e2e/aks_model.go)Rewrote
addFirewallRules()to:vnet-local+default-route-to-firewall) directly to that tableRoutesclient operations to avoid TOCTOU racesSupporting Change (
e2e/config/azure.go)Added
Routes *armnetwork.RoutesClientfor individual route operations.Live Cluster Remediation
aks-agentpool-13663576-routetablewithaks-subnetvnet-localanddefault-route-to-firewallroutes to AKS tableTesting
go build ./...passesparts/orpkg/changes, nomake generateneeded)