fix(e2e): prevent kubenet route table detachment and stale node accumulation by djsly · Pull Request #8279 · Azure/AgentBaker

djsly · 2026-04-11T01:05:38Z

Problem

ALL 40 kubenet E2E tests have been failing consistently with cni plugin not initialized / nodes stuck in Waiting for cloud routes. Root cause investigation revealed a compound failure involving two independent bugs in the E2E infrastructure.

Root Cause Analysis

1. Stale K8s Node Accumulation

collectGarbageVMSS deletes Azure VMSS instances but does NOT clean up the corresponding Kubernetes Node objects. Over multiple E2E runs, 99+ stale nodes accumulated in the kubenet cluster. The cloud-provider-azure route controller perpetually retried CreateRoute for these ghost nodes, receiving instance not found errors (36+ observed in events). This saturated the route controller, starving new legitimate nodes of routes.

2. Disconnected Route Table

addFirewallRules created a separate abe2e-fw-rt route table (containing only firewall routes) and swapped it onto the subnet, disconnecting the AKS-managed aks-agentpool-*-routetable that holds pod CIDR routes. Cloud-provider-azure continued writing pod routes to the disconnected table — with no effect on actual traffic.

Evidence

Azure Activity Logs: ZERO subnet modifications during the entire test window. Only 1 route table write at 20:38 (after tests were already failing).
kubectl: 99 Node objects in cluster, 36+ FailedToCreateRoute events with "instance not found"
Pass/fail correlation: ALL 40 kubenet tests failed; "passing" tests were Azure CNI / network-isolated (different cluster)
Timeline: 41 VMSS hit the kubenet cluster in a 22-second window from both Regular and Scriptless E2E pipelines

Fix

Fix 1: Stale Node Cleanup (`e2e/cluster.go`)

Added collectGarbageNodes() that:

Lists all K8s nodes in the cluster
Builds a set of active VMSS names
Identifies nodes whose backing VMSS no longer exists (by deriving VMSS name from node name)
Preserves system pool nodes (prefixed with aks-)
Deletes stale nodes

Wired into the DAG via dag.Run1 with kube dependency so it runs after VMSS GC.

Fix 2: Preserve AKS Route Table (`e2e/aks_model.go`)

Rewrote addFirewallRules() to:

Find the AKS-managed route table already associated with the subnet
Add firewall routes (vnet-local + default-route-to-firewall) directly to that table
Use individual Routes client operations to avoid TOCTOU races
Never create a separate route table or change subnet association

Supporting Change (`e2e/config/azure.go`)

Added Routes *armnetwork.RoutesClient for individual route operations.

Live Cluster Remediation

✅ Re-associated aks-agentpool-13663576-routetable with aks-subnet
✅ Added vnet-local and default-route-to-firewall routes to AKS table
✅ Both pod CIDR routes and firewall routes now coexist

Testing

go build ./... passes
Changes are E2E infrastructure only (no parts/ or pkg/ changes, no make generate needed)
Full E2E validation will occur on next pipeline run

…ulation Two fixes for the kubenet E2E test failures where ALL 40 kubenet tests fail with 'cni plugin not initialized' (nodes never become Ready): 1. Add stale K8s Node cleanup to collectGarbageVMSS: - collectGarbageVMSS deleted Azure VMSS instances but left Kubernetes Node objects behind. Over multiple E2E runs, 99+ stale nodes accumulated in the cluster. - These stale nodes cause cloud-provider-azure route controller to loop with 'FailedToCreateRoute: instance not found' errors, saturating the controller and delaying route creation for real nodes. - Now also deletes K8s Nodes whose backing VMSS no longer exists. 2. Add firewall routes to the AKS-managed route table instead of creating a separate one: - Previously, addFirewallRules created 'abe2e-fw-rt' and swapped the subnet association, disconnecting the AKS-managed route table (aks-agentpool-*-routetable) that holds pod CIDR routes. - cloud-provider-azure kept writing pod routes to the detached table, making them ineffective. - Now adds firewall routes (vnet-local + default-to-firewall) directly to the AKS route table using individual route operations, preserving both pod routes and firewall routes on the same table. Root cause: 41 VMSS nodes joining simultaneously overwhelmed the route controller, which was already saturated processing 99 stale nodes. Only 1 route table write occurred during the entire 20-minute test window (confirmed via Azure Activity Logs). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Fixes kubenet E2E infrastructure failures by preventing AKS pod route table detachment during firewall setup and by cleaning up stale Kubernetes Node objects left behind after VMSS garbage collection.

Changes:

Add armnetwork.RoutesClient to the shared Azure client so firewall routes can be upserted individually.
Update cluster preparation to run VMSS GC with kube access and delete stale K8s Node objects whose backing VMSS no longer exists.
Rewrite addFirewallRules to add firewall routes directly into the AKS-managed route table instead of swapping subnet route table associations.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
e2e/config/azure.go	Adds `RoutesClient` initialization to support per-route operations.
e2e/cluster.go	Extends VMSS garbage collection to also delete stale Kubernetes Node objects.
e2e/aks_model.go	Updates firewall routing to modify the existing AKS route table (no subnet route table reassociation).

Copilot · 2026-04-11T01:11:36Z

e2e/aks_model.go

+	// Check if firewall routes already exist in the AKS route table
+	aksRT, err := config.Azure.RouteTables.Get(ctx, rg, aksRTName, nil)
+	if err != nil {
+		return fmt.Errorf("failed to get AKS route table %q: %w", aksRTName, err)
+	}
+
+	if aksRT.Properties != nil {
+		for _, route := range aksRT.Properties.Routes {
+			if route.Name != nil && *route.Name == "default-route-to-firewall" {
+				toolkit.Logf(ctx, "firewall routes already present in AKS route table %q", aksRTName)
+				return nil
+			}


The idempotency check returns early if it finds a route named "default-route-to-firewall", which can leave the route table in a partially-configured or stale state (e.g., missing the required "vnet-local" route, or pointing the default route at an old firewallPrivateIP after a firewall recreation). Instead of returning early, ensure both routes exist and match the desired properties (or just always upsert both routes since CreateOrUpdate is already idempotent).

Copilot · 2026-04-11T01:11:36Z

e2e/aks_model.go

+	aksRTID := *aksSubnetResp.Properties.RouteTable.ID
+	idParts := strings.Split(aksRTID, "/")
+	aksRTName := idParts[len(idParts)-1]
+


Deriving the route table name via strings.Split(aksRTID, "/") and taking the last element can yield an empty/incorrect name if the ID is malformed or ends with a trailing slash, which would later cause confusing failures (or panic if additional indexing is added). Consider parsing the Azure resource ID with azcore/arm helpers (e.g., arm.ParseResourceID) and validate that the extracted route table name is non-empty before proceeding.

Copilot · 2026-04-11T01:11:36Z

e2e/cluster.go

+func collectGarbageNodes(ctx context.Context, kube *Kubeclient, existingVMSS map[string]struct{}) {
+	defer toolkit.LogStepCtx(ctx, "collecting garbage K8s nodes")()
+
+	nodes, err := kube.Typed.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
+	if err != nil {
+		toolkit.Logf(ctx, "failed to list K8s nodes for garbage collection: %s", err)
+		return
+	}


collectGarbageNodes logs and returns on list/delete failures, which can silently leave stale Nodes behind and cause the next E2E run to fail again (route controller saturation). Consider returning an error from collectGarbageNodes and propagating it via collectGarbageVMSS so the prepareCluster DAG fails fast when garbage collection can’t complete.

…gate GC errors - Use arm.ParseResourceID instead of strings.Split for route table name - Remove early-return idempotency check; rely on CreateOrUpdate being idempotent to always ensure both firewall routes are correct - Return errors from collectGarbageNodes so DAG fails fast on GC issues Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 11, 2026 01:05

djsly requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, ganeshkumarashok, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, sulixu, surajssd, timmy-wright and zachary-bailey as code owners April 11, 2026 01:05

djsly temporarily deployed to test April 11, 2026 01:05 — with GitHub Actions Inactive

Copilot started reviewing on behalf of djsly April 11, 2026 01:07 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

djsly temporarily deployed to test April 11, 2026 01:16 — with GitHub Actions Inactive

r2k1 approved these changes Apr 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): prevent kubenet route table detachment and stale node accumulation#8279

fix(e2e): prevent kubenet route table detachment and stale node accumulation#8279
djsly wants to merge 2 commits intomainfrom
djsly/fix-kubenet-route-table

djsly commented Apr 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

djsly commented Apr 11, 2026

Problem

Root Cause Analysis

1. Stale K8s Node Accumulation

2. Disconnected Route Table

Evidence

Fix

Fix 1: Stale Node Cleanup (e2e/cluster.go)

Fix 2: Preserve AKS Route Table (e2e/aks_model.go)

Supporting Change (e2e/config/azure.go)

Live Cluster Remediation

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix 1: Stale Node Cleanup (`e2e/cluster.go`)

Fix 2: Preserve AKS Route Table (`e2e/aks_model.go`)

Supporting Change (`e2e/config/azure.go`)