Skip to content

fix(e2e): prevent kubenet route table detachment and stale node accumulation#8279

Open
djsly wants to merge 2 commits intomainfrom
djsly/fix-kubenet-route-table
Open

fix(e2e): prevent kubenet route table detachment and stale node accumulation#8279
djsly wants to merge 2 commits intomainfrom
djsly/fix-kubenet-route-table

Conversation

@djsly
Copy link
Copy Markdown
Collaborator

@djsly djsly commented Apr 11, 2026

Problem

ALL 40 kubenet E2E tests have been failing consistently with cni plugin not initialized / nodes stuck in Waiting for cloud routes. Root cause investigation revealed a compound failure involving two independent bugs in the E2E infrastructure.

Root Cause Analysis

1. Stale K8s Node Accumulation

collectGarbageVMSS deletes Azure VMSS instances but does NOT clean up the corresponding Kubernetes Node objects. Over multiple E2E runs, 99+ stale nodes accumulated in the kubenet cluster. The cloud-provider-azure route controller perpetually retried CreateRoute for these ghost nodes, receiving instance not found errors (36+ observed in events). This saturated the route controller, starving new legitimate nodes of routes.

2. Disconnected Route Table

addFirewallRules created a separate abe2e-fw-rt route table (containing only firewall routes) and swapped it onto the subnet, disconnecting the AKS-managed aks-agentpool-*-routetable that holds pod CIDR routes. Cloud-provider-azure continued writing pod routes to the disconnected table — with no effect on actual traffic.

Evidence

  • Azure Activity Logs: ZERO subnet modifications during the entire test window. Only 1 route table write at 20:38 (after tests were already failing).
  • kubectl: 99 Node objects in cluster, 36+ FailedToCreateRoute events with "instance not found"
  • Pass/fail correlation: ALL 40 kubenet tests failed; "passing" tests were Azure CNI / network-isolated (different cluster)
  • Timeline: 41 VMSS hit the kubenet cluster in a 22-second window from both Regular and Scriptless E2E pipelines

Fix

Fix 1: Stale Node Cleanup (e2e/cluster.go)

Added collectGarbageNodes() that:

  • Lists all K8s nodes in the cluster
  • Builds a set of active VMSS names
  • Identifies nodes whose backing VMSS no longer exists (by deriving VMSS name from node name)
  • Preserves system pool nodes (prefixed with aks-)
  • Deletes stale nodes

Wired into the DAG via dag.Run1 with kube dependency so it runs after VMSS GC.

Fix 2: Preserve AKS Route Table (e2e/aks_model.go)

Rewrote addFirewallRules() to:

  • Find the AKS-managed route table already associated with the subnet
  • Add firewall routes (vnet-local + default-route-to-firewall) directly to that table
  • Use individual Routes client operations to avoid TOCTOU races
  • Never create a separate route table or change subnet association

Supporting Change (e2e/config/azure.go)

Added Routes *armnetwork.RoutesClient for individual route operations.

Live Cluster Remediation

  • ✅ Re-associated aks-agentpool-13663576-routetable with aks-subnet
  • ✅ Added vnet-local and default-route-to-firewall routes to AKS table
  • ✅ Both pod CIDR routes and firewall routes now coexist

Testing

  • go build ./... passes
  • Changes are E2E infrastructure only (no parts/ or pkg/ changes, no make generate needed)
  • Full E2E validation will occur on next pipeline run

…ulation

Two fixes for the kubenet E2E test failures where ALL 40 kubenet tests
fail with 'cni plugin not initialized' (nodes never become Ready):

1. Add stale K8s Node cleanup to collectGarbageVMSS:
   - collectGarbageVMSS deleted Azure VMSS instances but left Kubernetes
     Node objects behind. Over multiple E2E runs, 99+ stale nodes
     accumulated in the cluster.
   - These stale nodes cause cloud-provider-azure route controller to
     loop with 'FailedToCreateRoute: instance not found' errors,
     saturating the controller and delaying route creation for real nodes.
   - Now also deletes K8s Nodes whose backing VMSS no longer exists.

2. Add firewall routes to the AKS-managed route table instead of
   creating a separate one:
   - Previously, addFirewallRules created 'abe2e-fw-rt' and swapped
     the subnet association, disconnecting the AKS-managed route table
     (aks-agentpool-*-routetable) that holds pod CIDR routes.
   - cloud-provider-azure kept writing pod routes to the detached table,
     making them ineffective.
   - Now adds firewall routes (vnet-local + default-to-firewall)
     directly to the AKS route table using individual route operations,
     preserving both pod routes and firewall routes on the same table.

Root cause: 41 VMSS nodes joining simultaneously overwhelmed the route
controller, which was already saturated processing 99 stale nodes.
Only 1 route table write occurred during the entire 20-minute test
window (confirmed via Azure Activity Logs).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes kubenet E2E infrastructure failures by preventing AKS pod route table detachment during firewall setup and by cleaning up stale Kubernetes Node objects left behind after VMSS garbage collection.

Changes:

  • Add armnetwork.RoutesClient to the shared Azure client so firewall routes can be upserted individually.
  • Update cluster preparation to run VMSS GC with kube access and delete stale K8s Node objects whose backing VMSS no longer exists.
  • Rewrite addFirewallRules to add firewall routes directly into the AKS-managed route table instead of swapping subnet route table associations.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
e2e/config/azure.go Adds RoutesClient initialization to support per-route operations.
e2e/cluster.go Extends VMSS garbage collection to also delete stale Kubernetes Node objects.
e2e/aks_model.go Updates firewall routing to modify the existing AKS route table (no subnet route table reassociation).

e2e/aks_model.go Outdated
Comment on lines +329 to +340
// Check if firewall routes already exist in the AKS route table
aksRT, err := config.Azure.RouteTables.Get(ctx, rg, aksRTName, nil)
if err != nil {
return fmt.Errorf("failed to get AKS route table %q: %w", aksRTName, err)
}

if aksRT.Properties != nil {
for _, route := range aksRT.Properties.Routes {
if route.Name != nil && *route.Name == "default-route-to-firewall" {
toolkit.Logf(ctx, "firewall routes already present in AKS route table %q", aksRTName)
return nil
}
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idempotency check returns early if it finds a route named "default-route-to-firewall", which can leave the route table in a partially-configured or stale state (e.g., missing the required "vnet-local" route, or pointing the default route at an old firewallPrivateIP after a firewall recreation). Instead of returning early, ensure both routes exist and match the desired properties (or just always upsert both routes since CreateOrUpdate is already idempotent).

Copilot uses AI. Check for mistakes.
e2e/aks_model.go Outdated
Comment on lines +325 to +328
aksRTID := *aksSubnetResp.Properties.RouteTable.ID
idParts := strings.Split(aksRTID, "/")
aksRTName := idParts[len(idParts)-1]

Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deriving the route table name via strings.Split(aksRTID, "/") and taking the last element can yield an empty/incorrect name if the ID is malformed or ends with a trailing slash, which would later cause confusing failures (or panic if additional indexing is added). Consider parsing the Azure resource ID with azcore/arm helpers (e.g., arm.ParseResourceID) and validate that the extracted route table name is non-empty before proceeding.

Copilot uses AI. Check for mistakes.
e2e/cluster.go Outdated
Comment on lines +751 to +758
func collectGarbageNodes(ctx context.Context, kube *Kubeclient, existingVMSS map[string]struct{}) {
defer toolkit.LogStepCtx(ctx, "collecting garbage K8s nodes")()

nodes, err := kube.Typed.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
if err != nil {
toolkit.Logf(ctx, "failed to list K8s nodes for garbage collection: %s", err)
return
}
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collectGarbageNodes logs and returns on list/delete failures, which can silently leave stale Nodes behind and cause the next E2E run to fail again (route controller saturation). Consider returning an error from collectGarbageNodes and propagating it via collectGarbageVMSS so the prepareCluster DAG fails fast when garbage collection can’t complete.

Copilot uses AI. Check for mistakes.
…gate GC errors

- Use arm.ParseResourceID instead of strings.Split for route table name
- Remove early-return idempotency check; rely on CreateOrUpdate being
  idempotent to always ensure both firewall routes are correct
- Return errors from collectGarbageNodes so DAG fails fast on GC issues

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants