Summary
Add a Karpenter cloudprovider that provisions external nodes in other Azure regions, modeled on the existing karpenter/pkg/cloudproviders/nebius/ pattern. Goal: let an AKS cluster (e.g. westeurope) auto-provision capacity for SKUs only available in another region (e.g. Standard_ND96isr_H200_v5 in eastus2) via a standard Karpenter NodePool.
Motivation
The merged PR #61 + open PR #62 unblocked manual cross-region node join. We have 2 H200 nodes in eastus2 joined to an AKS control plane in westeurope — proven working end-to-end with aks-flex-node v0.0.18.
What's missing is lifecycle automation: today operators must run gen_userdata.py + az vm create for every node and clean up by hand. A Karpenter cloudprovider closes the loop so a researcher's pending Pod with nodeSelector: gpu=h200 triggers VM creation in eastus2 automatically.
The existing upstream karpenter-provider-azure provisions only into the AKS cluster's own region (VMSS-based). Cross-region requires a separate provider.
Proposed Phase 1 Scope
One region per NodeClass, BYO network, on-demand only, static SKU allowlist.
Mirror nebius:
karpenter/pkg/apis/v1alpha1/azureflex.go # AzureFlexNodeClass CRD
karpenter/pkg/cloudproviders/azure/ # CloudProvider impl
karpenter/pkg/controllers/azure/ # NodeClass status + termination
karpenter/examples/azure/ # NodePool + NodeClass YAML
AzureFlexNodeClassSpec:
subscriptionID (required)
location (required, e.g. "eastus2")
resourceGroup (required)
subnetID (required, full ARM resource ID — assumes operator pre-provisioned VNet/peering/NSG)
imageReference (publisher/offer/sku/version) or imageID (SIG/community gallery)
securityType (default Standard)
osDiskSizeGB (default 128)
sshPublicKeys
allocateNodePublicIP (default false)
maxPodsPerNode (default 110)
tags (map)
- (deferred: zones, identity/UAMI, PPG, capacity reservation, spot)
ProviderID: azure-flex:///subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Compute/virtualMachines/<name> (full canonical ARM ID).
VM lifecycle:
- Deterministic VM name from
NodeClaim.Name
DeleteOption=Delete on NIC + OS disk so VM delete cascades
- Idempotent on retries (handle 404 as success)
Userdata: reuse existing plugin/pkg/util/kubeadm/azure.go:FromAKS() + plugin/pkg/services/agentpools/userdata/flex/. Cache the FromAKS result per controller process to avoid hammering the bootstrap secret on every Create.
Drift: hash-based, mirroring nebius pattern. Triggers: image, subnet, location, RG/sub change.
SKU catalog: hardcoded allowlist of 3-5 GPU SKUs we currently care about (ND96isr_H200_v5, ND96amsr_A100_v4, NCadsH100v5) with explicit availableLocations. Clear interface{} boundary so Azure SKUs API can be plugged in later.
Explicit Non-Goals (Phase 1)
- Network bring-up (VNet, peering, NSG) — operator-managed, NodeClass references existing subnet
- Multi-region per NodeClass (use multiple NodePools instead)
- Spot pricing / capacity-type selection
- Quota preflight (let ARM fail, classify as
InsufficientCapacity)
- Full Azure SKU catalog (allowlist only)
- Cross-subscription identity management (assume controller MI has rights in target sub)
Open Questions for Maintainers
- API group:
karpenter.flex.aks.azure.com to match the project? Or stay under flex.aks.azure.com like nebius does (nebiusnodeclasses.flex.aks.azure.com)?
- Reuse vs reimplement: should the provider take a Go-module dependency on
Azure/karpenter-provider-azure for VM lifecycle helpers, or stay self-contained like nebius does for Nebius?
- Identity model: Phase 1 assumes the controller's MI/SP has
Contributor on the target subscription/RG/subnet. Worth surfacing as a NodeClass field now (escape hatch) or keep as deployment-level concern?
- SKU catalog: hardcoded allowlist OK for v1, or block on Azure SKUs API integration?
aks-flex-node version: plugin/pkg/services/agentpools/userdata/flex/flex.go pins v0.0.17. Should this PR also bump to v0.0.18 (which is what proven works for our H200 case), or do that in a separate PR?
Validation Plan
Real-world e2e target: voice-agent-flex (AKS westeurope) provisioning into voice-agent-flex-h200-rg (eastus2). 2 working H200 nodes already there to compare against.
Unit tests will cover: providerID round-trip, idempotent delete, partial-failure cleanup, NodeClass deletion blocked by live NodeClaims, bad-subnet validation, unsupported SKU/region rejection, drift hash determinism.
Happy to scope this down further if Phase 1 is too big a chunk. Filing as a draft for direction-setting before opening a PR.
Summary
Add a Karpenter cloudprovider that provisions external nodes in other Azure regions, modeled on the existing
karpenter/pkg/cloudproviders/nebius/pattern. Goal: let an AKS cluster (e.g. westeurope) auto-provision capacity for SKUs only available in another region (e.g.Standard_ND96isr_H200_v5in eastus2) via a standard Karpenter NodePool.Motivation
The merged PR #61 + open PR #62 unblocked manual cross-region node join. We have 2 H200 nodes in eastus2 joined to an AKS control plane in westeurope — proven working end-to-end with
aks-flex-node v0.0.18.What's missing is lifecycle automation: today operators must run
gen_userdata.py+az vm createfor every node and clean up by hand. A Karpenter cloudprovider closes the loop so a researcher'spendingPod withnodeSelector: gpu=h200triggers VM creation in eastus2 automatically.The existing upstream
karpenter-provider-azureprovisions only into the AKS cluster's own region (VMSS-based). Cross-region requires a separate provider.Proposed Phase 1 Scope
One region per NodeClass, BYO network, on-demand only, static SKU allowlist.
Mirror nebius:
AzureFlexNodeClassSpec:subscriptionID(required)location(required, e.g. "eastus2")resourceGroup(required)subnetID(required, full ARM resource ID — assumes operator pre-provisioned VNet/peering/NSG)imageReference(publisher/offer/sku/version) orimageID(SIG/community gallery)securityType(defaultStandard)osDiskSizeGB(default 128)sshPublicKeysallocateNodePublicIP(default false)maxPodsPerNode(default 110)tags(map)ProviderID:
azure-flex:///subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Compute/virtualMachines/<name>(full canonical ARM ID).VM lifecycle:
NodeClaim.NameDeleteOption=Deleteon NIC + OS disk so VM delete cascadesUserdata: reuse existing
plugin/pkg/util/kubeadm/azure.go:FromAKS()+plugin/pkg/services/agentpools/userdata/flex/. Cache the FromAKS result per controller process to avoid hammering the bootstrap secret on every Create.Drift: hash-based, mirroring nebius pattern. Triggers: image, subnet, location, RG/sub change.
SKU catalog: hardcoded allowlist of 3-5 GPU SKUs we currently care about (ND96isr_H200_v5, ND96amsr_A100_v4, NCadsH100v5) with explicit
availableLocations. Clearinterface{}boundary so Azure SKUs API can be plugged in later.Explicit Non-Goals (Phase 1)
InsufficientCapacity)Open Questions for Maintainers
karpenter.flex.aks.azure.comto match the project? Or stay underflex.aks.azure.comlike nebius does (nebiusnodeclasses.flex.aks.azure.com)?Azure/karpenter-provider-azurefor VM lifecycle helpers, or stay self-contained like nebius does for Nebius?Contributoron the target subscription/RG/subnet. Worth surfacing as a NodeClass field now (escape hatch) or keep as deployment-level concern?aks-flex-nodeversion:plugin/pkg/services/agentpools/userdata/flex/flex.gopinsv0.0.17. Should this PR also bump tov0.0.18(which is what proven works for our H200 case), or do that in a separate PR?Validation Plan
Real-world e2e target:
voice-agent-flex(AKS westeurope) provisioning intovoice-agent-flex-h200-rg(eastus2). 2 working H200 nodes already there to compare against.Unit tests will cover: providerID round-trip, idempotent delete, partial-failure cleanup, NodeClass deletion blocked by live NodeClaims, bad-subnet validation, unsupported SKU/region rejection, drift hash determinism.
Happy to scope this down further if Phase 1 is too big a chunk. Filing as a draft for direction-setting before opening a PR.