Add Azure cross-region cloudprovider (modeled on nebius)

## Summary

Add a Karpenter cloudprovider that provisions external nodes in **other Azure regions**, modeled on the existing `karpenter/pkg/cloudproviders/nebius/` pattern. Goal: let an AKS cluster (e.g. westeurope) auto-provision capacity for SKUs only available in another region (e.g. `Standard_ND96isr_H200_v5` in eastus2) via a standard Karpenter NodePool.

## Motivation

The merged PR #61 + open PR #62 unblocked manual cross-region node join. We have 2 H200 nodes in eastus2 joined to an AKS control plane in westeurope — proven working end-to-end with `aks-flex-node v0.0.18`.

What's missing is **lifecycle automation**: today operators must run `gen_userdata.py` + `az vm create` for every node and clean up by hand. A Karpenter cloudprovider closes the loop so a researcher's `pending` Pod with `nodeSelector: gpu=h200` triggers VM creation in eastus2 automatically.

The existing upstream `karpenter-provider-azure` provisions only into the AKS cluster's own region (VMSS-based). Cross-region requires a separate provider.

## Proposed Phase 1 Scope

**One region per NodeClass, BYO network, on-demand only, static SKU allowlist.**

Mirror nebius:

```
karpenter/pkg/apis/v1alpha1/azureflex.go         # AzureFlexNodeClass CRD
karpenter/pkg/cloudproviders/azure/              # CloudProvider impl
karpenter/pkg/controllers/azure/                 # NodeClass status + termination
karpenter/examples/azure/                        # NodePool + NodeClass YAML
```

`AzureFlexNodeClassSpec`:
- `subscriptionID` (required)
- `location` (required, e.g. "eastus2")
- `resourceGroup` (required)
- `subnetID` (required, full ARM resource ID — assumes operator pre-provisioned VNet/peering/NSG)
- `imageReference` (publisher/offer/sku/version) **or** `imageID` (SIG/community gallery)
- `securityType` (default `Standard`)
- `osDiskSizeGB` (default 128)
- `sshPublicKeys`
- `allocateNodePublicIP` (default false)
- `maxPodsPerNode` (default 110)
- `tags` (map)
- *(deferred: zones, identity/UAMI, PPG, capacity reservation, spot)*

ProviderID: `azure-flex:///subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Compute/virtualMachines/<name>` (full canonical ARM ID).

VM lifecycle:
- Deterministic VM name from `NodeClaim.Name`
- `DeleteOption=Delete` on NIC + OS disk so VM delete cascades
- Idempotent on retries (handle 404 as success)

Userdata: reuse existing `plugin/pkg/util/kubeadm/azure.go:FromAKS()` + `plugin/pkg/services/agentpools/userdata/flex/`. Cache the FromAKS result per controller process to avoid hammering the bootstrap secret on every Create.

Drift: hash-based, mirroring nebius pattern. Triggers: image, subnet, location, RG/sub change.

SKU catalog: hardcoded allowlist of 3-5 GPU SKUs we currently care about (ND96isr_H200_v5, ND96amsr_A100_v4, NCadsH100v5) with explicit `availableLocations`. Clear `interface{}` boundary so Azure SKUs API can be plugged in later.

## Explicit Non-Goals (Phase 1)

- Network bring-up (VNet, peering, NSG) — operator-managed, NodeClass references existing subnet
- Multi-region per NodeClass (use multiple NodePools instead)
- Spot pricing / capacity-type selection
- Quota preflight (let ARM fail, classify as `InsufficientCapacity`)
- Full Azure SKU catalog (allowlist only)
- Cross-subscription identity management (assume controller MI has rights in target sub)

## Open Questions for Maintainers

1. **API group**: `karpenter.flex.aks.azure.com` to match the project? Or stay under `flex.aks.azure.com` like nebius does (`nebiusnodeclasses.flex.aks.azure.com`)?
2. **Reuse vs reimplement**: should the provider take a Go-module dependency on `Azure/karpenter-provider-azure` for VM lifecycle helpers, or stay self-contained like nebius does for Nebius?
3. **Identity model**: Phase 1 assumes the controller's MI/SP has `Contributor` on the target subscription/RG/subnet. Worth surfacing as a NodeClass field now (escape hatch) or keep as deployment-level concern?
4. **SKU catalog**: hardcoded allowlist OK for v1, or block on Azure SKUs API integration?
5. **`aks-flex-node` version**: `plugin/pkg/services/agentpools/userdata/flex/flex.go` pins `v0.0.17`. Should this PR also bump to `v0.0.18` (which is what proven works for our H200 case), or do that in a separate PR?

## Validation Plan

Real-world e2e target: `voice-agent-flex` (AKS westeurope) provisioning into `voice-agent-flex-h200-rg` (eastus2). 2 working H200 nodes already there to compare against.

Unit tests will cover: providerID round-trip, idempotent delete, partial-failure cleanup, NodeClass deletion blocked by live NodeClaims, bad-subnet validation, unsupported SKU/region rejection, drift hash determinism.

Happy to scope this down further if Phase 1 is too big a chunk. Filing as a draft for direction-setting before opening a PR.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Azure cross-region cloudprovider (modeled on nebius) #63

Summary

Motivation

Proposed Phase 1 Scope

Explicit Non-Goals (Phase 1)

Open Questions for Maintainers

Validation Plan

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add Azure cross-region cloudprovider (modeled on nebius) #63

Description

Summary

Motivation

Proposed Phase 1 Scope

Explicit Non-Goals (Phase 1)

Open Questions for Maintainers

Validation Plan

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions