You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The B200 accelerator has only the wildcard stub b200-any-training.yaml (added by #436) and no concrete service-bound overlays. A user running aicr recipe --accelerator b200 --service <any> resolves only the wildcard NCCL threshold, with no service-specific GPU operator config, no OS overlay, no platform variant.
Motivation / Context
Surfaced as an explicit "out of scope (track separately)" item in #969 — the validation phase coverage audit listed b200-any-training as a stub missing D + C, but the per-leaf fix pattern doesn't apply to wildcard overlays. The fix is concrete service-bound overlays at non-wildcard level.
B200 (Blackwell datacenter SKU, not the Grace-Blackwell GB200 superchip) is rolling out across hyperscalers; the closer that AICR has overlay coverage to GA, the lower the barrier for early adopters.
B200 cloud SKU availability
Cloud
SKU
AWS EKS
p6-b200.48xlarge (8x B200), p6e-gb200.48xlarge (GB200 — different criteria)
GCP GKE
a4 machine series (B200, rolling out 2026)
Azure AKS
Standard_ND96isr_B200_v6 (NDv6 B200 series, rolling out)
OCI OKE
B200 bare-metal shapes (BM.GPU.B200.8, rolling out)
CoreWeave / Lambda Labs / etc.
B200 nodes available
Land in order of testbed availability — likely EKS first given the existing p6-b200 GA status.
Suggested scope
Mirror the H100 / GB200 pattern. First PR (minimum):
b200-eks-training.yaml and b200-eks-inference.yaml
One platform variant each (b200-eks-ubuntu-training-kubeflow.yaml, b200-eks-ubuntu-inference-dynamo.yaml)
Per-accelerator constraint: Deployment.gpu-operator.version floor — Blackwell B200 support stabilized in gpu-operator v25.10 (same baseline as GB200); recommend >= v25.10.0
NCCL bandwidth threshold for training — already in b200-any-training.yaml (>= 350 GB/s); validate that threshold against the testbed and tighten
PR 2+: Same patterns for GKE, AKS, OKE, in order of testbed availability.
Each PR should:
Add the overlays
Reuse GB200's kernel-module-params.yaml preManifestFile if EFA dma-buf attach is required (B200 PCI topology likely needs the same NVreg_GrdmaPciTopoCheckOverride=1 workaround as GB200; verify on testbed)
Regenerate the BOM (make bom-docs) if any chart pin differs
Decide whether b200-any-training.yaml should be retired once concrete overlays exist, or kept as a cross-cutting threshold contributor (mirror GB200's choice — gb200-any-training.yaml is still present)
Out of scope (file separately)
GB200 (Grace-Blackwell superchip) — distinct gb200 enum value, distinct overlays already exist; not in scope here.
MIG profiles on B200 — defer until MIG support on Blackwell is GA in gpu-operator.
Summary
The B200 accelerator has only the wildcard stub
b200-any-training.yaml(added by #436) and no concrete service-bound overlays. A user runningaicr recipe --accelerator b200 --service <any>resolves only the wildcard NCCL threshold, with no service-specific GPU operator config, no OS overlay, no platform variant.Motivation / Context
Surfaced as an explicit "out of scope (track separately)" item in #969 — the validation phase coverage audit listed
b200-any-trainingas a stub missing D + C, but the per-leaf fix pattern doesn't apply to wildcard overlays. The fix is concrete service-bound overlays at non-wildcard level.B200 (Blackwell datacenter SKU, not the Grace-Blackwell GB200 superchip) is rolling out across hyperscalers; the closer that AICR has overlay coverage to GA, the lower the barrier for early adopters.
B200 cloud SKU availability
p6-b200.48xlarge(8x B200),p6e-gb200.48xlarge(GB200 — different criteria)a4machine series (B200, rolling out 2026)Standard_ND96isr_B200_v6(NDv6 B200 series, rolling out)BM.GPU.B200.8, rolling out)Land in order of testbed availability — likely EKS first given the existing
p6-b200GA status.Suggested scope
Mirror the H100 / GB200 pattern. First PR (minimum):
b200-eks-training.yamlandb200-eks-inference.yamlb200-eks-ubuntu-training.yaml,b200-eks-ubuntu-inference.yamlb200-eks-ubuntu-training-kubeflow.yaml,b200-eks-ubuntu-inference-dynamo.yaml)Deployment.gpu-operator.versionfloor — Blackwell B200 support stabilized in gpu-operator v25.10 (same baseline as GB200); recommend>= v25.10.0b200-any-training.yaml(>= 350 GB/s); validate that threshold against the testbed and tightenPR 2+: Same patterns for GKE, AKS, OKE, in order of testbed availability.
Each PR should:
kernel-module-params.yamlpreManifestFile if EFA dma-buf attach is required (B200 PCI topology likely needs the sameNVreg_GrdmaPciTopoCheckOverride=1workaround as GB200; verify on testbed)TestOverlayValidationPhaseFloor(deployment + conformance inherited from service-root via PR feat(recipe): deliver deployment-phase floor at per-accelerator wildcards #1001)make bom-docs) if any chart pin differsb200-any-training.yamlshould be retired once concrete overlays exist, or kept as a cross-cutting threshold contributor (mirror GB200's choice —gb200-any-training.yamlis still present)Out of scope (file separately)
gb200enum value, distinct overlays already exist; not in scope here.Related
recipes/overlays/gb200-*.yaml(closest reference pattern)recipes/overlays/h100-*.yaml(breadth target)