Skip to content

feat(recipes): add concrete B200 service-bound overlays #1004

@yuanchen8911

Description

@yuanchen8911

Summary

The B200 accelerator has only the wildcard stub b200-any-training.yaml (added by #436) and no concrete service-bound overlays. A user running aicr recipe --accelerator b200 --service <any> resolves only the wildcard NCCL threshold, with no service-specific GPU operator config, no OS overlay, no platform variant.

Motivation / Context

Surfaced as an explicit "out of scope (track separately)" item in #969 — the validation phase coverage audit listed b200-any-training as a stub missing D + C, but the per-leaf fix pattern doesn't apply to wildcard overlays. The fix is concrete service-bound overlays at non-wildcard level.

B200 (Blackwell datacenter SKU, not the Grace-Blackwell GB200 superchip) is rolling out across hyperscalers; the closer that AICR has overlay coverage to GA, the lower the barrier for early adopters.

B200 cloud SKU availability

Cloud SKU
AWS EKS p6-b200.48xlarge (8x B200), p6e-gb200.48xlarge (GB200 — different criteria)
GCP GKE a4 machine series (B200, rolling out 2026)
Azure AKS Standard_ND96isr_B200_v6 (NDv6 B200 series, rolling out)
OCI OKE B200 bare-metal shapes (BM.GPU.B200.8, rolling out)
CoreWeave / Lambda Labs / etc. B200 nodes available

Land in order of testbed availability — likely EKS first given the existing p6-b200 GA status.

Suggested scope

Mirror the H100 / GB200 pattern. First PR (minimum):

  • b200-eks-training.yaml and b200-eks-inference.yaml
  • b200-eks-ubuntu-training.yaml, b200-eks-ubuntu-inference.yaml
  • One platform variant each (b200-eks-ubuntu-training-kubeflow.yaml, b200-eks-ubuntu-inference-dynamo.yaml)
  • Per-accelerator constraint: Deployment.gpu-operator.version floor — Blackwell B200 support stabilized in gpu-operator v25.10 (same baseline as GB200); recommend >= v25.10.0
  • NCCL bandwidth threshold for training — already in b200-any-training.yaml (>= 350 GB/s); validate that threshold against the testbed and tighten

PR 2+: Same patterns for GKE, AKS, OKE, in order of testbed availability.

Each PR should:

  • Add the overlays
  • Reuse GB200's kernel-module-params.yaml preManifestFile if EFA dma-buf attach is required (B200 PCI topology likely needs the same NVreg_GrdmaPciTopoCheckOverride=1 workaround as GB200; verify on testbed)
  • Pass TestOverlayValidationPhaseFloor (deployment + conformance inherited from service-root via PR feat(recipe): deliver deployment-phase floor at per-accelerator wildcards #1001)
  • Regenerate the BOM (make bom-docs) if any chart pin differs
  • Decide whether b200-any-training.yaml should be retired once concrete overlays exist, or kept as a cross-cutting threshold contributor (mirror GB200's choice — gb200-any-training.yaml is still present)

Out of scope (file separately)

  • GB200 (Grace-Blackwell superchip) — distinct gb200 enum value, distinct overlays already exist; not in scope here.
  • MIG profiles on B200 — defer until MIG support on Blackwell is GA in gpu-operator.

Related

Metadata

Metadata

Assignees

No one assigned
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions