feat(recipes): add concrete B200 service-bound overlays

## Summary

The B200 accelerator has only the wildcard stub `b200-any-training.yaml` (added by #436) and **no concrete service-bound overlays**. A user running `aicr recipe --accelerator b200 --service <any>` resolves only the wildcard NCCL threshold, with no service-specific GPU operator config, no OS overlay, no platform variant.

## Motivation / Context

Surfaced as an explicit "out of scope (track separately)" item in #969 — the validation phase coverage audit listed `b200-any-training` as a stub missing D + C, but the per-leaf fix pattern doesn't apply to wildcard overlays. The fix is concrete service-bound overlays at non-wildcard level.

B200 (Blackwell datacenter SKU, not the Grace-Blackwell GB200 superchip) is rolling out across hyperscalers; the closer that AICR has overlay coverage to GA, the lower the barrier for early adopters.

## B200 cloud SKU availability

| Cloud | SKU |
|---|---|
| AWS EKS | `p6-b200.48xlarge` (8x B200), `p6e-gb200.48xlarge` (GB200 — different criteria) |
| GCP GKE | `a4` machine series (B200, rolling out 2026) |
| Azure AKS | `Standard_ND96isr_B200_v6` (NDv6 B200 series, rolling out) |
| OCI OKE | B200 bare-metal shapes (`BM.GPU.B200.8`, rolling out) |
| CoreWeave / Lambda Labs / etc. | B200 nodes available |

Land in order of testbed availability — likely EKS first given the existing `p6-b200` GA status.

## Suggested scope

Mirror the H100 / GB200 pattern. **First PR (minimum):**

- `b200-eks-training.yaml` and `b200-eks-inference.yaml`
- `b200-eks-ubuntu-training.yaml`, `b200-eks-ubuntu-inference.yaml`
- One platform variant each (`b200-eks-ubuntu-training-kubeflow.yaml`, `b200-eks-ubuntu-inference-dynamo.yaml`)
- Per-accelerator constraint: `Deployment.gpu-operator.version` floor — Blackwell B200 support stabilized in gpu-operator v25.10 (same baseline as GB200); recommend `>= v25.10.0`
- NCCL bandwidth threshold for training — already in `b200-any-training.yaml` (`>= 350 GB/s`); validate that threshold against the testbed and tighten

**PR 2+:** Same patterns for GKE, AKS, OKE, in order of testbed availability.

Each PR should:
- Add the overlays
- Reuse GB200's `kernel-module-params.yaml` preManifestFile if EFA dma-buf attach is required (B200 PCI topology likely needs the same `NVreg_GrdmaPciTopoCheckOverride=1` workaround as GB200; verify on testbed)
- Pass `TestOverlayValidationPhaseFloor` (deployment + conformance inherited from service-root via PR #1001)
- Regenerate the BOM (`make bom-docs`) if any chart pin differs
- Decide whether `b200-any-training.yaml` should be retired once concrete overlays exist, or kept as a cross-cutting threshold contributor (mirror GB200's choice — `gb200-any-training.yaml` is still present)

## Out of scope (file separately)

- **GB200 (Grace-Blackwell superchip)** — distinct `gb200` enum value, distinct overlays already exist; not in scope here.
- **MIG profiles on B200** — defer until MIG support on Blackwell is GA in gpu-operator.

## Related

- #436 — Add B200 accelerator type support (CLOSED, added the wildcard stub)
- #969 — validation phase coverage audit (B200 wildcard called out as "stub" missing D + C)
- GB200 overlay set: `recipes/overlays/gb200-*.yaml` (closest reference pattern)
- H100 overlay set: `recipes/overlays/h100-*.yaml` (breadth target)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recipes): add concrete B200 service-bound overlays #1004

Summary

Motivation / Context

B200 cloud SKU availability

Suggested scope

Out of scope (file separately)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Cloud	SKU
AWS EKS	`p6-b200.48xlarge` (8x B200), `p6e-gb200.48xlarge` (GB200 — different criteria)
GCP GKE	`a4` machine series (B200, rolling out 2026)
Azure AKS	`Standard_ND96isr_B200_v6` (NDv6 B200 series, rolling out)
OCI OKE	B200 bare-metal shapes (`BM.GPU.B200.8`, rolling out)
CoreWeave / Lambda Labs / etc.	B200 nodes available

feat(recipes): add concrete B200 service-bound overlays #1004

Description

Summary

Motivation / Context

B200 cloud SKU availability

Suggested scope

Out of scope (file separately)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions