Summary
Add performance-phase constraints to the OCI / OKE GB200 overlays once an OCI testbed is available to produce empirically-grounded thresholds. This is the largest testbed-blocked cohort.
Affected overlays
| Overlay |
Required performance constraint kind |
gb200-oke-training |
NCCL all-reduce-bw (nccl-all-reduce-bw-net + nccl-all-reduce-bw-nvls, mirroring gb200-eks-training) |
gb200-oke-ubuntu-training |
same |
gb200-oke-ubuntu-training-kubeflow |
same |
gb200-oke-ubuntu-inference-dynamo |
inference-perf (throughput + TTFT p99, mirroring h100-eks-ubuntu-inference-dynamo) |
These are all the OKE entries flagged by the strict-mode floor today (AICR_VALIDATION_FLOOR_STRICT=1). The deployment + conformance phases are inherited from oke.yaml via PR #1001; only the performance phase remains gapped.
Blocker
No OCI testbed available today for:
- NCCL bandwidth measurements on GB200 / OKE bare-metal shapes (
BM.GPU.B200.8 or equivalent NVL72 IMEX domains)
inference-perf runs on the same hardware
The GB200 EKS reference (recipes/overlays/gb200-eks-training.yaml:90-126) split NCCL into -net (EFA) and -nvls (MNNVL across the NVL72 IMEX domain) channels — OCI's network stack is different, so the EKS thresholds (>= 40 GB/s NET, >= 500 GB/s NVLS) are not portable. OCI deserves its own empirically-grounded numbers.
Design notes
- NCCL training overlays should follow the GB200 / EKS multi-channel pattern (one constraint per transport) rather than a single
nccl-all-reduce-bw — both transports exercise the actual interconnect and a silent fallback to NET should not masquerade as a pass.
- Dynamo inference uses the same
inference-perf check as h100-eks-ubuntu-inference-dynamo; thresholds can start at the H100 placeholder floors (inference-throughput >= 5000, inference-ttft-p99 <= 200) if GB200 is at least as fast, then tighten.
Done when
- OCI GB200 testbed produces baseline NCCL bandwidth numbers (NET + NVLS) and
inference-perf numbers (throughput tok/s, TTFT p99 ms).
- All 4 overlays gain a
performance.checks block with the appropriate constraint set.
- The overlays disappear from the
AICR_VALIDATION_FLOOR_STRICT=1 floor test output.
Related
Summary
Add performance-phase constraints to the OCI / OKE GB200 overlays once an OCI testbed is available to produce empirically-grounded thresholds. This is the largest testbed-blocked cohort.
Affected overlays
gb200-oke-trainingnccl-all-reduce-bw-net+nccl-all-reduce-bw-nvls, mirroringgb200-eks-training)gb200-oke-ubuntu-traininggb200-oke-ubuntu-training-kubeflowgb200-oke-ubuntu-inference-dynamoinference-perf(throughput + TTFT p99, mirroringh100-eks-ubuntu-inference-dynamo)These are all the OKE entries flagged by the strict-mode floor today (
AICR_VALIDATION_FLOOR_STRICT=1). The deployment + conformance phases are inherited fromoke.yamlvia PR #1001; only the performance phase remains gapped.Blocker
No OCI testbed available today for:
BM.GPU.B200.8or equivalent NVL72 IMEX domains)inference-perfruns on the same hardwareThe GB200 EKS reference (
recipes/overlays/gb200-eks-training.yaml:90-126) split NCCL into-net(EFA) and-nvls(MNNVL across the NVL72 IMEX domain) channels — OCI's network stack is different, so the EKS thresholds (>= 40 GB/s NET,>= 500 GB/s NVLS) are not portable. OCI deserves its own empirically-grounded numbers.Design notes
nccl-all-reduce-bw— both transports exercise the actual interconnect and a silent fallback to NET should not masquerade as a pass.inference-perfcheck ash100-eks-ubuntu-inference-dynamo; thresholds can start at the H100 placeholder floors (inference-throughput >= 5000,inference-ttft-p99 <= 200) if GB200 is at least as fast, then tighten.Done when
inference-perfnumbers (throughput tok/s, TTFT p99 ms).performance.checksblock with the appropriate constraint set.AICR_VALIDATION_FLOOR_STRICT=1floor test output.Related