feat(recipe): register h200 as first-class accelerator type#1091
Conversation
|
🌿 Preview your docs: https://nvidia-preview-feat-h200-accelerator-1086.docs.buildwithfern.com/aicr |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds h200 as a supported GPU accelerator type across the AICR codebase. Changes include: updating the OpenAPI schema (api/aicr/v1/server.yaml) to include h200 in query parameter and schema enums; implementing h200 support in the Go criteria type system (pkg/recipe/criteria.go) with a new constant, parsing logic, and getter updates; extending unit tests to cover h200 parsing and adjust registry-based tests; updating fingerprint GPU SKU parsing and related tests; updating documentation, CLI help text, issue templates, and adding an h200 RecipeMetadata overlay. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/recipe/doc.go`:
- Around line 29-30: The package docs are missing CriteriaAcceleratorH200 in the
"Accelerator types for GPU selection" list; add CriteriaAcceleratorH200 to that
bullet list in pkg/recipe/doc.go so the documented accelerator list (the
comments showing values like h100, h200, gb200, ...) matches the updated
field/query docs and includes h200 alongside the other CriteriaAccelerator*
entries.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 0ee64d10-17bd-4920-9e70-2facff9a017d
📒 Files selected for processing (20)
.github/ISSUE_TEMPLATE/bug_report.ymlapi/aicr/v1/server.yamldocs/README.mddocs/contributor/api-server-extending.mddocs/contributor/api-server.mddocs/contributor/cli.mddocs/contributor/data.mddocs/contributor/validations.mddocs/user/api-reference.mddocs/user/cli-reference.mdpkg/api/doc.gopkg/cli/recipe.gopkg/fingerprint/doc.gopkg/fingerprint/types.gopkg/recipe/criteria.gopkg/recipe/criteria_registry_parse_test.gopkg/recipe/criteria_registry_test.gopkg/recipe/criteria_test.gopkg/recipe/doc.gopkg/server/doc.go
73564c1 to
5777f12
Compare
|
Actionable comments posted: 0 |
5777f12 to
971fb99
Compare
971fb99 to
2e4d05e
Compare
|
Actionable comments posted: 0 |
2e4d05e to
d09d8ba
Compare
|
Actionable comments posted: 0 |
74ff416 to
10a43cf
Compare
10a43cf to
2b06bd0
Compare
|
Actionable comments posted: 0 |
mchmarny
left a comment
There was a problem hiding this comment.
Model PR for adding a new enum value. The audit follows the CLAUDE.md "new enum" rule across every surface (Go const + parse + Get list, OpenAPI spec ×5 blocks, every doc page that enumerates accelerators, issue template dropdown, godoc on six packages), h200-any.yaml is byte-parallel with h100-any.yaml so the Hopper deployment floor carries over with zero divergence risk, and the GH200/H200 substring disambiguation in gpu_sku.go is exactly the right defense with negative tests to lock it in. Swapping the OriginExternal extensibility test value from h200 → mi300x (clearly non-NVIDIA) preserves the original test intent.
Note: PR is currently blocked only on in-progress CI (tests/E2E, Tier 1 deploy matrix still running at review time) — substance is approved.
LGTM 🚢
| // first-class accelerator enum, so match it explicitly before the "H200" | ||
| // rule and leave it unresolved ("") rather than mislabeling it as h200 — | ||
| // same reason "GB200" precedes "B200" above. | ||
| {"GH200", ""}, |
There was a problem hiding this comment.
Nice catch on the GH200 substring trap — Grace Hopper would otherwise silently mislabel as h200. The comment explicitly anchoring it to the existing GB200→B200 precedent makes the ordering invariant clear to future contributors, and the negative test cases (NVIDIA GH200, NVIDIA GH200 480GB) lock it in.
Adds CriteriaAcceleratorH200 to the criteria registry so users can
pass `--accelerator h200` and have the recipe metadata reflect the
hardware. H200 is the same Hopper generation as H100 (same R570/R580
driver line, same gpu-operator support floor), so its deployment-phase
floor mirrors H100's.
Updates every surface that enumerates accelerator types per the
"Adding a new enum value" audit rule in .claude/CLAUDE.md:
- pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes
- pkg/recipe/criteria_test.go: parse + Get tests cover h200
- pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in)
for mi300x as the extensibility-test example value
- api/aicr/v1/server.yaml: all 5 enum blocks
- .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries
- docs/{README.md,user/cli-reference.md,user/api-reference.md,
contributor/api-server.md,contributor/cli.md,contributor/data.md,
contributor/validations.md,contributor/api-server-extending.md}
- pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go,
server/doc.go}: godoc enumerations
Wires up the two consumer paths the enum alone left incomplete:
- pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the
snapshot -> fingerprint -> recipe path resolves real H200 hardware
("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku
- recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring
h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0)
so an `--accelerator h200` recipe inherits the same deployment-phase
floor as H100 rather than landing on bare base
- .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the
model->accelerator mapping and valid-values tables
Validated against a real H200 NVL cluster: GFD / DRA correctly identify
the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0.
End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu
--intent training` resolves the identical deployment floor as the h100
equivalent and produces a recipe with criteria.accelerator: h200.
Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out
from PR NVIDIA#1089).
2b06bd0 to
6dff4c4
Compare
Summary
Add
h200as a built-in accelerator type so users with H200 hardware can pass--accelerator h200and have the recipe metadata reflect what's actually deployed. H200 is the same Hopper generation as H100 (same R570/R580 driver line, same gpu-operator support floor, NVML auto-detects everything), so it shares H100's deployment-phase floor via a parallelh200-any.yamlcriteria-wildcard overlay.Motivation / Context
Fixes: checkbox 4 of #1086
Related: #1089 (the BCM overlay-gaps PR that landed checkboxes 1–3 of #1086)
Validated on a real H200 NVL cluster during the BCM service overlay work:
NVIDIA H200 NVLBefore this PR:
accelerator: h100for an H200 cluster — wrong on inspection, breaks reproducibilitygds.enabledgated on H200's larger HBM, or NVL-form-factor topology), there's no enum key to hang it onTest infrastructure was already partially in place:
pkg/recipe/criteria_registry_parse_test.goregisteredh200viaOriginExternalas the extensibility test value. This PR promotes h200 to first-class and swaps that test value tomi300x(a clearly non-built-in accelerator that retains the extensibility-path semantics).Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/fingerprint)docs/)Implementation Notes
Surfaces touched per the
.claude/CLAUDE.md"Adding a new enum value" audit rule:Core Go:
pkg/recipe/criteria.go— new constCriteriaAcceleratorH200, parse case,GetCriteriaAcceleratorTypes()list, godoc enumerationSnapshot detection (the auto-detection half of the enum):
pkg/fingerprint/gpu_sku.go— add theH200ProductName pattern togpuSKURegistryso the snapshot → fingerprint → recipe path resolves real H200 hardware (NVIDIA H200 NVL,…141GB HBM3e) toh200instead ofunknown-sku. Without this, only the manual--accelerator h200path worked; snapshot-driven recipe generation (the path the wrong-metadata problem originates from) never producedh200.pkg/fingerprint/gpu_sku_test.go— H200 NVL + 141GB HBM3e casesDeployment-phase floor (so h200 truly mirrors h100):
recipes/overlays/h200-any.yaml— new criteria-wildcard overlay mirroringh100-any.yaml(4 standard checks: operator-health, expected-resources, gpu-operator-version, check-nvidia-smi; plusgpu-operator >= v24.6.0, the Hopper floor). Without it, an--accelerator h200recipe matched noh100-*overlay and landed on barebase, silently dropping the deployment validation floor every h100 recipe gets. Mirrors the per-accelerator wildcard pattern from feat(recipe): deliver deployment-phase floor at per-accelerator wildcards #1001 (h100-any/b200-any/gb200-any/rtx-pro-6000-any).Tests:
pkg/recipe/criteria_test.go— parse table coversh200andH200(uppercase);TestGetCriteriaAcceleratorTypesexpected list updatedpkg/recipe/criteria_registry_parse_test.goandcriteria_registry_test.go—h200was the OriginExternal extensibility example; swapped tomi300x(AMD, clearly non-built-in) so the tests retain their original intentAPI surface:
api/aicr/v1/server.yaml— all 5 enum blocks (accelerator + gpu alias, on snapshot/recipe endpoints)Docs:
docs/README.md(glossary)docs/user/cli-reference.md,docs/user/api-reference.mddocs/contributor/api-server.md,docs/contributor/cli.md,docs/contributor/data.md,docs/contributor/validations.md,docs/contributor/api-server-extending.md.claude/skills/analyzing-snapshots/SKILL.md— model→accelerator mapping table + valid-values listGodoc:
pkg/recipe/doc.go,pkg/cli/recipe.go,pkg/api/doc.go,pkg/server/doc.go,pkg/fingerprint/doc.go,pkg/fingerprint/types.goIssue templates:
.github/ISSUE_TEMPLATE/bug_report.yml(GPU type dropdown — 2 occurrences)Intentionally not changed:
types.gotop-level "e.g., h100, b200" example comment —e.g.is illustrative, not enumerativedocs/integrator/components/nodewright.md— that page lists NodeWright's current component support (h100, gb200), which is component-specific scope and would be a false claim to extende.g., h100, gb200example references indocs/user/cli-reference.mdand elsewhere — illustrative, not enumerativeh200-*leaf overlays (e.g.h200-bcm-training) — deferred until there's a real per-service tuning delta; theh200-any.yamlfloor plus the accelerator-agnostic service overlays cover today's needsTesting
Risk Assessment
h100recipes are unaffected;h200-any.yamlonly matchesaccelerator: h200queries. H200 hardware now auto-detects toh200on the snapshot path and inherits the same deployment floor as H100.Rollout notes: None. Existing recipes that use
--accelerator h100for H200 hardware continue to work; users can opt into the more-accurate--accelerator h200at their convenience.Checklist
./pkg/recipe/...,./pkg/fingerprint/..., andmake qualifyfor the touched packages)yamllint+golangci-lint)criteria_test.goparse + Get cases,gpu_sku_test.goH200 cases, extensibility tests swapped tomi300x).claude/CLAUDE.md)git commit -S)