docs(guides): add holodeck + AICR integration preview (provisioning + snapshot/recipe)#819
docs(guides): add holodeck + AICR integration preview (provisioning + snapshot/recipe)#819ArangoGutierrez wants to merge 14 commits into
Conversation
Coverage Report for CI Build 26209697389Warning Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes. Coverage remained the same at 47.822%Details
Uncovered ChangesNo uncovered changes found. Coverage RegressionsNo coverage regressions found. Coverage Stats
💛 - Coveralls |
b8edbff to
1b3132a
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Preview-scope rewrite reads well and CI is fully green. One nit on a copy-pasteable yq snippet in §2.1 — readers will hit a parse error as written. Not a merge blocker.
…ample The unquoted .data.gpu.product-architecture path is parsed by jq (and therefore kislyuk/yq) as a function call: .product minus architecture/0, which is undefined. Readers using the Python yq get a compile error on copy-paste; mikefarah/yq accepts both forms. Quote the hyphenated key so the snippet works under either yq implementation. Addresses inline review comment on PR NVIDIA#819. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
…ample The unquoted .data.gpu.product-architecture path is parsed by jq (and therefore kislyuk/yq) as a function call: .product minus architecture/0, which is undefined. Readers using the Python yq get a compile error on copy-paste; mikefarah/yq accepts both forms. Quote the hyphenated key so the snippet works under either yq implementation. Addresses inline review comment on PR NVIDIA#819. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
3c3f640 to
9358013
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Docs-only PR adding the Holodeck → AICR preview walkthrough. Scope is well-bounded, "What's coming" honestly names the upstream gates, and the troubleshooting section captures real friction from the manual test run. CI is green across 18 checks (PR is "behind" main — needs a rebase before merge).
One medium concern: the 0.0.0.0/0 security-group troubleshooting entry doesn't match the single-node code path (create.go errors out rather than falling back), so either the wording needs adjustment or the actual code path observed needs to be pinned down. Plus a handful of nits — unused jq prereq, PROVISIONED: true phrasing, and one clarification around --accelerator inference in the recipe step. Nothing that blocks merge.
| [AICR installation](https://github.com/NVIDIA/aicr/blob/main/docs/user/installation.md)) | ||
| - AWS account with credentials in your environment and `g6e` quota in | ||
| `us-west-2` (request via the EC2 service quotas console) | ||
| - `kubectl`, `yq`, and `jq` on your path |
There was a problem hiding this comment.
jq is listed as a prerequisite but never used in the guide — only yq appears in the snippets (lines 162-163, 187). Drop jq from the list, or add the jq command that justifies it.
There was a problem hiding this comment.
Dropped jq from prerequisites in 106cfc4 — only yq is used in the snippets.
| **Security group is `0.0.0.0/0` instead of your public IP.** Holodeck | ||
| falls back to `0.0.0.0/0` when it cannot detect your public IP. Either |
There was a problem hiding this comment.
This troubleshooting entry contradicts the single-node code path the guide covers. In pkg/provider/aws/create.go:368-369, when utils.GetIPAddress() fails, holodeck create returns an error (could not detect public IP for security group (set ingressCidr explicitly)) rather than falling back to 0.0.0.0/0. The 0.0.0.0/0 rules in pkg/provider/aws/cluster.go apply to NodePort and Calico VXLAN ranges in multinode cluster mode, not single-node.
If you actually observed a 0.0.0.0/0 SSH/apiserver rule on g6e.xlarge during testing, it's worth pinning down the code path — otherwise this section may steer readers wrong. Suggest rewording around the actual failure mode: "If holodeck create fails with could not detect public IP, set spec.ingressIpRanges explicitly or run from a network with a routable public IP."
There was a problem hiding this comment.
Verified against the PR's view of pkg/provider/aws/create.go:366-368 — utils.GetIPAddress() failure returns could not detect public IP for security group rather than falling back to 0.0.0.0/0, so the previous wording was wrong. Rewrote the troubleshooting entry in 106cfc4 to describe the actual failure mode and point at spec.ingressIpRanges for an explicit override.
For completeness on the cluster.go rules: on the multi-node path, only NodePort TCP/UDP (cluster.go:421-434) open 0.0.0.0/0; the Calico VXLAN/BGP/Typha rules use UserIdGroupPairs between the CP and Worker SGs, not 0.0.0.0/0. None of that applies to the single-node walkthrough in this guide.
| holodeck dryrun -f ./my-env.yaml | ||
| ``` | ||
|
|
||
| On success, holodeck records the instance with `PROVISIONED: true`. |
There was a problem hiding this comment.
nit: holodeck list prints a table with a PROVISIONED column showing true/false (see cmd/cli/list/list.go:46), not a KEY: value line. Reword as "...the instance shows true under the PROVISIONED column of holodeck list."
There was a problem hiding this comment.
Reworded in 106cfc4 to: "the instance shows true under the PROVISIONED column of holodeck list."
| ```bash | ||
| aicr recipe --snapshot snapshot.yaml \ | ||
| --intent inference --platform dynamo \ | ||
| --output recipe.yaml |
There was a problem hiding this comment.
nit: the "What's coming" section at line 213 talks about --accelerator l40, but this command doesn't pass --accelerator at all. Worth one sentence here clarifying that the accelerator is inferred from the snapshot — otherwise readers may think they need to add --accelerator l40 after seeing the L40 note below.
There was a problem hiding this comment.
Added a one-sentence clarification in 106cfc4 before the AICR-matching explanation — accelerator is inferred from the snapshot, no --accelerator flag in this command. Keeps the --accelerator l40 reference in "What's coming" scoped to the underlying overlay-catalog gate.
| The snapshot → recipe step demonstrates the matching mechanic that | ||
| makes AICR reproducible: given the same snapshot, you get the same | ||
| recipe, and that recipe is what `aicr bundle` would turn into a | ||
| deployable Helm chart sequence in the full demo. | ||
|
|
||
| ## What's coming | ||
|
|
||
| The full end-to-end demo (Slurm batch job + Dynamo chat-completions | ||
| finale) is gated on three upstream changes: | ||
|
|
||
| - **AICR PR #866 (Slinky/Slurm)** — adds `--platform slurm`. Currently | ||
| on the feature branch `feat/slinky-slurm-operator`, not yet merged | ||
| to `main`. Enables the Phase 2 Slurm track. | ||
| - **AICR overlay catalog — L40 coverage** — `--accelerator l40` today | ||
| matches only the `base` + `monitoring-hpa` overlays, so the inference | ||
| recipe does not include `dynamo-platform`. Tracked upstream. | ||
| - **GPU Operator on kubeadm + holodeck driver** — the `gdrcopy-validation` | ||
| init container of `nvidia-operator-validator` requires the `gdrdrv` |
There was a problem hiding this comment.
The "What's coming" section is a strong addition — naming the three upstream gates (Slurm PR #866, L40 overlay coverage, gdrdrv) makes the preview scope honest and gives readers a clear sense of when v2 grows back. Nice.
- drop unused jq from prerequisites (only yq is used in snippets) - correct PROVISIONED phrasing: 'holodeck list' prints a table column, not a KEY: value line - clarify the accelerator is inferred from the snapshot in §2.2 — no --accelerator flag is passed in the example, despite what the L40 note in 'What's coming' references - rewrite the security-group troubleshooting entry: the code path in pkg/provider/aws/create.go errors out with 'could not detect public IP' rather than falling back to 0.0.0.0/0; point readers at spec.ingressIpRanges instead Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Single-node Holodeck Environment for the AICR integration demo (docs/guides/aicr-integration.md). Uses g6e.xlarge (1x NVIDIA L40S), kubeadm K8s v1.35.0 (AICR floor 1.34+), Ubuntu 22.04, containerd, NVIDIA driver + toolkit installed. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Controller + 1-replica NodeSet (1x nvidia.com/gpu) sized for the AICR integration demo's single L40S worker. Operator: slinky-slurm- operator chart v1.1.0 (pinned by AICR recipes/components/). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Title, TL;DR, 'What you'll build' (with Mermaid flowchart), and Prerequisites. Phase 1 + Phase 2 + closing sections will be filled in subsequent commits. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Configure the Environment YAML, holodeck create + monitor, verify GPU access via a runtimeClass=nvidia pod. Phase 1 ends with the explicit hand-off note about nvidia.com/gpu being Day 1's job. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
aicr snapshot capture + brief explanation of how the snapshot feeds both recipe matching and later validation. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
aicr recipe/bundle/deploy for --platform slurm, then kubectl apply the SlurmCluster manifest and submit an sbatch GPU job via the controller pod. Open question: --service flag value for kubeadm clusters (resolves in end-to-end validation). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Second aicr recipe/bundle/deploy on the same cluster: dynamo platform installed (GPU Operator + dynamo-platform), sample vllm-agg workload, port-forward + curl against the chat completions API. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
aicr validate --phase all against both recipes, summarized with jq. Both reports should show failed=0. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Why this matters (superpower wrap-up), Troubleshooting (6 common failure modes), Next steps + cleanup (holodeck delete + multi-node + AICR cloud variants). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Cross-references between docs/guides/README.md, docs/examples/ README.md, and the new docs/guides/aicr-integration.md + examples/aicr-demo/ folder. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The Slinky SlurmCluster manifest is removed from the AICR demo example folder. The guide's v1 scope (preview) covers Holodeck provisioning plus an AICR snapshot/recipe walkthrough — the Slurm track of the original design is deferred pending upstream AICR support landing on main. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
…view Rewrites the guide to match what works end-to-end on g6e.xlarge L40S against current holodeck v0.2.18 + aicr main: - Phase 1 (provisioning): corrected to use 'holodeck create --provision' and 'holodeck get kubeconfig -o PATH INSTANCE_ID'. Drops the 'runtimeClassName: nvidia' Day-0 GPU test (RuntimeClass not installed by holodeck). - Phase 2 (AICR): reduced to snapshot + recipe demonstration. Drops the Slurm track (aicr PR #866 not yet merged), the Dynamo deploy finale (L40 overlay catalog gap), and aicr validate. - Adds a 'What's coming' section linking the three upstream gaps: aicr PR #866, L40 overlay coverage, gdrcopy validator on holodeck. - Adds troubleshooting notes for the holodeck CLI flag-must-precede- arg parser quirk and the 0.0.0.0/0 security-group fallback. - Drops Slurm/Dynamo references from docs/guides/README.md and docs/examples/README.md to match. Reframes the guide as a 'preview' so readers understand the scope is intentional and tracks the upstream changes. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
…ample The unquoted .data.gpu.product-architecture path is parsed by jq (and therefore kislyuk/yq) as a function call: .product minus architecture/0, which is undefined. Readers using the Python yq get a compile error on copy-paste; mikefarah/yq accepts both forms. Quote the hyphenated key so the snippet works under either yq implementation. Addresses inline review comment on PR NVIDIA#819. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- drop unused jq from prerequisites (only yq is used in snippets) - correct PROVISIONED phrasing: 'holodeck list' prints a table column, not a KEY: value line - clarify the accelerator is inferred from the snapshot in §2.2 — no --accelerator flag is passed in the example, despite what the L40 note in 'What's coming' references - rewrite the security-group troubleshooting entry: the code path in pkg/provider/aws/create.go errors out with 'could not detect public IP' rather than falling back to 0.0.0.0/0; point readers at spec.ingressIpRanges instead Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
106cfc4 to
636dfbb
Compare
Summary
Adds a preview walkthrough at
docs/guides/aicr-integration.mdcovering the Day-0 + early Day-1 surface of the Holodeck → AICR flow:
provision a single-node
g6e.xlargeL40S cluster with Holodeck, thencapture an AICR snapshot and generate a matching recipe. The companion
example at
examples/aicr-demo/environment.yamlis the singledeclarative file the guide walks through.
Scope
The guide is intentionally narrowed to what runs end-to-end on
g6e.xlargeL40S against current Holodeck + AICRmaintoday:holodeck create --provision→holodeck get kubeconfig→
kubectl get nodesReady.aicr snapshotagainst the live cluster.aicr recipefrom that snapshot.The full end-to-end (Slurm
sbatchjob + Dynamo chat-completionsresponse) is deferred. A "What's coming" section in the guide lists
the prerequisites:
main.--accelerator l40today applies only thebase + monitoring-hpaoverlays, sodynamo-platformdoes not appear in the bundle.gdrcopy-validationinit step — requires thegdrdrvkernel module, which Holodeck's Ansible plays do notinstall.
When these land, the guide grows back the bundle/deploy/validate
sections.
CLI corrections vs. an earlier draft
Driving Holodeck end-to-end surfaced a handful of CLI-shape drifts
that the guide now reflects:
holodeck create -f env.yamlonly spins up the EC2 instance; the--provisionflag is required to run the Ansible plays.holodeck update INSTANCE_ID --reprovisionreturns "instance ID isrequired". Flag must precede positional arg:
holodeck update --reprovision INSTANCE_ID. Same ordering appliesto
holodeck get kubeconfig -o PATH INSTANCE_ID. The guide'stroubleshooting calls this out.
holodeck get kubeconfig,not written automatically to CWD by
create.0.0.0.0/0when public-IP detectionfails. Documented in the guide's troubleshooting.
runtimeClassName: nvidiadoes not work as a Day-0 GPU smoke test— the
nvidiaRuntimeClass is not installed by Holodeck.Kubernetes-level GPU registration is Phase 2's job (GPU Operator).
Testing
Manual run on AWS
us-west-2g6e.xlarge(1× L40S) on 2026-05-18:provisioning succeeded, snapshot captured with
NVIDIA L40S/Ada Lovelace / driver 575.57.08 / CUDA 13.2 detected, recipe generated
with 10 components. Cluster torn down at end of session.
make mdlintnot run locally in this environment; CI will verify.Files
docs/guides/aicr-integration.md— preview-scope rewrite.docs/guides/README.md,docs/examples/README.md— index entriesreflect the reduced scope.
examples/aicr-demo/environment.yaml— header comment updated.examples/aicr-demo/slurm-cluster.yaml— removed from v1 (nolonger referenced; the Slurm track is deferred).
Out of scope (future work)
upstream prerequisites above land.
make demo-test/ GitHub Action for CI-driven end-to-end runs.scripts/aicr-demo.shhelper.