Skip to content

docs(guides): add holodeck + AICR integration preview (provisioning + snapshot/recipe)#819

Open
ArangoGutierrez wants to merge 14 commits into
NVIDIA:mainfrom
ArangoGutierrez:docs-aicr-integration-demo
Open

docs(guides): add holodeck + AICR integration preview (provisioning + snapshot/recipe)#819
ArangoGutierrez wants to merge 14 commits into
NVIDIA:mainfrom
ArangoGutierrez:docs-aicr-integration-demo

Conversation

@ArangoGutierrez
Copy link
Copy Markdown
Collaborator

@ArangoGutierrez ArangoGutierrez commented May 18, 2026

Summary

Adds a preview walkthrough at docs/guides/aicr-integration.md
covering the Day-0 + early Day-1 surface of the Holodeck → AICR flow:
provision a single-node g6e.xlarge L40S cluster with Holodeck, then
capture an AICR snapshot and generate a matching recipe. The companion
example at examples/aicr-demo/environment.yaml is the single
declarative file the guide walks through.

Scope

The guide is intentionally narrowed to what runs end-to-end on
g6e.xlarge L40S against current Holodeck + AICR main today:

  • Phase 1 — holodeck create --provisionholodeck get kubeconfig
    kubectl get nodes Ready.
  • Phase 2.1 — aicr snapshot against the live cluster.
  • Phase 2.2 — aicr recipe from that snapshot.

The full end-to-end (Slurm sbatch job + Dynamo chat-completions
response) is deferred. A "What's coming" section in the guide lists
the prerequisites:

  • AICR Slinky/Slurm support — not yet on main.
  • AICR overlay catalog coverage for L40 accelerators — --accelerator l40 today applies only the base + monitoring-hpa overlays, so
    dynamo-platform does not appear in the bundle.
  • GPU Operator's gdrcopy-validation init step — requires the
    gdrdrv kernel module, which Holodeck's Ansible plays do not
    install.

When these land, the guide grows back the bundle/deploy/validate
sections.

CLI corrections vs. an earlier draft

Driving Holodeck end-to-end surfaced a handful of CLI-shape drifts
that the guide now reflects:

  • holodeck create -f env.yaml only spins up the EC2 instance; the
    --provision flag is required to run the Ansible plays.
  • holodeck update INSTANCE_ID --reprovision returns "instance ID is
    required". Flag must precede positional arg:
    holodeck update --reprovision INSTANCE_ID. Same ordering applies
    to holodeck get kubeconfig -o PATH INSTANCE_ID. The guide's
    troubleshooting calls this out.
  • The kubeconfig is fetched explicitly via holodeck get kubeconfig,
    not written automatically to CWD by create.
  • Security group falls back to 0.0.0.0/0 when public-IP detection
    fails. Documented in the guide's troubleshooting.
  • runtimeClassName: nvidia does not work as a Day-0 GPU smoke test
    — the nvidia RuntimeClass is not installed by Holodeck.
    Kubernetes-level GPU registration is Phase 2's job (GPU Operator).

Testing

Manual run on AWS us-west-2 g6e.xlarge (1× L40S) on 2026-05-18:
provisioning succeeded, snapshot captured with NVIDIA L40S /
Ada Lovelace / driver 575.57.08 / CUDA 13.2 detected, recipe generated
with 10 components. Cluster torn down at end of session.

make mdlint not run locally in this environment; CI will verify.

Files

  • docs/guides/aicr-integration.md — preview-scope rewrite.
  • docs/guides/README.md, docs/examples/README.md — index entries
    reflect the reduced scope.
  • examples/aicr-demo/environment.yaml — header comment updated.
  • examples/aicr-demo/slurm-cluster.yaml — removed from v1 (no
    longer referenced; the Slurm track is deferred).

Out of scope (future work)

  • Full Slurm + Dynamo end-to-end demo — re-added once the three
    upstream prerequisites above land.
  • make demo-test / GitHub Action for CI-driven end-to-end runs.
  • scripts/aicr-demo.sh helper.

@coveralls
Copy link
Copy Markdown

coveralls commented May 18, 2026

Coverage Report for CI Build 26209697389

Warning

Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes.
Quick fix: rebase this PR. Learn more →

Coverage remained the same at 47.822%

Details

  • Coverage remained the same as the base build.
  • Patch coverage: No coverable lines changed in this PR.
  • No coverage regressions found.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 11066
Covered Lines: 5292
Line Coverage: 47.82%
Coverage Strength: 0.53 hits per line

💛 - Coveralls

@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review May 18, 2026 17:21
@ArangoGutierrez ArangoGutierrez force-pushed the docs-aicr-integration-demo branch from b8edbff to 1b3132a Compare May 18, 2026 17:51
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preview-scope rewrite reads well and CI is fully green. One nit on a copy-pasteable yq snippet in §2.1 — readers will hit a parse error as written. Not a merge blocker.

Comment thread docs/guides/aicr-integration.md Outdated
ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request May 19, 2026
…ample

The unquoted .data.gpu.product-architecture path is parsed by jq (and
therefore kislyuk/yq) as a function call: .product minus
architecture/0, which is undefined. Readers using the Python yq get a
compile error on copy-paste; mikefarah/yq accepts both forms. Quote the
hyphenated key so the snippet works under either yq implementation.

Addresses inline review comment on PR NVIDIA#819.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez requested a review from mchmarny May 19, 2026 06:43
ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request May 19, 2026
…ample

The unquoted .data.gpu.product-architecture path is parsed by jq (and
therefore kislyuk/yq) as a function call: .product minus
architecture/0, which is undefined. Readers using the Python yq get a
compile error on copy-paste; mikefarah/yq accepts both forms. Quote the
hyphenated key so the snippet works under either yq implementation.

Addresses inline review comment on PR NVIDIA#819.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez force-pushed the docs-aicr-integration-demo branch from 3c3f640 to 9358013 Compare May 19, 2026 08:55
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs-only PR adding the Holodeck → AICR preview walkthrough. Scope is well-bounded, "What's coming" honestly names the upstream gates, and the troubleshooting section captures real friction from the manual test run. CI is green across 18 checks (PR is "behind" main — needs a rebase before merge).

One medium concern: the 0.0.0.0/0 security-group troubleshooting entry doesn't match the single-node code path (create.go errors out rather than falling back), so either the wording needs adjustment or the actual code path observed needs to be pinned down. Plus a handful of nits — unused jq prereq, PROVISIONED: true phrasing, and one clarification around --accelerator inference in the recipe step. Nothing that blocks merge.

Comment thread docs/guides/aicr-integration.md Outdated
[AICR installation](https://github.com/NVIDIA/aicr/blob/main/docs/user/installation.md))
- AWS account with credentials in your environment and `g6e` quota in
`us-west-2` (request via the EC2 service quotas console)
- `kubectl`, `yq`, and `jq` on your path
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jq is listed as a prerequisite but never used in the guide — only yq appears in the snippets (lines 162-163, 187). Drop jq from the list, or add the jq command that justifies it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped jq from prerequisites in 106cfc4 — only yq is used in the snippets.

Comment thread docs/guides/aicr-integration.md Outdated
Comment on lines +243 to +244
**Security group is `0.0.0.0/0` instead of your public IP.** Holodeck
falls back to `0.0.0.0/0` when it cannot detect your public IP. Either
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This troubleshooting entry contradicts the single-node code path the guide covers. In pkg/provider/aws/create.go:368-369, when utils.GetIPAddress() fails, holodeck create returns an error (could not detect public IP for security group (set ingressCidr explicitly)) rather than falling back to 0.0.0.0/0. The 0.0.0.0/0 rules in pkg/provider/aws/cluster.go apply to NodePort and Calico VXLAN ranges in multinode cluster mode, not single-node.

If you actually observed a 0.0.0.0/0 SSH/apiserver rule on g6e.xlarge during testing, it's worth pinning down the code path — otherwise this section may steer readers wrong. Suggest rewording around the actual failure mode: "If holodeck create fails with could not detect public IP, set spec.ingressIpRanges explicitly or run from a network with a routable public IP."

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified against the PR's view of pkg/provider/aws/create.go:366-368utils.GetIPAddress() failure returns could not detect public IP for security group rather than falling back to 0.0.0.0/0, so the previous wording was wrong. Rewrote the troubleshooting entry in 106cfc4 to describe the actual failure mode and point at spec.ingressIpRanges for an explicit override.

For completeness on the cluster.go rules: on the multi-node path, only NodePort TCP/UDP (cluster.go:421-434) open 0.0.0.0/0; the Calico VXLAN/BGP/Typha rules use UserIdGroupPairs between the CP and Worker SGs, not 0.0.0.0/0. None of that applies to the single-node walkthrough in this guide.

Comment thread docs/guides/aicr-integration.md Outdated
holodeck dryrun -f ./my-env.yaml
```

On success, holodeck records the instance with `PROVISIONED: true`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: holodeck list prints a table with a PROVISIONED column showing true/false (see cmd/cli/list/list.go:46), not a KEY: value line. Reword as "...the instance shows true under the PROVISIONED column of holodeck list."

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded in 106cfc4 to: "the instance shows true under the PROVISIONED column of holodeck list."

Comment on lines +175 to +178
```bash
aicr recipe --snapshot snapshot.yaml \
--intent inference --platform dynamo \
--output recipe.yaml
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the "What's coming" section at line 213 talks about --accelerator l40, but this command doesn't pass --accelerator at all. Worth one sentence here clarifying that the accelerator is inferred from the snapshot — otherwise readers may think they need to add --accelerator l40 after seeing the L40 note below.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a one-sentence clarification in 106cfc4 before the AICR-matching explanation — accelerator is inferred from the snapshot, no --accelerator flag in this command. Keeps the --accelerator l40 reference in "What's coming" scoped to the underlying overlay-catalog gate.

Comment on lines +200 to +217
The snapshot → recipe step demonstrates the matching mechanic that
makes AICR reproducible: given the same snapshot, you get the same
recipe, and that recipe is what `aicr bundle` would turn into a
deployable Helm chart sequence in the full demo.

## What's coming

The full end-to-end demo (Slurm batch job + Dynamo chat-completions
finale) is gated on three upstream changes:

- **AICR PR #866 (Slinky/Slurm)** — adds `--platform slurm`. Currently
on the feature branch `feat/slinky-slurm-operator`, not yet merged
to `main`. Enables the Phase 2 Slurm track.
- **AICR overlay catalog — L40 coverage** — `--accelerator l40` today
matches only the `base` + `monitoring-hpa` overlays, so the inference
recipe does not include `dynamo-platform`. Tracked upstream.
- **GPU Operator on kubeadm + holodeck driver** — the `gdrcopy-validation`
init container of `nvidia-operator-validator` requires the `gdrdrv`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "What's coming" section is a strong addition — naming the three upstream gates (Slurm PR #866, L40 overlay coverage, gdrdrv) makes the preview scope honest and gives readers a clear sense of when v2 grows back. Nice.

ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request May 21, 2026
- drop unused jq from prerequisites (only yq is used in snippets)
- correct PROVISIONED phrasing: 'holodeck list' prints a table column,
  not a KEY: value line
- clarify the accelerator is inferred from the snapshot in §2.2 — no
  --accelerator flag is passed in the example, despite what the L40
  note in 'What's coming' references
- rewrite the security-group troubleshooting entry: the code path in
  pkg/provider/aws/create.go errors out with 'could not detect public
  IP' rather than falling back to 0.0.0.0/0; point readers at
  spec.ingressIpRanges instead

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Single-node Holodeck Environment for the AICR integration demo
(docs/guides/aicr-integration.md). Uses g6e.xlarge (1x NVIDIA L40S),
kubeadm K8s v1.35.0 (AICR floor 1.34+), Ubuntu 22.04, containerd,
NVIDIA driver + toolkit installed.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Controller + 1-replica NodeSet (1x nvidia.com/gpu) sized for the
AICR integration demo's single L40S worker. Operator: slinky-slurm-
operator chart v1.1.0 (pinned by AICR recipes/components/).

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Title, TL;DR, 'What you'll build' (with Mermaid flowchart), and
Prerequisites. Phase 1 + Phase 2 + closing sections will be filled
in subsequent commits.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Configure the Environment YAML, holodeck create + monitor, verify
GPU access via a runtimeClass=nvidia pod. Phase 1 ends with the
explicit hand-off note about nvidia.com/gpu being Day 1's job.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
aicr snapshot capture + brief explanation of how the snapshot feeds
both recipe matching and later validation.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
aicr recipe/bundle/deploy for --platform slurm, then kubectl apply
the SlurmCluster manifest and submit an sbatch GPU job via the
controller pod. Open question: --service flag value for kubeadm
clusters (resolves in end-to-end validation).

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Second aicr recipe/bundle/deploy on the same cluster: dynamo
platform installed (GPU Operator + dynamo-platform), sample
vllm-agg workload, port-forward + curl against the chat
completions API.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
aicr validate --phase all against both recipes, summarized with jq.
Both reports should show failed=0.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Why this matters (superpower wrap-up), Troubleshooting (6 common
failure modes), Next steps + cleanup (holodeck delete + multi-node
+ AICR cloud variants).

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Cross-references between docs/guides/README.md, docs/examples/
README.md, and the new docs/guides/aicr-integration.md +
examples/aicr-demo/ folder.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The Slinky SlurmCluster manifest is removed from the AICR demo
example folder. The guide's v1 scope (preview) covers Holodeck
provisioning plus an AICR snapshot/recipe walkthrough — the Slurm
track of the original design is deferred pending upstream AICR
support landing on main.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
…view

Rewrites the guide to match what works end-to-end on g6e.xlarge L40S
against current holodeck v0.2.18 + aicr main:

- Phase 1 (provisioning): corrected to use 'holodeck create
  --provision' and 'holodeck get kubeconfig -o PATH INSTANCE_ID'.
  Drops the 'runtimeClassName: nvidia' Day-0 GPU test (RuntimeClass
  not installed by holodeck).
- Phase 2 (AICR): reduced to snapshot + recipe demonstration. Drops
  the Slurm track (aicr PR #866 not yet merged), the Dynamo deploy
  finale (L40 overlay catalog gap), and aicr validate.
- Adds a 'What's coming' section linking the three upstream gaps:
  aicr PR #866, L40 overlay coverage, gdrcopy validator on holodeck.
- Adds troubleshooting notes for the holodeck CLI flag-must-precede-
  arg parser quirk and the 0.0.0.0/0 security-group fallback.
- Drops Slurm/Dynamo references from docs/guides/README.md and
  docs/examples/README.md to match.

Reframes the guide as a 'preview' so readers understand the scope is
intentional and tracks the upstream changes.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
…ample

The unquoted .data.gpu.product-architecture path is parsed by jq (and
therefore kislyuk/yq) as a function call: .product minus
architecture/0, which is undefined. Readers using the Python yq get a
compile error on copy-paste; mikefarah/yq accepts both forms. Quote the
hyphenated key so the snippet works under either yq implementation.

Addresses inline review comment on PR NVIDIA#819.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- drop unused jq from prerequisites (only yq is used in snippets)
- correct PROVISIONED phrasing: 'holodeck list' prints a table column,
  not a KEY: value line
- clarify the accelerator is inferred from the snapshot in §2.2 — no
  --accelerator flag is passed in the example, despite what the L40
  note in 'What's coming' references
- rewrite the security-group troubleshooting entry: the code path in
  pkg/provider/aws/create.go errors out with 'could not detect public
  IP' rather than falling back to 0.0.0.0/0; point readers at
  spec.ingressIpRanges instead

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez force-pushed the docs-aicr-integration-demo branch from 106cfc4 to 636dfbb Compare May 21, 2026 09:14
@ArangoGutierrez
Copy link
Copy Markdown
Collaborator Author

@mchmarny — pushed 636dfbb addressing the 5 nits (per-thread replies inline) and rebased on current main. Ready for another look when you have a sec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants