feat(tools): add local Talos cluster + snapshot chainsaw test by ayuskauskas · Pull Request #762 · NVIDIA/aicr

ayuskauskas · 2026-05-05T18:36:10Z

Summary

Adds an opt-in, kind-equivalent harness for spinning up a local Talos
Linux cluster on a developer laptop (tools/talos-test/) and a chainsaw
test that exercises the AICR snapshot agent against it
(tests/chainsaw/snapshot/deploy-agent-talos/). Also adds --requests
and --limits flags to aicr snapshot so the agent can be right-sized
on resource-constrained dev clusters.

Motivation / Context

PR #714 added Talos OS support (pkg/collector/talos/ + the OS=talos
pod-shape branch in pkg/k8s/agent/job.go), but the only coverage was
unit tests. There was no integration test running the snapshot agent
against an actual Talos node — and no easy way for a developer on macOS
to exercise that path locally without standing up a remote cluster.

The agent container's resource requests/limits were also hardcoded
(Privileged: 4Gi req / 8Gi lim memory). That sizing fits production
GPU nodes but blocks scheduling on the talosctl Docker provisioner's
default 2Gi worker, with no way to override at the node level.

Fixes: N/A
Related: #714, #565

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: tests/chainsaw, tools/, Makefile

Implementation Notes

`tools/talos-test/{up.sh,down.sh,README.md}` — cluster lifecycle

Wraps talosctl cluster create docker (newer talosctl moved
provisioners to subcommands; the old --provisioner flag is gone).
Threads a containerd registry-mirror config patch via
--config-patch so localhost:5001/aicr:local resolves transparently
from inside Talos node containers (the host registry started by
make dev-env).
br_netfilter preflight: the macOS host VM (Docker Desktop's
LinuxKit or Podman Machine's Fedora CoreOS) ships the module but
does not auto-load it, so flannel CNI fails until we modprobe it via
a one-shot privileged sidecar with /lib/modules bind-mounted from
the VM. No-op on Linux hosts.
talosctl health --wait-timeout 8m streams readiness phases
(etcd, apid, kubelet, control plane, all k8s nodes ready, kube-proxy,
CoreDNS). On failure the script prints diagnostic commands the
developer can run to drill into stuck services.
The Talos in-network IP (10.5.0.0/24) is not host-routable on
macOS, so the script derives the docker-NAT'd apid host port (50000)
and K8s API host port (6443) from docker port, passes them as
--endpoints, and rewrites the kubeconfig server URL to
https://127.0.0.1:<host-port> with insecure-skip-tls-verify=true.
Relabels the default namespace with
pod-security.kubernetes.io/enforce=privileged so the snapshot
agent's privileged pod can be scheduled (Talos enforces restricted
cluster-wide by default).

`tests/chainsaw/snapshot/deploy-agent-talos/` — chainsaw test

Invokes aicr snapshot --os talos -o cm://default/aicr-e2e-snapshot
from a script: step. The CLI exercises pkg/k8s/agent.Deployer
end-to-end (RBAC creation, OS=talos pod shape from
pkg/k8s/agent/job.go, log streaming, Job completion, ConfigMap
write, cleanup) — no hand-applied YAML fixtures to drift against
job.go (the existing kind chainsaw test's drift risk is captured
for follow-up in issues/e2e-update.md, kept local).
Asserts only the surviving ConfigMap content. Uses chainsaw
JMESPath filters ((sort([*].type)),
(@[?type == 'X'] | [0].subtypes | sort([*].subtype))) so the
assertion is order-independent at both the measurement-type and
per-type subtype levels — collector goroutine completion order
is non-deterministic.
Pinned: 5 measurements; the set of types
{GPU, K8s, NodeTopology, OS, SystemD}; per-type subtype names.
Per-subtype data values are intentionally not pinned to avoid
Talos / kernel / kubernetes version churn.

`aicr snapshot` — `--requests` / `--limits`

Adds two CLI flags accepting kubectl-style comma-separated
name=quantity lists:

--requests cpu=500m,memory=1Gi,ephemeral-storage=1Gi
--limits   cpu=1,memory=2Gi,ephemeral-storage=2Gi

Plumbed through pkg/snapshotter.AgentConfig and pkg/k8s/agent.Config
as corev1.ResourceList. applyPrivilegedSettings and
applyRestrictedSettings merge overrides on top of the existing
hardcoded defaults via a new mergeResourceList helper, so unspecified
keys keep the production defaults. The nvidia.com/gpu limit added
by --require-gpu is preserved on top of any override.

Make targets

talos-dev-env, talos-dev-env-clean, talos-snapshot-test. None
are wired into make qualify; Talos remains opt-in for now.

Testing

unset GITLAB_TOKEN
make build
make test     # touched packages
golangci-lint run -c .golangci.yaml ./pkg/k8s/agent/... ./pkg/cli/... ./pkg/snapshotter/...
make lint-yaml

End-to-end against a live local cluster (macOS Podman Machine, Talos
v1.9.0, talosctl 1.13.0):

make image IMAGE_REGISTRY=localhost:5001/aicr IMAGE_TAG=local
make build
make talos-dev-env
make talos-snapshot-test          # PASS
make talos-dev-env-clean

Coverage delta on touched packages:

pkg/k8s/agent: 77.8% (baseline) → ~78% (added tests for resource-
override paths)
pkg/cli: ~52.9% (baseline) → comparable (added flag plumbing
- parseResourceList; covered indirectly)
pkg/snapshotter: ~52.2% (baseline) → comparable (1-line struct
field + plumbing only)

Risk Assessment

Low — Isolated change, well-tested, easy to revert
Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout

Rollout notes: Production callers of aicr snapshot are unaffected;
both new flags default to empty, in which case the hardcoded
privileged/restricted defaults remain in force exactly as before. The
Talos test harness is a new opt-in surface and is not wired into
make qualify or any CI workflow.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint) — only go: no such tool covdata on
cmd/aicr/cmd/aicrd (a pre-existing local toolchain issue,
unrelated to this branch — does not affect CI)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

github-actions · 2026-05-05T18:37:09Z

🌿 Preview your docs: https://nvidia-preview-feat-talos-local-test.docs.buildwithfern.com/aicr

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Makefile`:
- Around line 660-665: The Makefile target talos-dev-env passes TALOS_KUBECONFIG
into the script as KUBECONFIG_OUT which is correct but mismatches the README
variable name and can confuse users; update either the Makefile or the
documentation for consistency: either (A) change the environment var passed to
./tools/talos-test/up.sh to use TALOS_KUBECONFIG as the name consumed by the
script, or (B) add a one-line comment above the talos-dev-env target explaining
that TALOS_KUBECONFIG is forwarded into up.sh as KUBECONFIG_OUT, and optionally
update README.md to mention KUBECONFIG_OUT as the internal name; locate the
talos-dev-env target in the Makefile and the up.sh script to align names or add
the clarifying comment referencing KUBECONFIG_OUT, TALOS_KUBECONFIG, and
./tools/talos-test/up.sh.

In `@tests/chainsaw/snapshot/deploy-agent-talos/chainsaw-test.yaml`:
- Around line 78-91: The test uses a hardcoded
WORK="/tmp/chainsaw-snapshot-talos" which can collide in concurrent runs; change
the script that defines WORK to create a unique temp dir (e.g., assign
WORK="$(mktemp -d)" or similar) and update all uses of WORK in the "script"
block where chainsaw assert runs and in the "cleanup" block so the cleanup
removes that same unique directory (ensure mktemp failure is handled or the
script exits if WORK is empty).

In `@tools/talos-test/README.md`:
- Around line 66-84: Update the Troubleshooting section to use consistent
formatting by wrapping each problem statement/error condition in backticks
(e.g., `talosctl: command not found`, `localhost:5001 registry not reachable`,
`Image pull fails inside the Talos node with a TLS error`,
`localhost:5001/aicr:local not found inside the cluster`) and correct the
wording that currently reads "kind path" to "kind cluster" to avoid confusion;
ensure these edits are applied under the "Troubleshooting" heading and preserve
the explanatory text that follows each backticked problem.
- Around line 86-93: Replace the vague phrase "the kind path uses" in the "Why a
separate cluster?" paragraph with a clearer reference to the kind-based
development environment (e.g., "the path used by the kind-based dev environment"
or "the path used by KinD clusters"), so the sentence comparing Talos to kind
explicitly mentions KinD/kind-based dev environments; update the sentence
containing the string "the kind path uses" to use the clearer wording while
keeping the rest of the paragraph intact.

In `@tools/talos-test/up.sh`:
- Around line 47-52: The docker invocation in the up.sh script uses the floating
image tag alpine:3 which can change; update the command that contains "docker
run ... alpine:3 sh -c 'modprobe br_netfilter ...'" to reference a pinned Alpine
tag or digest (e.g., a specific version tag or `@sha256` digest) so the modprobe
check is reproducible in CI; replace the literal "alpine:3" string with the
chosen pinned image reference throughout the script and keep the rest of the
command unchanged.
- Around line 70-75: The talosctl invocation mistakenly includes the deprecated
"docker" subcommand; update the command in the script where talosctl cluster
create is called (referenced by variables CLUSTER_NAME, TALOS_VERSION,
PATCH_FILE, MIRROR_HOST) to remove the "docker" token so it uses the form
talosctl cluster create with flags --name "${CLUSTER_NAME}" --workers 1 --image
"ghcr.io/siderolabs/talos:${TALOS_VERSION}" --config-patch "@${PATCH_FILE}"
(keep the existing flags and variables intact, only remove the literal
"docker").

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 5f48d4ab-cea7-425a-a581-a82c58392934

📥 Commits

Reviewing files that changed from the base of the PR and between 1e644aa and 2100ca1.

📒 Files selected for processing (12)

Makefile
docs/user/cli-reference.md
pkg/cli/snapshot.go
pkg/k8s/agent/job.go
pkg/k8s/agent/job_test.go
pkg/k8s/agent/types.go
pkg/snapshotter/agent.go
tests/chainsaw/snapshot/deploy-agent-talos/assert-snapshot-content.yaml
tests/chainsaw/snapshot/deploy-agent-talos/chainsaw-test.yaml
tools/talos-test/README.md
tools/talos-test/down.sh
tools/talos-test/up.sh

mchmarny

Solid PR — the Talos OS=talos pod-shape branch from #714 finally has end-to-end coverage, and the --requests/--limits plumbing is clean (separate types, merge-not-replace semantics, defaults preserved, non-mutating helper, unit tests exercising all three branches).

One medium concern on the --require-gpu vs --limits nvidia.com/gpu=N precedence — current behavior silently overrides the explicit limit and the CLI help text reads ambiguously. Worth deciding whether --require-gpu is "set if unset" or "always force to 1" and asserting it in tests. Plus a few nits on duplicate-key handling, image pinning in dev tooling, multi-arch find logic, and surfacing the namespace PSP relabel in the README. Nothing blocking; CI is still green where complete.

github-actions · 2026-05-05T18:46:54Z

Coverage Report ✅

Metric	Value
Coverage	76.8%
Threshold	75%
Status	Pass

Coverage Badge

![Coverage](https://img.shields.io/badge/coverage-76.8%25-green)

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/NVIDIA/aicr/pkg/cli	62.92% (+0.34%)	👍
github.com/NVIDIA/aicr/pkg/k8s/agent	78.59% (+0.63%)	👍
github.com/NVIDIA/aicr/pkg/snapshotter	51.05% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/NVIDIA/aicr/pkg/cli/snapshot.go	31.71% (+27.71%)	82 (+32)	26 (+24)	56 (+8)	🌟
github.com/NVIDIA/aicr/pkg/k8s/agent/job.go	89.36% (+1.13%)	94 (+9)	84 (+9)	10	👍
github.com/NVIDIA/aicr/pkg/k8s/agent/types.go	100.00% (ø)	1	1	0
github.com/NVIDIA/aicr/pkg/snapshotter/agent.go	37.97% (ø)	158	60	98

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

…re-gpu PR #762 review (mchmarny): RequireGPU=true silently overwrote any nvidia.com/gpu key the caller put in --limits, because the GPU default was unconditionally written after mergeResourceList. A user invoking --require-gpu --limits nvidia.com/gpu=4 ended up with gpu=1. Now RequireGPU only inserts the default of 1 when the merged limits don't already contain nvidia.com/gpu — caller override wins. Updated the agent.Config.Limits and snapshotter.AgentConfig.Limits doc comments, the --limits CLI flag help, and docs/user/cli-reference.md to match. Added a test case in TestApplyPrivilegedSettings_ResourceOverrides covering the override-wins path; renamed the existing case to make the default-when-absent path explicit.

PR #762 review (mchmarny): parseResourceList silently last-write-wins on duplicate keys (e.g. --requests cpu=1,cpu=2 produced cpu=2 with no warning). Easy footgun with shell-templated invocations and AICR_REQUESTS / AICR_LIMITS env-var composition. Now returns an explicit 'duplicate key %q' error. Added TestParseResourceList in pkg/cli/snapshot_test.go covering happy paths, all existing error paths, and both the bare and whitespace-normalized duplicate-key cases.

PR #762 review (mchmarny): the previous 'find dist -maxdepth 1 -type d -name aicr_*' returned alphabetical-first, so on a multi-platform goreleaser build it could pick the wrong arch (e.g. aicr_darwin_amd64_v3 before aicr_darwin_arm64_v8.0). The chosen binary would then silently fail to exec on the host. Now the find pattern is anchored to 'go env GOOS' / 'go env GOARCH', so the lookup is deterministic regardless of which platforms a prior 'make build' produced. Error message also updated to surface the host-specific path it expected.

PR #762 review (mchmarny): up.sh permanently labels the 'default' namespace with pod-security.kubernetes.io/enforce=privileged (plus audit/warn). That posture change persists past the snapshot test — anything else a developer schedules in 'default' afterward also runs at the privileged baseline. New 'Side effects' section in tools/talos-test/README.md surfaces both posture changes (the namespace relabel and the br_netfilter modprobe on the host VM), explains why each is required, and notes that 'talos-dev-env-clean' does not revert them — restart the host runtime VM to drop br_netfilter; recreate the namespace to drop the PSS labels.

github-actions · 2026-05-07T16:48:11Z

@ayuskauskas this PR now has merge conflicts with main. Please rebase to resolve them.

…/AICR_LIMITS PR #762 review (CodeRabbit, after fixup commits): * parseResourceList previously emitted plain fmt.Errorf errors, which violates the project rule that CLI-path validation errors carry pkg/errors codes. Switch all four return sites to errors.New(ErrCodeInvalidRequest, ...) (or errors.Wrap for the resource.ParseQuantity case where preserving the inner error is useful), and switch the two call sites in snapshotCmd's Action from errors.Wrap to errors.PropagateOrWrap so the inner code propagates unmodified instead of being re-wrapped. * docs/user/cli-reference.md's centralized 'Environment Variables' table omitted AICR_REQUESTS and AICR_LIMITS even though both are documented inline on the per-flag table. Add rows for both, pointing at the corresponding --requests / --limits flag and reproducing the --require-gpu precedence note. Tests: go test -run TestParseResourceList ./pkg/cli/... -v # all pass golangci-lint run -c .golangci.yaml ./pkg/cli/... # 0 issues make lint-yaml # pass

Four trivial-but-real cleanups, one per CodeRabbit thread: * tools/talos-test/up.sh: pin alpine:3 -> alpine:3.21 for the br_netfilter modprobe sidecar (matches existing pinning convention in docs/integrator/validator-extension.md). Also updates the troubleshooting hint string to reflect the pinned tag. Resolves: 3190760718 * Makefile: add a one-line comment above the talos-dev-env recipe documenting the TALOS_KUBECONFIG (user-facing) -> KUBECONFIG_OUT (script-internal) variable indirection so a reader of the Makefile alone does not have to dive into up.sh to understand the rename. Resolves: 3190760664 * tests/chainsaw/snapshot/deploy-agent-talos/chainsaw-test.yaml: switch the assert step from a fixed /tmp/chainsaw-snapshot-talos path to a per-run mktemp -d directory with a trap-based cleanup so parallel CI shards on a shared runner cannot collide. Drops the old fixed-path rm -rf from the cleanup block. Resolves: 3190760666 * tools/talos-test/README.md: bold-format the four troubleshooting problem statements for consistency, fix two 'kind path' references to read 'kind-based dev cluster', and replace 'Bring it up' with 'Start it' in the registry section (the LanguageTool flag CodeRabbit surfaced). Resolves: 3190760688, 3190760707

Adds an opt-in, kind-equivalent harness for spinning up a local Talos Linux cluster on a developer laptop and exercising the AICR snapshot agent against it. Closes the integration-test gap left by #714 (Talos collector) — the OS=talos pod-shape branch in pkg/k8s/agent/job.go was unit-tested only. The harness has two independent pieces stitched together by make: * tools/talos-test/{up.sh,down.sh,README.md}: cluster lifecycle. Wraps 'talosctl cluster create docker' (newer talosctl moved provisioners to subcommands), threads a containerd registry-mirror config patch so 'localhost:5001/aicr:local' resolves from inside Talos nodes, and on macOS Docker Desktop / Podman Machine loads br_netfilter via a one-shot privileged sidecar so the default Flannel CNI works. Streams 'talosctl health' progress, derives the apid endpoint and K8s API server URL from the docker-NAT'd host ports, rewrites the kubeconfig to the host-routable form, and relaxes Pod Security Standards on the default namespace so the privileged agent pod can be scheduled. Documented prerequisites and install commands for talosctl in the README. * tests/chainsaw/snapshot/deploy-agent-talos/: chainsaw test. Invokes 'aicr snapshot --os talos -o cm://default/aicr-e2e-snapshot' from a script step; the CLI exercises pkg/k8s/agent.Deployer end-to-end (RBAC creation, OS=talos pod shape, log streaming, Job completion, ConfigMap write, cleanup). Asserts only the surviving ConfigMap content. Uses chainsaw JMESPath filters ((sort([*].type)), (@[?type == 'X'] | [0].subtypes | sort(...))) so the assertion is order-independent at both the measurement-type and per-type subtype levels. Make targets (Makefile): talos-dev-env spin up cluster talos-dev-env-clean destroy cluster talos-snapshot-test build + run chainsaw against the live cluster None are wired into 'make qualify'; Talos is opt-in for now. CLI: The agent container's resource requests and limits were hardcoded (Privileged: 4Gi req / 8Gi lim memory; Restricted: 256Mi / 512Mi). That sizing fits production GPU nodes but blocks scheduling on small dev clusters such as the talosctl Docker provisioner default (2Gi worker), where there is no node-level option to bypass. Adds: --requests 'cpu=500m,memory=1Gi,ephemeral-storage=1Gi' --limits 'cpu=1,memory=2Gi,ephemeral-storage=2Gi' Each accepts a comma-separated 'name=quantity' list. Unspecified keys fall back to the existing per-mode defaults so the production path is unchanged. The nvidia.com/gpu limit injected by --require-gpu is preserved on top of any override. Plumbed through pkg/snapshotter.AgentConfig and pkg/k8s/agent.Config as corev1.ResourceList; applyPrivilegedSettings and applyRestrictedSettings merge overrides over the defaults via a new mergeResourceList helper. Tests: pkg/k8s/agent: TestApplyPrivilegedSettings_ResourceOverrides (no override, partial override, RequireGPU + override), TestApplyRestrictedSettings_ResourceOverrides, TestMergeResourceList (input-immutability invariant). Docs: docs/user/cli-reference.md --requests / --limits flag table entries tools/talos-test/README.md prerequisites, quickstart, customization, troubleshooting Components Affected: CLI (pkg/cli) Agent (pkg/k8s/agent, pkg/snapshotter) Tests/CI (tests/chainsaw/snapshot/deploy-agent-talos, tools/talos-test) Build (Makefile) Docs (docs/user/cli-reference.md) Tested locally end-to-end on macOS Podman Machine (Talos v1.9.0 + talosctl 1.13.0) — bring-up, agent deployment, snapshot assertion, teardown all pass.

…re-gpu PR #762 review (mchmarny): RequireGPU=true silently overwrote any nvidia.com/gpu key the caller put in --limits, because the GPU default was unconditionally written after mergeResourceList. A user invoking --require-gpu --limits nvidia.com/gpu=4 ended up with gpu=1. Now RequireGPU only inserts the default of 1 when the merged limits don't already contain nvidia.com/gpu — caller override wins. Updated the agent.Config.Limits and snapshotter.AgentConfig.Limits doc comments, the --limits CLI flag help, and docs/user/cli-reference.md to match. Added a test case in TestApplyPrivilegedSettings_ResourceOverrides covering the override-wins path; renamed the existing case to make the default-when-absent path explicit.

PR #762 review (mchmarny): parseResourceList silently last-write-wins on duplicate keys (e.g. --requests cpu=1,cpu=2 produced cpu=2 with no warning). Easy footgun with shell-templated invocations and AICR_REQUESTS / AICR_LIMITS env-var composition. Now returns an explicit 'duplicate key %q' error. Added TestParseResourceList in pkg/cli/snapshot_test.go covering happy paths, all existing error paths, and both the bare and whitespace-normalized duplicate-key cases.

PR #762 review (mchmarny): the previous 'find dist -maxdepth 1 -type d -name aicr_*' returned alphabetical-first, so on a multi-platform goreleaser build it could pick the wrong arch (e.g. aicr_darwin_amd64_v3 before aicr_darwin_arm64_v8.0). The chosen binary would then silently fail to exec on the host. Now the find pattern is anchored to 'go env GOOS' / 'go env GOARCH', so the lookup is deterministic regardless of which platforms a prior 'make build' produced. Error message also updated to surface the host-specific path it expected.

PR #762 review (mchmarny): up.sh permanently labels the 'default' namespace with pod-security.kubernetes.io/enforce=privileged (plus audit/warn). That posture change persists past the snapshot test — anything else a developer schedules in 'default' afterward also runs at the privileged baseline. New 'Side effects' section in tools/talos-test/README.md surfaces both posture changes (the namespace relabel and the br_netfilter modprobe on the host VM), explains why each is required, and notes that 'talos-dev-env-clean' does not revert them — restart the host runtime VM to drop br_netfilter; recreate the namespace to drop the PSS labels.

…/AICR_LIMITS PR #762 review (CodeRabbit, after fixup commits): * parseResourceList previously emitted plain fmt.Errorf errors, which violates the project rule that CLI-path validation errors carry pkg/errors codes. Switch all four return sites to errors.New(ErrCodeInvalidRequest, ...) (or errors.Wrap for the resource.ParseQuantity case where preserving the inner error is useful), and switch the two call sites in snapshotCmd's Action from errors.Wrap to errors.PropagateOrWrap so the inner code propagates unmodified instead of being re-wrapped. * docs/user/cli-reference.md's centralized 'Environment Variables' table omitted AICR_REQUESTS and AICR_LIMITS even though both are documented inline on the per-flag table. Add rows for both, pointing at the corresponding --requests / --limits flag and reproducing the --require-gpu precedence note. Tests: go test -run TestParseResourceList ./pkg/cli/... -v # all pass golangci-lint run -c .golangci.yaml ./pkg/cli/... # 0 issues make lint-yaml # pass

Four trivial-but-real cleanups, one per CodeRabbit thread: * tools/talos-test/up.sh: pin alpine:3 -> alpine:3.21 for the br_netfilter modprobe sidecar (matches existing pinning convention in docs/integrator/validator-extension.md). Also updates the troubleshooting hint string to reflect the pinned tag. Resolves: 3190760718 * Makefile: add a one-line comment above the talos-dev-env recipe documenting the TALOS_KUBECONFIG (user-facing) -> KUBECONFIG_OUT (script-internal) variable indirection so a reader of the Makefile alone does not have to dive into up.sh to understand the rename. Resolves: 3190760664 * tests/chainsaw/snapshot/deploy-agent-talos/chainsaw-test.yaml: switch the assert step from a fixed /tmp/chainsaw-snapshot-talos path to a per-run mktemp -d directory with a trap-based cleanup so parallel CI shards on a shared runner cannot collide. Drops the old fixed-path rm -rf from the cleanup block. Resolves: 3190760666 * tools/talos-test/README.md: bold-format the four troubleshooting problem statements for consistency, fix two 'kind path' references to read 'kind-based dev cluster', and replace 'Bring it up' with 'Start it' in the registry section (the LanguageTool flag CodeRabbit surfaced). Resolves: 3190760688, 3190760707

…CR_LIMITS env doc PR #762 review (CodeRabbit, post-rebase batch): * parseResourceList accepted negative quantities like '--requests cpu=-1' or '--limits memory=-1Gi'; the failure surfaced only later when the K8s API rejected the Job. Now rejected at parse time with a structured ErrCodeInvalidRequest carrying the offending entry. Zero is still accepted (legitimate for some resources). Test cases added to TestParseResourceList: negative cpu, negative memory with suffix, negative-in-second-entry, and a positive-zero case for the boundary. Resolves: 3203439877 * Environment Variables table's AICR_LIMITS row in docs/user/cli-reference.md was missing the 'Unspecified resources keep the built-in defaults' clause that the --limits flag's inline doc had. Added it; AICR_REQUESTS already had the analogous wording. Resolves: 3203439871 Tested: go test -run TestParseResourceList -v ./pkg/cli/... # all pass

…ence PR #762 review (CodeRabbit, post-rebase batch): * up.sh's preflight 'for tool in talosctl docker kubectl chainsaw' loop did not include curl, but the next step calls 'curl -sf http://localhost:5001/v2/' for the registry check. On a host without curl that check would have failed and the script would have printed the misleading 'registry not reachable' remediation. Add curl to the preflight list and to the README prerequisites. Resolves: 3203439882 * up.sh writes the cluster's apid endpoint and node IP to the user's default talosctl config (~/.talos/config) via 'talosctl config endpoint/node'. This persists past 'make talos-dev-env-clean', leaving subsequent interactive talosctl commands pointed at a destroyed cluster. Promoted the Side effects section from two to three items, documented the TALOSCONFIG=$(mktemp) workaround for users who want to keep their default config untouched, and pointed at 'talosctl config remove-context aicr-talos' for cleanup after teardown. Resolves: 3203439886

…#762)

ayuskauskas requested review from a team as code owners May 5, 2026 18:36

github-actions Bot added area/tests area/docs area/cli size/XL labels May 5, 2026

mchmarny assigned ayuskauskas May 5, 2026

This comment was marked as resolved.

Sign in to view

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Comment thread Makefile

Comment thread tests/chainsaw/snapshot/deploy-agent-talos/chainsaw-test.yaml Outdated

Comment thread tools/talos-test/README.md

Comment thread tools/talos-test/README.md

Comment thread tools/talos-test/up.sh

Comment thread tools/talos-test/up.sh

mchmarny reviewed May 5, 2026

View reviewed changes

Comment thread pkg/k8s/agent/job.go Outdated

Comment thread pkg/cli/snapshot.go

Comment thread tools/talos-test/up.sh

Comment thread Makefile

Comment thread tools/talos-test/README.md

This comment was marked as resolved.

Sign in to view

github-actions Bot added the needs-rebase label May 7, 2026

This comment was marked as resolved.

Sign in to view

ayuskauskas added 8 commits May 7, 2026 16:13

ayuskauskas force-pushed the feat/talos_local_test branch from 858d3bc to c73dad1 Compare May 8, 2026 17:10

Merge branch 'main' into feat/talos_local_test

c0a9808

mchmarny approved these changes May 9, 2026

View reviewed changes

mchmarny merged commit b9488a3 into main May 9, 2026
36 checks passed

mchmarny deleted the feat/talos_local_test branch May 9, 2026 11:06

xdu31 pushed a commit to xdu31/aicr that referenced this pull request May 11, 2026

feat(tools): add local Talos cluster + snapshot chainsaw test (NVIDIA…

a336a2d

…#762)

This was referenced May 12, 2026

feat: os-talos mixin + bundler preManifestFiles support #856

Merged

[Feature]: Talos Linux support #565

Closed

Conversation

ayuskauskas commented May 5, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

tools/talos-test/{up.sh,down.sh,README.md} — cluster lifecycle

tests/chainsaw/snapshot/deploy-agent-talos/ — chainsaw test

aicr snapshot — --requests / --limits

Make targets

Testing

Risk Assessment

Checklist

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

This comment was marked as resolved.

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report ✅

Merging this branch will increase overall coverage

Changed files (no unit tests)

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`tools/talos-test/{up.sh,down.sh,README.md}` — cluster lifecycle

`tests/chainsaw/snapshot/deploy-agent-talos/` — chainsaw test

`aicr snapshot` — `--requests` / `--limits`

github-actions Bot commented May 5, 2026 •

edited

Loading