Skip to content

feat(tools): add local Talos cluster + snapshot chainsaw test#762

Merged
mchmarny merged 10 commits into
mainfrom
feat/talos_local_test
May 9, 2026
Merged

feat(tools): add local Talos cluster + snapshot chainsaw test#762
mchmarny merged 10 commits into
mainfrom
feat/talos_local_test

Conversation

@ayuskauskas
Copy link
Copy Markdown
Contributor

Summary

Adds an opt-in, kind-equivalent harness for spinning up a local Talos
Linux cluster on a developer laptop (tools/talos-test/) and a chainsaw
test that exercises the AICR snapshot agent against it
(tests/chainsaw/snapshot/deploy-agent-talos/). Also adds --requests
and --limits flags to aicr snapshot so the agent can be right-sized
on resource-constrained dev clusters.

Motivation / Context

PR #714 added Talos OS support (pkg/collector/talos/ + the OS=talos
pod-shape branch in pkg/k8s/agent/job.go), but the only coverage was
unit tests. There was no integration test running the snapshot agent
against an actual Talos node — and no easy way for a developer on macOS
to exercise that path locally without standing up a remote cluster.

The agent container's resource requests/limits were also hardcoded
(Privileged: 4Gi req / 8Gi lim memory). That sizing fits production
GPU nodes but blocks scheduling on the talosctl Docker provisioner's
default 2Gi worker, with no way to override at the node level.

Fixes: N/A
Related: #714, #565

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: tests/chainsaw, tools/, Makefile

Implementation Notes

tools/talos-test/{up.sh,down.sh,README.md} — cluster lifecycle

  • Wraps talosctl cluster create docker (newer talosctl moved
    provisioners to subcommands; the old --provisioner flag is gone).
  • Threads a containerd registry-mirror config patch via
    --config-patch so localhost:5001/aicr:local resolves transparently
    from inside Talos node containers (the host registry started by
    make dev-env).
  • br_netfilter preflight: the macOS host VM (Docker Desktop's
    LinuxKit or Podman Machine's Fedora CoreOS) ships the module but
    does not auto-load it, so flannel CNI fails until we modprobe it via
    a one-shot privileged sidecar with /lib/modules bind-mounted from
    the VM. No-op on Linux hosts.
  • talosctl health --wait-timeout 8m streams readiness phases
    (etcd, apid, kubelet, control plane, all k8s nodes ready, kube-proxy,
    CoreDNS). On failure the script prints diagnostic commands the
    developer can run to drill into stuck services.
  • The Talos in-network IP (10.5.0.0/24) is not host-routable on
    macOS, so the script derives the docker-NAT'd apid host port (50000)
    and K8s API host port (6443) from docker port, passes them as
    --endpoints, and rewrites the kubeconfig server URL to
    https://127.0.0.1:<host-port> with insecure-skip-tls-verify=true.
  • Relabels the default namespace with
    pod-security.kubernetes.io/enforce=privileged so the snapshot
    agent's privileged pod can be scheduled (Talos enforces restricted
    cluster-wide by default).

tests/chainsaw/snapshot/deploy-agent-talos/ — chainsaw test

  • Invokes aicr snapshot --os talos -o cm://default/aicr-e2e-snapshot
    from a script: step. The CLI exercises pkg/k8s/agent.Deployer
    end-to-end (RBAC creation, OS=talos pod shape from
    pkg/k8s/agent/job.go, log streaming, Job completion, ConfigMap
    write, cleanup) — no hand-applied YAML fixtures to drift against
    job.go (the existing kind chainsaw test's drift risk is captured
    for follow-up in issues/e2e-update.md, kept local).
  • Asserts only the surviving ConfigMap content. Uses chainsaw
    JMESPath filters ((sort([*].type)),
    (@[?type == 'X'] | [0].subtypes | sort([*].subtype))) so the
    assertion is order-independent at both the measurement-type and
    per-type subtype levels — collector goroutine completion order
    is non-deterministic.
  • Pinned: 5 measurements; the set of types
    {GPU, K8s, NodeTopology, OS, SystemD}; per-type subtype names.
  • Per-subtype data values are intentionally not pinned to avoid
    Talos / kernel / kubernetes version churn.

aicr snapshot--requests / --limits

Adds two CLI flags accepting kubectl-style comma-separated
name=quantity lists:

--requests cpu=500m,memory=1Gi,ephemeral-storage=1Gi
--limits   cpu=1,memory=2Gi,ephemeral-storage=2Gi

Plumbed through pkg/snapshotter.AgentConfig and pkg/k8s/agent.Config
as corev1.ResourceList. applyPrivilegedSettings and
applyRestrictedSettings merge overrides on top of the existing
hardcoded defaults via a new mergeResourceList helper, so unspecified
keys keep the production defaults. The nvidia.com/gpu limit added
by --require-gpu is preserved on top of any override.

Make targets

talos-dev-env, talos-dev-env-clean, talos-snapshot-test. None
are wired into make qualify; Talos remains opt-in for now.

Testing

unset GITLAB_TOKEN
make build
make test     # touched packages
golangci-lint run -c .golangci.yaml ./pkg/k8s/agent/... ./pkg/cli/... ./pkg/snapshotter/...
make lint-yaml

End-to-end against a live local cluster (macOS Podman Machine, Talos
v1.9.0, talosctl 1.13.0):

make image IMAGE_REGISTRY=localhost:5001/aicr IMAGE_TAG=local
make build
make talos-dev-env
make talos-snapshot-test          # PASS
make talos-dev-env-clean

Coverage delta on touched packages:

  • pkg/k8s/agent: 77.8% (baseline) → ~78% (added tests for resource-
    override paths)
  • pkg/cli: ~52.9% (baseline) → comparable (added flag plumbing
    • parseResourceList; covered indirectly)
  • pkg/snapshotter: ~52.2% (baseline) → comparable (1-line struct
    field + plumbing only)

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: Production callers of aicr snapshot are unaffected;
both new flags default to empty, in which case the hardcoded
privileged/restricted defaults remain in force exactly as before. The
Talos test harness is a new opt-in surface and is not wired into
make qualify or any CI workflow.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint) — only go: no such tool covdata on
    cmd/aicr/cmd/aicrd (a pre-existing local toolchain issue,
    unrelated to this branch — does not affect CI)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

@coderabbitai

This comment was marked as resolved.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Makefile`:
- Around line 660-665: The Makefile target talos-dev-env passes TALOS_KUBECONFIG
into the script as KUBECONFIG_OUT which is correct but mismatches the README
variable name and can confuse users; update either the Makefile or the
documentation for consistency: either (A) change the environment var passed to
./tools/talos-test/up.sh to use TALOS_KUBECONFIG as the name consumed by the
script, or (B) add a one-line comment above the talos-dev-env target explaining
that TALOS_KUBECONFIG is forwarded into up.sh as KUBECONFIG_OUT, and optionally
update README.md to mention KUBECONFIG_OUT as the internal name; locate the
talos-dev-env target in the Makefile and the up.sh script to align names or add
the clarifying comment referencing KUBECONFIG_OUT, TALOS_KUBECONFIG, and
./tools/talos-test/up.sh.

In `@tests/chainsaw/snapshot/deploy-agent-talos/chainsaw-test.yaml`:
- Around line 78-91: The test uses a hardcoded
WORK="/tmp/chainsaw-snapshot-talos" which can collide in concurrent runs; change
the script that defines WORK to create a unique temp dir (e.g., assign
WORK="$(mktemp -d)" or similar) and update all uses of WORK in the "script"
block where chainsaw assert runs and in the "cleanup" block so the cleanup
removes that same unique directory (ensure mktemp failure is handled or the
script exits if WORK is empty).

In `@tools/talos-test/README.md`:
- Around line 66-84: Update the Troubleshooting section to use consistent
formatting by wrapping each problem statement/error condition in backticks
(e.g., `talosctl: command not found`, `localhost:5001 registry not reachable`,
`Image pull fails inside the Talos node with a TLS error`,
`localhost:5001/aicr:local not found inside the cluster`) and correct the
wording that currently reads "kind path" to "kind cluster" to avoid confusion;
ensure these edits are applied under the "Troubleshooting" heading and preserve
the explanatory text that follows each backticked problem.
- Around line 86-93: Replace the vague phrase "the kind path uses" in the "Why a
separate cluster?" paragraph with a clearer reference to the kind-based
development environment (e.g., "the path used by the kind-based dev environment"
or "the path used by KinD clusters"), so the sentence comparing Talos to kind
explicitly mentions KinD/kind-based dev environments; update the sentence
containing the string "the kind path uses" to use the clearer wording while
keeping the rest of the paragraph intact.

In `@tools/talos-test/up.sh`:
- Around line 47-52: The docker invocation in the up.sh script uses the floating
image tag alpine:3 which can change; update the command that contains "docker
run ... alpine:3 sh -c 'modprobe br_netfilter ...'" to reference a pinned Alpine
tag or digest (e.g., a specific version tag or `@sha256` digest) so the modprobe
check is reproducible in CI; replace the literal "alpine:3" string with the
chosen pinned image reference throughout the script and keep the rest of the
command unchanged.
- Around line 70-75: The talosctl invocation mistakenly includes the deprecated
"docker" subcommand; update the command in the script where talosctl cluster
create is called (referenced by variables CLUSTER_NAME, TALOS_VERSION,
PATCH_FILE, MIRROR_HOST) to remove the "docker" token so it uses the form
talosctl cluster create with flags --name "${CLUSTER_NAME}" --workers 1 --image
"ghcr.io/siderolabs/talos:${TALOS_VERSION}" --config-patch "@${PATCH_FILE}"
(keep the existing flags and variables intact, only remove the literal
"docker").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 5f48d4ab-cea7-425a-a581-a82c58392934

📥 Commits

Reviewing files that changed from the base of the PR and between 1e644aa and 2100ca1.

📒 Files selected for processing (12)
  • Makefile
  • docs/user/cli-reference.md
  • pkg/cli/snapshot.go
  • pkg/k8s/agent/job.go
  • pkg/k8s/agent/job_test.go
  • pkg/k8s/agent/types.go
  • pkg/snapshotter/agent.go
  • tests/chainsaw/snapshot/deploy-agent-talos/assert-snapshot-content.yaml
  • tests/chainsaw/snapshot/deploy-agent-talos/chainsaw-test.yaml
  • tools/talos-test/README.md
  • tools/talos-test/down.sh
  • tools/talos-test/up.sh

Comment thread Makefile
Comment thread tests/chainsaw/snapshot/deploy-agent-talos/chainsaw-test.yaml Outdated
Comment thread tools/talos-test/README.md
Comment thread tools/talos-test/README.md
Comment thread tools/talos-test/up.sh
Comment thread tools/talos-test/up.sh
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid PR — the Talos OS=talos pod-shape branch from #714 finally has end-to-end coverage, and the --requests/--limits plumbing is clean (separate types, merge-not-replace semantics, defaults preserved, non-mutating helper, unit tests exercising all three branches).

One medium concern on the --require-gpu vs --limits nvidia.com/gpu=N precedence — current behavior silently overrides the explicit limit and the CLI help text reads ambiguously. Worth deciding whether --require-gpu is "set if unset" or "always force to 1" and asserting it in tests. Plus a few nits on duplicate-key handling, image pinning in dev tooling, multi-arch find logic, and surfacing the namespace PSP relabel in the README. Nothing blocking; CI is still green where complete.

Comment thread pkg/k8s/agent/job.go Outdated
Comment thread pkg/cli/snapshot.go
Comment thread tools/talos-test/up.sh
Comment thread Makefile
Comment thread tools/talos-test/README.md
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Coverage Report ✅

Metric Value
Coverage 76.8%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.8%25-green)

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/cli 62.92% (+0.34%) 👍
github.com/NVIDIA/aicr/pkg/k8s/agent 78.59% (+0.63%) 👍
github.com/NVIDIA/aicr/pkg/snapshotter 51.05% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/cli/snapshot.go 31.71% (+27.71%) 82 (+32) 26 (+24) 56 (+8) 🌟
github.com/NVIDIA/aicr/pkg/k8s/agent/job.go 89.36% (+1.13%) 94 (+9) 84 (+9) 10 👍
github.com/NVIDIA/aicr/pkg/k8s/agent/types.go 100.00% (ø) 1 1 0
github.com/NVIDIA/aicr/pkg/snapshotter/agent.go 37.97% (ø) 158 60 98

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

ayuskauskas added a commit that referenced this pull request May 6, 2026
…re-gpu

PR #762 review (mchmarny): RequireGPU=true silently overwrote any
nvidia.com/gpu key the caller put in --limits, because the GPU default
was unconditionally written after mergeResourceList. A user invoking
--require-gpu --limits nvidia.com/gpu=4 ended up with gpu=1.

Now RequireGPU only inserts the default of 1 when the merged limits
don't already contain nvidia.com/gpu — caller override wins. Updated
the agent.Config.Limits and snapshotter.AgentConfig.Limits doc
comments, the --limits CLI flag help, and docs/user/cli-reference.md
to match. Added a test case in TestApplyPrivilegedSettings_ResourceOverrides
covering the override-wins path; renamed the existing case to make
the default-when-absent path explicit.
ayuskauskas added a commit that referenced this pull request May 6, 2026
PR #762 review (mchmarny): parseResourceList silently last-write-wins
on duplicate keys (e.g. --requests cpu=1,cpu=2 produced cpu=2 with no
warning). Easy footgun with shell-templated invocations and
AICR_REQUESTS / AICR_LIMITS env-var composition.

Now returns an explicit 'duplicate key %q' error. Added
TestParseResourceList in pkg/cli/snapshot_test.go covering happy
paths, all existing error paths, and both the bare and
whitespace-normalized duplicate-key cases.
ayuskauskas added a commit that referenced this pull request May 6, 2026
PR #762 review (mchmarny): the previous 'find dist -maxdepth 1 -type d
-name aicr_*' returned alphabetical-first, so on a multi-platform
goreleaser build it could pick the wrong arch (e.g.
aicr_darwin_amd64_v3 before aicr_darwin_arm64_v8.0). The chosen
binary would then silently fail to exec on the host.

Now the find pattern is anchored to 'go env GOOS' / 'go env GOARCH',
so the lookup is deterministic regardless of which platforms a prior
'make build' produced. Error message also updated to surface the
host-specific path it expected.
ayuskauskas added a commit that referenced this pull request May 6, 2026
PR #762 review (mchmarny): up.sh permanently labels the 'default'
namespace with pod-security.kubernetes.io/enforce=privileged (plus
audit/warn). That posture change persists past the snapshot test —
anything else a developer schedules in 'default' afterward also runs
at the privileged baseline.

New 'Side effects' section in tools/talos-test/README.md surfaces
both posture changes (the namespace relabel and the br_netfilter
modprobe on the host VM), explains why each is required, and notes
that 'talos-dev-env-clean' does not revert them — restart the host
runtime VM to drop br_netfilter; recreate the namespace to drop the
PSS labels.
coderabbitai[bot]

This comment was marked as resolved.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

@ayuskauskas this PR now has merge conflicts with main. Please rebase to resolve them.

ayuskauskas added a commit that referenced this pull request May 7, 2026
…/AICR_LIMITS

PR #762 review (CodeRabbit, after fixup commits):

* parseResourceList previously emitted plain fmt.Errorf errors, which
  violates the project rule that CLI-path validation errors carry
  pkg/errors codes. Switch all four return sites to
  errors.New(ErrCodeInvalidRequest, ...) (or errors.Wrap for the
  resource.ParseQuantity case where preserving the inner error is
  useful), and switch the two call sites in snapshotCmd's Action from
  errors.Wrap to errors.PropagateOrWrap so the inner code propagates
  unmodified instead of being re-wrapped.

* docs/user/cli-reference.md's centralized 'Environment Variables'
  table omitted AICR_REQUESTS and AICR_LIMITS even though both are
  documented inline on the per-flag table. Add rows for both,
  pointing at the corresponding --requests / --limits flag and
  reproducing the --require-gpu precedence note.

Tests:
  go test -run TestParseResourceList ./pkg/cli/... -v   # all pass
  golangci-lint run -c .golangci.yaml ./pkg/cli/...     # 0 issues
  make lint-yaml                                         # pass
ayuskauskas added a commit that referenced this pull request May 7, 2026
Four trivial-but-real cleanups, one per CodeRabbit thread:

* tools/talos-test/up.sh: pin alpine:3 -> alpine:3.21 for the
  br_netfilter modprobe sidecar (matches existing pinning convention
  in docs/integrator/validator-extension.md). Also updates the
  troubleshooting hint string to reflect the pinned tag.
  Resolves: 3190760718

* Makefile: add a one-line comment above the talos-dev-env recipe
  documenting the TALOS_KUBECONFIG (user-facing) -> KUBECONFIG_OUT
  (script-internal) variable indirection so a reader of the Makefile
  alone does not have to dive into up.sh to understand the rename.
  Resolves: 3190760664

* tests/chainsaw/snapshot/deploy-agent-talos/chainsaw-test.yaml:
  switch the assert step from a fixed /tmp/chainsaw-snapshot-talos
  path to a per-run mktemp -d directory with a trap-based cleanup so
  parallel CI shards on a shared runner cannot collide. Drops the
  old fixed-path rm -rf from the cleanup block.
  Resolves: 3190760666

* tools/talos-test/README.md: bold-format the four troubleshooting
  problem statements for consistency, fix two 'kind path' references
  to read 'kind-based dev cluster', and replace 'Bring it up' with
  'Start it' in the registry section (the LanguageTool flag CodeRabbit
  surfaced).
  Resolves: 3190760688, 3190760707
coderabbitai[bot]

This comment was marked as resolved.

Adds an opt-in, kind-equivalent harness for spinning up a local Talos
Linux cluster on a developer laptop and exercising the AICR snapshot
agent against it. Closes the integration-test gap left by #714 (Talos
collector) — the OS=talos pod-shape branch in pkg/k8s/agent/job.go was
unit-tested only.

The harness has two independent pieces stitched together by make:

* tools/talos-test/{up.sh,down.sh,README.md}: cluster lifecycle.
  Wraps 'talosctl cluster create docker' (newer talosctl moved
  provisioners to subcommands), threads a containerd registry-mirror
  config patch so 'localhost:5001/aicr:local' resolves from inside
  Talos nodes, and on macOS Docker Desktop / Podman Machine loads
  br_netfilter via a one-shot privileged sidecar so the default
  Flannel CNI works. Streams 'talosctl health' progress, derives the
  apid endpoint and K8s API server URL from the docker-NAT'd host
  ports, rewrites the kubeconfig to the host-routable form, and
  relaxes Pod Security Standards on the default namespace so the
  privileged agent pod can be scheduled. Documented prerequisites and
  install commands for talosctl in the README.

* tests/chainsaw/snapshot/deploy-agent-talos/: chainsaw test.
  Invokes 'aicr snapshot --os talos -o cm://default/aicr-e2e-snapshot'
  from a script step; the CLI exercises pkg/k8s/agent.Deployer
  end-to-end (RBAC creation, OS=talos pod shape, log streaming, Job
  completion, ConfigMap write, cleanup). Asserts only the surviving
  ConfigMap content. Uses chainsaw JMESPath filters
  ((sort([*].type)), (@[?type == 'X'] | [0].subtypes | sort(...)))
  so the assertion is order-independent at both the measurement-type
  and per-type subtype levels.

Make targets (Makefile):
  talos-dev-env        spin up cluster
  talos-dev-env-clean  destroy cluster
  talos-snapshot-test  build + run chainsaw against the live cluster
None are wired into 'make qualify'; Talos is opt-in for now.

CLI:
The agent container's resource requests and limits were hardcoded
(Privileged: 4Gi req / 8Gi lim memory; Restricted: 256Mi / 512Mi).
That sizing fits production GPU nodes but blocks scheduling on small
dev clusters such as the talosctl Docker provisioner default (2Gi
worker), where there is no node-level option to bypass. Adds:

  --requests 'cpu=500m,memory=1Gi,ephemeral-storage=1Gi'
  --limits   'cpu=1,memory=2Gi,ephemeral-storage=2Gi'

Each accepts a comma-separated 'name=quantity' list. Unspecified keys
fall back to the existing per-mode defaults so the production path is
unchanged. The nvidia.com/gpu limit injected by --require-gpu is
preserved on top of any override. Plumbed through
pkg/snapshotter.AgentConfig and pkg/k8s/agent.Config as
corev1.ResourceList; applyPrivilegedSettings and
applyRestrictedSettings merge overrides over the defaults via a new
mergeResourceList helper.

Tests:
  pkg/k8s/agent: TestApplyPrivilegedSettings_ResourceOverrides
                 (no override, partial override, RequireGPU + override),
                 TestApplyRestrictedSettings_ResourceOverrides,
                 TestMergeResourceList (input-immutability invariant).

Docs:
  docs/user/cli-reference.md  --requests / --limits flag table entries
  tools/talos-test/README.md  prerequisites, quickstart, customization,
                              troubleshooting

Components Affected:
  CLI (pkg/cli)
  Agent (pkg/k8s/agent, pkg/snapshotter)
  Tests/CI (tests/chainsaw/snapshot/deploy-agent-talos, tools/talos-test)
  Build (Makefile)
  Docs (docs/user/cli-reference.md)

Tested locally end-to-end on macOS Podman Machine (Talos v1.9.0 +
talosctl 1.13.0) — bring-up, agent deployment, snapshot assertion,
teardown all pass.
…re-gpu

PR #762 review (mchmarny): RequireGPU=true silently overwrote any
nvidia.com/gpu key the caller put in --limits, because the GPU default
was unconditionally written after mergeResourceList. A user invoking
--require-gpu --limits nvidia.com/gpu=4 ended up with gpu=1.

Now RequireGPU only inserts the default of 1 when the merged limits
don't already contain nvidia.com/gpu — caller override wins. Updated
the agent.Config.Limits and snapshotter.AgentConfig.Limits doc
comments, the --limits CLI flag help, and docs/user/cli-reference.md
to match. Added a test case in TestApplyPrivilegedSettings_ResourceOverrides
covering the override-wins path; renamed the existing case to make
the default-when-absent path explicit.
PR #762 review (mchmarny): parseResourceList silently last-write-wins
on duplicate keys (e.g. --requests cpu=1,cpu=2 produced cpu=2 with no
warning). Easy footgun with shell-templated invocations and
AICR_REQUESTS / AICR_LIMITS env-var composition.

Now returns an explicit 'duplicate key %q' error. Added
TestParseResourceList in pkg/cli/snapshot_test.go covering happy
paths, all existing error paths, and both the bare and
whitespace-normalized duplicate-key cases.
PR #762 review (mchmarny): the previous 'find dist -maxdepth 1 -type d
-name aicr_*' returned alphabetical-first, so on a multi-platform
goreleaser build it could pick the wrong arch (e.g.
aicr_darwin_amd64_v3 before aicr_darwin_arm64_v8.0). The chosen
binary would then silently fail to exec on the host.

Now the find pattern is anchored to 'go env GOOS' / 'go env GOARCH',
so the lookup is deterministic regardless of which platforms a prior
'make build' produced. Error message also updated to surface the
host-specific path it expected.
PR #762 review (mchmarny): up.sh permanently labels the 'default'
namespace with pod-security.kubernetes.io/enforce=privileged (plus
audit/warn). That posture change persists past the snapshot test —
anything else a developer schedules in 'default' afterward also runs
at the privileged baseline.

New 'Side effects' section in tools/talos-test/README.md surfaces
both posture changes (the namespace relabel and the br_netfilter
modprobe on the host VM), explains why each is required, and notes
that 'talos-dev-env-clean' does not revert them — restart the host
runtime VM to drop br_netfilter; recreate the namespace to drop the
PSS labels.
…/AICR_LIMITS

PR #762 review (CodeRabbit, after fixup commits):

* parseResourceList previously emitted plain fmt.Errorf errors, which
  violates the project rule that CLI-path validation errors carry
  pkg/errors codes. Switch all four return sites to
  errors.New(ErrCodeInvalidRequest, ...) (or errors.Wrap for the
  resource.ParseQuantity case where preserving the inner error is
  useful), and switch the two call sites in snapshotCmd's Action from
  errors.Wrap to errors.PropagateOrWrap so the inner code propagates
  unmodified instead of being re-wrapped.

* docs/user/cli-reference.md's centralized 'Environment Variables'
  table omitted AICR_REQUESTS and AICR_LIMITS even though both are
  documented inline on the per-flag table. Add rows for both,
  pointing at the corresponding --requests / --limits flag and
  reproducing the --require-gpu precedence note.

Tests:
  go test -run TestParseResourceList ./pkg/cli/... -v   # all pass
  golangci-lint run -c .golangci.yaml ./pkg/cli/...     # 0 issues
  make lint-yaml                                         # pass
Four trivial-but-real cleanups, one per CodeRabbit thread:

* tools/talos-test/up.sh: pin alpine:3 -> alpine:3.21 for the
  br_netfilter modprobe sidecar (matches existing pinning convention
  in docs/integrator/validator-extension.md). Also updates the
  troubleshooting hint string to reflect the pinned tag.
  Resolves: 3190760718

* Makefile: add a one-line comment above the talos-dev-env recipe
  documenting the TALOS_KUBECONFIG (user-facing) -> KUBECONFIG_OUT
  (script-internal) variable indirection so a reader of the Makefile
  alone does not have to dive into up.sh to understand the rename.
  Resolves: 3190760664

* tests/chainsaw/snapshot/deploy-agent-talos/chainsaw-test.yaml:
  switch the assert step from a fixed /tmp/chainsaw-snapshot-talos
  path to a per-run mktemp -d directory with a trap-based cleanup so
  parallel CI shards on a shared runner cannot collide. Drops the
  old fixed-path rm -rf from the cleanup block.
  Resolves: 3190760666

* tools/talos-test/README.md: bold-format the four troubleshooting
  problem statements for consistency, fix two 'kind path' references
  to read 'kind-based dev cluster', and replace 'Bring it up' with
  'Start it' in the registry section (the LanguageTool flag CodeRabbit
  surfaced).
  Resolves: 3190760688, 3190760707
…CR_LIMITS env doc

PR #762 review (CodeRabbit, post-rebase batch):

* parseResourceList accepted negative quantities like
  '--requests cpu=-1' or '--limits memory=-1Gi'; the failure surfaced
  only later when the K8s API rejected the Job. Now rejected at parse
  time with a structured ErrCodeInvalidRequest carrying the offending
  entry. Zero is still accepted (legitimate for some resources).
  Test cases added to TestParseResourceList: negative cpu, negative
  memory with suffix, negative-in-second-entry, and a positive-zero
  case for the boundary. Resolves: 3203439877

* Environment Variables table's AICR_LIMITS row in
  docs/user/cli-reference.md was missing the
  'Unspecified resources keep the built-in defaults' clause that the
  --limits flag's inline doc had. Added it; AICR_REQUESTS already had
  the analogous wording. Resolves: 3203439871

Tested:
  go test -run TestParseResourceList -v ./pkg/cli/...   # all pass
…ence

PR #762 review (CodeRabbit, post-rebase batch):

* up.sh's preflight 'for tool in talosctl docker kubectl chainsaw'
  loop did not include curl, but the next step calls 'curl -sf
  http://localhost:5001/v2/' for the registry check. On a host without
  curl that check would have failed and the script would have printed
  the misleading 'registry not reachable' remediation. Add curl to
  the preflight list and to the README prerequisites.
  Resolves: 3203439882

* up.sh writes the cluster's apid endpoint and node IP to the user's
  default talosctl config (~/.talos/config) via 'talosctl config
  endpoint/node'. This persists past 'make talos-dev-env-clean',
  leaving subsequent interactive talosctl commands pointed at a
  destroyed cluster. Promoted the Side effects section from two to
  three items, documented the TALOSCONFIG=$(mktemp) workaround for
  users who want to keep their default config untouched, and pointed
  at 'talosctl config remove-context aicr-talos' for cleanup after
  teardown. Resolves: 3203439886
@ayuskauskas ayuskauskas force-pushed the feat/talos_local_test branch from 858d3bc to c73dad1 Compare May 8, 2026 17:10
@mchmarny mchmarny merged commit b9488a3 into main May 9, 2026
36 checks passed
@mchmarny mchmarny deleted the feat/talos_local_test branch May 9, 2026 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants