Skip to content

ci: run helm tests with ctlptl registry#211

Merged
lockwobr merged 9 commits intoNVIDIA:mainfrom
AnouarMohamed:feat/ctlptl-local-registry
May 7, 2026
Merged

ci: run helm tests with ctlptl registry#211
lockwobr merged 9 commits intoNVIDIA:mainfrom
AnouarMohamed:feat/ctlptl-local-registry

Conversation

@AnouarMohamed
Copy link
Copy Markdown
Contributor

@AnouarMohamed AnouarMohamed commented May 2, 2026

Summary

  • Add a ctlptl local-dev config that creates a kind cluster with a localhost:5005 registry.
  • Update make create-kind-cluster, make helm-tests, and local rollout targets to use the local registry path.
  • Add helm-tests to the operator CI matrix using the ctlptl-backed cluster.
  • Document the local cluster, webhook iteration, and teardown workflow.

Fixes #209

Notes

  • operator/config/local-dev/kind-config.yaml is still used by non-Helm CI paths; the local-dev path and the helm-tests CI entry use operator/config/local-dev/ctlptl-config.yaml.
  • ctlptl is pinned and installed through operator/deps.mk.
  • setup-kind-cluster is idempotent and skips node-init-secret creation when the local Docker auth file is missing.
  • rollout-local builds, tags, pushes, restarts the Helm-managed operator deployment, and waits for rollout completion.

Validation

  • bash -n k8s-tests/chainsaw/helm/install-helm-chart.sh
  • git diff --check
  • make unit-tests from operator/
  • make create-kind-cluster DOCKER_CMD=docker from operator/
  • make setup-kind-cluster DOCKER_CMD=docker
  • make setup-kind-cluster DOCKER_CMD=docker DOCKER_AUTH_FILE=/tmp/nodewright-missing-docker-auth.json
  • kubectl -n skyhook rollout status deployment -l app.kubernetes.io/name=skyhook-operator,control-plane=controller-manager --timeout=60s
  • make helm-tests DOCKER_CMD=docker from operator/
  • Webhook iteration check: temporarily added a log marker in operator/internal/controller/webhook_controller.go, ran make rollout-local DOCKER_CMD=docker, saw the operator deployment restart and complete, and found the marker in the restarted manager pod logs. The temporary marker was removed, then make rollout-local DOCKER_CMD=docker was run again.

@AnouarMohamed AnouarMohamed force-pushed the feat/ctlptl-local-registry branch from 2b46fa1 to 2599915 Compare May 2, 2026 12:58
@AnouarMohamed AnouarMohamed marked this pull request as ready for review May 2, 2026 13:00
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 2, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a helm-tests CI matrix entry to run Helm-based Chainsaw tests against a ctlptl-provisioned Kind cluster (k8s v1.35.0) with a local registry. Introduces operator/config/local-dev/ctlptl-config.yaml. Adds ctlptl install wiring to operator/deps.mk. Updates operator/Makefile with LOCAL_REGISTRY, LOCAL_OPERATOR_IMG, targets to tag/push the local image, ctlptl-driven create/delete/setup cluster targets, and a rollout-local flow. Modifies Helm install script to accept LOCAL_OPERATOR_IMG overrides and relaxes Chainsaw image assertions to accept the local registry image. Adds docs/development.md describing the local dev workflow.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'ci: run helm tests with ctlptl registry' clearly and concisely summarizes the primary change: integrating ctlptl for running helm tests with a local registry.
Linked Issues check ✅ Passed The PR addresses all primary objectives from #209: ctlptl config for cluster+registry, make targets updated (create-kind-cluster, helm-tests), LOCAL_OPERATOR_IMG support in helm install script, helm-tests CI entry added, local dev docs created, and all acceptance criteria met.
Out of Scope Changes check ✅ Passed All changes are scoped to the ctlptl integration objectives: operator Makefile, helm tests, CI workflow, local dev docs, config files, and chainsaw assertions—no unrelated modifications detected.
Description check ✅ Passed The pull request description clearly relates to the changeset, detailing the addition of ctlptl local-dev config, updates to make targets, Helm tests, and comprehensive validation performed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/operator-ci.yaml:
- Around line 121-126: Add SHA256 verification to the "Install kind" step:
download the kind binary (currently saved to /tmp/kind), also download the
corresponding kind-linux-amd64.sha256sum for v0.31.0, run sha256sum -c against
the downloaded checksum file and only then run sudo install /tmp/kind
/usr/local/bin/kind; update the workflow's "Install kind" run block to perform
these three actions in sequence. For "Install ctlptl" note that upstream does
not provide checksums so either document that limitation or replace with an
out-of-band pinned verification mechanism if you want parity with kind.

In `@docs/development.md`:
- Around line 11-15: Update the documented ctlptl install command to pin the
same version CI uses instead of `@latest`; replace the current example line `go
install github.com/tilt-dev/ctlptl/cmd/ctlptl@latest` with the explicit version
used in CI (v0.9.3) so the docs read `go install
github.com/tilt-dev/ctlptl/cmd/ctlptl@v0.9.3`, and add a short note to keep this
value in sync with the CI configuration when that CI version is bumped.

In `@k8s-tests/chainsaw/helm/helm-chart-test/assert-scheduled.yaml`:
- Line 52: The assertion currently hard-codes the allowed image values (the
matcher comparing image to 'ghcr.io/nvidia/skyhook/operator:latest' or
'localhost:5005/skyhook-operator:testing'), which rejects valid overrides;
update the matcher to accept the actual override by comparing the pod/container
image against the LOCAL_OPERATOR_IMG value (or allow any tag under the localhost
repo) instead of a single hard-coded testing tag — e.g., change the condition to
check image == LOCAL_OPERATOR_IMG or to match a prefix like image startsWith
'localhost:5005/skyhook-operator:' so overrides such as
'localhost:5005/skyhook-operator:dev' pass; adjust references to the image
variable used in assert-scheduled.yaml accordingly.

In `@k8s-tests/chainsaw/helm/install-helm-chart.sh`:
- Around line 33-50: The script extracts LOCAL_OPERATOR_IMG_DIGEST but then
discards it when building LOCAL_OPERATOR_IMG_ARGS; change the third --set entry
in LOCAL_OPERATOR_IMG_ARGS to pass the parsed digest instead of an empty string
(use --set
"controllerManager.manager.image.digest=${LOCAL_OPERATOR_IMG_DIGEST}"), and
ensure the parsing logic around LOCAL_OPERATOR_IMG_WITHOUT_DIGEST /
LOCAL_OPERATOR_IMG_TAG /LOCAL_OPERATOR_IMG_LAST_PART remains intact so
repo:tag@digest and repo@digest forms resolve correctly into
LOCAL_OPERATOR_IMG_DIGEST and LOCAL_OPERATOR_IMG_REPOSITORY/TAG before building
LOCAL_OPERATOR_IMG_ARGS.

In `@operator/config/local-dev/kind-config.yaml`:
- Around line 17-19: The top-of-file comment claiming "# Retained for review
only." is incorrect because this KinD config is still used as the default for
non-helm test suites (see operator-ci.yaml); update the comment in
local-dev/kind-config.yaml to accurately reflect current usage (e.g., "Default
KinD config used by non-helm test suites / CI") or remove the misleading phrase
so future contributors won't assume it is unused; ensure the updated comment
clearly states it is the three-node (two-worker) KinD config used by the CI
non-helm suites.

In `@operator/Makefile`:
- Around line 292-294: The create-kind-cluster target is not idempotent because
setup-kind-cluster uses `kubectl create secret` for `node-init-secret` which
fails on re-runs; update setup-kind-cluster to make secret creation idempotent
(e.g., replace `kubectl create secret` with an apply pattern such as `kubectl
create secret ... --dry-run=client -o yaml | kubectl apply -f -` or check for
the secret and patch/update it), ensuring any references to `node-init-secret`
are handled safely so create-kind-cluster can be re-run without triggering
AlreadyExists errors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 7755fb93-131f-48a4-a31c-de9430d251da

📥 Commits

Reviewing files that changed from the base of the PR and between 71ca22b and 2599915.

📒 Files selected for processing (8)
  • .github/workflows/operator-ci.yaml
  • docs/development.md
  • k8s-tests/chainsaw/helm/helm-chart-test/assert-no-schedule.yaml
  • k8s-tests/chainsaw/helm/helm-chart-test/assert-scheduled.yaml
  • k8s-tests/chainsaw/helm/install-helm-chart.sh
  • operator/Makefile
  • operator/config/local-dev/ctlptl-config.yaml
  • operator/config/local-dev/kind-config.yaml

Comment thread .github/workflows/operator-ci.yaml Outdated
Comment thread docs/development.md Outdated
Comment thread k8s-tests/chainsaw/helm/helm-chart-test/assert-scheduled.yaml Outdated
Comment thread k8s-tests/chainsaw/helm/install-helm-chart.sh Outdated
Comment thread operator/config/local-dev/kind-config.yaml Outdated
Comment thread operator/Makefile Outdated
Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
@AnouarMohamed AnouarMohamed force-pushed the feat/ctlptl-local-registry branch from 2599915 to 9fa25ef Compare May 2, 2026 13:15
Copy link
Copy Markdown
Collaborator

@lockwobr lockwobr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Thanks for picking this up — overall direction is right and the script-level handling of LOCAL_OPERATOR_IMG (digest/tag parsing) is clean. A few changes I'd like to see before merge.

Acceptance criteria check (against #209)

# Criterion Status
1 make create-kind-cluster brings up kind + registry at localhost:5005
2 make helm-tests installs locally-built image with no chart edits / no manual --set
3 helm-tests matrix entry runs in operator-ci.yaml ✅ (first time helm-tests is in GH CI — nice)
4 Webhook iteration loop (build → push → rollout restart) ⚠️ documented but not validated; instructions are fragile (see #1)
5 Teardown leaves no kind cluster or registry ⚠️ no make target — only raw ctlptl delete is documented
6 Local-dev docs updated
"Single source of truth for the local cluster shape" ❌ — fine to defer; just need an accurate comment (see #3)

Required changes

1. Hide docker behind $(DOCKER_CMD) and wrap the iteration loop in make targets. DOCKER_CMD ?= podman (Makefile:54) is the project default, but docs/development.md documents raw docker tag / docker push, which silently breaks for podman users. The fix is to wrap the tag/push in a make target — that also removes the duplication with the helm-tests recipe and stops hardcoding ghcr.io/nvidia/skyhook/operator:testing (which becomes wrong if anyone overrides IMG_REPO):

.PHONY: push-local-image
push-local-image: docker-build ## tag and push the operator image to the local ctlptl registry
	$(DOCKER_CMD) tag $(IMG):testing $(LOCAL_OPERATOR_IMG)
	$(DOCKER_CMD) push $(LOCAL_OPERATOR_IMG)

.PHONY: rollout-local
rollout-local: push-local-image ## push locally-built operator and restart the deployment
	$(KUBECTL) -n $(SKYHOOK_NAMESPACE) rollout restart deploy/skyhook-operator

.PHONY: delete-kind-cluster
delete-kind-cluster: ctlptl ## delete the local ctlptl kind cluster and registry
	$(CTLPTL) delete -f config/local-dev/ctlptl-config.yaml

helm-tests then collapses to:

helm-tests: helm chainsaw ctlptl ensure-test-symlinks push-local-image
	$(MAKE) run
	$(MAKE) uninstall ignore-not-found=true
	$(MAKE) kill
	LOCAL_OPERATOR_IMG=$(LOCAL_OPERATOR_IMG) $(CHAINSAW) test --test-dir ../k8s-tests/chainsaw/helm $(CHAINSAW_ARGS)

docs/development.md becomes:

cd operator
make rollout-local         # build + tag + push + restart
make delete-kind-cluster   # teardown

2. Move ctlptl install into deps.mk. Every other tool we depend on is pinned and installed there into $(LOCALBIN); ctlptl should follow the same pattern instead of an inline curl in the workflow. This also drops the "v0.9.3 does not publish checksum assets" comment as a non-issue, since the pin lives next to every other tool's pin.

## deps.mk
CTLPTL_VERSION ?= v0.9.3

install-deps: ... ctlptl   # add to fan-out

CTLPTL ?= $(LOCALBIN)/ctlptl
# ctlptl release naming: darwin -> "mac", amd64 -> "x86_64"
CTLPTL_OS = $(if $(filter darwin,$(OS)),mac,$(OS))
CTLPTL_ARCH = $(if $(filter amd64,$(ARCH)),x86_64,$(ARCH))
CTLPTL_VERSION_NO_V = $(patsubst v%,%,$(CTLPTL_VERSION))

.PHONY: ctlptl
ctlptl: $(LOCALBIN) ## Download ctlptl binary if necessary.
	test -s $(LOCALBIN)/ctlptl || curl -sSfL \
	    https://github.com/tilt-dev/ctlptl/releases/download/$(CTLPTL_VERSION)/ctlptl.$(CTLPTL_VERSION_NO_V).$(CTLPTL_OS).$(CTLPTL_ARCH).tar.gz | \
	    tar --no-same-owner -zxv -C $(LOCALBIN) ctlptl

create-kind-cluster and helm-tests add ctlptl as a prerequisite and call $(CTLPTL). The CI workflow drops the Install ctlptl step entirely (kind's separate ad-hoc install can stay — kind isn't a new tool here).

3. Fix the framing on kind-config.yaml. The PR description says it's "retained for review only" and the new comment in the file says "retained for comparing the ctlptl local development cluster shape." Both are wrong — the file is still actively consumed by:

  • .github/workflows/operator-ci.yaml:116 — default for e2e and cli-e2e (config: ${{ matrix.kind-config || '...kind-config.yaml' }})
  • .github/workflows/agent-ci.yaml:299

Issue #209's "single source of truth" goal isn't met as a result, but converging the rest of the matrix onto ctlptl is a bigger surface (touches agent-ci.yaml, e2e, cli-e2e, plus templating for the multi-k8s-version coverage) — fine to leave for a follow-up. Just update the comment to say so honestly. Suggested replacement for the YAML comment:

# kind cluster config used by the e2e, cli-e2e (operator-ci.yaml) and agent-ci.yaml
# matrix entries via helm/kind-action. The local-dev path and the helm-tests CI
# entry use ctlptl-config.yaml in this directory instead. Converging the
# helm/kind-action paths onto ctlptl is tracked as a follow-up to #209.

And drop the "retained for review only" line from the PR description.

Should-do

4. Webhook iteration validation. Issue criterion: "a code change in webhook_controller.go is observable in the running operator within a single rebuild cycle." Validation section only mentions make helm-tests. With make rollout-local in place, a one-line "tested webhook iteration: edited webhook_controller.go, ran make rollout-local, observed restart in N seconds" closes this.

Nice-to-have / follow-up (don't need to land here)

5. Unrelated change in setup-kind-cluster. The PR makes the node-init-secret creation idempotent (--dry-run=client | apply). Good change but outside #209's scope and unmentioned in the PR description. Either call it out in the body or split into its own PR.

6. LOCAL_REGISTRY parameterization. LOCAL_REGISTRY ?= localhost:5005 is overridable in the Makefile but the chainsaw assertions hardcode localhost:5005/skyhook-operator — overriding breaks the assert. Probably easiest to drop the ?= and make it a constant.

Spot checks that pass

  • Chart key paths (controllerManager.manager.image.repository/tag/digest) match chart/values.yaml:85-88. ✓
  • Bash ${VAR##*/} / ${VAR%:*} repository/tag parsing in install-helm-chart.sh correctly handles digest, tag, and digest-only forms. ✓
  • CI condition matrix.test-suite != 'unit-tests' && matrix.test-suite != 'helm-tests' correctly skips helm/kind-action for the new entry. ✓
  • Other helm tests (helm-node-affinity-test, helm-scale-test) use the same install-helm-chart.sh, so they inherit the override automatically; their assertions don't pin the image, so no other assert files needed updating. ✓

Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/development.md`:
- Line 30: Update the documentation text so the config path is shown relative to
the documented working directory (after cd operator): replace the path string
"operator/config/local-dev/ctlptl-config.yaml" with
"config/local-dev/ctlptl-config.yaml" in the sentence that describes `make
create-kind-cluster` so readers know the correct relative path to apply.

In `@operator/Makefile`:
- Around line 245-246: The rollout-local Makefile target restarts the operator
but doesn't wait for the rollout to complete, causing races; update the
rollout-local target (the rule labeled "rollout-local") to invoke $(KUBECTL)
rollout status on the same deployment selector
(app.kubernetes.io/name=skyhook-operator,control-plane=controller-manager) in
namespace $(SKYHOOK_NAMESPACE) after the existing rollout restart command, using
--timeout (e.g., 60s) so the target blocks until the deployment is ready.
- Around line 306-307: The Makefile currently unconditionally runs the kubectl
secret creation which fails if $(DOCKER_AUTH_FILE) is missing; update the
setup-kind-cluster (or the Makefile recipe containing the shown command) to
first check for the existence of $(DOCKER_AUTH_FILE) and only run the $(KUBECTL)
create secret generic node-init-secret ... pipeline when the file exists,
otherwise skip the step (optionally echo a message indicating the docker auth
file was not found) so fresh local setups and CI without registry auth do not
hard-fail.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 771513ea-ccc9-4f50-81ae-bd61f810c934

📥 Commits

Reviewing files that changed from the base of the PR and between 2278247 and 5867c4a.

📒 Files selected for processing (2)
  • docs/development.md
  • operator/Makefile

Comment thread docs/development.md Outdated
Comment thread operator/Makefile
Comment thread operator/Makefile Outdated
Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
@AnouarMohamed
Copy link
Copy Markdown
Contributor Author

Updated the PR description and pushed a small follow-up for the latest review notes. CodeRabbit is green on the current head; please take another look when you have time.

@AnouarMohamed AnouarMohamed requested a review from lockwobr May 5, 2026 21:58
Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
@AnouarMohamed
Copy link
Copy Markdown
Contributor Author

Pushed a follow-up for the webhook CI failure and updated the PR notes. Local make helm-tests DOCKER_CMD=docker is passing now.

Comment thread .github/workflows/operator-ci.yaml Outdated
Comment thread chart/templates/skyhook-crd.yaml
Comment thread operator/config/local-dev/ctlptl-config.yaml
Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
Comment thread operator/Makefile
Signed-off-by: AnouarMohamed <m.anouar@mundiapolis.ma>
@lockwobr lockwobr merged commit b7cde87 into NVIDIA:main May 7, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use ctlptl for the local kind cluster + registry; run helm e2e tests against the locally-built operator image

3 participants