ci: parallelize e2e tests by pool and add merge gates#229
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis pull request implements a test-pool partitioning system for the Chainsaw e2e test suite. All Chainsaw test manifests receive metadata.labels.pool assignments across four pools (core, interrupt, uninstall, lifecycle). The operator-ci workflow gains a pool matrix and exports POOL to test steps; Makefile e2e-tests uses POOL/POOL_NON_DEFAULT to build chainsaw --selector arguments (core excludes non-core pools, named POOL selects that pool, unset runs all). Multiple GitHub Actions workflows opt JS actions into Node 24 and add ci-gate jobs that verify all required job results via jq. Supporting changes include a new lint CI workflow and CI config files, GitLab chart publishing switched to Helm OCI, removal of the Helm cm-push plugin install, and documentation updates (docs/ci-test-pools.md, docs/README.md, .claude/CLAUDE.md). Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/agent-ci.yaml:
- Around line 337-348: The agent-gate's jq check currently fails when
conditional jobs are legitimately skipped; update the check in the agent-gate
step (job name "agent-gate") to accept ".value.result == \"skipped\"" for the
conditional jobs "create-manifest" and "operator-agent-tests" (or alternatively
remove those conditional jobs from the gate's needs and only depend on
always-run jobs "test" and "build-agent"); specifically change the jq filter
used in the Verify step to allow success OR skipped for those two keys so fork
PRs don't fail the gate.
In @.github/workflows/operator-ci.yaml:
- Around line 397-407: The operator-gate job's verification step uses jq
'all(.value.result == "success")' which fails when conditional dependencies are
skipped; update the verification logic in the operator-gate step (the block that
sets results='${{ toJSON(needs) }}' and pipes to jq) to treat "skipped" as
acceptable (e.g., change the jq predicate to allow .value.result == "success" OR
.value.result == "skipped") or, if you prefer stricter semantics, split the
needs into required vs optional lists and validate only required jobs for
success; adjust the jq expression or the needs split accordingly so skipped jobs
no longer break the gate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: ade090eb-bbf5-473b-b6f9-9fca6172827f
📒 Files selected for processing (35)
.claude/CLAUDE.md.github/workflows/agent-ci.yaml.github/workflows/operator-ci.yamldocs/README.mddocs/ci-test-pools.mdk8s-tests/chainsaw/skyhook/auto-taint-new-nodes/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/cleanup-pods/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/config-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/delete-blocked-when-disabled/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/delete-blocked-when-paused/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/delete-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/depends-on/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/downgrade-after-uninstall/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/downgrade-enabled-false-preserves-state/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/explicit-uninstall/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/failure-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/hello-world/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/interrupt-grouping/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/interrupt/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/malformed-node-state/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/package-upgrade/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/pod-finalizer/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/runtime-required/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/simple-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/simple-update-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/skyhook-upgrade/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/strict-order/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/taint-scheduling/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-cancel/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-fix-config/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-mixed-packages/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-on-delete/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/validate-packages/chainsaw-test.yamloperator/Makefile
88d1351 to
14c8e48
Compare
Coverage Report for CI Build 25580653479Warning No base build found for commit Coverage: 81.937%Details
Uncovered ChangesNo uncovered changes found. Coverage RegressionsRequires a base build to compare against. How to fix this → Coverage Stats
💛 - Coveralls |
14c8e48 to
d9a425e
Compare
48743e7 to
9ce0d7b
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/operator-ci.yaml (1)
23-38:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix
pathsalias usage to unblock YAML meta-lint.Line 38 currently uses
paths: *operator-paths, and your meta-lint reports this as invalid for the schema (“must be sequence node”). This is currently merge-blocking.🔧 Proposed fix
on: workflow_dispatch: {} pull_request: - paths: &operator-paths + paths: - operator/**/*.go - operator/go.mod - operator/go.sum - operator/deps.mk - operator/config/** - containers/operator.Dockerfile - .github/workflows/operator-ci.yaml - k8s-tests/** - chart/** push: branches: - main tags: - operator/* - paths: *operator-paths + paths: + - operator/**/*.go + - operator/go.mod + - operator/go.sum + - operator/deps.mk + - operator/config/** + - containers/operator.Dockerfile + - .github/workflows/operator-ci.yaml + - k8s-tests/** + - chart/**🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/operator-ci.yaml around lines 23 - 38, The push job's paths uses the YAML alias "*operator-paths" which the meta-linter flags as invalid for the schema; update the push.paths entry to be an explicit sequence instead of the alias by copying the sequence defined under the "paths: &operator-paths" anchor (the list of operator/**/*.go, operator/go.mod, operator/go.sum, operator/deps.mk, operator/config/**, containers/operator.Dockerfile, .github/workflows/operator-ci.yaml, k8s-tests/**, chart/**) into push.paths so push.paths is a concrete sequence node rather than "paths: *operator-paths".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.gitlab-ci.yml:
- Line 291: The helm push step currently globs using $(ls
skyhook-operator-*.tgz) which can break if multiple .tgz files exist or names
change; update the pipeline to capture the exact output of the helm package
command (e.g., store the path returned by `helm package` in a variable) and use
that variable in the helm push invocation instead of the glob; update the job
that runs helm package and the `helm push` invocation (reference: helm push and
skyhook-operator-*.tgz, OCI_PATH) so the push uses the captured package path.
In `@ci/lycheeignore`:
- Around line 16-18: The exclusion rule ^https?://.+\.nvidia\.com is too broad
and should be removed or tightened; replace it with explicit, known
internal-only host patterns (e.g., specific subdomains) or a narrower regex (for
example target-only internal prefixes) instead of the wildcard, leaving the
other explicit rules (e.g., ^https?://.+\.nvda\.ai and
^https?://.+\.ngc\.nvidia\.com) as-is unless they also need narrowing; update
the ci/lycheeignore entry by deleting or substituting the
^https?://.+\.nvidia\.com pattern with explicit hostnames or a stricter regex
that only matches internal NVIDIA hosts.
---
Outside diff comments:
In @.github/workflows/operator-ci.yaml:
- Around line 23-38: The push job's paths uses the YAML alias "*operator-paths"
which the meta-linter flags as invalid for the schema; update the push.paths
entry to be an explicit sequence instead of the alias by copying the sequence
defined under the "paths: &operator-paths" anchor (the list of operator/**/*.go,
operator/go.mod, operator/go.sum, operator/deps.mk, operator/config/**,
containers/operator.Dockerfile, .github/workflows/operator-ci.yaml,
k8s-tests/**, chart/**) into push.paths so push.paths is a concrete sequence
node rather than "paths: *operator-paths".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 610904ac-e82e-4cf8-8b9b-ad1965c4817a
📒 Files selected for processing (44)
.claude/CLAUDE.md.github/workflows/agent-ci.yaml.github/workflows/agentless-container.yaml.github/workflows/cli-release.yaml.github/workflows/lint-ci.yaml.github/workflows/operator-ci.yaml.github/workflows/security-checkov.yaml.gitlab-ci.ymlci/lycheeignoreci/markdownlint.yamlci/yamllint.yamldocs/README.mddocs/ci-test-pools.mdk8s-tests/chainsaw/skyhook/auto-taint-new-nodes/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/cleanup-pods/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/config-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/delete-blocked-when-disabled/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/delete-blocked-when-paused/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/delete-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/depends-on/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/downgrade-after-uninstall/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/downgrade-enabled-false-preserves-state/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/explicit-uninstall/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/failure-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/hello-world/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/interrupt-grouping/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/interrupt/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/malformed-node-state/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/package-upgrade/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/pod-finalizer/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/runtime-required/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/simple-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/simple-update-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/skyhook-upgrade/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/strict-order/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/taint-scheduling/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-cancel/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-fix-config/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-mixed-packages/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-on-delete/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/validate-packages/chainsaw-test.yamloperator/Makefileoperator/deps.mk
💤 Files with no reviewable changes (1)
- operator/deps.mk
9ce0d7b to
896394a
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.claude/CLAUDE.md:
- Line 61: The markdown line for the "make e2e-tests" entry exceeds the
configured line-length (MD013); break the long comment into shorter lines or
rephrase so each line stays under the limit — locate the line containing the
literal "make e2e-tests # chainsaw e2e against current cluster (set
POOL=<name> to run one pool — see docs/ci-test-pools.md)" and split the trailing
comment into two sentences/lines (for example, keep "make e2e-tests #
chainsaw e2e against current cluster" on one line and move the "(set POOL=... —
see docs/ci-test-pools.md)" part to the next line) so the file conforms to the
markdown line-length rule.
In @.github/workflows/lint-ci.yaml:
- Around line 53-54: The CI currently streams release tarballs directly into tar
for actionlint (using ACTIONLINT_VERSION and the
actionlint_${ACTIONLINT_VERSION}_linux_amd64.tar.gz URL) which is unsafe;
instead download the tarball to a temporary file, fetch and verify its
integrity/signature (e.g., compare a published SHA256 checksum or verify a GPG
signature for actionlint_${ACTIONLINT_VERSION}_linux_amd64.tar.gz), fail the job
on mismatch, and only then extract to /tmp; apply the same change to the second
download block (the other occurrence that fetches a release tarball at lines
81-82) so both downloads are verified before extraction/execution.
In @.gitlab-ci.yml:
- Around line 269-276: The publish-rule conditioned on if: $CI_COMMIT_REF_NAME
== $CI_DEFAULT_BRANCH is never reached because top-level workflow.rules only
allows pipelines when CI_COMMIT_TAG is set; update workflow.rules to also create
pipelines for the default branch by adding a rule that returns true when
CI_COMMIT_REF_NAME == CI_DEFAULT_BRANCH (or otherwise permit branch pipelines),
so the existing publish rule (the rules block with OCI_PATH:
nvidian/swgpu-baseos and if: $CI_COMMIT_REF_NAME == $CI_DEFAULT_BRANCH) can
execute.
In `@chart/README.md`:
- Line 18: Update the README entry for the runtimeRequired link text
"[**runtimeRequired**]" to use concise phrasing: change "will do work prior to
removing custom taint" to "will do work before removing custom taint" while
keeping the corrected relative link ../docs/runtime_required.md intact.
In `@docs/metrics/README.md`:
- Line 79: Replace the grammatically incorrect phrase in the README sentence
that references metrics_test.py: change "will let you test of exists or absence
of metrics based on name and labels" to a clearer wording such as "lets you test
for the existence or absence of metrics by name and labels" so the sentence
reads: "See the script metrics_test.py that lets you test for the existence or
absence of metrics by name and labels."
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: a26aea9b-98ff-4fff-b689-919580fe0d0d
📒 Files selected for processing (47)
.claude/CLAUDE.md.github/workflows/agent-ci.yaml.github/workflows/agentless-container.yaml.github/workflows/cli-release.yaml.github/workflows/lint-ci.yaml.github/workflows/operator-ci.yaml.github/workflows/security-checkov.yaml.gitlab-ci.ymlchart/README.mdci/.markdownlint-cli2.yamlci/lycheeignoreci/yamllint.yamldocs/README.mddocs/ci-test-pools.mddocs/designs/uninstall-enhancement.mddocs/metrics/README.mdk8s-tests/chainsaw/skyhook/auto-taint-new-nodes/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/cleanup-pods/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/config-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/delete-blocked-when-disabled/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/delete-blocked-when-paused/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/delete-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/depends-on/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/downgrade-after-uninstall/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/downgrade-enabled-false-preserves-state/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/explicit-uninstall/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/failure-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/hello-world/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/interrupt-grouping/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/interrupt/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/malformed-node-state/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/package-upgrade/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/pod-finalizer/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/runtime-required/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/simple-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/simple-update-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/skyhook-upgrade/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/strict-order/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/taint-scheduling/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-cancel/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-fix-config/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-mixed-packages/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-on-delete/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/validate-packages/chainsaw-test.yamloperator/Makefileoperator/deps.mk
💤 Files with no reviewable changes (1)
- operator/deps.mk
896394a to
7d854c1
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
.claude/CLAUDE.md (1)
141-143:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix markdownlint MD032 around the Agent list.
CI reports MD032 at Line 142; add a blank line before the list that starts with
- Reads ...so the list is surrounded by blank lines.Suggested fix
The agent is a container entrypoint the operator injects alongside every package. It: + - Reads `/skyhook-package/config.json` (validated against `schemas/`) - Dispatches to the requested stage/step - Uses `chroot_exec.py` to run step scripts inside the host root mount🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/CLAUDE.md around lines 141 - 143, Add a blank line before the bulleted list that begins with "- Reads `/skyhook-package/config.json`" under the paragraph that starts "The agent is a container entrypoint..." so the list is surrounded by blank lines and satisfies markdownlint MD032; locate the paragraph text and insert one empty line between it and the list items.ci/.markdownlint-cli2.yaml (1)
1-25:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd the required Apache-2.0 license header to this YAML file.
This new YAML file is missing the repository-standard license header.
Suggested fix
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + # markdownlint-cli2 config — see https://github.com/DavidAnson/markdownlint-cli2 # Permissive starting baseline. Tighten over time as docs are normalized.As per coding guidelines, "Every source file must have an Apache-2.0 license header; use scripts/format_license.py or 'make license-fmt' to add/replace headers".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci/.markdownlint-cli2.yaml` around lines 1 - 25, This YAML file is missing the repository-standard Apache-2.0 license header; add the standard header block at the very top of ci/.markdownlint-cli2.yaml (above the existing "config:" entry) so it matches other source files, or run the provided script to do it automatically (scripts/format_license.py) or invoke the Make target (make license-fmt) to insert/replace the header; confirm the header text exactly matches the repository's canonical Apache-2.0 header.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ci/yamllint.yaml`:
- Around line 1-3: This YAML file is missing the required Apache-2.0 license
header; add the standard Apache-2.0 header as the top comment block above the
existing content (preserving the "extends: relaxed" line), or run the repository
tool to apply it automatically (scripts/format_license.py or make license-fmt)
so the file conforms to the /**/*.{go,py,yaml,yml,sh,mk} licensing policy.
In `@k8s-tests/chainsaw/skyhook/package-upgrade/chainsaw-test.yaml`:
- Around line 22-23: The test's pool label currently sets "pool: uninstall"
which is semantically misleading for an upgrade-focused test; change the label
value to a more appropriate pool such as "lifecycle" or "upgrade" by updating
the labels.pool entry (look for the labels: block containing pool: uninstall) so
the test grouping matches its purpose, and confirm any test-runner pooling
config or test distribution expectations after the change.
---
Outside diff comments:
In @.claude/CLAUDE.md:
- Around line 141-143: Add a blank line before the bulleted list that begins
with "- Reads `/skyhook-package/config.json`" under the paragraph that starts
"The agent is a container entrypoint..." so the list is surrounded by blank
lines and satisfies markdownlint MD032; locate the paragraph text and insert one
empty line between it and the list items.
In `@ci/.markdownlint-cli2.yaml`:
- Around line 1-25: This YAML file is missing the repository-standard Apache-2.0
license header; add the standard header block at the very top of
ci/.markdownlint-cli2.yaml (above the existing "config:" entry) so it matches
other source files, or run the provided script to do it automatically
(scripts/format_license.py) or invoke the Make target (make license-fmt) to
insert/replace the header; confirm the header text exactly matches the
repository's canonical Apache-2.0 header.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: c57ad5d7-782f-4e44-a5a9-3dc3888db476
📒 Files selected for processing (47)
.claude/CLAUDE.md.github/workflows/agent-ci.yaml.github/workflows/agentless-container.yaml.github/workflows/cli-release.yaml.github/workflows/lint-ci.yaml.github/workflows/operator-ci.yaml.github/workflows/security-checkov.yaml.gitlab-ci.ymlchart/README.mdci/.markdownlint-cli2.yamlci/lycheeignoreci/yamllint.yamldocs/README.mddocs/ci-test-pools.mddocs/designs/uninstall-enhancement.mddocs/metrics/README.mdk8s-tests/chainsaw/skyhook/auto-taint-new-nodes/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/cleanup-pods/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/config-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/delete-blocked-when-disabled/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/delete-blocked-when-paused/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/delete-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/depends-on/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/downgrade-after-uninstall/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/downgrade-enabled-false-preserves-state/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/explicit-uninstall/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/failure-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/hello-world/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/interrupt-grouping/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/interrupt/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/malformed-node-state/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/package-upgrade/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/pod-finalizer/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/runtime-required/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/simple-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/simple-update-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/skyhook-upgrade/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/strict-order/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/taint-scheduling/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-cancel/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-fix-config/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-mixed-packages/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-on-delete/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/validate-packages/chainsaw-test.yamloperator/Makefileoperator/deps.mk
💤 Files with no reviewable changes (1)
- operator/deps.mk
7d854c1 to
003fce0
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/operator-ci.yaml (1)
23-47:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd
operator/Makefileto both workflow path filters.Line 23 and Line 38 omit
operator/Makefile, even though the new pool-selection entrypoint now lives there. A Makefile-only change toe2e-testswould skip Operator CI entirely on PRs and pushes, so this required workflow would never exercise the code that dispatches pooled tests.🐛 Proposed fix
pull_request: paths: + - operator/Makefile - operator/**/*.go - operator/go.mod - operator/go.sum - operator/deps.mk - operator/config/** - containers/operator.Dockerfile - .github/workflows/operator-ci.yaml - k8s-tests/** - chart/** push: branches: - main tags: - operator/* paths: + - operator/Makefile - operator/**/*.go - operator/go.mod - operator/go.sum - operator/deps.mk - operator/config/** - containers/operator.Dockerfile - .github/workflows/operator-ci.yaml - k8s-tests/** - chart/**🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/operator-ci.yaml around lines 23 - 47, Update the path filters in .github/workflows/operator-ci.yaml so changes to operator/Makefile trigger the workflow: add "operator/Makefile" to both "paths:" lists (the one alongside operator/**/*.go etc. used for PRs/pull_request and the equivalent under push/tags) so Makefile-only changes (e.g., the new pool-selection entrypoint) will run Operator CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/lint-ci.yaml:
- Around line 62-63: Add a pinned YAMLLINT_VERSION env like the existing
ACTIONLINT_VERSION/LYCHEE_VERSION and change the installer invocation so it
installs yamllint using that pinned version instead of unpinned; specifically
add an env var YAMLLINT_VERSION and replace the current "pip install --user
--quiet yamllint" call with an install that references the pinned version (e.g.,
pip install --user --quiet "yamllint==${YAMLLINT_VERSION}") and keep the
subsequent "yamllint -c ci/yamllint.yaml ." invocation as-is.
In `@docs/ci-test-pools.md`:
- Around line 45-49: The doc references non-existent gate check names
(`operator-gate` / `agent-gate`) but the workflow actually publishes `ci-gate`;
update the text so branch protection and examples use the real published check
name `ci-gate` (or explicitly state mapping if multiple gates are published) and
remove or correct references to `operator-gate` and `agent-gate` to avoid
pointing branch-protection rules at checks that don't exist.
---
Outside diff comments:
In @.github/workflows/operator-ci.yaml:
- Around line 23-47: Update the path filters in
.github/workflows/operator-ci.yaml so changes to operator/Makefile trigger the
workflow: add "operator/Makefile" to both "paths:" lists (the one alongside
operator/**/*.go etc. used for PRs/pull_request and the equivalent under
push/tags) so Makefile-only changes (e.g., the new pool-selection entrypoint)
will run Operator CI.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 7e4948b8-d33b-457a-bc14-925304c675bb
📒 Files selected for processing (48)
.claude/CLAUDE.md.github/workflows/agent-ci.yaml.github/workflows/agentless-container.yaml.github/workflows/cli-release.yaml.github/workflows/commit-linting.yaml.github/workflows/lint-ci.yaml.github/workflows/operator-ci.yaml.github/workflows/security-checkov.yaml.gitlab-ci.ymlchart/README.mdci/.markdownlint-cli2.yamlci/lycheeignoreci/yamllint.yamldocs/README.mddocs/ci-test-pools.mddocs/designs/uninstall-enhancement.mddocs/metrics/README.mdk8s-tests/chainsaw/skyhook/auto-taint-new-nodes/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/cleanup-pods/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/config-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/delete-blocked-when-disabled/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/delete-blocked-when-paused/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/delete-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/depends-on/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/downgrade-after-uninstall/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/downgrade-enabled-false-preserves-state/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/explicit-uninstall/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/failure-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/hello-world/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/interrupt-grouping/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/interrupt/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/malformed-node-state/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/package-upgrade/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/pod-finalizer/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/runtime-required/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/simple-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/simple-update-skyhook/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/skyhook-upgrade/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/strict-order/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/taint-scheduling/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-cancel/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-fix-config/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-mixed-packages/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/uninstall-on-delete/chainsaw-test.yamlk8s-tests/chainsaw/skyhook/validate-packages/chainsaw-test.yamloperator/Makefileoperator/deps.mk
💤 Files with no reviewable changes (1)
- operator/deps.mk
003fce0 to
21bac1f
Compare
a1f06c0 to
835cfea
Compare
835cfea to
77f735b
Compare
Summary
Started as "parallelize e2e tests by pool" but grew to encompass several adjacent CI improvements that fell out of the work. Net effect: ~12–18 min wall-clock per K8s row (down from 40–45), a unified
ci-gaterequired check, repo-wide markdown/yaml/workflow/link linting, helm-chart publishing via OCI, and assorted hardening.1. CI test pools (the headline)
Splits the 29-test skyhook chainsaw suite into 4 labeled pools running as parallel matrix jobs, one kind cluster per job.
operator/Makefile—e2e-testshonors aPOOLvariable.POOL=<name>→chainsaw --selector pool=<name>.POOL=core→chainsaw --selector 'pool notin (interrupt,uninstall,lifecycle)'so unlabeled tests fall through to core (default-pool fallback).POOL_NON_DEFAULT..github/workflows/operator-ci.yaml—testsmatrix gains apoolaxis: 4 pools × 4 K8s versions = 16 e2e jobs (was 4).cli-e2e/deployment-policy/unit-tests/helm-testsinclude-rows unchanged.coverage-<suite>[-<pool>]-k8s-<v>) so sibling pool rows don't collide. The existingupload-coveragemerge step still globscoverage-*and produces a single combined profile equivalent to today's —make e2e-testsandmerge-coverageare unchanged.k8s-tests/chainsaw/skyhook/get ametadata.labels.pool: <name>label.Wall-clock was the priority; the runner-minute increase is acceptable.
2. Unified
ci-gatefor branch protectionEvery workflow that gates merge publishes a check named
ci-gate. GitHub composes same-named required checks across workflows, so branch protection onmainonly ever needs the single name.Workflows publishing
ci-gate:ci-gate.needs:lint-ci.yaml(NEW)[meta-lint]operator-ci.yaml[tests, build-operator]agent-ci.yaml[test, build-agent]commit-linting.yaml[commit-linting]security-checkov.yaml[build]agentless-container.yaml[build-and-publish-agentless]Each gate uses
if: always()+jq -e 'all(.value.result == "success")'— the standard fix for GitHub's "skipped == green" pitfall.needs:only includes jobs that always run. Conditionally-skipped jobs (create-manifest/upload-coverage/operator-agent-testson fork PRs and tag builds) are deliberately excluded so legitimate skips don't fail the gate.lint-ci.yamlhas no path filter, guaranteeing aci-gateposts on every PR — including doc-only changes that wouldn't otherwise trigger any other workflow.3. Lint CI workflow (new)
.github/workflows/lint-ci.yamlruns on every PR with four steps:v1.7.7) with SHA256 verification. Shellcheck integration disabled (existing inline scripts have ~30 pre-existing SC2086/SC2155 warnings; tracked separately).1.38.0via pip, configured byci/yamllint.yaml.ci/.markdownlint-cli2.yaml.v0.21.0) with SHA256 verification, configured byci/lycheeignore. UsesGITHUB_TOKENfor github.com rate-limiting.Configs live at
ci/(visible directory, not dotfile) and each carries an Apache 2.0 SPDX header. The markdownlint config doubles as documentation — every rule we evaluated is listed explicitly with a one-line rationale (17 enforced bug-catchers, 9 disabled with reasons).4. Helm chart publishing via OCI native (drops
cm-pushplugin)Helm 4.1.x added default plugin-source verification, breaking installs of the unsigned
chartmuseum/helm-pushplugin (Error: plugin source does not support verification)..gitlab-ci.ymlpublish-chartjob switched fromhelm cm-pushtohelm push oci://nvcr.io/<path>(Helm 4 native), withhelm registry login nvcr.io --password-stdin.operator/deps.mkdrops thecm-pushplugin install entirely (no other consumer).⚠ Pull-side consumers using the chartmuseum URL must switch to
helm pull oci://nvcr.io/nvidian/swgpu-baseos/skyhook-operatoretc. The OCI path mapping (helm.ngc.nvidia.com/<org>/<team>→oci://nvcr.io/<org>/<team>) should be confirmed against NGC before merging, ideally with a one-shot test push.5. Markdownlint auto-fix sweep + manual fixes
Running
markdownlint-cli2 --fixover the repo with the new config touched ~37 markdown files, normalizing:Manual follow-ups (12 files, structural fixes the auto-fixer can't do):
operator/README.md).# H1s demoted to## H2indocs/metrics/,docs/operator_resources_at_scale.md,docs/releases.md,docs/runtime_required.md.chart/README.md,KnownIssues.md.CONTRIBUTING.md→ fenced.docs/metrics/.Plus three real broken doc links lychee surfaced (rename-era leftovers):
chart/README.md,docs/metrics/README.md,docs/designs/uninstall-enhancement.md.Disabled markdownlint rules and reasons live in
ci/.markdownlint-cli2.yaml— the config is self-documenting.6. Misc CI hardening
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: trueadded to every workflow's top-levelenv:to silence the Node 20 deprecation warning ahead of GitHub's June 2026 phase-out.ACTIONLINT_VERSION,LYCHEE_VERSION,YAMLLINT_VERSIONenv vars; matches the existing pinning pattern.operator/deps.mk(no longer needed thanks to OCI switch — see §4).7. Documentation
docs/ci-test-pools.md(new) — pool taxonomy, localPOOL=<name>usage, three-file process for adding a new pool, coverage-merge invariant,ci-gatepattern. Linked fromdocs/README.md..claude/CLAUDE.md— briefPOOL=<name>hint in the common commands section.Checklist
git commit -s) per the DCO.