reliably use flox in CI#2
Conversation
There was a problem hiding this comment.
Pull request overview
This PR migrates CI workflows from custom GitHub Actions with manually managed tool versions to Flox-based builds, simplifying dependency management and improving reliability.
Changes:
- Replaced custom GitHub Actions (
go-ci,go-build-release,load-versions, etc.) with Flox-based tooling - Added new Flox actions (
setup-flox,flox-run) to manage development environment - Consolidated tool versions in
.flox/env/manifest.tomlinstead of scattered YAML files
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/on-tag.yaml |
Replaced version loading and custom actions with Flox-based build/test/release steps |
.github/workflows/on-push.yaml |
Split monolithic test job into parallel test/lint/security-scan jobs using Flox |
.github/workflows/e2e-test.yaml |
Migrated E2E tests to run all commands via Flox instead of manual tool installation |
.github/actions/setup-flox/action.yml |
New action to install and configure Flox with Nix cache |
.github/actions/flox-run/action.yml |
New action to run commands inside Flox environment |
.github/actions/attest-image-from-tag/action.yml |
Updated to use Flox for crane instead of manual installation |
.flox/env/manifest.toml |
Added E2E testing tools (crane, ctlptl, helm, kind, kubectl, syft, tilt) to Flox environment |
.github/actions/setup-build-tools/action.yml |
Removed (replaced by Flox) |
.github/actions/load-versions/action.yml |
Removed (replaced by Flox) |
.github/actions/install-e2e-tools/action.yml |
Removed (replaced by Flox) |
.github/actions/go-ci/action.yml |
Removed (replaced by Flox) |
.github/actions/go-build-release/action.yml |
Removed (replaced by Flox) |
19abdc4 to
938cbea
Compare
…dos into cullen/add-flox-in-ci
There was a problem hiding this comment.
I have some concerns about the divergence between local development and CI here. By introducing new tooling for PR qualification that differs from our standard make commands, we’re creating a 'works on my machine' gap for anyone not using Flox. To keep the barrier to entry low for new Go contributors, I think we should ensure our primary qualification path remains idiomatic and accessible without requiring additional toolchains.
We certainly can provide the Flow way as an alternative without making it must-have, right?
I think it would be beneficial for us to make flox the happy path. It provides much more consistency and reliability than any home grown tool we will make. Can keep it as a side thing if that's the consensus from the group but I have not had a team have a bad experience leveraging Flox to provide all of our tools. Its a single tool that can encapsulate everything else, this will really simplify onboarding and ensuring everyone has the right pieces installed at the right versions. I would like for us to at least give it a try and if after some time its not working well for us we can back it out. |
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The `nvsentinel` registry entry declared:
defaultRepository: https://helm.ngc.nvidia.com/nvidia
defaultChart: nvidia/nvsentinel
But the chart isn't published to the HTTPS NGC index — only to the
OCI registry at `oci://ghcr.io/nvidia/nvsentinel`. The defaults are
silently ignored today: every nvsentinel-using overlay sets its own
`source: oci://ghcr.io/nvidia` + chart `nvsentinel`, so the broken
HTTPS default never resolves. But anyone relying on the registry
defaults (e.g. via `aicr bundle` without explicit overlay overrides
on this entry) would hit the dead path.
Update the defaults to match what every overlay already uses:
defaultRepository: oci://ghcr.io/nvidia
defaultChart: nvsentinel
Same shape as the kai-scheduler entry post-NVIDIA#720 (OCI registry path
in `defaultRepository`, bare chart name in `defaultChart`). Verified
locally:
$ helm pull oci://ghcr.io/nvidia/nvsentinel --version v1.3.0
Pulled.
$ aicr bundle -r recipe.yaml -o /tmp/bundle
... generates upstream.env with
CHART='oci://ghcr.io/nvidia/nvsentinel'
REPO=''
VERSION='v1.3.0'
Note: other NGC HTTPS entries in the registry (gpu-operator,
network-operator, nodewright-operator, nvidia-dra-driver-gpu) are
unchanged — those charts are genuinely served by the HTTPS NGC
index. nvsentinel is special because it ships only via OCI.
Refs: NVIDIA#698 (Phase 1 follow-up NVIDIA#2)
The `nvsentinel` registry entry declared:
defaultRepository: https://helm.ngc.nvidia.com/nvidia
defaultChart: nvidia/nvsentinel
But the chart isn't published to the HTTPS NGC index — only to the
OCI registry at `oci://ghcr.io/nvidia/nvsentinel`. The defaults are
silently ignored today: every nvsentinel-using overlay sets its own
`source: oci://ghcr.io/nvidia` + chart `nvsentinel`, so the broken
HTTPS default never resolves. But anyone relying on the registry
defaults (e.g. via `aicr bundle` without explicit overlay overrides
on this entry) would hit the dead path.
Update the defaults to match what every overlay already uses:
defaultRepository: oci://ghcr.io/nvidia
defaultChart: nvsentinel
Same shape as the kai-scheduler entry post-NVIDIA#720 (OCI registry path
in `defaultRepository`, bare chart name in `defaultChart`). Verified
locally:
$ helm pull oci://ghcr.io/nvidia/nvsentinel --version v1.3.0
Pulled.
$ aicr bundle -r recipe.yaml -o /tmp/bundle
... generates upstream.env with
CHART='oci://ghcr.io/nvidia/nvsentinel'
REPO=''
VERSION='v1.3.0'
Note: other NGC HTTPS entries in the registry (gpu-operator,
network-operator, nodewright-operator, nvidia-dra-driver-gpu) are
unchanged — those charts are genuinely served by the HTTPS NGC
index. nvsentinel is special because it ships only via OCI.
Refs: NVIDIA#698 (Phase 1 follow-up NVIDIA#2)
|
This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes. |
Summary
Standardizes all CI workflows on using Flox the source tool versions. Ensures we are using the same versions across CI and local dev
Motivation / Context
This is about standardization and ease of use for contributors.
Type of Change
Component(s) Affected
cmd/eidos,pkg/cli)cmd/eidosd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
Testing
# Commands run (prefer `make qualify` for non-trivial changes) make qualifyRisk Assessment
Rollout notes:
Checklist
make testwith-race)make lint)git commit -s) — DCO info