Skip to content

reliably use flox in CI#2

Closed
cullenmcdermott wants to merge 8 commits into
NVIDIA:mainfrom
cullenmcdermott:cullen/add-flox-in-ci
Closed

reliably use flox in CI#2
cullenmcdermott wants to merge 8 commits into
NVIDIA:mainfrom
cullenmcdermott:cullen/add-flox-in-ci

Conversation

@cullenmcdermott
Copy link
Copy Markdown
Contributor

@cullenmcdermott cullenmcdermott commented Jan 31, 2026

Summary

Standardizes all CI workflows on using Flox the source tool versions. Ensures we are using the same versions across CI and local dev

Motivation / Context

This is about standardization and ease of use for contributors.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/eidos, pkg/cli)
  • API server (cmd/eidosd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

Testing

# Commands run (prefer `make qualify` for non-trivial changes)
make qualify

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes:

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are signed off (git commit -s) — DCO info

Copilot AI review requested due to automatic review settings January 31, 2026 00:04
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates CI workflows from custom GitHub Actions with manually managed tool versions to Flox-based builds, simplifying dependency management and improving reliability.

Changes:

  • Replaced custom GitHub Actions (go-ci, go-build-release, load-versions, etc.) with Flox-based tooling
  • Added new Flox actions (setup-flox, flox-run) to manage development environment
  • Consolidated tool versions in .flox/env/manifest.toml instead of scattered YAML files

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
.github/workflows/on-tag.yaml Replaced version loading and custom actions with Flox-based build/test/release steps
.github/workflows/on-push.yaml Split monolithic test job into parallel test/lint/security-scan jobs using Flox
.github/workflows/e2e-test.yaml Migrated E2E tests to run all commands via Flox instead of manual tool installation
.github/actions/setup-flox/action.yml New action to install and configure Flox with Nix cache
.github/actions/flox-run/action.yml New action to run commands inside Flox environment
.github/actions/attest-image-from-tag/action.yml Updated to use Flox for crane instead of manual installation
.flox/env/manifest.toml Added E2E testing tools (crane, ctlptl, helm, kind, kubectl, syft, tilt) to Flox environment
.github/actions/setup-build-tools/action.yml Removed (replaced by Flox)
.github/actions/load-versions/action.yml Removed (replaced by Flox)
.github/actions/install-e2e-tools/action.yml Removed (replaced by Flox)
.github/actions/go-ci/action.yml Removed (replaced by Flox)
.github/actions/go-build-release/action.yml Removed (replaced by Flox)

Comment thread .github/workflows/on-tag.yaml
Comment thread .github/workflows/on-push.yaml Outdated
Comment thread .github/workflows/on-push.yaml
Copilot AI review requested due to automatic review settings January 31, 2026 00:13
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 1 comment.

Comment thread .github/actions/flox-run/action.yml
Copilot AI review requested due to automatic review settings January 31, 2026 00:43
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 1 comment.

Comment thread .github/workflows/on-tag.yaml
Copilot AI review requested due to automatic review settings January 31, 2026 00:48
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated no new comments.

@mchmarny mchmarny self-requested a review January 31, 2026 11:50
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about the divergence between local development and CI here. By introducing new tooling for PR qualification that differs from our standard make commands, we’re creating a 'works on my machine' gap for anyone not using Flox. To keep the barrier to entry low for new Go contributors, I think we should ensure our primary qualification path remains idiomatic and accessible without requiring additional toolchains.

We certainly can provide the Flow way as an alternative without making it must-have, right?

@cullenmcdermott
Copy link
Copy Markdown
Contributor Author

I have some concerns about the divergence between local development and CI here. By introducing new tooling for PR qualification that differs from our standard make commands, we’re creating a 'works on my machine' gap for anyone not using Flox. To keep the barrier to entry low for new Go contributors, I think we should ensure our primary qualification path remains idiomatic and accessible without requiring additional toolchains.

We certainly can provide the Flow way as an alternative without making it must-have, right?

I think it would be beneficial for us to make flox the happy path. It provides much more consistency and reliability than any home grown tool we will make. Can keep it as a side thing if that's the consensus from the group but I have not had a team have a bad experience leveraging Flox to provide all of our tools. Its a single tool that can encapsulate everything else, this will really simplify onboarding and ensuring everyone has the right pieces installed at the right versions. I would like for us to at least give it a try and if after some time its not working well for us we can back it out.

ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 21, 2026
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 22, 2026
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 22, 2026
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 22, 2026
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 23, 2026
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
The `nvsentinel` registry entry declared:

    defaultRepository: https://helm.ngc.nvidia.com/nvidia
    defaultChart: nvidia/nvsentinel

But the chart isn't published to the HTTPS NGC index — only to the
OCI registry at `oci://ghcr.io/nvidia/nvsentinel`. The defaults are
silently ignored today: every nvsentinel-using overlay sets its own
`source: oci://ghcr.io/nvidia` + chart `nvsentinel`, so the broken
HTTPS default never resolves. But anyone relying on the registry
defaults (e.g. via `aicr bundle` without explicit overlay overrides
on this entry) would hit the dead path.

Update the defaults to match what every overlay already uses:

    defaultRepository: oci://ghcr.io/nvidia
    defaultChart: nvsentinel

Same shape as the kai-scheduler entry post-NVIDIA#720 (OCI registry path
in `defaultRepository`, bare chart name in `defaultChart`). Verified
locally:

  $ helm pull oci://ghcr.io/nvidia/nvsentinel --version v1.3.0
  Pulled.
  $ aicr bundle -r recipe.yaml -o /tmp/bundle
  ... generates upstream.env with
      CHART='oci://ghcr.io/nvidia/nvsentinel'
      REPO=''
      VERSION='v1.3.0'

Note: other NGC HTTPS entries in the registry (gpu-operator,
network-operator, nodewright-operator, nvidia-dra-driver-gpu) are
unchanged — those charts are genuinely served by the HTTPS NGC
index. nvsentinel is special because it ships only via OCI.

Refs: NVIDIA#698 (Phase 1 follow-up NVIDIA#2)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
The `nvsentinel` registry entry declared:

    defaultRepository: https://helm.ngc.nvidia.com/nvidia
    defaultChart: nvidia/nvsentinel

But the chart isn't published to the HTTPS NGC index — only to the
OCI registry at `oci://ghcr.io/nvidia/nvsentinel`. The defaults are
silently ignored today: every nvsentinel-using overlay sets its own
`source: oci://ghcr.io/nvidia` + chart `nvsentinel`, so the broken
HTTPS default never resolves. But anyone relying on the registry
defaults (e.g. via `aicr bundle` without explicit overlay overrides
on this entry) would hit the dead path.

Update the defaults to match what every overlay already uses:

    defaultRepository: oci://ghcr.io/nvidia
    defaultChart: nvsentinel

Same shape as the kai-scheduler entry post-NVIDIA#720 (OCI registry path
in `defaultRepository`, bare chart name in `defaultChart`). Verified
locally:

  $ helm pull oci://ghcr.io/nvidia/nvsentinel --version v1.3.0
  Pulled.
  $ aicr bundle -r recipe.yaml -o /tmp/bundle
  ... generates upstream.env with
      CHART='oci://ghcr.io/nvidia/nvsentinel'
      REPO=''
      VERSION='v1.3.0'

Note: other NGC HTTPS entries in the registry (gpu-operator,
network-operator, nodewright-operator, nvidia-dra-driver-gpu) are
unchanged — those charts are genuinely served by the HTTPS NGC
index. nvsentinel is special because it ships only via OCI.

Refs: NVIDIA#698 (Phase 1 follow-up NVIDIA#2)
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes.

@github-actions github-actions Bot locked as resolved and limited conversation to collaborators May 4, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants