Skip to content

feat: custom resources must explicity set their helm hooks OR opt out#66

Merged
ayuskauskas merged 2 commits intomainfrom
fix_crd_detection
Feb 5, 2026
Merged

feat: custom resources must explicity set their helm hooks OR opt out#66
ayuskauskas merged 2 commits intomainfrom
fix_crd_detection

Conversation

@ayuskauskas
Copy link
Contributor

Summary

Tests will check if helm.sh/hook is on custom resources. If not it will error with suggested fix. The fix is to either add them or add a skip annotation if the user knows it isn't needed.

Motivation / Context

The logic behind this is that the developer of the manifest knows best if the resource needs additions hooks or not. Attempting to detect and add automatically can have unintended behaviour. By being declarative and checking we get the best of both worlds: known state and avoiding un-intentional issues.

Fixes: Issue with Skyhook-customizations needing to be after operator installs
Related:

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/eidos, pkg/cli)
  • API server (cmd/eidosd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

Problem

The automatic detection of CRD-dependent resources fails because manifest files contain embedded Helm templates (e.g., {{ .Release.Service }}) that break YAML parsing. Instead of trying to reliably strip Helm syntax, we shift responsibility to manifest authors.

Solution

  1. Require explicit annotations on CRD-dependent manifests (Custom Resources)
  2. Add a validation test that runs during make test to catch violations early
  3. Remove the unreliable auto-detection code from the bundler

Implementation

1. Update the Skyhook manifest to include Helm hooks

File: pkg/recipe/data/components/skyhook-customizations/manifests/h100-eks-ubuntu-training.yaml

Add annotations after metadata::

metadata:
  annotations:
    "helm.sh/hook": post-install,post-upgrade
    "helm.sh/hook-weight": "10"
    "helm.sh/hook-delete-policy": before-hook-creation
  labels:
    ...

2. Add validation test

File: pkg/recipe/yaml_test.go (new test function)

// TestManifestHelmHooksRequired validates that CRD-dependent manifests
// have the required Helm hook annotations for proper deployment ordering.
func TestManifestHelmHooksRequired(t *testing.T) {
    // 1. Walk all files in data/components/*/manifests/*.yaml
    // 2. For each file, extract apiVersion using regex (avoiding YAML parse issues)
    // 3. If apiVersion is NOT a standard K8s API, check for:
    //    - helm.sh/hook annotation (required), OR
    //    - eidos/skip-hook-validation: "true" annotation (opt-out)
    // 4. Fail with helpful message suggesting the fix
}

Standard K8s APIs (no hooks needed): v1, apps/v1, batch/v1, networking.k8s.io/*, rbac.authorization.k8s.io/*, etc.

3. Simplify the bundler code

File: pkg/bundler/deployer/helm/helm.go

  • Remove isCRDDependentResource() function (unreliable)
  • Remove addHelmHookAnnotation() function (no longer needed)
  • Simplify generateTemplates() to just copy manifest content as-is

4. Error message format

When a manifest fails validation:

manifest "h100-eks-ubuntu-training.yaml" has custom apiVersion "skyhook.nvidia.com/v1alpha1"
but is missing required Helm hook annotations.

CRD-dependent resources must include these annotations to ensure proper deployment ordering:

  metadata:
    annotations:
      "helm.sh/hook": post-install,post-upgrade
      "helm.sh/hook-weight": "10"
      "helm.sh/hook-delete-policy": before-hook-creation

To skip this validation (not recommended), add:
  metadata:
    annotations:
      eidos/skip-hook-validation: "true"

Testing

# Commands run (prefer `make qualify` for non-trivial changes)
make qualify

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes:

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are signed off (git commit -s) — DCO info

Tests will check if helm.sh/hook is on custom resources. If not it will error with suggested fix.
The fix is to either add them or add a skip annotation if the user knows it isn't needed.

The logic behind this is that the developer of the manifest knows best if the resource needs additions hooks or not. Attempting to detect and add automatically can have unintended behaviour. By being declarative and checking we get the best of both worlds: known state and avoiding un-intentional issues.
@ayuskauskas ayuskauskas requested review from a team as code owners February 5, 2026 20:04
@yuanchen8911 yuanchen8911 self-requested a review February 5, 2026 20:10
yuanchen8911
yuanchen8911 previously approved these changes Feb 5, 2026
Copy link
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@github-actions
Copy link

github-actions bot commented Feb 5, 2026

Coverage Report ✅

Metric Value
Coverage 74.4%
Threshold 70%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-74.4%25-green)

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/eidos/pkg/bundler/deployer/helm 76.29% (+19.80%) 🎉

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/eidos/pkg/bundler/deployer/helm/helm.go 76.29% (+19.80%) 194 (-68) 148 46 (-68) 🎉

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

…validation

Add a --wait --timeout 10m to the example helm install command.
fix(skyhook-customizations/h100-eks-ubuntu-training): replace _ with - to conform to name validation
Copy link
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@ayuskauskas ayuskauskas requested a review from mchmarny February 5, 2026 21:10
@ayuskauskas ayuskauskas self-assigned this Feb 5, 2026
Copy link
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@ayuskauskas ayuskauskas merged commit 386b39a into main Feb 5, 2026
16 checks passed
@ayuskauskas ayuskauskas deleted the fix_crd_detection branch February 5, 2026 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants