Skip to content

fix(bundler): synthesize GKE ResourceQuota for critical-priority pods#921

Merged
mchmarny merged 2 commits into
mainfrom
fix/915-gke-critical-priority-quota
May 15, 2026
Merged

fix(bundler): synthesize GKE ResourceQuota for critical-priority pods#921
mchmarny merged 2 commits into
mainfrom
fix/915-gke-critical-priority-quota

Conversation

@mchmarny
Copy link
Copy Markdown
Member

Summary

GKE Standard ships a kube-system ResourceQuota scoped to system-*-critical PriorityClasses. Per the Kubernetes spec, once any cluster-wide quota scopes by PriorityClass for those values, pods that request a matching priority class can only be created in namespaces that have a matching quota. AICR bundles emit gpu-operator (and any chart that defaults priorityClassName to system-node-critical) into per-component namespaces, so admission blocks every pod and helmfile apply --wait times out after ~10 minutes with an opaque Replicas: 0/1 error.

Fixes: #915
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Documentation update (demo workaround removed)

Component(s) Affected

  • Bundlers (pkg/bundler, pkg/component/*)
  • Recipe engine / data (pkg/recipe)
  • Docs/examples (docs/, examples/)

Implementation Notes

  • Schema: new gkeCriticalPriority field on registry component entries (pkg/recipe/components.go). gpu-operator is marked today; future components that emit system-*-critical pods opt in by setting the same field.
  • Synthesis seam: pkg/bundler/bundler.go::injectGKECriticalPriorityQuotas runs inside collectComponentPreManifests, so every deployer (helm, helmfile, argocd, argocd-helm, flux) benefits without per-deployer branching. The synthesized YAML is added under filename aicr/synthesized/gke-critical-pods-quota.yaml in the per-component pre-manifest map; the directory prefix prevents collision with any real PreManifestFiles path declared by an overlay.
  • Pods cap formula: max(criteria.Nodes × 32, 32). The 32-per-node multiplier covers gpu-operator's ~8–10 critical-priority DaemonSets per GPU node (driver, toolkit, device-plugin, GFD, DCGM, DCGM exporter, MIG manager, validator) plus the controller Deployment, with headroom for rolling-update churn. The 32-pod floor handles recipes that did not declare --nodes (e.g., demo configs). For a 2,000-node cluster the cap is 64,000 — still an admission allowlist, not a real capacity cap.
  • Phase choice: PreManifestFiles, not ManifestFiles. The chart's pods are rejected by admission until the quota exists, so the quota must land first. (The prior recipes/components/gpu-operator/manifests/gke-resource-quota.yaml was in ManifestFiles, applied AFTER the chart — that's why the demo workaround had users kubectl apply the quota manually before helmfile apply.)
  • Non-GKE recipes are unchanged: injectGKECriticalPriorityQuotas short-circuits when criteria.Service \!= gke, so EKS / AKS / OKE / bare-metal / kind bundles never see the synthesized manifest.
  • Removed superseded artifacts: recipes/components/gpu-operator/manifests/gke-resource-quota.yaml (file) and the corresponding manifestFiles entry in recipes/overlays/gke-cos.yaml. The bundler now handles this for every GKE overlay (including future gke-ubuntu etc.), not just gke-cos.

Testing

make qualify

New tests in pkg/bundler/bundler_gke_quota_test.go:

  • TestComputeGKECriticalPriorityQuotaPods (6 sub-cases): floor for 0/negative, identity at 1 node, computed for 8/100/2000 nodes.
  • TestRenderGKECriticalPriorityQuota: shape pin on apiVersion/kind/metadata/spec/scopeSelector.
  • TestInjectGKECriticalPriorityQuotas (6 sub-cases): gke+marked+nodes → cap; gke+marked+zero → floor; non-gke+marked → no-op; gke+unmarked → no-op; gke+marked+empty-namespace → skipped with warning; gke+mixed → only marked component synthesized.
  • TestInjectGKECriticalPriorityQuotas_CoexistsWithExistingPreManifests: additive merge with existing PreManifestFiles.
  • TestInjectGKECriticalPriorityQuotas_NilInputs: nil-tolerant contract.

Coverage on pkg/bundler: stable, full project coverage 76.4% (over 75% threshold).

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert.

Rollout notes: Additive on non-GKE recipes (short-circuits). For GKE recipes, the synthesized quota is applied before the chart, fixing the previously-broken default flow. No new CLI flags, no API surface changes. Existing GKE-COS overlays no longer need the manual manifestFiles entry — removed in the same PR.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

GKE Standard ships a kube-system ResourceQuota scoped to
system-*-critical PriorityClasses. Per the Kubernetes spec, once any
cluster-wide quota scopes by PriorityClass for those values, pods
that request a matching priority class can only be created in
namespaces that have a matching quota. AICR bundles emit gpu-operator
(and any other chart that defaults priorityClassName to
system-node-critical) into per-component namespaces, so admission
blocks every pod and `helmfile apply --wait` times out after ~10
minutes with an opaque "Replicas: 0/1" error.

Fix: add a `gkeCriticalPriority` field to the registry schema. When
the recipe's criteria.service is "gke" and a referenced component
declares it, the bundler synthesizes a permissive ResourceQuota
(pods = max(criteria.Nodes × 32, 32)) into the component's namespace
via PreManifestFiles so the quota exists before the chart's pods
attempt admission. Marks gpu-operator today; supersedes the
overlay-driven manifestFiles entry in gke-cos.yaml (deleted).

Demo cuj1-gke-config.md: drops the manual kubectl apply workaround.

Fixes #915
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 86eb8345-716a-4a1e-9093-725286a7faaa

📥 Commits

Reviewing files that changed from the base of the PR and between fccadce and abfb291.

📒 Files selected for processing (2)
  • pkg/bundler/bundler.go
  • pkg/bundler/bundler_gke_quota_test.go

📝 Walkthrough

Walkthrough

This PR automates GKE ResourceQuota injection for components that declare system-*-critical priority classes. The bundler now synthesizes per-namespace ResourceQuota manifests when a recipe targets GKE and components are marked gkeCriticalPriority: true; pod limits are computed from node count with a floor of 32. YAML rendering is deterministic. The static gke-resource-quota manifest and the demo workaround instructions were removed. Tests cover computation, rendering, injection gating, coexistence, and nil-handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

  • NVIDIA/aicr#917: Main PR’s bundler now synthesizes/injects the GKE “critical priority” ResourceQuota automatically and the updated demos/cuj1-gke-config.md removes the manual ResourceQuota workaround that retrieved PR #917 documented as workaround #915.

Suggested labels

area/tests, size/L

Suggested reviewers

  • lalitadithya
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: synthesizing GKE ResourceQuota for critical-priority pods, which is the primary objective of this PR.
Description check ✅ Passed The description is detailed and directly related to the changeset, covering the problem statement, implementation approach, testing, and risks in a way that clearly ties to the code changes.
Linked Issues check ✅ Passed The PR fully addresses issue #915 by implementing option A (bundler-side synthesis). It adds the gkeCriticalPriority field, synthesizes ResourceQuota via injectGKECriticalPriorityQuotas, uses the specified pods formula, and ensures GKE-only application with comprehensive test coverage.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the linked issue #915. The removal of gke-resource-quota.yaml, updates to registry and overlays, new bundler logic, schema additions, and comprehensive tests are all necessary to implement the synthesized quota solution.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/915-gke-critical-priority-quota

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/bundler/bundler.go`:
- Around line 1305-1330: The current renderGKECriticalPriorityQuota builds the
quota map and calls yaml.Marshal(quota) which yields non-deterministic YAML;
replace that call with serializer.MarshalYAMLDeterministic(quota) and update
imports to include the serializer package used elsewhere in the repo (add the
serializer import to the file's import block). Keep the same quota variable and
return signature so only the call site and imports change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: b43d9ccd-bb65-4a69-bb8d-a48cfb386836

📥 Commits

Reviewing files that changed from the base of the PR and between 9776d9f and fccadce.

📒 Files selected for processing (7)
  • demos/cuj1-gke-config.md
  • pkg/bundler/bundler.go
  • pkg/bundler/bundler_gke_quota_test.go
  • pkg/recipe/components.go
  • recipes/components/gpu-operator/manifests/gke-resource-quota.yaml
  • recipes/overlays/gke-cos.yaml
  • recipes/registry.yaml
💤 Files with no reviewable changes (2)
  • recipes/components/gpu-operator/manifests/gke-resource-quota.yaml
  • demos/cuj1-gke-config.md

Comment thread pkg/bundler/bundler.go Outdated
@mchmarny mchmarny self-assigned this May 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

Coverage Report ✅

Metric Value
Coverage 76.3%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.3%25-green)

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/bundler 63.21% (+1.26%) 👍
github.com/NVIDIA/aicr/pkg/recipe 90.23% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/bundler/bundler.go 59.87% (+1.93%) 461 (+33) 276 (+28) 185 (+5) 👍
github.com/NVIDIA/aicr/pkg/recipe/components.go 86.54% (ø) 104 90 14

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

lalitadithya
lalitadithya previously approved these changes May 15, 2026
Per CodeRabbit review on #921. Switches the synthesized ResourceQuota
from yaml.Marshal to serializer.MarshalYAMLDeterministic so the bytes
are stable across runs — the manifest lands in the bundle artifact
(checksummed and optionally attested), and yaml.v3 walks randomized
Go map order. Adds TestRenderGKECriticalPriorityQuota_Deterministic
to guard against accidental regression.
@mchmarny mchmarny enabled auto-merge (squash) May 15, 2026 14:52
@mchmarny mchmarny merged commit 6ffbd88 into main May 15, 2026
89 checks passed
@mchmarny mchmarny deleted the fix/915-gke-critical-priority-quota branch May 15, 2026 14:57
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 15, 2026
…al-aware dependsOn

The flux deployer never consumed `ComponentPreManifests`, so the
synthesized GKE critical-priority `ResourceQuota` from PR NVIDIA#921 was
silently dropped on `--deployer flux` bundles and the original symptom
from NVIDIA#915 reproduced. The same gap also blocked the os-talos mixin
from working on flux.

Add `ComponentPreManifests` to `flux.Generator`, wire it through
`bundler.buildDeployer`, and emit a `<name>-pre` HelmRelease ahead of
the primary when pre-manifests exist. Rewire the primary's `dependsOn`
to point at `<name>-pre` so the chain is
`previous → <name>-pre → <name> → <name>-post → next`.

Also fix a pre-existing ordering bug: the next component used to
depend on the previous component's primary name, not its terminal
(`<prev>-post` for mixed components). New `terminalReleaseNameFor`
helper resolves the correct tail; `buildPrimaryDependsOn` and the
README renderer both use it. Without this fix, Flux could reconcile
the next component in parallel with the previous component's post
manifests.

Add a `<name>-pre` collision guard mirroring the rule in
`pkg/bundler/deployer/localformat/writer.go`.

Tests: pre-only, pre+post with terminal-aware next-component dependsOn
(regression guard), collision rejection, and a refreshed
`TestBuildPrimaryDependsOn` covering mixed, manifest-only, and
chart-only previous components.

Docs: `pkg/bundler/deployer/flux/doc.go` describes the chain and the
example tree now includes `<name>-pre/`; `docs/user/cli-reference.md`
updates the Flux Deployment Order bullet.

Fixes: NVIDIA#923
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 15, 2026
…al-aware dependsOn

The flux deployer never consumed `ComponentPreManifests`, so the
synthesized GKE critical-priority `ResourceQuota` from PR NVIDIA#921 was
silently dropped on `--deployer flux` bundles and the original symptom
from NVIDIA#915 reproduced. The same gap also blocked the os-talos mixin
from working on flux.

Add `ComponentPreManifests` to `flux.Generator`, wire it through
`bundler.buildDeployer`, and emit a `<name>-pre` HelmRelease ahead of
the primary when pre-manifests exist. Rewire the primary's `dependsOn`
to point at `<name>-pre` so the chain is
`previous → <name>-pre → <name> → <name>-post → next`.

Also fix a pre-existing ordering bug: the next component used to
depend on the previous component's primary name, not its terminal
(`<prev>-post` for mixed components). New `terminalReleaseNameFor`
helper resolves the correct tail; `buildPrimaryDependsOn` and the
README renderer both use it. Without this fix, Flux could reconcile
the next component in parallel with the previous component's post
manifests.

Add a `<name>-pre` collision guard mirroring the rule in
`pkg/bundler/deployer/localformat/writer.go`.

Tests: pre-only, pre+post with terminal-aware next-component dependsOn
(regression guard), collision rejection, and a refreshed
`TestBuildPrimaryDependsOn` covering mixed, manifest-only, and
chart-only previous components.

Docs: `pkg/bundler/deployer/flux/doc.go` describes the chain and the
example tree now includes `<name>-pre/`; `docs/user/cli-reference.md`
updates the Flux Deployment Order bullet.

Fixes: NVIDIA#923
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 15, 2026
…al-aware dependsOn

The flux deployer never consumed `ComponentPreManifests`, so the
synthesized GKE critical-priority `ResourceQuota` from PR NVIDIA#921 was
silently dropped on `--deployer flux` bundles and the original symptom
from NVIDIA#915 reproduced. The same gap also blocked the os-talos mixin
from working on flux.

Add `ComponentPreManifests` to `flux.Generator`, wire it through
`bundler.buildDeployer`, and emit a `<name>-pre` HelmRelease ahead of
the primary when pre-manifests exist. Rewire the primary's `dependsOn`
to point at `<name>-pre` so the chain is
`previous → <name>-pre → <name> → <name>-post → next`.

Also fix a pre-existing ordering bug: the next component used to
depend on the previous component's primary name, not its terminal
(`<prev>-post` for mixed components). New `terminalReleaseNameFor`
helper resolves the correct tail; `buildPrimaryDependsOn` and the
README renderer both use it. Without this fix, Flux could reconcile
the next component in parallel with the previous component's post
manifests.

Add a `<name>-pre` collision guard mirroring the rule in
`pkg/bundler/deployer/localformat/writer.go`.

Tests: pre-only, pre+post with terminal-aware next-component dependsOn
(regression guard), collision rejection, and a refreshed
`TestBuildPrimaryDependsOn` covering mixed, manifest-only, and
chart-only previous components.

Docs: `pkg/bundler/deployer/flux/doc.go` describes the chain and the
example tree now includes `<name>-pre/`; `docs/user/cli-reference.md`
updates the Flux Deployment Order bullet.

Fixes: NVIDIA#923
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 15, 2026
…al-aware dependsOn

The flux deployer never consumed `ComponentPreManifests`, so the
synthesized GKE critical-priority `ResourceQuota` from PR NVIDIA#921 was
silently dropped on `--deployer flux` bundles and the original symptom
from NVIDIA#915 reproduced. The same gap also blocked the os-talos mixin
from working on flux.

Add `ComponentPreManifests` to `flux.Generator`, wire it through
`bundler.buildDeployer`, and emit a `<name>-pre` HelmRelease ahead of
the primary when pre-manifests exist. Rewire the primary's `dependsOn`
to point at `<name>-pre` so the chain is
`previous → <name>-pre → <name> → <name>-post → next`.

Also fix a pre-existing ordering bug: the next component used to
depend on the previous component's primary name, not its terminal
(`<prev>-post` for mixed components). New `terminalReleaseNameFor`
helper resolves the correct tail; `buildPrimaryDependsOn` and the
README renderer both use it. Without this fix, Flux could reconcile
the next component in parallel with the previous component's post
manifests.

Add a `<name>-pre` collision guard mirroring the rule in
`pkg/bundler/deployer/localformat/writer.go`.

Tests: pre-only, pre+post with terminal-aware next-component dependsOn
(regression guard), collision rejection, and a refreshed
`TestBuildPrimaryDependsOn` covering mixed, manifest-only, and
chart-only previous components.

Docs: `pkg/bundler/deployer/flux/doc.go` describes the chain and the
example tree now includes `<name>-pre/`; `docs/user/cli-reference.md`
updates the Flux Deployment Order bullet.

Fixes: NVIDIA#923
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 15, 2026
…al-aware dependsOn

The flux deployer never consumed `ComponentPreManifests`, so the
synthesized GKE critical-priority `ResourceQuota` from PR NVIDIA#921 was
silently dropped on `--deployer flux` bundles and the original symptom
from NVIDIA#915 reproduced. The same gap also blocked the os-talos mixin
from working on flux.

Add `ComponentPreManifests` to `flux.Generator`, wire it through
`bundler.buildDeployer`, and emit a `<name>-pre` HelmRelease ahead of
the primary when pre-manifests exist. Rewire the primary's `dependsOn`
to point at `<name>-pre` so the chain is
`previous → <name>-pre → <name> → <name>-post → next`.

Also fix a pre-existing ordering bug: the next component used to
depend on the previous component's primary name, not its terminal
(`<prev>-post` for mixed components). New `terminalReleaseNameFor`
helper resolves the correct tail; `buildPrimaryDependsOn` and the
README renderer both use it. Without this fix, Flux could reconcile
the next component in parallel with the previous component's post
manifests.

Add a `<name>-pre` collision guard mirroring the rule in
`pkg/bundler/deployer/localformat/writer.go`.

Tests: pre-only, pre+post with terminal-aware next-component dependsOn
(regression guard), collision rejection, and a refreshed
`TestBuildPrimaryDependsOn` covering mixed, manifest-only, and
chart-only previous components.

Docs: `pkg/bundler/deployer/flux/doc.go` describes the chain and the
example tree now includes `<name>-pre/`; `docs/user/cli-reference.md`
updates the Flux Deployment Order bullet.

Fixes: NVIDIA#923
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: helmfile/helm bundles fail on GKE — system-*-critical priorityClass pods rejected by ResourceQuota admission

2 participants