CP-37207: Per-Check Type Configuration for Validator Diagnostics by evan-cz · Pull Request #652 · Cloudzero/cloudzero-agent

evan-cz · 2026-02-04T14:18:34Z

The Problem

The CloudZero Agent validator runs diagnostic checks during pod startup to verify the deployment is configured correctly. Until now, these checks existed but offered no user-facing configuration. All checks ran, but there was no supported way to:

Control whether a failing check blocked pod startup
Disable checks that don't apply to a particular environment, or are broken
Distinguish between critical misconfigurations and minor issues

This made it difficult to be aggressive about validation. We couldn't enforce checks that detect serious problems (like Istio cross-cluster misconfigurations) without also risking blocked deployments for less critical issues.

The Solution

This PR introduces the first user-facing API for controlling validator diagnostic behavior. Each check can now be configured independently with one of four types:

Type	On Failure	Use Case
`required`	Blocks pod startup (non-zero exit)	Critical misconfigurations that will cause problems
`optional`	Warning logged, pod continues	Important validation but may have transient failures
`informative`	Always passes	Pure telemetry gathering
`disabled`	Check not run	Doesn't apply to this environment

Key behavior: All checks run before any exit decision—we collect all diagnostics first, then determine the exit code based on whether any required checks failed.

The API

The definitive reference is helm/values.yaml. Here's the current configuration:

components:
  validator:
    checks:
      pre-start:
        api_key_valid: optional
        istio_xcluster_lb: required
      post-start:
        k8s_version: informative
        k8s_namespace: informative
        k8s_provider: informative
        kube_state_metrics_reachable: optional
        prometheus_version: informative
        scrape_cfg: optional
        webhook_server_reachable: optional
      pre-stop: {}
      config-load:
        api_key_valid: optional
        k8s_version: informative
        k8s_namespace: informative
        k8s_provider: informative
        kube_state_metrics_reachable: optional

This interface was chosen because it provides a Helm-friendly way to override the settings; trying to modify a list in Helm is difficult, you generally have to replace the whole list. With this design, you can disable any check you want (e.g., components.validator.checks.pre-start.istio_xcluster_lb=disabled), add it to another location, etc. The JSON Schema prevents specification of invalid validator checks.

Design Decisions

`istio_xcluster_lb` as `required`

This is the exemplar of what required is for. The Istio cross-cluster load balancing check:

Passes silently for non-Istio clusters (no false positives)
Only fails when it detects a genuine Istio misconfiguration
Prevents serious problems — without this check, cross-cluster metrics get misattributed

This is exactly the kind of aggressive validation we want. If this check fails, there's a real configuration problem that will cause incorrect cost allocation.

`api_key_valid` as `optional`

We want to start collecting data as soon as possible. If the API key is invalid:

The customer can fix it without restarting the collector
Data collected during that window may be recoverable
Blocking startup only delays problem discovery

Informative Checks

Pure telemetry: k8s_version, k8s_namespace, k8s_provider, prometheus_version

These gather environment information sent to CloudZero for diagnostics. They never block startup and always report passing—their job is information gathering, not validation.

Optional Checks

Important but not critical: kube_state_metrics_reachable, scrape_cfg, webhook_server_reachable

These validate connectivity and configuration but may have transient failures during cluster startup. We log warnings without blocking.

What This Enables

More aggressive validation — We can add required checks for serious misconfigurations without fear of blocking all deployments
Customer-specific configuration — Users can disable checks that don't apply (webhook check when webhooks are disabled, Istio check when not using Istio)
Graceful degradation — Optional checks warn about issues without preventing data collection

Discussion

The check type assignments represent our best judgment, but we're open to feedback:

Should any checks move between categories?
Are there checks that should be required by default but aren't?
Are there checks where optional is too aggressive?

The API structure itself (components.validator.checks.<stage>.<check>: <type>) is intended to be stable across releases.

Validation

Go unit tests: All existing tests updated and passing, plus new tests for CheckType validation, CheckConfig validation, and requiredFailures tracking in the runner
Helm schema validation: All schema tests passing with new CheckConfig type definitions
Helm unit tests: all tests passing, including new tests for per-check configuration (helm/tests/validator_checks_test.yaml)
Manual deployment: Deployed to GKE cluster in many different configurations to verify that all
options were being respected and handled correctly.

Backwards Compatibility

No concerns — the previous enforce flag was never exposed in values.yaml. This is a new user-facing API, not a migration from an existing one.

Technical Details

For reviewers interested in implementation:

Go config: app/config/validator/diagnostics.go — CheckType enum, CheckConfig struct
Runner logic: app/domain/diagnostic/runner/runner.go — Type-aware exit code determination
Helm template: helm/templates/_helpers.tpl — cloudzero-agent.validator.stageCheck helper
Schema: helm/values.schema.yaml — Full documentation for each check

Check types are tracked internally in the runner for exit code determination. They're not stored in the status protobuf or exposed in the API.

The validator's pre-start diagnostic stage had a hardcoded `enforce: true` setting that wasn't actually wired up to affect behavior. This made it impossible for users to control whether diagnostic failures should block pod startup. Additionally, when the diagnostic runner encountered errors (distinct from check failures), it would cause the validator to fail even when enforcement was disabled, effectively making `enforce: false` meaningless in error scenarios. Functional Change: Before: The `enforce` setting in validator config was vestigial - diagnostic failures always logged warnings but never blocked pod startup. Runner errors would crash the validator regardless of enforce setting. After: When `enforce: true`, failing pre-start checks cause the validator to exit with error code 1, blocking pod startup via the lifecycle hook. When `enforce: false` (now the default), failures are logged and reported via telemetry but the pod starts normally. Runner errors are handled gracefully when enforcement is disabled. Solution: 1. Added `components.validator.enforce` to values.yaml with default `false` - Includes documentation explaining behavior for both settings - Updated JSON schema with enforce boolean property 2. Updated helm/templates/validator-cm.yaml to use the configurable value - Pre-start stage now uses `{{ .Values.components.validator.enforce }}` - Other stages remain hardcoded to `enforce: false` 3. Enhanced diagnostic runner (app/domain/diagnostic/runner/runner.go): - Added `enforce` and `hasFailures` fields to runner struct - `NewRunner()` captures enforce setting from stage config - Added `ShouldFail()` method: returns true only when enforce=true AND checks failed - Added `IsEnforced()` method: exposes enforce state for error handling 4. Modified command.go to implement enforcement behavior: - After `Run()`, checks `engine.ShouldFail()` to determine exit behavior - When enforce=true and checks fail: logs failures and returns error - When enforce=false and runner errors: warns and continues Validation: - Added 5 new test functions to runner_test.go (coverage: 69.5% -> 86.7%): - TestRunner_ShouldFail: verifies enforce+hasFailures logic - TestRunner_EnforceSetFromStageConfig: verifies config parsing - TestRunner_HasFailuresTracking: verifies failure detection - TestRunner_ShouldFailIntegration: end-to-end config->behavior test - TestRunner_IsEnforced: verifies IsEnforced() method - Added helm/tests/validator_enforce_test.yaml with 6 test cases: - Default value (false), explicit true/false, other stages unaffected - Added 3 schema validation test files: - components.validator.enforce.true.pass.yaml - components.validator.enforce.false.pass.yaml - components.validator.enforce.invalid.fail.yaml - Deployed to Brahms cluster and verified: - enforce=true + invalid API key: pod enters Init:CrashLoopBackOff (expected) - enforce=false + invalid API key: pod starts normally with warnings (expected)

The recently added `enforce` flag for validator diagnostics controls whether check failures are fatal. However, there's a need to control which checks run at all, independent of whether failures block pod startup. Use cases include: - Testing deployments with intentionally invalid API keys (disable api_key_valid) - Debugging specific check failures without running other checks - Enabling checks in stages where they don't normally run Functional Change: Before: The validator ran a fixed set of checks per stage. Users could only control whether failures were fatal (via `enforce`), not which checks executed. After: Users can enable or disable individual checks on a per-stage basis via `components.validator.checks.<stage>.<check>: true|false`. All checks default to enabled. The `enforce` flag remains orthogonal - it controls fatality, not execution. Solution: 1. Added `checks: {}` configuration to helm/values.yaml with documentation listing all available stages (pre-start, post-start, pre-stop, config-load) and checks 2. Added flexible schema to helm/values.schema.yaml using `additionalProperties: type: boolean` pattern, avoiding the need to update schema when new checks are added 3. Created `cloudzero-agent.validator.enabledChecks` helper in _helpers.tpl that: - Filters default checks based on per-stage config - Uses `hasKey` instead of `| default true` to correctly handle `false` values - Supports adding checks to stages where they don't normally run 4. Updated helm/templates/validator-cm.yaml to use dynamic check filtering for all four diagnostic stages Validation: - Added 15 Helm unit tests (helm/tests/validator_checks_test.yaml) covering: - Default behavior (all checks enabled per stage) - Disabling individual checks in a single stage - Disabling the same check in multiple stages - Disabling multiple checks in the same stage - Disabling all checks in a stage (empty list) - Adding checks to stages where they don't normally run - Explicitly setting default checks to true (no-op) - Empty checks config preserves defaults - Interaction with enforce flag - Added 4 schema validation tests in tests/helm/schema/: - components.validator.checks.pre-start.api_key_valid.false.pass.yaml - components.validator.checks.post-start.k8s_version.true.pass.yaml - components.validator.checks.invalid-stage.fail.yaml - components.validator.checks.pre-start.invalid-value.fail.yaml - All new Helm unit tests pass - All schema validation tests pass - Deployed to GKE cluster (bach) with api_key_valid disabled in pre-start and config-load stages: - ConfigMap correctly shows empty checks list for pre-start - ConfigMap correctly omits api_key_valid from config-load - Validator init container completes successfully (exit code 0) - No check table output for pre-start (no checks to run) - Pod starts without api_key_valid "forbidden error" that appeared with defaults

josephbarnett · 2026-02-04T15:02:44Z

What about a default - for example if type: is not defined on the yaml?

josephbarnett

I might recommend adding support for when type is not defined in a yaml file at all - to optional .... but otherwise looks good.

The stage-level `enforce` flag provided only coarse-grained control over validator behavior - either all checks in a stage affected the exit code, or none did. This made it difficult to configure the validator for different environments (e.g., disabling API key validation in test clusters while keeping other critical checks enforced). This change introduces a per-check type system that provides granular control over how each diagnostic check affects validator behavior. Functional Change: Before: Validator used a boolean `enforce` flag per stage. When enabled, ANY check failure in that stage caused a non-zero exit code. Users could not selectively disable individual checks or control their severity. After: Each check has a `type` field (required, optional, informative, disabled) that controls its behavior: - `required`: Failures cause non-zero exit code (all checks still run first) - `optional`: Failures emit warnings but don't affect exit code - `informative`: Information gathering only - always reports passing - `disabled`: Check is not run at all Solution: 1. Go config changes (`app/config/validator/diagnostics.go`): - Added `CheckType` enum with required/optional/informative values - Added `CheckConfig` struct with name and type fields - Updated `Stage` struct to use `[]CheckConfig` instead of `[]string` - Removed `Enforce` field from `Stage` 2. Go runner changes (`app/domain/diagnostic/runner/runner.go`): - Replaced `enforce` and `hasFailures` with `checkTypes` map and `requiredFailures` - `ShouldFail()` now returns true only if required checks failed - Removed `IsEnforced()` method (no longer needed) 3. Helm values (`helm/values.yaml`, `app/functions/helmless/default-values.yaml`): - New `components.validator.checks` structure with per-stage check configuration - Default types: api_key_valid and istio_xcluster_lb as required; k8s_version, k8s_namespace, k8s_provider, prometheus_version as informative; others as optional 4. Helm schema (`helm/values.schema.yaml`): - Added `CheckConfig` type as string enum (required/optional/informative/disabled) - Added `StageChecks` type with explicit properties for each valid check - Added comprehensive descriptions for each diagnostic check documenting what it validates and when it might fail 5. Helm template helper (`helm/templates/_helpers.tpl`): - Updated `cloudzero-agent.validator.stageCheck` to merge user overrides with defaults, filter out disabled checks, and output `[{name, type}]` format 6. CLI output (`app/functions/agent-validator/diagnose/command.go`): - Updated table header to show "Type" column instead of removed fields - Fixed condition in `printClusterStatusRow` that incorrectly filtered output Validation: - All existing Go unit tests updated and passing - Added new tests for CheckType validation, CheckConfig validation, and requiredFailures tracking in runner - All Helm schema validation tests updated and passing (removed enforce tests, added type tests) - All Helm unit tests updated and passing with new default values - Deployed to bach cluster with api_key_valid disabled - validator runs correctly: - istio_xcluster_lb shows as "required" with Passing: true - Disabled checks are correctly omitted from output - Pod starts successfully (exit code 0) - ConfigMap generates correct `checks: [{name, type}]` format

evan-cz · 2026-02-04T16:34:44Z

I might recommend adding support for when type is not defined in a yaml file at all - to optional .... but otherwise looks good.

had the default set to "required"; I changed it to "optional"

But the thing is that due to how the code is structured it can't really come up... because the user doesn't have direct access (unless they are engaging in shenanigans) to the data generated in the validator CM; they specify stuff in their overrides, then the template helpers generate the actual name/type properties.

If they do something like api_key_valid: (aka api_key_valid: null) it just won't get merged in by Helm and the default will be used.

evan-cz added 2 commits February 4, 2026 00:04

evan-cz force-pushed the CP-37207-plus branch from 246a095 to efcab20 Compare February 4, 2026 14:25

josephbarnett approved these changes Feb 4, 2026

View reviewed changes

evan-cz force-pushed the CP-37207-plus branch from efcab20 to 086db10 Compare February 4, 2026 16:30

evan-cz marked this pull request as ready for review February 4, 2026 16:34

evan-cz requested a review from a team as a code owner February 4, 2026 16:34

evan-cz mentioned this pull request Feb 4, 2026

CP-37207: Implement configurable enforce flag for validator diagnostics #646

Closed

evan-cz added this pull request to the merge queue Feb 5, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Feb 5, 2026

evan-cz added this pull request to the merge queue Feb 5, 2026

Merged via the queue into develop with commit 90802a9 Feb 5, 2026
44 checks passed

evan-cz deleted the CP-37207-plus branch February 5, 2026 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CP-37207: Per-Check Type Configuration for Validator Diagnostics#652

CP-37207: Per-Check Type Configuration for Validator Diagnostics#652
evan-cz merged 3 commits intodevelopfrom
CP-37207-plus

evan-cz commented Feb 4, 2026 •

edited

Loading

Uh oh!

josephbarnett commented Feb 4, 2026

Uh oh!

josephbarnett left a comment

Uh oh!

evan-cz commented Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

evan-cz commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Problem

The Solution

The API

Design Decisions

istio_xcluster_lb as required

api_key_valid as optional

Informative Checks

Optional Checks

What This Enables

Discussion

Validation

Backwards Compatibility

Technical Details

Uh oh!

josephbarnett commented Feb 4, 2026

Uh oh!

josephbarnett left a comment

Choose a reason for hiding this comment

Uh oh!

evan-cz commented Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

evan-cz commented Feb 4, 2026 •

edited

Loading

`istio_xcluster_lb` as `required`

`api_key_valid` as `optional`