Skip to content

CP-37207: Per-Check Type Configuration for Validator Diagnostics#652

Merged
evan-cz merged 3 commits intodevelopfrom
CP-37207-plus
Feb 5, 2026
Merged

CP-37207: Per-Check Type Configuration for Validator Diagnostics#652
evan-cz merged 3 commits intodevelopfrom
CP-37207-plus

Conversation

@evan-cz
Copy link
Copy Markdown
Contributor

@evan-cz evan-cz commented Feb 4, 2026

The Problem

The CloudZero Agent validator runs diagnostic checks during pod startup to verify the deployment is configured correctly. Until now, these checks existed but offered no user-facing configuration. All checks ran, but there was no supported way to:

  • Control whether a failing check blocked pod startup
  • Disable checks that don't apply to a particular environment, or are broken
  • Distinguish between critical misconfigurations and minor issues

This made it difficult to be aggressive about validation. We couldn't enforce checks that detect serious problems (like Istio cross-cluster misconfigurations) without also risking blocked deployments for less critical issues.

The Solution

This PR introduces the first user-facing API for controlling validator diagnostic behavior. Each check can now be configured independently with one of four types:

Type On Failure Use Case
required Blocks pod startup (non-zero exit) Critical misconfigurations that will cause problems
optional Warning logged, pod continues Important validation but may have transient failures
informative Always passes Pure telemetry gathering
disabled Check not run Doesn't apply to this environment

Key behavior: All checks run before any exit decision—we collect all diagnostics first, then determine the exit code based on whether any required checks failed.

The API

The definitive reference is helm/values.yaml. Here's the current configuration:

components:
  validator:
    checks:
      pre-start:
        api_key_valid: optional
        istio_xcluster_lb: required
      post-start:
        k8s_version: informative
        k8s_namespace: informative
        k8s_provider: informative
        kube_state_metrics_reachable: optional
        prometheus_version: informative
        scrape_cfg: optional
        webhook_server_reachable: optional
      pre-stop: {}
      config-load:
        api_key_valid: optional
        k8s_version: informative
        k8s_namespace: informative
        k8s_provider: informative
        kube_state_metrics_reachable: optional

This interface was chosen because it provides a Helm-friendly way to override the settings; trying to modify a list in Helm is difficult, you generally have to replace the whole list. With this design, you can disable any check you want (e.g., components.validator.checks.pre-start.istio_xcluster_lb=disabled), add it to another location, etc. The JSON Schema prevents specification of invalid validator checks.

Design Decisions

istio_xcluster_lb as required

This is the exemplar of what required is for. The Istio cross-cluster load balancing check:

  • Passes silently for non-Istio clusters (no false positives)
  • Only fails when it detects a genuine Istio misconfiguration
  • Prevents serious problems — without this check, cross-cluster metrics get misattributed

This is exactly the kind of aggressive validation we want. If this check fails, there's a real configuration problem that will cause incorrect cost allocation.

api_key_valid as optional

We want to start collecting data as soon as possible. If the API key is invalid:

  • The customer can fix it without restarting the collector
  • Data collected during that window may be recoverable
  • Blocking startup only delays problem discovery

Informative Checks

Pure telemetry: k8s_version, k8s_namespace, k8s_provider, prometheus_version

These gather environment information sent to CloudZero for diagnostics. They never block startup and always report passing—their job is information gathering, not validation.

Optional Checks

Important but not critical: kube_state_metrics_reachable, scrape_cfg, webhook_server_reachable

These validate connectivity and configuration but may have transient failures during cluster startup. We log warnings without blocking.

What This Enables

  1. More aggressive validation — We can add required checks for serious misconfigurations without fear of blocking all deployments

  2. Customer-specific configuration — Users can disable checks that don't apply (webhook check when webhooks are disabled, Istio check when not using Istio)

  3. Graceful degradation — Optional checks warn about issues without preventing data collection

Discussion

The check type assignments represent our best judgment, but we're open to feedback:

  • Should any checks move between categories?
  • Are there checks that should be required by default but aren't?
  • Are there checks where optional is too aggressive?

The API structure itself (components.validator.checks.<stage>.<check>: <type>) is intended to be stable across releases.

Validation

  • Go unit tests: All existing tests updated and passing, plus new tests for CheckType validation, CheckConfig validation, and requiredFailures tracking in the runner
  • Helm schema validation: All schema tests passing with new CheckConfig type definitions
  • Helm unit tests: all tests passing, including new tests for per-check configuration (helm/tests/validator_checks_test.yaml)
  • Manual deployment: Deployed to GKE cluster in many different configurations to verify that all
    options were being respected and handled correctly.

Backwards Compatibility

No concerns — the previous enforce flag was never exposed in values.yaml. This is a new user-facing API, not a migration from an existing one.

Technical Details

For reviewers interested in implementation:

  • Go config: app/config/validator/diagnostics.goCheckType enum, CheckConfig struct
  • Runner logic: app/domain/diagnostic/runner/runner.go — Type-aware exit code determination
  • Helm template: helm/templates/_helpers.tplcloudzero-agent.validator.stageCheck helper
  • Schema: helm/values.schema.yaml — Full documentation for each check

Check types are tracked internally in the runner for exit code determination. They're not stored in the status protobuf or exposed in the API.

The validator's pre-start diagnostic stage had a hardcoded `enforce: true` setting
that wasn't actually wired up to affect behavior. This made it impossible for users
to control whether diagnostic failures should block pod startup.

Additionally, when the diagnostic runner encountered errors (distinct from check
failures), it would cause the validator to fail even when enforcement was disabled,
effectively making `enforce: false` meaningless in error scenarios.

Functional Change:

Before: The `enforce` setting in validator config was vestigial - diagnostic failures
always logged warnings but never blocked pod startup. Runner errors would crash the
validator regardless of enforce setting.

After: When `enforce: true`, failing pre-start checks cause the validator to exit
with error code 1, blocking pod startup via the lifecycle hook. When `enforce: false`
(now the default), failures are logged and reported via telemetry but the pod starts
normally. Runner errors are handled gracefully when enforcement is disabled.

Solution:

1. Added `components.validator.enforce` to values.yaml with default `false`
   - Includes documentation explaining behavior for both settings
   - Updated JSON schema with enforce boolean property

2. Updated helm/templates/validator-cm.yaml to use the configurable value
   - Pre-start stage now uses `{{ .Values.components.validator.enforce }}`
   - Other stages remain hardcoded to `enforce: false`

3. Enhanced diagnostic runner (app/domain/diagnostic/runner/runner.go):
   - Added `enforce` and `hasFailures` fields to runner struct
   - `NewRunner()` captures enforce setting from stage config
   - Added `ShouldFail()` method: returns true only when enforce=true AND checks failed
   - Added `IsEnforced()` method: exposes enforce state for error handling

4. Modified command.go to implement enforcement behavior:
   - After `Run()`, checks `engine.ShouldFail()` to determine exit behavior
   - When enforce=true and checks fail: logs failures and returns error
   - When enforce=false and runner errors: warns and continues

Validation:

- Added 5 new test functions to runner_test.go (coverage: 69.5% -> 86.7%):
  - TestRunner_ShouldFail: verifies enforce+hasFailures logic
  - TestRunner_EnforceSetFromStageConfig: verifies config parsing
  - TestRunner_HasFailuresTracking: verifies failure detection
  - TestRunner_ShouldFailIntegration: end-to-end config->behavior test
  - TestRunner_IsEnforced: verifies IsEnforced() method

- Added helm/tests/validator_enforce_test.yaml with 6 test cases:
  - Default value (false), explicit true/false, other stages unaffected

- Added 3 schema validation test files:
  - components.validator.enforce.true.pass.yaml
  - components.validator.enforce.false.pass.yaml
  - components.validator.enforce.invalid.fail.yaml

- Deployed to Brahms cluster and verified:
  - enforce=true + invalid API key: pod enters Init:CrashLoopBackOff (expected)
  - enforce=false + invalid API key: pod starts normally with warnings (expected)
The recently added `enforce` flag for validator diagnostics controls whether check
failures are fatal. However, there's a need to control which checks run at all,
independent of whether failures block pod startup. Use cases include:

- Testing deployments with intentionally invalid API keys (disable api_key_valid)
- Debugging specific check failures without running other checks
- Enabling checks in stages where they don't normally run

Functional Change:

Before: The validator ran a fixed set of checks per stage. Users could only control
whether failures were fatal (via `enforce`), not which checks executed.

After: Users can enable or disable individual checks on a per-stage basis via
`components.validator.checks.<stage>.<check>: true|false`. All checks default to
enabled. The `enforce` flag remains orthogonal - it controls fatality, not execution.

Solution:

1. Added `checks: {}` configuration to helm/values.yaml with documentation listing
   all available stages (pre-start, post-start, pre-stop, config-load) and checks

2. Added flexible schema to helm/values.schema.yaml using `additionalProperties:
   type: boolean` pattern, avoiding the need to update schema when new checks are
   added

3. Created `cloudzero-agent.validator.enabledChecks` helper in _helpers.tpl that:
   - Filters default checks based on per-stage config
   - Uses `hasKey` instead of `| default true` to correctly handle `false` values
   - Supports adding checks to stages where they don't normally run

4. Updated helm/templates/validator-cm.yaml to use dynamic check filtering for all
   four diagnostic stages

Validation:

- Added 15 Helm unit tests (helm/tests/validator_checks_test.yaml) covering:
  - Default behavior (all checks enabled per stage)
  - Disabling individual checks in a single stage
  - Disabling the same check in multiple stages
  - Disabling multiple checks in the same stage
  - Disabling all checks in a stage (empty list)
  - Adding checks to stages where they don't normally run
  - Explicitly setting default checks to true (no-op)
  - Empty checks config preserves defaults
  - Interaction with enforce flag

- Added 4 schema validation tests in tests/helm/schema/:
  - components.validator.checks.pre-start.api_key_valid.false.pass.yaml
  - components.validator.checks.post-start.k8s_version.true.pass.yaml
  - components.validator.checks.invalid-stage.fail.yaml
  - components.validator.checks.pre-start.invalid-value.fail.yaml

- All new Helm unit tests pass
- All schema validation tests pass
- Deployed to GKE cluster (bach) with api_key_valid disabled in pre-start and
  config-load stages:
  - ConfigMap correctly shows empty checks list for pre-start
  - ConfigMap correctly omits api_key_valid from config-load
  - Validator init container completes successfully (exit code 0)
  - No check table output for pre-start (no checks to run)
  - Pod starts without api_key_valid "forbidden error" that appeared with defaults
@josephbarnett
Copy link
Copy Markdown
Collaborator

What about a default - for example if type: is not defined on the yaml?

Copy link
Copy Markdown
Collaborator

@josephbarnett josephbarnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might recommend adding support for when type is not defined in a yaml file at all - to optional .... but otherwise looks good.

The stage-level `enforce` flag provided only coarse-grained control over validator
behavior - either all checks in a stage affected the exit code, or none did. This
made it difficult to configure the validator for different environments (e.g.,
disabling API key validation in test clusters while keeping other critical checks
enforced).

This change introduces a per-check type system that provides granular control over
how each diagnostic check affects validator behavior.

Functional Change:

Before: Validator used a boolean `enforce` flag per stage. When enabled, ANY check
failure in that stage caused a non-zero exit code. Users could not selectively
disable individual checks or control their severity.

After: Each check has a `type` field (required, optional, informative, disabled)
that controls its behavior:
- `required`: Failures cause non-zero exit code (all checks still run first)
- `optional`: Failures emit warnings but don't affect exit code
- `informative`: Information gathering only - always reports passing
- `disabled`: Check is not run at all

Solution:

1. Go config changes (`app/config/validator/diagnostics.go`):
   - Added `CheckType` enum with required/optional/informative values
   - Added `CheckConfig` struct with name and type fields
   - Updated `Stage` struct to use `[]CheckConfig` instead of `[]string`
   - Removed `Enforce` field from `Stage`

2. Go runner changes (`app/domain/diagnostic/runner/runner.go`):
   - Replaced `enforce` and `hasFailures` with `checkTypes` map and `requiredFailures`
   - `ShouldFail()` now returns true only if required checks failed
   - Removed `IsEnforced()` method (no longer needed)

3. Helm values (`helm/values.yaml`, `app/functions/helmless/default-values.yaml`):
   - New `components.validator.checks` structure with per-stage check configuration
   - Default types: api_key_valid and istio_xcluster_lb as required; k8s_version,
     k8s_namespace, k8s_provider, prometheus_version as informative; others as optional

4. Helm schema (`helm/values.schema.yaml`):
   - Added `CheckConfig` type as string enum (required/optional/informative/disabled)
   - Added `StageChecks` type with explicit properties for each valid check
   - Added comprehensive descriptions for each diagnostic check documenting what
     it validates and when it might fail

5. Helm template helper (`helm/templates/_helpers.tpl`):
   - Updated `cloudzero-agent.validator.stageCheck` to merge user overrides with
     defaults, filter out disabled checks, and output `[{name, type}]` format

6. CLI output (`app/functions/agent-validator/diagnose/command.go`):
   - Updated table header to show "Type" column instead of removed fields
   - Fixed condition in `printClusterStatusRow` that incorrectly filtered output

Validation:

- All existing Go unit tests updated and passing
- Added new tests for CheckType validation, CheckConfig validation, and
  requiredFailures tracking in runner
- All Helm schema validation tests updated and passing (removed enforce tests,
  added type tests)
- All Helm unit tests updated and passing with new default values
- Deployed to bach cluster with api_key_valid disabled - validator runs correctly:
  - istio_xcluster_lb shows as "required" with Passing: true
  - Disabled checks are correctly omitted from output
  - Pod starts successfully (exit code 0)
  - ConfigMap generates correct `checks: [{name, type}]` format
@evan-cz
Copy link
Copy Markdown
Contributor Author

evan-cz commented Feb 4, 2026

I might recommend adding support for when type is not defined in a yaml file at all - to optional .... but otherwise looks good.

had the default set to "required"; I changed it to "optional"

But the thing is that due to how the code is structured it can't really come up... because the user doesn't have direct access (unless they are engaging in shenanigans) to the data generated in the validator CM; they specify stuff in their overrides, then the template helpers generate the actual name/type properties.

If they do something like api_key_valid: (aka api_key_valid: null) it just won't get merged in by Helm and the default will be used.

@evan-cz evan-cz marked this pull request as ready for review February 4, 2026 16:34
@evan-cz evan-cz requested a review from a team as a code owner February 4, 2026 16:34
@evan-cz evan-cz added this pull request to the merge queue Feb 5, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Feb 5, 2026
@evan-cz evan-cz added this pull request to the merge queue Feb 5, 2026
Merged via the queue into develop with commit 90802a9 Feb 5, 2026
44 checks passed
@evan-cz evan-cz deleted the CP-37207-plus branch February 5, 2026 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants