CP-37207: Per-Check Type Configuration for Validator Diagnostics#652
CP-37207: Per-Check Type Configuration for Validator Diagnostics#652
Conversation
The validator's pre-start diagnostic stage had a hardcoded `enforce: true` setting
that wasn't actually wired up to affect behavior. This made it impossible for users
to control whether diagnostic failures should block pod startup.
Additionally, when the diagnostic runner encountered errors (distinct from check
failures), it would cause the validator to fail even when enforcement was disabled,
effectively making `enforce: false` meaningless in error scenarios.
Functional Change:
Before: The `enforce` setting in validator config was vestigial - diagnostic failures
always logged warnings but never blocked pod startup. Runner errors would crash the
validator regardless of enforce setting.
After: When `enforce: true`, failing pre-start checks cause the validator to exit
with error code 1, blocking pod startup via the lifecycle hook. When `enforce: false`
(now the default), failures are logged and reported via telemetry but the pod starts
normally. Runner errors are handled gracefully when enforcement is disabled.
Solution:
1. Added `components.validator.enforce` to values.yaml with default `false`
- Includes documentation explaining behavior for both settings
- Updated JSON schema with enforce boolean property
2. Updated helm/templates/validator-cm.yaml to use the configurable value
- Pre-start stage now uses `{{ .Values.components.validator.enforce }}`
- Other stages remain hardcoded to `enforce: false`
3. Enhanced diagnostic runner (app/domain/diagnostic/runner/runner.go):
- Added `enforce` and `hasFailures` fields to runner struct
- `NewRunner()` captures enforce setting from stage config
- Added `ShouldFail()` method: returns true only when enforce=true AND checks failed
- Added `IsEnforced()` method: exposes enforce state for error handling
4. Modified command.go to implement enforcement behavior:
- After `Run()`, checks `engine.ShouldFail()` to determine exit behavior
- When enforce=true and checks fail: logs failures and returns error
- When enforce=false and runner errors: warns and continues
Validation:
- Added 5 new test functions to runner_test.go (coverage: 69.5% -> 86.7%):
- TestRunner_ShouldFail: verifies enforce+hasFailures logic
- TestRunner_EnforceSetFromStageConfig: verifies config parsing
- TestRunner_HasFailuresTracking: verifies failure detection
- TestRunner_ShouldFailIntegration: end-to-end config->behavior test
- TestRunner_IsEnforced: verifies IsEnforced() method
- Added helm/tests/validator_enforce_test.yaml with 6 test cases:
- Default value (false), explicit true/false, other stages unaffected
- Added 3 schema validation test files:
- components.validator.enforce.true.pass.yaml
- components.validator.enforce.false.pass.yaml
- components.validator.enforce.invalid.fail.yaml
- Deployed to Brahms cluster and verified:
- enforce=true + invalid API key: pod enters Init:CrashLoopBackOff (expected)
- enforce=false + invalid API key: pod starts normally with warnings (expected)
The recently added `enforce` flag for validator diagnostics controls whether check
failures are fatal. However, there's a need to control which checks run at all,
independent of whether failures block pod startup. Use cases include:
- Testing deployments with intentionally invalid API keys (disable api_key_valid)
- Debugging specific check failures without running other checks
- Enabling checks in stages where they don't normally run
Functional Change:
Before: The validator ran a fixed set of checks per stage. Users could only control
whether failures were fatal (via `enforce`), not which checks executed.
After: Users can enable or disable individual checks on a per-stage basis via
`components.validator.checks.<stage>.<check>: true|false`. All checks default to
enabled. The `enforce` flag remains orthogonal - it controls fatality, not execution.
Solution:
1. Added `checks: {}` configuration to helm/values.yaml with documentation listing
all available stages (pre-start, post-start, pre-stop, config-load) and checks
2. Added flexible schema to helm/values.schema.yaml using `additionalProperties:
type: boolean` pattern, avoiding the need to update schema when new checks are
added
3. Created `cloudzero-agent.validator.enabledChecks` helper in _helpers.tpl that:
- Filters default checks based on per-stage config
- Uses `hasKey` instead of `| default true` to correctly handle `false` values
- Supports adding checks to stages where they don't normally run
4. Updated helm/templates/validator-cm.yaml to use dynamic check filtering for all
four diagnostic stages
Validation:
- Added 15 Helm unit tests (helm/tests/validator_checks_test.yaml) covering:
- Default behavior (all checks enabled per stage)
- Disabling individual checks in a single stage
- Disabling the same check in multiple stages
- Disabling multiple checks in the same stage
- Disabling all checks in a stage (empty list)
- Adding checks to stages where they don't normally run
- Explicitly setting default checks to true (no-op)
- Empty checks config preserves defaults
- Interaction with enforce flag
- Added 4 schema validation tests in tests/helm/schema/:
- components.validator.checks.pre-start.api_key_valid.false.pass.yaml
- components.validator.checks.post-start.k8s_version.true.pass.yaml
- components.validator.checks.invalid-stage.fail.yaml
- components.validator.checks.pre-start.invalid-value.fail.yaml
- All new Helm unit tests pass
- All schema validation tests pass
- Deployed to GKE cluster (bach) with api_key_valid disabled in pre-start and
config-load stages:
- ConfigMap correctly shows empty checks list for pre-start
- ConfigMap correctly omits api_key_valid from config-load
- Validator init container completes successfully (exit code 0)
- No check table output for pre-start (no checks to run)
- Pod starts without api_key_valid "forbidden error" that appeared with defaults
246a095 to
efcab20
Compare
|
What about a default - for example if |
josephbarnett
left a comment
There was a problem hiding this comment.
I might recommend adding support for when type is not defined in a yaml file at all - to optional .... but otherwise looks good.
The stage-level `enforce` flag provided only coarse-grained control over validator
behavior - either all checks in a stage affected the exit code, or none did. This
made it difficult to configure the validator for different environments (e.g.,
disabling API key validation in test clusters while keeping other critical checks
enforced).
This change introduces a per-check type system that provides granular control over
how each diagnostic check affects validator behavior.
Functional Change:
Before: Validator used a boolean `enforce` flag per stage. When enabled, ANY check
failure in that stage caused a non-zero exit code. Users could not selectively
disable individual checks or control their severity.
After: Each check has a `type` field (required, optional, informative, disabled)
that controls its behavior:
- `required`: Failures cause non-zero exit code (all checks still run first)
- `optional`: Failures emit warnings but don't affect exit code
- `informative`: Information gathering only - always reports passing
- `disabled`: Check is not run at all
Solution:
1. Go config changes (`app/config/validator/diagnostics.go`):
- Added `CheckType` enum with required/optional/informative values
- Added `CheckConfig` struct with name and type fields
- Updated `Stage` struct to use `[]CheckConfig` instead of `[]string`
- Removed `Enforce` field from `Stage`
2. Go runner changes (`app/domain/diagnostic/runner/runner.go`):
- Replaced `enforce` and `hasFailures` with `checkTypes` map and `requiredFailures`
- `ShouldFail()` now returns true only if required checks failed
- Removed `IsEnforced()` method (no longer needed)
3. Helm values (`helm/values.yaml`, `app/functions/helmless/default-values.yaml`):
- New `components.validator.checks` structure with per-stage check configuration
- Default types: api_key_valid and istio_xcluster_lb as required; k8s_version,
k8s_namespace, k8s_provider, prometheus_version as informative; others as optional
4. Helm schema (`helm/values.schema.yaml`):
- Added `CheckConfig` type as string enum (required/optional/informative/disabled)
- Added `StageChecks` type with explicit properties for each valid check
- Added comprehensive descriptions for each diagnostic check documenting what
it validates and when it might fail
5. Helm template helper (`helm/templates/_helpers.tpl`):
- Updated `cloudzero-agent.validator.stageCheck` to merge user overrides with
defaults, filter out disabled checks, and output `[{name, type}]` format
6. CLI output (`app/functions/agent-validator/diagnose/command.go`):
- Updated table header to show "Type" column instead of removed fields
- Fixed condition in `printClusterStatusRow` that incorrectly filtered output
Validation:
- All existing Go unit tests updated and passing
- Added new tests for CheckType validation, CheckConfig validation, and
requiredFailures tracking in runner
- All Helm schema validation tests updated and passing (removed enforce tests,
added type tests)
- All Helm unit tests updated and passing with new default values
- Deployed to bach cluster with api_key_valid disabled - validator runs correctly:
- istio_xcluster_lb shows as "required" with Passing: true
- Disabled checks are correctly omitted from output
- Pod starts successfully (exit code 0)
- ConfigMap generates correct `checks: [{name, type}]` format
efcab20 to
086db10
Compare
had the default set to "required"; I changed it to "optional" But the thing is that due to how the code is structured it can't really come up... because the user doesn't have direct access (unless they are engaging in shenanigans) to the data generated in the validator CM; they specify stuff in their overrides, then the template helpers generate the actual name/type properties. If they do something like |
The Problem
The CloudZero Agent validator runs diagnostic checks during pod startup to verify the deployment is configured correctly. Until now, these checks existed but offered no user-facing configuration. All checks ran, but there was no supported way to:
This made it difficult to be aggressive about validation. We couldn't enforce checks that detect serious problems (like Istio cross-cluster misconfigurations) without also risking blocked deployments for less critical issues.
The Solution
This PR introduces the first user-facing API for controlling validator diagnostic behavior. Each check can now be configured independently with one of four types:
requiredoptionalinformativedisabledKey behavior: All checks run before any exit decision—we collect all diagnostics first, then determine the exit code based on whether any
requiredchecks failed.The API
The definitive reference is
helm/values.yaml. Here's the current configuration:This interface was chosen because it provides a Helm-friendly way to override the settings; trying to modify a list in Helm is difficult, you generally have to replace the whole list. With this design, you can disable any check you want (e.g.,
components.validator.checks.pre-start.istio_xcluster_lb=disabled), add it to another location, etc. The JSON Schema prevents specification of invalid validator checks.Design Decisions
istio_xcluster_lbasrequiredThis is the exemplar of what
requiredis for. The Istio cross-cluster load balancing check:This is exactly the kind of aggressive validation we want. If this check fails, there's a real configuration problem that will cause incorrect cost allocation.
api_key_validasoptionalWe want to start collecting data as soon as possible. If the API key is invalid:
Informative Checks
Pure telemetry:
k8s_version,k8s_namespace,k8s_provider,prometheus_versionThese gather environment information sent to CloudZero for diagnostics. They never block startup and always report passing—their job is information gathering, not validation.
Optional Checks
Important but not critical:
kube_state_metrics_reachable,scrape_cfg,webhook_server_reachableThese validate connectivity and configuration but may have transient failures during cluster startup. We log warnings without blocking.
What This Enables
More aggressive validation — We can add
requiredchecks for serious misconfigurations without fear of blocking all deploymentsCustomer-specific configuration — Users can disable checks that don't apply (webhook check when webhooks are disabled, Istio check when not using Istio)
Graceful degradation — Optional checks warn about issues without preventing data collection
Discussion
The check type assignments represent our best judgment, but we're open to feedback:
requiredby default but aren't?optionalis too aggressive?The API structure itself (
components.validator.checks.<stage>.<check>: <type>) is intended to be stable across releases.Validation
CheckTypevalidation,CheckConfigvalidation, andrequiredFailurestracking in the runnerCheckConfigtype definitionsoptions were being respected and handled correctly.
Backwards Compatibility
No concerns — the previous
enforceflag was never exposed invalues.yaml. This is a new user-facing API, not a migration from an existing one.Technical Details
For reviewers interested in implementation:
app/config/validator/diagnostics.go—CheckTypeenum,CheckConfigstructapp/domain/diagnostic/runner/runner.go— Type-aware exit code determinationhelm/templates/_helpers.tpl—cloudzero-agent.validator.stageCheckhelperhelm/values.schema.yaml— Full documentation for each checkCheck types are tracked internally in the runner for exit code determination. They're not stored in the status protobuf or exposed in the API.