Skip to content

Expand projection drift diagnostics#93

Merged
chrisbliss18 merged 4 commits intov2from
feature/projection-drift-tooling
Apr 30, 2026
Merged

Expand projection drift diagnostics#93
chrisbliss18 merged 4 commits intov2from
feature/projection-drift-tooling

Conversation

@chrisbliss18
Copy link
Copy Markdown
Contributor

Why

This improves the rollout projection-drift tooling so Systems operators get actionable diagnostics when the legacy jetpack_monitor_sites.site_status projection disagrees with authoritative v2 event state. Projection drift is a rollout blocker, and a count-only failure does not give enough context to decide whether the issue is isolated, range-wide, stale legacy state, a missing event-to-projection write, or an unexpected event shape.

What changed

  • Updates projection-drift SQL to compare each active site row against a per-site rollup of open HTTP events.
  • Keeps the query scoped to the requested rollout bucket range so per-host checks do not aggregate unrelated open events.
  • Adds bucket/status summary output before individual drift rows.
  • Adds likely-cause labels and operator guidance for common mismatch classes.
  • Adds an explicit manual-review warning and next-step guidance before the detailed tables.
  • Uses all summary groups for cause totals while displaying only the top summary rows, so hidden groups do not make the guidance incomplete.
  • Reports hidden summary groups/rows and warns if drift changes while the report is running.
  • Sanitizes database-backed table strings before printing them to the terminal.
  • Documents that the command is read-only and that any future repair planner should be based on observed rehearsal or production drift classes.

Example output

INFO projection_drift_range=0-99
INFO legacy_projection_drift=3
WARN legacy_projection_drift_requires_manual_review=3
WARN projection_drift_next_step="review the summary first, then inspect listed event rows before making any site_status repair"

## projection drift summary
BUCKET   COUNT   SITE_STATUS            EXPECTED               OPEN_EVENTS SAMPLE_BLOG  CAUSE                               EVENT_STATE
7        2       1/SITE_RUNNING         2/SITE_CONFIRMED_DOWN  1           42           missing_confirmed_down_projection   Down
8        1       0/SITE_DOWN            1/SITE_RUNNING         0           43           stale_legacy_down_projection        -
WARN projection_drift_cause=missing_confirmed_down_projection count=2 action="an open Down event exists but the legacy site row still reports running; inspect the eventstore transaction path before continuing rollout"
WARN projection_drift_cause=stale_legacy_down_projection count=1 action="the legacy site row still reports downtime even though no open HTTP downtime event requires it; inspect recent close transitions before setting the projection back to running"

## projection drift rows
BLOG_ID      BUCKET   SITE_STATUS            EXPECTED               EVENT_ID   OPEN_EVENTS CAUSE                               EVENT_STATE
42           7        1/SITE_RUNNING         2/SITE_CONFIRMED_DOWN  123        1           missing_confirmed_down_projection   Down
INFO projection_drift_rows_truncated=2
INFO projection_drift_repair=manual_confirmation_required
INFO projection_drift_repair_guidance=confirm the authoritative event rows first; then repair the legacy site_status projection inside a reviewed DB change or by rerunning the code path that writes the event and projection together
FAIL legacy projection drift=3 in range 0-99

Validation

  • go test ./cmd/jetmon2 ./internal/db
  • make rollout-docs-verify

Chris Jean added 4 commits April 30, 2026 13:57
Improve the rollout projection-drift command so operators get a more useful report when the legacy site_status projection disagrees with the v2 event state.

The drift queries now compare each site against a per-blog rollup of open HTTP events before counting or listing mismatches. This avoids overcounting sites that have multiple open HTTP endpoint events and makes the expected legacy projection match the site-level state operators need to reason about during rollout.

The command now prints bucket/status summary rows, likely-cause labels, warning guidance, open-event counts, and explicit manual repair guidance before returning the existing non-zero drift failure. It remains read-only and intentionally does not generate or execute production repair SQL.

Update the rollout and data model docs to describe the richer drift report, and track the remaining repair-planner decision in the roadmap.
Add operator-facing notes that the future projection-drift repair planner should wait for real rehearsal or early-production drift examples.

The warning now appears near the projection documentation and fleet operations guidance, where operators investigating drift will see it. It reinforces that the current report is intentionally read-only and that repair automation should be based on observed cause labels rather than assumed failure modes.
Tighten the projection-drift implementation after a sysadmin, security, and code-quality review of the branch.

The drift SQL now starts from active sites in the requested rollout bucket range and aggregates only matching open HTTP events for those site rows. This preserves the per-site projection comparison while avoiding the broader all-open-events aggregation that was unnecessary for per-host rollout checks.

Improve operator output by warning immediately when manual review is required, ordering summary groups and cause guidance by highest drift count first, and widening cause columns so the most important diagnosis remains readable.

Sanitize database-backed table strings before printing them in rollout output so unexpected control characters in event state data cannot alter terminal display during an incident investigation.
Improve the projection-drift report after a second sysadmin, security, and code-quality review.

The command now fetches all summary groups for cause guidance while still displaying only the top summary rows. This prevents truncated summaries from making the warning counts look complete when hidden groups contain additional causes.

Add an explicit next-step warning before the tables, report hidden summary groups and hidden drift rows, and warn if drift changes while the report is running so operators rerun the check before any manual repair.

Add regression coverage that hidden summary groups still contribute to cause guidance without being printed in the top summary table.
@chrisbliss18 chrisbliss18 merged commit b970444 into v2 Apr 30, 2026
@chrisbliss18 chrisbliss18 deleted the feature/projection-drift-tooling branch April 30, 2026 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant