Expand projection drift diagnostics#93
Merged
chrisbliss18 merged 4 commits intov2from Apr 30, 2026
Merged
Conversation
added 4 commits
April 30, 2026 13:57
Improve the rollout projection-drift command so operators get a more useful report when the legacy site_status projection disagrees with the v2 event state. The drift queries now compare each site against a per-blog rollup of open HTTP events before counting or listing mismatches. This avoids overcounting sites that have multiple open HTTP endpoint events and makes the expected legacy projection match the site-level state operators need to reason about during rollout. The command now prints bucket/status summary rows, likely-cause labels, warning guidance, open-event counts, and explicit manual repair guidance before returning the existing non-zero drift failure. It remains read-only and intentionally does not generate or execute production repair SQL. Update the rollout and data model docs to describe the richer drift report, and track the remaining repair-planner decision in the roadmap.
Add operator-facing notes that the future projection-drift repair planner should wait for real rehearsal or early-production drift examples. The warning now appears near the projection documentation and fleet operations guidance, where operators investigating drift will see it. It reinforces that the current report is intentionally read-only and that repair automation should be based on observed cause labels rather than assumed failure modes.
Tighten the projection-drift implementation after a sysadmin, security, and code-quality review of the branch. The drift SQL now starts from active sites in the requested rollout bucket range and aggregates only matching open HTTP events for those site rows. This preserves the per-site projection comparison while avoiding the broader all-open-events aggregation that was unnecessary for per-host rollout checks. Improve operator output by warning immediately when manual review is required, ordering summary groups and cause guidance by highest drift count first, and widening cause columns so the most important diagnosis remains readable. Sanitize database-backed table strings before printing them in rollout output so unexpected control characters in event state data cannot alter terminal display during an incident investigation.
Improve the projection-drift report after a second sysadmin, security, and code-quality review. The command now fetches all summary groups for cause guidance while still displaying only the top summary rows. This prevents truncated summaries from making the warning counts look complete when hidden groups contain additional causes. Add an explicit next-step warning before the tables, report hidden summary groups and hidden drift rows, and warn if drift changes while the report is running so operators rerun the check before any manual repair. Add regression coverage that hidden summary groups still contribute to cause guidance without being printed in the top summary table.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
This improves the rollout projection-drift tooling so Systems operators get actionable diagnostics when the legacy
jetpack_monitor_sites.site_statusprojection disagrees with authoritative v2 event state. Projection drift is a rollout blocker, and a count-only failure does not give enough context to decide whether the issue is isolated, range-wide, stale legacy state, a missing event-to-projection write, or an unexpected event shape.What changed
Example output
Validation
go test ./cmd/jetmon2 ./internal/dbmake rollout-docs-verify