Harden Jetmon v2 against DNS-related post-recovery false positives#108
Merged
Conversation
added 4 commits
May 12, 2026 22:15
The latest uptime-bench public-fleet report showed Jetmon v2 detecting the real outages but then opening extra false-positive incidents after recoveries or verifier false alarms. Many of those extra incidents were transport-only failures such as local resolver NXDOMAIN or connect/timeout errors after the benchmark target had already recovered. Keep a short in-memory recent-recovery marker per site and use it to suppress transport-only failures for one site cadence, capped at five minutes. Suppressed failures are counted and audited, but they do not seed retry state, open Seems Down events, or make the streaming planner mark the site non-running. Also treat verifier false-alarm closure as a recovery point so repeated local resolver blips after a false alarm do not immediately reopen the same incident. Add orchestrator, retry queue, and streaming side-effect tests for the suppression window and normal escalation after the window expires.
Post-recovery transport failures are now suppressed before they enter retry state, but the streaming scheduler could still put that site on the one-minute immediate retry path before side effects completed. That kept hammering resolver/provider transients and could reopen the same false-positive pattern as soon as the suppression window expired. Teach the streaming retry decision to recognize the same recent-recovery transport failure shape and schedule it at the normal site cadence instead. Existing retry state still keeps the immediate retry path, so real in-progress incidents continue to move toward verifier escalation without delay.
The latest uptime-bench run narrowed the remaining Jetmon v2 failures to repeated local resolver NXDOMAIN false positives after Verifliers had already disagreed with the outage. Those repeats landed just outside the normal post-recovery window for 3-minute sites, so treating false alarms as ordinary recoveries allowed the monitor to reopen Seems Down too quickly. Track verifier false alarms separately from real recoveries and give them a longer transport-only dampening window. Also cancel stale queued immediate retries when streaming side effects keep a failed result in Running state, so async side-effect resolution cannot leak another one-minute retry after suppression. Validated with go test ./internal/orchestrator and go test ./....
Treat monitor-local DNS lookup failures as low-confidence transport failures until verifier confirmation. This keeps transient resolver instability from immediately opening customer-visible HTTP Seems Down events while preserving the retry and verifier escalation path for real outages. When verifiers confirm a deferred DNS failure, open the HTTP incident directly as Down and keep the legacy projection update tied to the event mutation. Add resolver-source metadata and startup logging so reports can show whether checks used configured resolvers or the host resolver path. Refresh the false-alarm dampening marker whenever a transient post-false-alarm failure is suppressed so repeated local transport noise does not reappear just outside the original window.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR hardens Jetmon v2 against the post-recovery false positives seen in the TLS advisory uptime-bench scenarios when local DNS briefly returned NXDOMAIN or resolver errors after a target recovered.
The change keeps real downtime detectable while lowering confidence in local monitor-only DNS failures. When the local monitor sees a DNS-shaped failure with no HTTP response, Jetmon now defers customer-visible HTTP downtime until Veriflier confirmation. If Verifliers agree that the site is down, Jetmon opens the confirmed outage directly as
Down; if they do not agree, the transient local failure remains non-customer-visible.What changed
Seems Downevents for those low-confidence DNS failures until Verifliers confirm.Focused validation
go test ./...20260512T220614Z-75m-jetmon-v2-post-recovery-fp-clean-dns4/4scenarios with no false positives, no TLS advisory false outages, and expected TLS advisory detection preserved.