Skip to content

Harden Jetmon v2 against DNS-related post-recovery false positives#108

Merged
chrisbliss18 merged 4 commits into
v2from
feature/jetmon-v2-post-recovery-fp-hardening
May 13, 2026
Merged

Harden Jetmon v2 against DNS-related post-recovery false positives#108
chrisbliss18 merged 4 commits into
v2from
feature/jetmon-v2-post-recovery-fp-hardening

Conversation

@chrisbliss18
Copy link
Copy Markdown
Contributor

Summary

This PR hardens Jetmon v2 against the post-recovery false positives seen in the TLS advisory uptime-bench scenarios when local DNS briefly returned NXDOMAIN or resolver errors after a target recovered.

The change keeps real downtime detectable while lowering confidence in local monitor-only DNS failures. When the local monitor sees a DNS-shaped failure with no HTTP response, Jetmon now defers customer-visible HTTP downtime until Veriflier confirmation. If Verifliers agree that the site is down, Jetmon opens the confirmed outage directly as Down; if they do not agree, the transient local failure remains non-customer-visible.

What changed

  • Treat local DNS timeout/connect failures with no HTTP status as low-confidence downtime candidates.
  • Defer customer-visible HTTP Seems Down events for those low-confidence DNS failures until Verifliers confirm.
  • Preserve resolver evidence in retry and confirmed-down metadata, including whether checks used configured or system resolvers.
  • Refresh the false-alarm dampening marker while suppressing transient post-recovery blips so the dampening window rolls forward during unstable recovery periods.
  • Add metrics for low-confidence DNS failures awaiting Veriflier confirmation.
  • Add regression tests for post-recovery suppression, DNS metadata, retry behavior, and verifier-confirmed promotion paths.

Focused validation

  • go test ./...
  • Deployed the branch to the Jetmon v2 test service and ran the focused uptime-bench post-recovery false-positive suite.
  • Latest focused report: 20260512T220614Z-75m-jetmon-v2-post-recovery-fp-clean-dns
  • Result: Jetmon v2 passed 4/4 scenarios with no false positives, no TLS advisory false outages, and expected TLS advisory detection preserved.

Chris Jean added 4 commits May 12, 2026 22:15
The latest uptime-bench public-fleet report showed Jetmon v2 detecting the real outages but then opening extra false-positive incidents after recoveries or verifier false alarms. Many of those extra incidents were transport-only failures such as local resolver NXDOMAIN or connect/timeout errors after the benchmark target had already recovered.

Keep a short in-memory recent-recovery marker per site and use it to suppress transport-only failures for one site cadence, capped at five minutes. Suppressed failures are counted and audited, but they do not seed retry state, open Seems Down events, or make the streaming planner mark the site non-running.

Also treat verifier false-alarm closure as a recovery point so repeated local resolver blips after a false alarm do not immediately reopen the same incident. Add orchestrator, retry queue, and streaming side-effect tests for the suppression window and normal escalation after the window expires.
Post-recovery transport failures are now suppressed before they enter retry state, but the streaming scheduler could still put that site on the one-minute immediate retry path before side effects completed. That kept hammering resolver/provider transients and could reopen the same false-positive pattern as soon as the suppression window expired.

Teach the streaming retry decision to recognize the same recent-recovery transport failure shape and schedule it at the normal site cadence instead. Existing retry state still keeps the immediate retry path, so real in-progress incidents continue to move toward verifier escalation without delay.
The latest uptime-bench run narrowed the remaining Jetmon v2 failures to repeated local resolver NXDOMAIN false positives after Verifliers had already disagreed with the outage. Those repeats landed just outside the normal post-recovery window for 3-minute sites, so treating false alarms as ordinary recoveries allowed the monitor to reopen Seems Down too quickly.

Track verifier false alarms separately from real recoveries and give them a longer transport-only dampening window. Also cancel stale queued immediate retries when streaming side effects keep a failed result in Running state, so async side-effect resolution cannot leak another one-minute retry after suppression.

Validated with go test ./internal/orchestrator and go test ./....
Treat monitor-local DNS lookup failures as low-confidence transport failures until verifier confirmation. This keeps transient resolver instability from immediately opening customer-visible HTTP Seems Down events while preserving the retry and verifier escalation path for real outages.

When verifiers confirm a deferred DNS failure, open the HTTP incident directly as Down and keep the legacy projection update tied to the event mutation. Add resolver-source metadata and startup logging so reports can show whether checks used configured resolvers or the host resolver path.

Refresh the false-alarm dampening marker whenever a transient post-false-alarm failure is suppressed so repeated local transport noise does not reappear just outside the original window.
@chrisbliss18 chrisbliss18 merged commit ec1a26d into v2 May 13, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant