feat(liveness): B.2a — system-liveness-loop (15/15 ACs, 100%)#421
Merged
Conversation
This was referenced May 29, 2026
fbeecae to
d5000e9
Compare
remyluslosius
added a commit
that referenced
this pull request
May 29, 2026
…100%)
Closes Slice B.2 (awareness layer): B.2a liveness loop + B.2b drift
detector. Pure consumer of the B.1c transaction log; classifies
per-host compliance drift against operator-tunable thresholds.
Spec
New: app/specs/system/drift-detector.spec.yaml (status: approved).
14 ACs across 8 constraints.
Migrations 0011 + 0012 (copies from B.1a/B.1c branches, identical content)
Required for the integration tests — host_rule_state and transactions
tables are the detector's primary read source. goose treats identical
duplicate migrations as no-ops when multiple PRs merge.
internal/drift package
doc.go Architectural choices: pure consumer of transactions, no
baselines table (Python-era artifact explicitly dropped),
percentage-point math, single read transaction.
types.go DriftKind closed enum (stable / minor_worsening /
major_worsening / improvement). Thresholds struct with
defaults major=10pp, minor=5pp, improvement=5pp.
ValidateThresholds enforces (0, 100] range + major >= minor.
DriftReport with per-severity transition counts
(critical/high/medium/low × became_failing/became_passing).
classify.go Pure function:
delta := current - prior
delta >= ImprovementPP → DriftImprovement
delta <= -MajorWorseningPP → DriftMajorWorsening
delta <= -MinorWorseningPP → DriftMinorWorsening
otherwise → DriftStable
ComplianceScore(passed, failed) excludes skipped from
denominator (passed / (passed + failed)) × 100.
service.go Service.DetectForScan reads prior + current scores under
a single read transaction. Prior score is reconstructed
by inverting this scan's transactions (state_changed flips,
first_seen removed from prior). Per-severity counts come
from the transactions table filtered to (state_changed,
first_seen). Emission gates on Kind != DriftStable;
stable scans produce zero audits (no-noise principle).
ACs covered (14 of 14)
AC-01 Classify(80, 70) at default thresholds → DriftMajorWorsening
AC-02 Classify(80, 76) → DriftStable (4pp below minor)
AC-03 Classify(80, 75) → DriftMinorWorsening (5pp = minor)
AC-04 Classify(70, 78) → DriftImprovement (8pp gain)
AC-05 Classify(80, 82) → DriftStable (2pp swing)
AC-06 Different thresholds produce different kinds for same delta
AC-07 ValidateThresholds rejects (0, 100] violators and major < minor
AC-08 First-ever scan (all first_seen) → DriftStable, HasPriorBaseline=false
AC-09 Per-severity transition counts populated from transactions
AC-10 Major worsening emits one compliance.drift.detected with
drift_type="major" + negative score_delta
AC-11 Stable scans emit zero compliance.drift.detected
AC-12 Source-inspection: no scan_baselines / ScanBaseline references
AC-13 DriftKind enum has exactly 4 values; AllDriftKinds lists them
AC-14 ComplianceScore excludes skipped (denominator = pass + fail)
Local validation
go build ./internal/drift/ clean
go vet ./internal/drift/ clean
go test -race ./internal/drift/ 16 sub-tests pass against real
Postgres + migrations 0001-0012
specter coverage system-drift-detector 14/14 = 100%
Architectural choices worth flagging
- Pure-function classifier. Same (prior, current, thresholds) →
same DriftKind, deterministic and trivially testable without a
database. The 7 Classify tests run in <1ms total.
- Read-only transaction wraps prior+current score computation
so a concurrent writer cannot produce a torn view (spec C-06).
- DriftReport carries the per-severity transition counts so the
B.3 alert router can route by severity without re-querying the
transactions table.
- Audit emits only on non-stable kinds. Stable scans (delta below
minor threshold) produce zero audit traffic. Spec C-04.
Slice B.2 status
B.2a liveness loop PR #421 — 15/15 ACs, ready for review
B.2b drift detector this PR — 14/14 ACs
B.2 trunk done: 29 ACs across 2 specs. Ready for B.3 (event bus +
alert router) once these merge.
d5000e9 to
a125c11
Compare
remyluslosius
added a commit
that referenced
this pull request
May 29, 2026
…100%)
Closes Slice B.2 (awareness layer): B.2a liveness loop + B.2b drift
detector. Pure consumer of the B.1c transaction log; classifies
per-host compliance drift against operator-tunable thresholds.
Spec
New: app/specs/system/drift-detector.spec.yaml (status: approved).
14 ACs across 8 constraints.
Migrations 0011 + 0012 (copies from B.1a/B.1c branches, identical content)
Required for the integration tests — host_rule_state and transactions
tables are the detector's primary read source. goose treats identical
duplicate migrations as no-ops when multiple PRs merge.
internal/drift package
doc.go Architectural choices: pure consumer of transactions, no
baselines table (Python-era artifact explicitly dropped),
percentage-point math, single read transaction.
types.go DriftKind closed enum (stable / minor_worsening /
major_worsening / improvement). Thresholds struct with
defaults major=10pp, minor=5pp, improvement=5pp.
ValidateThresholds enforces (0, 100] range + major >= minor.
DriftReport with per-severity transition counts
(critical/high/medium/low × became_failing/became_passing).
classify.go Pure function:
delta := current - prior
delta >= ImprovementPP → DriftImprovement
delta <= -MajorWorseningPP → DriftMajorWorsening
delta <= -MinorWorseningPP → DriftMinorWorsening
otherwise → DriftStable
ComplianceScore(passed, failed) excludes skipped from
denominator (passed / (passed + failed)) × 100.
service.go Service.DetectForScan reads prior + current scores under
a single read transaction. Prior score is reconstructed
by inverting this scan's transactions (state_changed flips,
first_seen removed from prior). Per-severity counts come
from the transactions table filtered to (state_changed,
first_seen). Emission gates on Kind != DriftStable;
stable scans produce zero audits (no-noise principle).
ACs covered (14 of 14)
AC-01 Classify(80, 70) at default thresholds → DriftMajorWorsening
AC-02 Classify(80, 76) → DriftStable (4pp below minor)
AC-03 Classify(80, 75) → DriftMinorWorsening (5pp = minor)
AC-04 Classify(70, 78) → DriftImprovement (8pp gain)
AC-05 Classify(80, 82) → DriftStable (2pp swing)
AC-06 Different thresholds produce different kinds for same delta
AC-07 ValidateThresholds rejects (0, 100] violators and major < minor
AC-08 First-ever scan (all first_seen) → DriftStable, HasPriorBaseline=false
AC-09 Per-severity transition counts populated from transactions
AC-10 Major worsening emits one compliance.drift.detected with
drift_type="major" + negative score_delta
AC-11 Stable scans emit zero compliance.drift.detected
AC-12 Source-inspection: no scan_baselines / ScanBaseline references
AC-13 DriftKind enum has exactly 4 values; AllDriftKinds lists them
AC-14 ComplianceScore excludes skipped (denominator = pass + fail)
Local validation
go build ./internal/drift/ clean
go vet ./internal/drift/ clean
go test -race ./internal/drift/ 16 sub-tests pass against real
Postgres + migrations 0001-0012
specter coverage system-drift-detector 14/14 = 100%
Architectural choices worth flagging
- Pure-function classifier. Same (prior, current, thresholds) →
same DriftKind, deterministic and trivially testable without a
database. The 7 Classify tests run in <1ms total.
- Read-only transaction wraps prior+current score computation
so a concurrent writer cannot produce a torn view (spec C-06).
- DriftReport carries the per-severity transition counts so the
B.3 alert router can route by severity without re-querying the
transactions table.
- Audit emits only on non-stable kinds. Stable scans (delta below
minor threshold) produce zero audit traffic. Spec C-04.
Slice B.2 status
B.2a liveness loop PR #421 — 15/15 ACs, ready for review
B.2b drift detector this PR — 14/14 ACs
B.2 trunk done: 29 ACs across 2 specs. Ready for B.3 (event bus +
alert router) once these merge.
a125c11 to
9aa5e12
Compare
remyluslosius
added a commit
that referenced
this pull request
May 29, 2026
…100%)
Closes Slice B.2 (awareness layer): B.2a liveness loop + B.2b drift
detector. Pure consumer of the B.1c transaction log; classifies
per-host compliance drift against operator-tunable thresholds.
Spec
New: app/specs/system/drift-detector.spec.yaml (status: approved).
14 ACs across 8 constraints.
Migrations 0011 + 0012 (copies from B.1a/B.1c branches, identical content)
Required for the integration tests — host_rule_state and transactions
tables are the detector's primary read source. goose treats identical
duplicate migrations as no-ops when multiple PRs merge.
internal/drift package
doc.go Architectural choices: pure consumer of transactions, no
baselines table (Python-era artifact explicitly dropped),
percentage-point math, single read transaction.
types.go DriftKind closed enum (stable / minor_worsening /
major_worsening / improvement). Thresholds struct with
defaults major=10pp, minor=5pp, improvement=5pp.
ValidateThresholds enforces (0, 100] range + major >= minor.
DriftReport with per-severity transition counts
(critical/high/medium/low × became_failing/became_passing).
classify.go Pure function:
delta := current - prior
delta >= ImprovementPP → DriftImprovement
delta <= -MajorWorseningPP → DriftMajorWorsening
delta <= -MinorWorseningPP → DriftMinorWorsening
otherwise → DriftStable
ComplianceScore(passed, failed) excludes skipped from
denominator (passed / (passed + failed)) × 100.
service.go Service.DetectForScan reads prior + current scores under
a single read transaction. Prior score is reconstructed
by inverting this scan's transactions (state_changed flips,
first_seen removed from prior). Per-severity counts come
from the transactions table filtered to (state_changed,
first_seen). Emission gates on Kind != DriftStable;
stable scans produce zero audits (no-noise principle).
ACs covered (14 of 14)
AC-01 Classify(80, 70) at default thresholds → DriftMajorWorsening
AC-02 Classify(80, 76) → DriftStable (4pp below minor)
AC-03 Classify(80, 75) → DriftMinorWorsening (5pp = minor)
AC-04 Classify(70, 78) → DriftImprovement (8pp gain)
AC-05 Classify(80, 82) → DriftStable (2pp swing)
AC-06 Different thresholds produce different kinds for same delta
AC-07 ValidateThresholds rejects (0, 100] violators and major < minor
AC-08 First-ever scan (all first_seen) → DriftStable, HasPriorBaseline=false
AC-09 Per-severity transition counts populated from transactions
AC-10 Major worsening emits one compliance.drift.detected with
drift_type="major" + negative score_delta
AC-11 Stable scans emit zero compliance.drift.detected
AC-12 Source-inspection: no scan_baselines / ScanBaseline references
AC-13 DriftKind enum has exactly 4 values; AllDriftKinds lists them
AC-14 ComplianceScore excludes skipped (denominator = pass + fail)
Local validation
go build ./internal/drift/ clean
go vet ./internal/drift/ clean
go test -race ./internal/drift/ 16 sub-tests pass against real
Postgres + migrations 0001-0012
specter coverage system-drift-detector 14/14 = 100%
Architectural choices worth flagging
- Pure-function classifier. Same (prior, current, thresholds) →
same DriftKind, deterministic and trivially testable without a
database. The 7 Classify tests run in <1ms total.
- Read-only transaction wraps prior+current score computation
so a concurrent writer cannot produce a torn view (spec C-06).
- DriftReport carries the per-severity transition counts so the
B.3 alert router can route by severity without re-querying the
transactions table.
- Audit emits only on non-stable kinds. Stable scans (delta below
minor threshold) produce zero audit traffic. Spec C-04.
Slice B.2 status
B.2a liveness loop PR #421 — 15/15 ACs, ready for review
B.2b drift detector this PR — 14/14 ACs
B.2 trunk done: 29 ACs across 2 specs. Ready for B.3 (event bus +
alert router) once these merge.
remyluslosius
added a commit
that referenced
this pull request
May 29, 2026
* feat(drift): B.2b — system-drift-detector implementation (14/14 ACs, 100%)
Closes Slice B.2 (awareness layer): B.2a liveness loop + B.2b drift
detector. Pure consumer of the B.1c transaction log; classifies
per-host compliance drift against operator-tunable thresholds.
Spec
New: app/specs/system/drift-detector.spec.yaml (status: approved).
14 ACs across 8 constraints.
Migrations 0011 + 0012 (copies from B.1a/B.1c branches, identical content)
Required for the integration tests — host_rule_state and transactions
tables are the detector's primary read source. goose treats identical
duplicate migrations as no-ops when multiple PRs merge.
internal/drift package
doc.go Architectural choices: pure consumer of transactions, no
baselines table (Python-era artifact explicitly dropped),
percentage-point math, single read transaction.
types.go DriftKind closed enum (stable / minor_worsening /
major_worsening / improvement). Thresholds struct with
defaults major=10pp, minor=5pp, improvement=5pp.
ValidateThresholds enforces (0, 100] range + major >= minor.
DriftReport with per-severity transition counts
(critical/high/medium/low × became_failing/became_passing).
classify.go Pure function:
delta := current - prior
delta >= ImprovementPP → DriftImprovement
delta <= -MajorWorseningPP → DriftMajorWorsening
delta <= -MinorWorseningPP → DriftMinorWorsening
otherwise → DriftStable
ComplianceScore(passed, failed) excludes skipped from
denominator (passed / (passed + failed)) × 100.
service.go Service.DetectForScan reads prior + current scores under
a single read transaction. Prior score is reconstructed
by inverting this scan's transactions (state_changed flips,
first_seen removed from prior). Per-severity counts come
from the transactions table filtered to (state_changed,
first_seen). Emission gates on Kind != DriftStable;
stable scans produce zero audits (no-noise principle).
ACs covered (14 of 14)
AC-01 Classify(80, 70) at default thresholds → DriftMajorWorsening
AC-02 Classify(80, 76) → DriftStable (4pp below minor)
AC-03 Classify(80, 75) → DriftMinorWorsening (5pp = minor)
AC-04 Classify(70, 78) → DriftImprovement (8pp gain)
AC-05 Classify(80, 82) → DriftStable (2pp swing)
AC-06 Different thresholds produce different kinds for same delta
AC-07 ValidateThresholds rejects (0, 100] violators and major < minor
AC-08 First-ever scan (all first_seen) → DriftStable, HasPriorBaseline=false
AC-09 Per-severity transition counts populated from transactions
AC-10 Major worsening emits one compliance.drift.detected with
drift_type="major" + negative score_delta
AC-11 Stable scans emit zero compliance.drift.detected
AC-12 Source-inspection: no scan_baselines / ScanBaseline references
AC-13 DriftKind enum has exactly 4 values; AllDriftKinds lists them
AC-14 ComplianceScore excludes skipped (denominator = pass + fail)
Local validation
go build ./internal/drift/ clean
go vet ./internal/drift/ clean
go test -race ./internal/drift/ 16 sub-tests pass against real
Postgres + migrations 0001-0012
specter coverage system-drift-detector 14/14 = 100%
Architectural choices worth flagging
- Pure-function classifier. Same (prior, current, thresholds) →
same DriftKind, deterministic and trivially testable without a
database. The 7 Classify tests run in <1ms total.
- Read-only transaction wraps prior+current score computation
so a concurrent writer cannot produce a torn view (spec C-06).
- DriftReport carries the per-severity transition counts so the
B.3 alert router can route by severity without re-querying the
transactions table.
- Audit emits only on non-stable kinds. Stable scans (delta below
minor threshold) produce zero audit traffic. Spec C-04.
Slice B.2 status
B.2a liveness loop PR #421 — 15/15 ACs, ready for review
B.2b drift detector this PR — 14/14 ACs
B.2 trunk done: 29 ACs across 2 specs. Ready for B.3 (event bus +
alert router) once these merge.
* fix(drift): rename Drift-prefixed exports — Kind/Report/TypeForAudit/AllKinds
Revive flagged DriftKind/DriftReport/DriftTypeForAudit/AllDriftKinds
as stutter in drift.X form. Renamed to remove the prefix; enum value
constants (DriftStable, DriftMinorWorsening, etc.) retain their
prefix since those are already package-scoped names per Go style.
No callers outside internal/drift to update — package is consumed
only by its own tests today.
f9d5bf8 to
9d0f551
Compare
…, 100%)
Opens Slice B.2 with the host-reachability probe loop. Credential-free
TCP-banner probe with hysteresis on state transition, deterministic
per-host jitter, concurrency guard, and host_liveness persistence.
Spec
New: app/specs/system/liveness-loop.spec.yaml (status: approved).
15 ACs across 9 constraints.
Migration 0013
host_liveness table: PRIMARY KEY (host_id) FK → hosts(id) ON DELETE
CASCADE; reachability_status with CHECK constraint
('reachable'|'unreachable'|'unknown'); consecutive_failures,
last_response_ms, last_state_change_at, last_error_type, etc.
Partial index on rows where status='unreachable' so the scheduler's
dispatch path can quickly skip unreachable hosts.
internal/liveness package
doc.go Architectural choices: credential-free, TCP-banner over
ICMP, hysteresis, jitter, audit on transitions only.
types.go Status enum + safety limits (5min default cadence, 60s
floor, 60min ceiling, 5s probe timeout, 2-failure
threshold, ±20% jitter, 256-byte max banner).
ProbeResult struct + LastErrorType() classifier.
ErrProbeInFlight sentinel.
jitter.go ApplyJitter (deterministic FNV-1a hash of hostID,
produces value in [(1-jitter)×interval, (1+jitter)×interval]).
ClampInterval (enforces [60s, 60min] safety range,
defaults zero to 5min).
probe.go Probe(ctx, addr, timeout) - net.DialTimeout + 256-byte
banner read with conn.SetReadDeadline. Reachable=true
only when banner begins with "SSH-". Non-SSH banner
(e.g. HTTP server on port 22) → Reachable=false with
BannerSeen=true. No SSH handshake, no credentials.
service.go Service struct with ProbeFunc seam (production uses
Probe; tests inject controllable behavior; future
Kensa.Reachable() swap is mechanical).
ProbeHost(ctx, hostID, addr) → applies concurrency guard,
calls probeFunc, persists via host_liveness UPSERT,
emits host.connectivity.checked ONLY on state transitions
(hysteresis: 2 consecutive failures before reachable→
unreachable; immediate flip on unreachable→reachable).
Metrics: probe_count, success/failure counts,
state_transition_count, last_probe_at.
ACs covered (15 of 15)
AC-01 Probe completes within timeout, credential-free
AC-02 TCP refused/timeout produce classified failures
AC-03 SSH-2.0-banner → Reachable=true with BannerSeen
AC-04 Non-SSH banner → Reachable=false (BannerSeen=true)
AC-05 Per-host concurrency guard → ErrProbeInFlight without
invoking the probe function
AC-06 100 parallel probes against distinct hosts race-clean
AC-07 ApplyJitter deterministic + within ±20% + distinct hosts
produce distinct values (no collisions across 100 hosts)
AC-08 ClampInterval clamps [60s, 60min]; zero → 5min default
AC-09 First success: unknown → reachable, audit emitted
AC-10 First failure from reachable: counter=1, status stays
reachable (hysteresis), no audit
AC-11 N=2 consecutive failures: status flips to unreachable +
audit
AC-12 Success after unreachable: flips to reachable, counter=0,
audit emitted
AC-13 Migration 0013 creates host_liveness; ON DELETE CASCADE on
hosts removes liveness row
AC-14 Source-inspection: no internal/credential imports, no
crypto/ssh imports, no ParsePrivateKey calls anywhere
AC-15 Metrics struct round-trips through Snapshot under concurrent
increments
Local validation
go build ./internal/liveness/ clean
go vet ./internal/liveness/ clean
go test -race ./internal/liveness/ 25 sub-tests pass against
real Postgres + migrations 0001-0013
specter coverage system-liveness-loop 15/15 = 100%
Architectural choices worth flagging
- Probe via raw net.DialTimeout + banner read (not crypto/ssh) to
keep the credential-free guarantee. Source-inspection AC-14 catches
any future reach for the SSH library.
- Jitter is deterministic from FNV-1a(hostID) so the same host always
lands at the same offset within the cadence — important for
diagnosability.
- Hysteresis (default 2 consecutive failures before flipping
reachable→unreachable) avoids alert noise from one-off network
blips. Configurable via Service struct field.
- Audit emission on state transitions ONLY. Steady-state probes
don't audit. Keeps audit volume bounded for stable fleets.
Slice B.2 status
B.2a liveness loop this PR — 15/15 ACs
B.2b drift detector pending (next chunk)
9d0f551 to
0030fd0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Slice B.2a —
system-liveness-loopimplementation. Opens Slice B.2 (awareness layer). Credential-free TCP-banner probe loop with hysteresis on state transitions, deterministic per-host jitter, and concurrency guard.-race(pure logic + probe + integration + source-inspection)What landed
app/specs/system/liveness-loop.spec.yamlhost_livenesstable, FK ON DELETE CASCADE, partial index on unreachable rowsinternal/liveness/Architectural choices locked
net.DialTimeout+ 256-byte banner read; never importsinternal/credentialorgolang.org/x/crypto/ssh. AC-14 source-inspection enforces project-widehostIDproduces stable ±20% offset; same host always lands at the same place in the cadence, important for diagnosabilitysync.Mapof in-flight host IDs; second concurrent probe of the same host returnsErrProbeInFlightwithout invoking probe functionhost.connectivity.checkedfires on first-seen + state flips; steady-state probes don't audit. Keeps audit volume bounded for stable fleetsProbe; tests inject fakes; futureKensa.Reachable()swap is mechanicalACs satisfied
LastErrorType()SSH-2.0-banner → Reachable=truehost_liveness; ON DELETE CASCADE verifiedinternal/credentialimport, nocrypto/ssh, noParsePrivateKeyLocal validation
go build ./internal/liveness/— cleango vet ./internal/liveness/— cleango test -race ./internal/liveness/— 25 sub-tests passspecter coverage— system-liveness-loop 15/15 = 100%Slice B.2 status
Relationship to other PRs
host_liveness.reachability_statusalongsidehost_backoff_state.suppress_untilonce both land, but neither PR depends on the other for compilationtransactions