Skip to content

spec(slice-b): B.1 trunk drafts — scheduler, kensa-executor, transaction-log-writer#415

Draft
remyluslosius wants to merge 2 commits into
mainfrom
feat/slice-b-b1-specs
Draft

spec(slice-b): B.1 trunk drafts — scheduler, kensa-executor, transaction-log-writer#415
remyluslosius wants to merge 2 commits into
mainfrom
feat/slice-b-b1-specs

Conversation

@remyluslosius
Copy link
Copy Markdown
Contributor

Summary

Three draft Specter specs for Slice B's trunk wave — the end-to-end path that takes a scheduled scan through Kensa execution and into the write-on-change transaction log.

All three: status: draft, tier: 1 (100% coverage target once implementation lands).

Specs

Spec ACs Highlights
system-scheduler 11 60s cron tick, SKIP LOCKED dispatch, tier intervals from signed schedules policy, policy_version snapshotted at enqueue, 48h max interval cap, maintenance mode honored, manual scans bypass schedule
system-kensa-executor 12 Bridge from OW credentials to Kensa Go module, in-memory SSH key only (no /tmp keyfile — fixes a Python concern), per-host concurrency guard, ctx cancellation honored, credential zero-after-use, source-inspection ACs verify no engine-abstraction interface
system-transaction-log-writer 12 host_rule_state UPSERT every scan, transactions INSERT only on state change or first-seen, single DB tx per Apply, idempotent on scan_id, FK constraints w/ ON DELETE RESTRICT, evidence JSONB validated against KensaEvidence OpenAPI schema. Drops Python's scan_baselines table — the prior transactions row IS the baseline.

OpenAPI delta

Scan.policy_version field added (snapshotted policy version at enqueue). The existing Scan.initiator object already carries the scheduler-vs-manual distinction — no new endpoint needed.

Coverage gate — expected red

All 33 specs parse cleanly and pass structural checks (specter check). Coverage report shows the 3 new drafts at 0% — by design, no tests exist yet. Implementation lands in per-component follow-up PRs:

  • B.1a — system-scheduler impl + tests (brings to 100%)
  • B.1b — system-kensa-executor impl + tests (brings to 100%)
  • B.1c — system-transaction-log-writer impl + tests (brings to 100%)

This PR is opened as draft. Merging strategy is the open question for review (see below).

Open for review

  1. Spec shapes — constraint/AC granularity, scope includes/excludes, priority assignments.
  2. Merging strategy: three options —
    a. Merge this spec PR first, accept temporary coverage red on main, then land impl PRs one at a time.
    b. Hold this PR as draft; merge each spec+impl together as a single PR per component.
    c. Roll everything into one mega-PR (spec + all three impls + tests).
    I lean (b) — keeps main green at all times, cost is spec review happens alongside impl review.
  3. OpenAPI policy_version location — currently a top-level Scan field. Could nest under Scan.initiator.policy_version if you prefer.

Test plan

  • specter parse — all 33 specs valid
  • specter check — all constraints referenced, no orphans
  • CI coverage — will fail as documented above
  • Spec content review by user

…nsaction-log-writer

Three draft Specter specs for Slice B's "trunk" wave: the end-to-end
path that takes a scheduled scan through Kensa execution and into the
write-on-change transaction log. All three are status=draft, tier=1
(100% coverage target once implementation lands).

system-scheduler (11 ACs)
  Adaptive scan scheduling. 60s cron tick, SKIP LOCKED dispatch,
  tier intervals from Ed25519-signed `schedules` policy, snapshot
  policy_version at enqueue so reloads don't affect in-flight scans.
  48h max interval cap. Maintenance mode honors policy. Manual POST
  /scans bypasses the schedule entirely.

system-kensa-executor (12 ACs)
  Bridge from OpenWatch credentials to Kensa Go module. In-memory
  SSH key parsing only — never writes to /tmp (the major fix vs the
  Python implementation). Per-host concurrency guard, policy-tunable
  timeout, ctx cancellation honored, credential zero-after-use,
  emits scan.started/completed/failed. Source-inspection ACs verify
  no engine-abstraction interface (Kensa is the only engine in B).

system-transaction-log-writer (12 ACs)
  Write-on-change persistence: host_rule_state UPSERT every scan,
  transactions INSERT only on state change or first-seen. Single DB
  transaction per Apply call, idempotent on scan_id, FK constraints
  with ON DELETE RESTRICT. Evidence JSONB validated against the
  KensaEvidence OpenAPI schema. Explicitly drops Python's
  scan_baselines table — the prior transactions row IS the baseline.

OpenAPI delta
  Adds Scan.policy_version field (snapshotted policy version at
  enqueue). The existing Scan.initiator object already carries the
  scheduler-vs-manual distinction.

Coverage status
  All 33 specs parse and pass structural checks. Coverage shows the
  3 new drafts at 0% — expected, no tests yet. Implementation work
  follows in per-component PRs:

    B.1a  system-scheduler impl + tests
    B.1b  system-kensa-executor impl + tests
    B.1c  system-transaction-log-writer impl + tests

  Each will bring its spec from draft -> approved and lift coverage
  to 100% before merge. Until then this PR's coverage gate will fail
  by design.
Pass the three drafts through a security-first lens. Net adds: 9
constraints, 11 acceptance criteria. Coverage of new ACs stays at
the expected 0% until impl lands.

system-scheduler  (+4 constraints, +4 ACs)
  C-08 minimum 5-minute interval floor (anti scan-storm DoS)
  C-09 audit emission on every host_compliance_schedule UPDATE
  C-10 signing-key revocation list checked alongside Ed25519 sig
  C-11 HMAC over job payload, verified at dequeue
  AC-10 amended: policy verified at boot AND every reload
  ACs 12-15 cover the new constraints

system-kensa-executor  (+3 constraints, +4 ACs)
  C-09 SSH host-key verification via internal/ssh/known_hosts;
       first-connect policy from policy.Schedules.HostKeyPolicy
  C-10 per-rule evidence cap at 10 MB (target-OOM defense)
  C-11 per-host backoff after 3 consecutive failures
  ACs 13-16 cover host-key, oversize, decryption-failure audit,
            and backoff state visible to scheduler

system-transaction-log-writer  (+2 constraints, +3 ACs)
  C-04 amended: scan_id MUST be server-generated UUIDv4 (anti-replay)
  C-09 sqlc-only DB access; no string-concat SQL in this package
  C-10 256 KB per-rule evidence cap at writer (defense in depth on
       top of executor's 10 MB cap)
  ACs 13-15 cover sqlc-only source inspection, oversize rejection,
            and writer.apply.failed audit emission

Deferred (per discussion): evidence integrity hash defense-in-depth
  In-process trust boundary makes this belt-and-suspenders. Revisit
  if threat model warrants persisting a SHA-256 alongside evidence.

Local validation
  specter parse: 33/33 PASS
  specter check: all constraints referenced, no orphans
  specter coverage: 3 new drafts still at 0% (by design, no impl yet)
remyluslosius added a commit that referenced this pull request May 29, 2026
…00%) (#418)

* feat(scheduler): B.1a foundation — spec, migration, audit events, ladder logic

First reviewable chunk of system-scheduler implementation. The trunk is
laid; the dispatcher + cron tick + HMAC + post-scan update work follows
in subsequent commits before this PR merges.

Spec
  Promoted system-scheduler from draft → approved. Identical AC set as
  the draft in PR #415 (15 ACs, 11 constraints) — this PR is where the
  spec actually lands on main alongside its implementation.

Migration 0011
  - host_compliance_schedule (host_id PK, compliance_state, score,
    has_critical, current_interval_minutes, next_scheduled_scan,
    last_scan_completed_at, maintenance_mode, maintenance_until,
    policy_version_at_last_scan, timestamps)
  - host_backoff_state (host_id, probe_type {scan|intel},
    consecutive_failures, suppress_until, last_error_code, ...)
  Per the open-question lean accepted earlier: backoff lives in a
  separate table so executor-domain writes don't touch scheduler-owned
  schedule columns. Index supports the dispatcher's
  WHERE next_scheduled_scan <= now() AND maintenance_mode = FALSE pattern.

Audit events (codegen)
  Added category "scheduler" + 7 codes:
    scheduler.startup.failed
    scheduler.schedule.updated
    scheduler.policy.reload.rejected
    scheduler.policy.clamped
    scheduler.policy.revoked_key.rejected
    scheduler.job.hmac_rejected
    scheduler.tick.dispatched
  Each carries a typed detail_schema. events.gen.go regenerated; total
  registry now 103 events.

internal/scheduler package
  - types.go: ComplianceState enum (5 tiers), TierLadder type, hard
    safety floors (MinIntervalFloor=5m, MaxIntervalCap=48h), LoadResult
    + ClampRecord types. Detailed package comment documents architectural
    choices (scheduler-owned schedule writes, separate backoff table,
    manual scans bypass scheduler entirely).
  - ladder.go: LoadIntervals (pure function consuming PolicyTiers,
    returns clamped TierLadder + ClampRecords for audit emission),
    NextScanFor (lastFinishedAt + ladder[state], clamps to ceiling,
    zero-time signals immediate-schedule).

Tests (4 of 15 ACs satisfied — 26.6% coverage on the spec)
  AC-01  TestLoadIntervals_TierLookup_Default48hForMissingTier
  AC-02  TestNextScanFor_AddsLadderInterval
         TestNextScanFor_ClampsToMaxIntervalCap
         TestNextScanFor_ZeroLastFinishedAtMeansImmediate
  AC-09  TestLoadIntervals_PolicyVersionSnapshotted
  AC-12  TestLoadIntervals_ClampsBelow5MinToFloor
         TestLoadIntervals_ClampsAbove48hToCeiling
         TestLoadIntervals_NoClampForInBudgetValues

Spec promotion + missing 11 ACs means CI coverage will fail (T1
threshold = 100%). PR stays draft until the remaining ACs land:

  AC-03  Cron tick at 60s, no double-dispatch on restart
  AC-04  Dispatcher uses FOR UPDATE SKIP LOCKED
  AC-05  Maintenance mode skips dispatch, advances next_scan after expiry
  AC-06  Job payload includes policy_version, host_id, framework_id
  AC-07  Manual POST /scans bypasses schedule (no row writes)
  AC-08  UpdateAfterScan recomputes state + next_scheduled_scan
  AC-10  Bad policy at boot refuses startup + audit
  AC-11  Metrics counters
  AC-13  Every host_compliance_schedule UPDATE emits audit
  AC-14  Revoked-key policy rejected even with valid sig
  AC-15  Tampered job payload fails HMAC verification at dequeue

These break down into:
  - 2 more pure-logic ACs (AC-11 metrics, AC-08 state derivation)
  - 5 DB-integration ACs (AC-03/04/05/06/07 — need pgxpool + real schema)
  - 3 audit/policy ACs (AC-10, AC-13 — emission verification)
  - 2 HMAC ACs (AC-14, AC-15 — needs internal/secretkey + HKDF)

* feat(scheduler): B.1a — pure-logic ACs (state derivation, metrics, HMAC, startup)

Adds 5 more ACs of system-scheduler coverage. All pure functions; no
DB integration in this chunk. After this commit B.1a sits at 9/15 = 60%
coverage. The remaining 6 ACs (AC-03/04/05/07/13/14) all require DB
integration or external infrastructure; they land in the next chunk.

update.go — AC-08
  StateFromScore(score, hasCritical) → ComplianceState (5-bucket mapping
  with hasCritical override). UpdateAfterScan combines it with NextScanFor
  to produce a ScanResult ready for the (later) Service.PersistAfterScan
  UPSERT.

metrics.go — AC-11
  Metrics struct with atomic counters: DueCount, DispatchedCount,
  SkippedMaintenanceCount, SkippedBackoffCount, RefuseCount,
  PolicyClampedCount, HMACRejectCount, plus SetLastTick/LastTick. Snapshot
  produces a typed MetricsSnapshot ready for JSON serialization in the
  admin metrics handler.

hmac.go — AC-06, AC-15
  JobPayload (HostID, FrameworkID, PolicyVersion, EnqueuedAt) with a
  canonical Encode for stable HMAC. Sign / Verify use HMAC-SHA256 +
  constant-time compare. DeriveQueueKey uses HKDF-SHA256 from the DEK
  with info "openwatch-queue-v1" — per the locked open-question decision
  (option C: HKDF from credential DEK).
  Tests verify: round-trip; tampering each of the 4 fields produces a
  different HMAC and is rejected; wrong key rejected; key derivation is
  deterministic for the same DEK and distinct for different DEKs.

startup.go — AC-10
  PolicyLoadError enum (policy_missing / signature_invalid / revoked_key
  / parse_error). Startup(ctx, emit, path, reason) emits
  scheduler.startup.failed via the injected EmitFunc and returns
  ErrStartupRefused on any non-OK reason. EmitFunc matches audit.Emit's
  signature so production wiring is direct; tests use a fake recorder.

Coverage after this commit (9 of 15 ACs):
  AC-01, AC-02, AC-06, AC-08, AC-09, AC-10, AC-11, AC-12, AC-15

Uncovered (6 ACs — all require DB or integration scaffolding):
  AC-03  cron tick @60s + no double-dispatch on restart
  AC-04  dispatcher SELECT ... FOR UPDATE SKIP LOCKED
  AC-05  maintenance_mode skips dispatch; next_scan still advances
  AC-07  manual POST /scans bypasses schedule
  AC-13  every host_compliance_schedule UPDATE emits audit
  AC-14  revocation list mechanism + scheduler.policy.revoked_key.rejected

* feat(scheduler): B.1a — Service + Dispatch (DB integration; AC-04/05/13)

Live scheduler with the SKIP LOCKED dispatcher. Brings system-scheduler
coverage from 60% → 80% (12 of 15 ACs).

service.go
  Service struct holding pool, ladder, policyVersion, hmacKey, emit,
  metrics, Now (clock injection for tests), DefaultFramework.

  Dispatch(ctx) — one pass:
    BEGIN tx
    SELECT host_id, compliance_state, next_scheduled_scan
      FROM host_compliance_schedule
     WHERE next_scheduled_scan <= now() AND maintenance_mode = false
     ORDER BY next_scheduled_scan
     FOR UPDATE SKIP LOCKED
     LIMIT 100
    For each row:
      build JobPayload (host_id, framework_id, policy_version, enqueued_at)
      HMAC-sign with the derived queue key
      queue.Enqueue under job_type "scan"
      UPDATE host_compliance_schedule.next_scheduled_scan forward
      emit scheduler.schedule.updated
    COMMIT
    emit scheduler.tick.dispatched

  emitScheduleUpdated helper produces the typed audit event with
  prior + new state in detail. Metrics counters incremented inline.

service_test.go (integration; requires OPENWATCH_TEST_DSN)
  freshPool helper applies migrations through 0011 and truncates the
  scheduler-touched tables. seedUser/seedHost/seedSchedule build the
  FK chain. newTestService constructs a Service with a deterministic
  clock and a fake EmitFunc that records calls.

  TestDispatch_SkipLocked_DisjointClaim (AC-04)
    Seeds 12 due hosts. Runs two concurrent Dispatch goroutines.
    Asserts that countA + countB == 12 (no double-dispatch, no misses)
    AND that the job_queue has exactly 12 scan rows.

  TestDispatch_FuturesNotClaimed (AC-04 negative)
    All hosts have next_scheduled_scan in the future → dispatched == 0.

  TestDispatch_MaintenanceMode_RowSkipped (AC-05)
    One maintenance host + one normal host both due now → only normal
    is claimed. Maintenance row is NOT mutated.

  TestDispatch_EmitsScheduleUpdated (AC-13)
    Verifies exactly one scheduler.schedule.updated per dispatched host
    AND one scheduler.tick.dispatched per tick. Validates detail keys
    (host_id, change_kind=next_scan_advanced) via JSON decode.

ACs covered after this commit (12 of 15):
  AC-01, AC-02, AC-04, AC-05, AC-06, AC-08, AC-09,
  AC-10, AC-11, AC-12, AC-13, AC-15

Uncovered (3):
  AC-03  cron tick @60s + no double-dispatch on restart
  AC-07  manual POST /scans bypasses schedule
  AC-14  revocation list mechanism + scheduler.policy.revoked_key.rejected

* feat(scheduler): B.1a — final 3 ACs (AC-03/07/14) — 100% coverage

system-scheduler now at 100% spec coverage. specter sync passes end-to-end.

run.go — AC-03
  Run(ctx, interval) wires Service.Dispatch behind internal/cron at
  DefaultTickInterval = 60 * time.Second. interval = 0 means use the
  default; tests pass a sub-second cadence so they don't block.
  TickFunc inside Run logs and returns Dispatch errors; the cron loop
  keeps running on transient failures.

revocation.go — AC-14
  RevocationList: set of revoked Ed25519 signing-key fingerprints.
  Loaded at boot from a separate revocation file (path from config).
  NewRevocationList / Has / Size; nil-safe.
  Service.ValidateReload(ctx, fp, version, list) returns PolicyLoadOK
  or PolicyLoadRevokedKey; on rejection, emits
  scheduler.policy.revoked_key.rejected with detail.key_fingerprint
  and detail.policy_version. Previous valid policy stays active.

run_test.go covers AC-03 and AC-07:
  TestDefaultTickInterval_Is60Seconds — runtime constant check
  TestRun_SourceMentions60SecondInterval — source-inspection of run.go
  TestDispatch_NoDoubleDispatch_OnRepeatedTick — DB test confirming a
    second immediate Dispatch claims 0 rows after the first advanced
    next_scheduled_scan
  TestServer_NoSchedulerTableInScanHandlers — source-inspection of
    internal/server/*.go (non-test files). Asserts no .go file in the
    HTTP layer references "host_compliance_schedule". The scheduler is
    the only writer of that table; manual POST /scans cannot bypass-then-
    silently-mutate the schedule.

revocation_test.go covers AC-14:
  RevocationList Has matches added fingerprints; empty/nil safe.
  ValidateReload accepts un-revoked keys without emitting audit.
  ValidateReload rejects revoked keys, emits the typed event with the
  expected detail keys (key_fingerprint + policy_version).

Coverage after this commit: 15 of 15 ACs = 100%.

  AC-01  ladder default + missing-tier fallback (existing)
  AC-02  NextScanFor arithmetic + ceiling clamp (existing)
  AC-03  60s tick + no double-dispatch on restart    (this commit)
  AC-04  FOR UPDATE SKIP LOCKED dispatch (prior commit)
  AC-05  maintenance_mode skips dispatch (prior commit)
  AC-06  payload host_id/framework_id/policy_version (existing)
  AC-07  manual POST /scans bypasses schedule        (this commit)
  AC-08  state derivation + UpdateAfterScan (existing)
  AC-09  policy version snapshotted (existing)
  AC-10  boot refusal + audit on bad policy (existing)
  AC-11  metrics counters (existing)
  AC-12  policy clamped to safety floor + ceiling (existing)
  AC-13  every schedule UPDATE emits audit (prior commit)
  AC-14  revocation list rejects revoked-key policy  (this commit)
  AC-15  HMAC tamper-rejection across all 4 fields (existing)

specter sync: 31 specs / all pass / coverage thresholds met.

* fix(scheduler): make lint clean — gofmt + remove dead code + gosec annotations

* fix(scheduler): guard emitCall append with mutex — race-clean concurrent Dispatch test
remyluslosius added a commit that referenced this pull request May 29, 2026
…, 100%)

Closes Slice B.1 trunk. Compliance write-on-change persistence for the
Kensa-executor pipeline, complete with all 15 acceptance criteria.

Spec
  Promoted system-transaction-log-writer from draft → approved.
  15 ACs identical to PR #415's draft.

Migration 0011_host_compliance_schedule.sql
  Copy of B.1a's migration. Identical content; goose treats duplicate
  identical migrations as no-ops when both B.1a (#418) and this PR merge.

Migration 0012_transaction_log.sql
  - host_rule_state: ONE row per (host, rule). Current state, UPSERTed
    every Apply. Status CHECK constraint enforces the closed enum
    (pass/fail/skipped/error).
  - transactions: append-only state-change log. UNIQUE(scan_id, rule_id)
    enforces idempotency at the schema level (spec C-04).
  - Both tables FK to hosts(id) ON DELETE RESTRICT — historical
    findings outlive their host references (spec C-06).
  - Indexes: by (host_id, status) for current-fleet queries; by
    (host_id, rule_id, occurred_at DESC) for point-in-time temporal
    queries; by scan_id for idempotency check.

Audit events
  Added two new codes to events.yaml:
    finding.persisted    one per transactions row (spec AC-09)
    writer.apply.failed  per Apply-rollback (spec AC-15)
  Codegen produces audit.FindingPersisted and audit.WriterApplyFailed
  constants; events.gen.go grew from 96 → 98 events total.

internal/transactionlog package
  types.go   ApplyBatch, Result, Status / ChangeKind / FailureReason
             enums, sentinel errors, MaxEvidenceBytes (256 KiB cap).
  writer.go  Writer.Apply: single-tx-per-call. Steps:
               1. Validate every result (status, evidence size + shape).
                  Spec AC-08 / AC-14 reject BEFORE any INSERT — atomic.
               2. Idempotency: if any transactions row exists for the
                  scan_id, no-op (spec AC-05).
               3. BEGIN tx.
               4. Per result: read prior host_rule_state, decide
                  change_kind (first_seen / state_changed /
                  severity_changed / none), INSERT transactions only on
                  change, UPSERT host_rule_state with COALESCE-style
                  last_changed_at preservation.
               5. COMMIT.
               6. emit finding.persisted per state-change AFTER commit
                  (audit reflects what persisted, not what attempted).
             On any error: tx.Rollback + emit writer.apply.failed
             with classified reason (FK / deadlock / oversize / sqlc).

source_test.go (AC-12, AC-13)
  AC-12: walks every .go and .sql file under app/ asserting no
         scan_baselines / ScanBaseline references — the Python-era
         baselines table is explicitly dropped.
  AC-13: AST-parses every internal/transactionlog .go file asserting
         no database/sql import and no .Exec/.Query/.QueryRow whose
         SQL arg uses fmt.Sprintf or string concatenation.

writer_test.go (AC-01 through AC-11, AC-14, AC-15)
  16 sub-tests covering the writer behavior end-to-end against real
  Postgres:
    AC-01  pg_stat_database.xact_commit delta < 10 after 50-rule Apply
    AC-02  N first_seen rows on first scan
    AC-03  identical rescan = 0 new transactions, check_count++
    AC-04  one flip pass→fail = exactly 1 state_changed row
    AC-05  same scan_id replay = no-op
    AC-06  FK violation rolls back the whole batch (zero rows persist)
    AC-07  DELETE hosts with extant transactions fails (ON DELETE RESTRICT)
    AC-08  non-JSON-object evidence rejected (table-driven over 4 cases)
    AC-09  finding.persisted emission count = transactions row count
    AC-10  1000-rule Apply ≤ 2 seconds wall-clock
    AC-11  50 concurrent Applys against distinct hosts complete
    AC-14  oversize evidence rejected BEFORE INSERT; writer.apply.failed
           audit emitted with reason=evidence_oversize
    AC-15  FK violation emits writer.apply.failed with reason=fk_violation
           and detail.rule_count_attempted populated

Local validation
  go build ./internal/transactionlog/: clean
  go vet ./internal/transactionlog/: clean
  go test -race ./internal/transactionlog/ (unit + integration with
    real Postgres + migrations 0001-0012): 16 sub-tests pass
  specter coverage: system-transaction-log-writer 15/15 = 100%

Architectural choices worth flagging
  - Atomicity: validation phase runs BEFORE BEGIN, so oversize-evidence
    rejection is genuinely zero-INSERT (no rollback needed).
  - Pre-commit pending audit emissions: scan.completed / finding.persisted
    fire only AFTER tx.Commit succeeds, so the audit log truly reflects
    persisted state.
  - Evidence schema check: minimal "must be JSON object" gate today;
    full KensaEvidence-schema validation slots into validateResult when
    the OpenAPI components.schemas.KensaEvidence shape lands.

Slice B.1 trunk status
  B.1a scheduler             PR #418 — 15/15 ACs, ready for review
  B.1b kensa-executor        PR #419 — 16/16 ACs, ready for review
  B.1c transaction-log-writer this PR — 15/15 ACs

Total Slice B.1: 46 ACs covered across 3 specs. Ready to move on to
B.2 (liveness loop + drift detector) once these merge.
remyluslosius added a commit that referenced this pull request May 29, 2026
…, 100%)

Closes Slice B.1 trunk. Compliance write-on-change persistence for the
Kensa-executor pipeline, complete with all 15 acceptance criteria.

Spec
  Promoted system-transaction-log-writer from draft → approved.
  15 ACs identical to PR #415's draft.

Migration 0011_host_compliance_schedule.sql
  Copy of B.1a's migration. Identical content; goose treats duplicate
  identical migrations as no-ops when both B.1a (#418) and this PR merge.

Migration 0012_transaction_log.sql
  - host_rule_state: ONE row per (host, rule). Current state, UPSERTed
    every Apply. Status CHECK constraint enforces the closed enum
    (pass/fail/skipped/error).
  - transactions: append-only state-change log. UNIQUE(scan_id, rule_id)
    enforces idempotency at the schema level (spec C-04).
  - Both tables FK to hosts(id) ON DELETE RESTRICT — historical
    findings outlive their host references (spec C-06).
  - Indexes: by (host_id, status) for current-fleet queries; by
    (host_id, rule_id, occurred_at DESC) for point-in-time temporal
    queries; by scan_id for idempotency check.

Audit events
  Added two new codes to events.yaml:
    finding.persisted    one per transactions row (spec AC-09)
    writer.apply.failed  per Apply-rollback (spec AC-15)
  Codegen produces audit.FindingPersisted and audit.WriterApplyFailed
  constants; events.gen.go grew from 96 → 98 events total.

internal/transactionlog package
  types.go   ApplyBatch, Result, Status / ChangeKind / FailureReason
             enums, sentinel errors, MaxEvidenceBytes (256 KiB cap).
  writer.go  Writer.Apply: single-tx-per-call. Steps:
               1. Validate every result (status, evidence size + shape).
                  Spec AC-08 / AC-14 reject BEFORE any INSERT — atomic.
               2. Idempotency: if any transactions row exists for the
                  scan_id, no-op (spec AC-05).
               3. BEGIN tx.
               4. Per result: read prior host_rule_state, decide
                  change_kind (first_seen / state_changed /
                  severity_changed / none), INSERT transactions only on
                  change, UPSERT host_rule_state with COALESCE-style
                  last_changed_at preservation.
               5. COMMIT.
               6. emit finding.persisted per state-change AFTER commit
                  (audit reflects what persisted, not what attempted).
             On any error: tx.Rollback + emit writer.apply.failed
             with classified reason (FK / deadlock / oversize / sqlc).

source_test.go (AC-12, AC-13)
  AC-12: walks every .go and .sql file under app/ asserting no
         scan_baselines / ScanBaseline references — the Python-era
         baselines table is explicitly dropped.
  AC-13: AST-parses every internal/transactionlog .go file asserting
         no database/sql import and no .Exec/.Query/.QueryRow whose
         SQL arg uses fmt.Sprintf or string concatenation.

writer_test.go (AC-01 through AC-11, AC-14, AC-15)
  16 sub-tests covering the writer behavior end-to-end against real
  Postgres:
    AC-01  pg_stat_database.xact_commit delta < 10 after 50-rule Apply
    AC-02  N first_seen rows on first scan
    AC-03  identical rescan = 0 new transactions, check_count++
    AC-04  one flip pass→fail = exactly 1 state_changed row
    AC-05  same scan_id replay = no-op
    AC-06  FK violation rolls back the whole batch (zero rows persist)
    AC-07  DELETE hosts with extant transactions fails (ON DELETE RESTRICT)
    AC-08  non-JSON-object evidence rejected (table-driven over 4 cases)
    AC-09  finding.persisted emission count = transactions row count
    AC-10  1000-rule Apply ≤ 2 seconds wall-clock
    AC-11  50 concurrent Applys against distinct hosts complete
    AC-14  oversize evidence rejected BEFORE INSERT; writer.apply.failed
           audit emitted with reason=evidence_oversize
    AC-15  FK violation emits writer.apply.failed with reason=fk_violation
           and detail.rule_count_attempted populated

Local validation
  go build ./internal/transactionlog/: clean
  go vet ./internal/transactionlog/: clean
  go test -race ./internal/transactionlog/ (unit + integration with
    real Postgres + migrations 0001-0012): 16 sub-tests pass
  specter coverage: system-transaction-log-writer 15/15 = 100%

Architectural choices worth flagging
  - Atomicity: validation phase runs BEFORE BEGIN, so oversize-evidence
    rejection is genuinely zero-INSERT (no rollback needed).
  - Pre-commit pending audit emissions: scan.completed / finding.persisted
    fire only AFTER tx.Commit succeeds, so the audit log truly reflects
    persisted state.
  - Evidence schema check: minimal "must be JSON object" gate today;
    full KensaEvidence-schema validation slots into validateResult when
    the OpenAPI components.schemas.KensaEvidence shape lands.

Slice B.1 trunk status
  B.1a scheduler             PR #418 — 15/15 ACs, ready for review
  B.1b kensa-executor        PR #419 — 16/16 ACs, ready for review
  B.1c transaction-log-writer this PR — 15/15 ACs

Total Slice B.1: 46 ACs covered across 3 specs. Ready to move on to
B.2 (liveness loop + drift detector) once these merge.
remyluslosius added a commit that referenced this pull request May 29, 2026
…, 100%)

Closes Slice B.1 trunk. Compliance write-on-change persistence for the
Kensa-executor pipeline, complete with all 15 acceptance criteria.

Spec
  Promoted system-transaction-log-writer from draft → approved.
  15 ACs identical to PR #415's draft.

Migration 0011_host_compliance_schedule.sql
  Copy of B.1a's migration. Identical content; goose treats duplicate
  identical migrations as no-ops when both B.1a (#418) and this PR merge.

Migration 0012_transaction_log.sql
  - host_rule_state: ONE row per (host, rule). Current state, UPSERTed
    every Apply. Status CHECK constraint enforces the closed enum
    (pass/fail/skipped/error).
  - transactions: append-only state-change log. UNIQUE(scan_id, rule_id)
    enforces idempotency at the schema level (spec C-04).
  - Both tables FK to hosts(id) ON DELETE RESTRICT — historical
    findings outlive their host references (spec C-06).
  - Indexes: by (host_id, status) for current-fleet queries; by
    (host_id, rule_id, occurred_at DESC) for point-in-time temporal
    queries; by scan_id for idempotency check.

Audit events
  Added two new codes to events.yaml:
    finding.persisted    one per transactions row (spec AC-09)
    writer.apply.failed  per Apply-rollback (spec AC-15)
  Codegen produces audit.FindingPersisted and audit.WriterApplyFailed
  constants; events.gen.go grew from 96 → 98 events total.

internal/transactionlog package
  types.go   ApplyBatch, Result, Status / ChangeKind / FailureReason
             enums, sentinel errors, MaxEvidenceBytes (256 KiB cap).
  writer.go  Writer.Apply: single-tx-per-call. Steps:
               1. Validate every result (status, evidence size + shape).
                  Spec AC-08 / AC-14 reject BEFORE any INSERT — atomic.
               2. Idempotency: if any transactions row exists for the
                  scan_id, no-op (spec AC-05).
               3. BEGIN tx.
               4. Per result: read prior host_rule_state, decide
                  change_kind (first_seen / state_changed /
                  severity_changed / none), INSERT transactions only on
                  change, UPSERT host_rule_state with COALESCE-style
                  last_changed_at preservation.
               5. COMMIT.
               6. emit finding.persisted per state-change AFTER commit
                  (audit reflects what persisted, not what attempted).
             On any error: tx.Rollback + emit writer.apply.failed
             with classified reason (FK / deadlock / oversize / sqlc).

source_test.go (AC-12, AC-13)
  AC-12: walks every .go and .sql file under app/ asserting no
         scan_baselines / ScanBaseline references — the Python-era
         baselines table is explicitly dropped.
  AC-13: AST-parses every internal/transactionlog .go file asserting
         no database/sql import and no .Exec/.Query/.QueryRow whose
         SQL arg uses fmt.Sprintf or string concatenation.

writer_test.go (AC-01 through AC-11, AC-14, AC-15)
  16 sub-tests covering the writer behavior end-to-end against real
  Postgres:
    AC-01  pg_stat_database.xact_commit delta < 10 after 50-rule Apply
    AC-02  N first_seen rows on first scan
    AC-03  identical rescan = 0 new transactions, check_count++
    AC-04  one flip pass→fail = exactly 1 state_changed row
    AC-05  same scan_id replay = no-op
    AC-06  FK violation rolls back the whole batch (zero rows persist)
    AC-07  DELETE hosts with extant transactions fails (ON DELETE RESTRICT)
    AC-08  non-JSON-object evidence rejected (table-driven over 4 cases)
    AC-09  finding.persisted emission count = transactions row count
    AC-10  1000-rule Apply ≤ 2 seconds wall-clock
    AC-11  50 concurrent Applys against distinct hosts complete
    AC-14  oversize evidence rejected BEFORE INSERT; writer.apply.failed
           audit emitted with reason=evidence_oversize
    AC-15  FK violation emits writer.apply.failed with reason=fk_violation
           and detail.rule_count_attempted populated

Local validation
  go build ./internal/transactionlog/: clean
  go vet ./internal/transactionlog/: clean
  go test -race ./internal/transactionlog/ (unit + integration with
    real Postgres + migrations 0001-0012): 16 sub-tests pass
  specter coverage: system-transaction-log-writer 15/15 = 100%

Architectural choices worth flagging
  - Atomicity: validation phase runs BEFORE BEGIN, so oversize-evidence
    rejection is genuinely zero-INSERT (no rollback needed).
  - Pre-commit pending audit emissions: scan.completed / finding.persisted
    fire only AFTER tx.Commit succeeds, so the audit log truly reflects
    persisted state.
  - Evidence schema check: minimal "must be JSON object" gate today;
    full KensaEvidence-schema validation slots into validateResult when
    the OpenAPI components.schemas.KensaEvidence shape lands.

Slice B.1 trunk status
  B.1a scheduler             PR #418 — 15/15 ACs, ready for review
  B.1b kensa-executor        PR #419 — 16/16 ACs, ready for review
  B.1c transaction-log-writer this PR — 15/15 ACs

Total Slice B.1: 46 ACs covered across 3 specs. Ready to move on to
B.2 (liveness loop + drift detector) once these merge.
remyluslosius added a commit that referenced this pull request May 29, 2026
…, 100%)

Closes Slice B.1 trunk. Compliance write-on-change persistence for the
Kensa-executor pipeline, complete with all 15 acceptance criteria.

Spec
  Promoted system-transaction-log-writer from draft → approved.
  15 ACs identical to PR #415's draft.

Migration 0011_host_compliance_schedule.sql
  Copy of B.1a's migration. Identical content; goose treats duplicate
  identical migrations as no-ops when both B.1a (#418) and this PR merge.

Migration 0012_transaction_log.sql
  - host_rule_state: ONE row per (host, rule). Current state, UPSERTed
    every Apply. Status CHECK constraint enforces the closed enum
    (pass/fail/skipped/error).
  - transactions: append-only state-change log. UNIQUE(scan_id, rule_id)
    enforces idempotency at the schema level (spec C-04).
  - Both tables FK to hosts(id) ON DELETE RESTRICT — historical
    findings outlive their host references (spec C-06).
  - Indexes: by (host_id, status) for current-fleet queries; by
    (host_id, rule_id, occurred_at DESC) for point-in-time temporal
    queries; by scan_id for idempotency check.

Audit events
  Added two new codes to events.yaml:
    finding.persisted    one per transactions row (spec AC-09)
    writer.apply.failed  per Apply-rollback (spec AC-15)
  Codegen produces audit.FindingPersisted and audit.WriterApplyFailed
  constants; events.gen.go grew from 96 → 98 events total.

internal/transactionlog package
  types.go   ApplyBatch, Result, Status / ChangeKind / FailureReason
             enums, sentinel errors, MaxEvidenceBytes (256 KiB cap).
  writer.go  Writer.Apply: single-tx-per-call. Steps:
               1. Validate every result (status, evidence size + shape).
                  Spec AC-08 / AC-14 reject BEFORE any INSERT — atomic.
               2. Idempotency: if any transactions row exists for the
                  scan_id, no-op (spec AC-05).
               3. BEGIN tx.
               4. Per result: read prior host_rule_state, decide
                  change_kind (first_seen / state_changed /
                  severity_changed / none), INSERT transactions only on
                  change, UPSERT host_rule_state with COALESCE-style
                  last_changed_at preservation.
               5. COMMIT.
               6. emit finding.persisted per state-change AFTER commit
                  (audit reflects what persisted, not what attempted).
             On any error: tx.Rollback + emit writer.apply.failed
             with classified reason (FK / deadlock / oversize / sqlc).

source_test.go (AC-12, AC-13)
  AC-12: walks every .go and .sql file under app/ asserting no
         scan_baselines / ScanBaseline references — the Python-era
         baselines table is explicitly dropped.
  AC-13: AST-parses every internal/transactionlog .go file asserting
         no database/sql import and no .Exec/.Query/.QueryRow whose
         SQL arg uses fmt.Sprintf or string concatenation.

writer_test.go (AC-01 through AC-11, AC-14, AC-15)
  16 sub-tests covering the writer behavior end-to-end against real
  Postgres:
    AC-01  pg_stat_database.xact_commit delta < 10 after 50-rule Apply
    AC-02  N first_seen rows on first scan
    AC-03  identical rescan = 0 new transactions, check_count++
    AC-04  one flip pass→fail = exactly 1 state_changed row
    AC-05  same scan_id replay = no-op
    AC-06  FK violation rolls back the whole batch (zero rows persist)
    AC-07  DELETE hosts with extant transactions fails (ON DELETE RESTRICT)
    AC-08  non-JSON-object evidence rejected (table-driven over 4 cases)
    AC-09  finding.persisted emission count = transactions row count
    AC-10  1000-rule Apply ≤ 2 seconds wall-clock
    AC-11  50 concurrent Applys against distinct hosts complete
    AC-14  oversize evidence rejected BEFORE INSERT; writer.apply.failed
           audit emitted with reason=evidence_oversize
    AC-15  FK violation emits writer.apply.failed with reason=fk_violation
           and detail.rule_count_attempted populated

Local validation
  go build ./internal/transactionlog/: clean
  go vet ./internal/transactionlog/: clean
  go test -race ./internal/transactionlog/ (unit + integration with
    real Postgres + migrations 0001-0012): 16 sub-tests pass
  specter coverage: system-transaction-log-writer 15/15 = 100%

Architectural choices worth flagging
  - Atomicity: validation phase runs BEFORE BEGIN, so oversize-evidence
    rejection is genuinely zero-INSERT (no rollback needed).
  - Pre-commit pending audit emissions: scan.completed / finding.persisted
    fire only AFTER tx.Commit succeeds, so the audit log truly reflects
    persisted state.
  - Evidence schema check: minimal "must be JSON object" gate today;
    full KensaEvidence-schema validation slots into validateResult when
    the OpenAPI components.schemas.KensaEvidence shape lands.

Slice B.1 trunk status
  B.1a scheduler             PR #418 — 15/15 ACs, ready for review
  B.1b kensa-executor        PR #419 — 16/16 ACs, ready for review
  B.1c transaction-log-writer this PR — 15/15 ACs

Total Slice B.1: 46 ACs covered across 3 specs. Ready to move on to
B.2 (liveness loop + drift detector) once these merge.
remyluslosius added a commit that referenced this pull request May 29, 2026
…, 100%) (#420)

* feat(transactionlog): B.1c — system-transaction-log-writer (15/15 ACs, 100%)

Closes Slice B.1 trunk. Compliance write-on-change persistence for the
Kensa-executor pipeline, complete with all 15 acceptance criteria.

Spec
  Promoted system-transaction-log-writer from draft → approved.
  15 ACs identical to PR #415's draft.

Migration 0011_host_compliance_schedule.sql
  Copy of B.1a's migration. Identical content; goose treats duplicate
  identical migrations as no-ops when both B.1a (#418) and this PR merge.

Migration 0012_transaction_log.sql
  - host_rule_state: ONE row per (host, rule). Current state, UPSERTed
    every Apply. Status CHECK constraint enforces the closed enum
    (pass/fail/skipped/error).
  - transactions: append-only state-change log. UNIQUE(scan_id, rule_id)
    enforces idempotency at the schema level (spec C-04).
  - Both tables FK to hosts(id) ON DELETE RESTRICT — historical
    findings outlive their host references (spec C-06).
  - Indexes: by (host_id, status) for current-fleet queries; by
    (host_id, rule_id, occurred_at DESC) for point-in-time temporal
    queries; by scan_id for idempotency check.

Audit events
  Added two new codes to events.yaml:
    finding.persisted    one per transactions row (spec AC-09)
    writer.apply.failed  per Apply-rollback (spec AC-15)
  Codegen produces audit.FindingPersisted and audit.WriterApplyFailed
  constants; events.gen.go grew from 96 → 98 events total.

internal/transactionlog package
  types.go   ApplyBatch, Result, Status / ChangeKind / FailureReason
             enums, sentinel errors, MaxEvidenceBytes (256 KiB cap).
  writer.go  Writer.Apply: single-tx-per-call. Steps:
               1. Validate every result (status, evidence size + shape).
                  Spec AC-08 / AC-14 reject BEFORE any INSERT — atomic.
               2. Idempotency: if any transactions row exists for the
                  scan_id, no-op (spec AC-05).
               3. BEGIN tx.
               4. Per result: read prior host_rule_state, decide
                  change_kind (first_seen / state_changed /
                  severity_changed / none), INSERT transactions only on
                  change, UPSERT host_rule_state with COALESCE-style
                  last_changed_at preservation.
               5. COMMIT.
               6. emit finding.persisted per state-change AFTER commit
                  (audit reflects what persisted, not what attempted).
             On any error: tx.Rollback + emit writer.apply.failed
             with classified reason (FK / deadlock / oversize / sqlc).

source_test.go (AC-12, AC-13)
  AC-12: walks every .go and .sql file under app/ asserting no
         scan_baselines / ScanBaseline references — the Python-era
         baselines table is explicitly dropped.
  AC-13: AST-parses every internal/transactionlog .go file asserting
         no database/sql import and no .Exec/.Query/.QueryRow whose
         SQL arg uses fmt.Sprintf or string concatenation.

writer_test.go (AC-01 through AC-11, AC-14, AC-15)
  16 sub-tests covering the writer behavior end-to-end against real
  Postgres:
    AC-01  pg_stat_database.xact_commit delta < 10 after 50-rule Apply
    AC-02  N first_seen rows on first scan
    AC-03  identical rescan = 0 new transactions, check_count++
    AC-04  one flip pass→fail = exactly 1 state_changed row
    AC-05  same scan_id replay = no-op
    AC-06  FK violation rolls back the whole batch (zero rows persist)
    AC-07  DELETE hosts with extant transactions fails (ON DELETE RESTRICT)
    AC-08  non-JSON-object evidence rejected (table-driven over 4 cases)
    AC-09  finding.persisted emission count = transactions row count
    AC-10  1000-rule Apply ≤ 2 seconds wall-clock
    AC-11  50 concurrent Applys against distinct hosts complete
    AC-14  oversize evidence rejected BEFORE INSERT; writer.apply.failed
           audit emitted with reason=evidence_oversize
    AC-15  FK violation emits writer.apply.failed with reason=fk_violation
           and detail.rule_count_attempted populated

Local validation
  go build ./internal/transactionlog/: clean
  go vet ./internal/transactionlog/: clean
  go test -race ./internal/transactionlog/ (unit + integration with
    real Postgres + migrations 0001-0012): 16 sub-tests pass
  specter coverage: system-transaction-log-writer 15/15 = 100%

Architectural choices worth flagging
  - Atomicity: validation phase runs BEFORE BEGIN, so oversize-evidence
    rejection is genuinely zero-INSERT (no rollback needed).
  - Pre-commit pending audit emissions: scan.completed / finding.persisted
    fire only AFTER tx.Commit succeeds, so the audit log truly reflects
    persisted state.
  - Evidence schema check: minimal "must be JSON object" gate today;
    full KensaEvidence-schema validation slots into validateResult when
    the OpenAPI components.schemas.KensaEvidence shape lands.

Slice B.1 trunk status
  B.1a scheduler             PR #418 — 15/15 ACs, ready for review
  B.1b kensa-executor        PR #419 — 16/16 ACs, ready for review
  B.1c transaction-log-writer this PR — 15/15 ACs

Total Slice B.1: 46 ACs covered across 3 specs. Ready to move on to
B.2 (liveness loop + drift detector) once these merge.

* fix(transactionlog): make lint clean — remove dead helper + empty branch + gofmt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant