Skip to content

docs(slice-c): plan host intelligence as the slice after Slice B#417

Draft
remyluslosius wants to merge 1 commit into
mainfrom
docs/slice-c-plan
Draft

docs(slice-c): plan host intelligence as the slice after Slice B#417
remyluslosius wants to merge 1 commit into
mainfrom
docs/slice-c-plan

Conversation

@remyluslosius
Copy link
Copy Markdown
Contributor

Summary

Planning doc for Slice C — Host Intelligence, the slice after Slice B ships. Closes the visibility gap identified during Slice B B.1c planning: today we can answer "what's the compliance score" but not "which hosts have package X installed", which is required for asset management (ISO 27001 A.8, FedRAMP CM-8) and vulnerability correlation.

Companion to the boundary doc and stage_2_slice_a.md. Project-committed (not gitignored).

Locked design decisions

Decision Rationale
OpenWatch-owned, not a Kensa extension Boundary doc § 5.2 keeps Kensa pure-compliance
Separate scheduled probe, not piggybacking on Kensa scans Decoupled cadence; failure-independent
Storage: write-on-change for state facts, snapshot-based for metrics Write-on-change is the wrong model for time-series — articulated during B.1c planning
Reuses Slice B trunk unchanged Same scheduler / queue / credential / SSH discipline
No on-host agent SSH + stock OS utilities (rpm, dpkg, systemctl, ss, ip, nft, getent, /proc, /sys)
Privacy-first collector design Explicit per-collector allowlist; no shell history, no /proc/<pid>/cmdline, no log forwarding

Sub-slices

Wave Components
C.1 — probe trunk runner + host_facts writer + host_metrics writer
C.2 — core state collectors packages, services, users, hardware
C.3 — network + metrics network interfaces / routes / firewall + metrics sampler
C.4 — visibility surface read API, fleet rollups, vulnerability-correlation queries

14 specs planned, ~150 ACs total. Comparable to Slice A's footprint.

Sequencing

Slice C cannot start until Slice B (B.1 through B.4) ships. Estimated 6-8 weeks once B is complete.

What's in the doc (~380 lines)

  1. Why this slice exists (with framework citations: ISO 27001 A.8, FedRAMP CM-8, CMMC CA.L2-3.12.4, NIST SP 800-53 CM-8)
  2. Six locked design decisions with rationale
  3. Data model: host_facts, host_fact_state, host_metrics, intel_probes tables
  4. Sub-slices (waves) and their order
  5. Keep/change/drop audit against the Python system_info/ package
  6. Spec inventory (the 14 planned specs)
  7. OpenAPI surface preview (16 new endpoints)
  8. Privacy and security (collector charter, what we do NOT collect, audit emission)
  9. Performance budget (≤ 16s wall-clock per probe; ~1 RPC/sec sustained for 1000 hosts)
  10. Out of scope (auditd forwarding, process monitoring, network traffic analysis, configuration management)
  11. Six open questions for resolution before C.1 specs land
  12. Slice B entry criteria
  13. What "Slice C done" means concretely
  14. Trade-off note: why two storage shapes (state facts vs metrics)

Open questions surfaced at the bottom

  1. Probe runner concurrency — does it share Slice B's per-host guard?
  2. Probe cadence policy shape — global / per-fact-type / per-host?
  3. Retention defaults — 90d / 1y for facts / metrics?
  4. Backoff after collection failure — share or separate from scan backoff?
  5. Idempotency for manual refresh — debounce window?
  6. Migration of Python-era data — backfill or start fresh?

(Plus one item not yet a question: how intel_schedule policy structurally relates to schedules policy. Worth a § 4 followup once you've reviewed.)

What this PR is NOT

  • Not specs yet (those land per-wave once Slice B is in flight or done)
  • Not implementation (waits for Slice B to ship)
  • Not a roadmap update — happy to follow up if you want openwatch_roadmap.md to reference this

Test plan

  • Doc renders cleanly in GitHub markdown
  • Cross-references resolve (docs/KENSA_OPENWATCH_BOUNDARY.md, stage_2_slice_a.md)
  • User review for content + open questions

Slice C scope: collect package, service, user, network, hardware, and
metrics state from hosts via SSH so OpenWatch can answer asset-management
and vulnerability-correlation queries. Closes the visibility gap
identified during Slice B planning: today (after Slice A+B) we can answer
"what's the compliance score" and "which rules failed", not "which hosts
have package X installed".

Architectural decisions locked in this doc:

  - OpenWatch-owned, NOT a Kensa extension (boundary doc § 5.2 keeps Kensa
    pure-compliance)
  - Separate scheduled probe, NOT piggybacking on Kensa scans (decoupled
    cadence; failure-independent)
  - Storage: write-on-change for state facts (host_facts), snapshot-based
    for continuous metrics (host_metrics). Write-on-change is the wrong
    model for time-series; this split was articulated during B.1c planning
  - Reuses Slice B trunk unchanged: scheduler dispatches a new
    intel_probe job type alongside scan jobs; same SKIP LOCKED, same
    HMAC payload, same credential resolver, same SSH known_hosts policy
  - No on-host agent (SSH + stock OS utilities only)
  - Privacy-first collector design: explicit allowlist of files / commands
    per collector; no shell history, no /proc/*/cmdline, no log forwarding

Sub-slices (waves):

  C.1  probe trunk: runner + host_facts writer + host_metrics writer
  C.2  core state collectors: packages, services, users, hardware
  C.3  network + metrics sampler
  C.4  read API and fleet rollup queries

14 specs planned across all four waves, ~150 ACs (comparable to Slice A's
spec footprint).

Sequencing: Slice C cannot start until Slice B (B.1 through B.4) ships.
Estimated effort: 6-8 weeks once B is complete.

Six open questions surface at the bottom of the doc for resolution before
C.1 trunk specs land.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant