Skip to content

Refactor brittle Watcher↔Archiver "mirror discipline" for shared content-acquisition code #159

@gregoryfoster

Description

@gregoryfoster

Context

Watcher and Archiver share copies of several content-acquisition modules:

  • src/core/fetchers/{base,http}.py
  • src/core/extractors/{base,html,csv_excel,pdf}.py
  • src/core/simhash.py
  • src/core/extraction_defaults.py
  • src/core/logging.py

Per the discipline documented in AGENTS.md ("Mirrored content-acquisition code"), any change in one repo must be manually mirrored to the other:

Watcher and Archiver share copies of [the above files]. When changing any of these here, mirror the change to /home/exedev/archiver/src/core/. Notifier-style discipline; revisit when fingerprint parity becomes load-bearing (i.e., when Replicator joins the consumer set).

The original problem

Both services need:

  • Fetchers — to retrieve content from arbitrary URLs (HTTP, file://, eventually cloud storage).
  • Extractors — to convert raw HTML/PDF/CSV into normalized text chunks.
  • Fingerprinting — SHA-256 + simhash on extracted content for change detection (Watcher) and dedup/canonical addressing (Archiver).
  • Default extraction config — translate an InfoSource source_spec into runtime args.
  • Logging — JSON-structured logs with consistent field names.

When Archiver was extracted from Watcher (#149), it wasn't obvious where these should live. Both repos' content-acquisition paths need to produce bit-identical fingerprints, so the code must stay in lockstep. The chosen pragmatic answer: copy + AGENTS.md note + reviewer vigilance.

Why this approach is brittle

  • Drift is silent. Phase 5 CR round-3 finding 17 surfaced one example: src/core/extraction_defaults.py docstrings were updated in Watcher (Phase 5 — refactor Watcher to produce SourceRevisions in Archiver (v2 cutover) #156 commit 23b8ade) without mirroring to Archiver. No tooling caught it — only a careful CR pass.
  • Reviewer load doesn't scale. Every PR touching these files imposes a manual sync step. As Replicator (Phase 6+) joins the consumer set, this becomes three repos × O(n) PRs.
  • Fingerprint parity is the load-bearing concern. AGENTS.md acknowledges this: "revisit when fingerprint parity becomes load-bearing." Replicator's hash-verify-on-re-fetch path (Phase 5 — refactor Watcher to produce SourceRevisions in Archiver (v2 cutover) #156 Phase 5 design Section 2 step 5) MUST produce the same fingerprint Watcher recorded. Silent drift here = data corruption.
  • No type-check across copies. Each repo's CI checks its own copy. If Watcher renames a function, Archiver's copy doesn't get a red CI signal.

Recommended alternatives

Option A (preferred): Extract a shared Python package

Lift the mirrored modules into a new sibling repo cannobserv-content-acquisition (or similar):

/home/exedev/cannobserv-content-acquisition/
├── pyproject.toml
└── src/cannobserv_content/
    ├── fetchers/
    ├── extractors/
    ├── simhash.py
    ├── extraction_defaults.py
    └── logging.py

Both Watcher and Archiver pin it as a path-editable dependency (same pattern as archiver-client):

cannobserv-content = { path = "/home/exedev/cannobserv-content-acquisition", editable = true, version = ">=1.0.0,<2" }

Imports become from cannobserv_content.extractors import HtmlExtractor in both repos.

Pros: real Python module boundary; one canonical copy; CI signals drift via import errors; one PR per change; version-pinned for upgrades; same pattern operators already understand from archiver-client.

Cons: new repo to set up; needs its own test infra; bump cycle for breaking changes (but that's a feature, not a bug — it forces both consumers to deliberately upgrade).

Option B: Git submodule

Make src/core/content_acquisition/ a submodule pointing to the shared repo. Both Watcher and Archiver consume the same checkout.

Pros: no PyPI / wheel build; reuses existing skills-submodule pattern.
Cons: submodule UX is famously thorny (every checkout needs --recursive; CI must git submodule update); harder to version-pin to a specific commit deliberately.

Option C: Pre-commit + CI drift check (keep current pattern, add automation)

Add a scripts/check_mirror_parity.sh to both repos that diffs the mirrored files against the sibling repo and fails the commit/PR if they differ.

Pros: minimal disruption; no new infra.
Cons: still error-prone in spirit — relies on the operator/developer running both repos locally; CI needs cross-repo checkout to compare; doesn't solve the fingerprint-parity blast radius problem.

Recommendation

Option A. It mirrors the existing archiver-client pattern operators + agents already understand, makes drift impossible at import time, and prepares for Replicator's hash-verify path. Defer until Replicator is being stood up (Phase 6) — the urgency is fingerprint parity, and Watcher↔Archiver alone has tolerable drift today.

Scope of this issue

  • Research + design doc only. Do not implement here.
  • Read the trajectory doc at docs/research/2026-05-06-archiver-information-model.md and the Phase 4/5 plans in /home/exedev/archiver/docs/plans/ to understand what fingerprint parity guarantees Replicator will need.
  • Land the design as docs/plans/<date>-shared-content-acquisition-package.md.
  • File follow-up implementation issues when the design is approved.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions