You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Watcher and Archiver share copies of several content-acquisition modules:
src/core/fetchers/{base,http}.py
src/core/extractors/{base,html,csv_excel,pdf}.py
src/core/simhash.py
src/core/extraction_defaults.py
src/core/logging.py
Per the discipline documented in AGENTS.md ("Mirrored content-acquisition code"), any change in one repo must be manually mirrored to the other:
Watcher and Archiver share copies of [the above files]. When changing any of these here, mirror the change to /home/exedev/archiver/src/core/. Notifier-style discipline; revisit when fingerprint parity becomes load-bearing (i.e., when Replicator joins the consumer set).
The original problem
Both services need:
Fetchers — to retrieve content from arbitrary URLs (HTTP, file://, eventually cloud storage).
Extractors — to convert raw HTML/PDF/CSV into normalized text chunks.
Fingerprinting — SHA-256 + simhash on extracted content for change detection (Watcher) and dedup/canonical addressing (Archiver).
Default extraction config — translate an InfoSource source_spec into runtime args.
Logging — JSON-structured logs with consistent field names.
When Archiver was extracted from Watcher (#149), it wasn't obvious where these should live. Both repos' content-acquisition paths need to produce bit-identical fingerprints, so the code must stay in lockstep. The chosen pragmatic answer: copy + AGENTS.md note + reviewer vigilance.
Reviewer load doesn't scale. Every PR touching these files imposes a manual sync step. As Replicator (Phase 6+) joins the consumer set, this becomes three repos × O(n) PRs.
Fingerprint parity is the load-bearing concern. AGENTS.md acknowledges this: "revisit when fingerprint parity becomes load-bearing." Replicator's hash-verify-on-re-fetch path (Phase 5 — refactor Watcher to produce SourceRevisions in Archiver (v2 cutover) #156 Phase 5 design Section 2 step 5) MUST produce the same fingerprint Watcher recorded. Silent drift here = data corruption.
No type-check across copies. Each repo's CI checks its own copy. If Watcher renames a function, Archiver's copy doesn't get a red CI signal.
Recommended alternatives
Option A (preferred): Extract a shared Python package
Lift the mirrored modules into a new sibling repo cannobserv-content-acquisition (or similar):
Imports become from cannobserv_content.extractors import HtmlExtractor in both repos.
Pros: real Python module boundary; one canonical copy; CI signals drift via import errors; one PR per change; version-pinned for upgrades; same pattern operators already understand from archiver-client.
Cons: new repo to set up; needs its own test infra; bump cycle for breaking changes (but that's a feature, not a bug — it forces both consumers to deliberately upgrade).
Option B: Git submodule
Make src/core/content_acquisition/ a submodule pointing to the shared repo. Both Watcher and Archiver consume the same checkout.
Pros: no PyPI / wheel build; reuses existing skills-submodule pattern. Cons: submodule UX is famously thorny (every checkout needs --recursive; CI must git submodule update); harder to version-pin to a specific commit deliberately.
Option C: Pre-commit + CI drift check (keep current pattern, add automation)
Add a scripts/check_mirror_parity.sh to both repos that diffs the mirrored files against the sibling repo and fails the commit/PR if they differ.
Pros: minimal disruption; no new infra. Cons: still error-prone in spirit — relies on the operator/developer running both repos locally; CI needs cross-repo checkout to compare; doesn't solve the fingerprint-parity blast radius problem.
Recommendation
Option A. It mirrors the existing archiver-client pattern operators + agents already understand, makes drift impossible at import time, and prepares for Replicator's hash-verify path. Defer until Replicator is being stood up (Phase 6) — the urgency is fingerprint parity, and Watcher↔Archiver alone has tolerable drift today.
Scope of this issue
Research + design doc only. Do not implement here.
Read the trajectory doc at docs/research/2026-05-06-archiver-information-model.md and the Phase 4/5 plans in /home/exedev/archiver/docs/plans/ to understand what fingerprint parity guarantees Replicator will need.
Land the design as docs/plans/<date>-shared-content-acquisition-package.md.
File follow-up implementation issues when the design is approved.
Related
Mirror discipline lives in AGENTS.md:90 (Watcher) and the equivalent section in Archiver's AGENTS.md.
Context
Watcher and Archiver share copies of several content-acquisition modules:
src/core/fetchers/{base,http}.pysrc/core/extractors/{base,html,csv_excel,pdf}.pysrc/core/simhash.pysrc/core/extraction_defaults.pysrc/core/logging.pyPer the discipline documented in
AGENTS.md("Mirrored content-acquisition code"), any change in one repo must be manually mirrored to the other:The original problem
Both services need:
source_specinto runtime args.When Archiver was extracted from Watcher (#149), it wasn't obvious where these should live. Both repos' content-acquisition paths need to produce bit-identical fingerprints, so the code must stay in lockstep. The chosen pragmatic answer: copy + AGENTS.md note + reviewer vigilance.
Why this approach is brittle
src/core/extraction_defaults.pydocstrings were updated in Watcher (Phase 5 — refactor Watcher to produce SourceRevisions in Archiver (v2 cutover) #156 commit 23b8ade) without mirroring to Archiver. No tooling caught it — only a careful CR pass.Recommended alternatives
Option A (preferred): Extract a shared Python package
Lift the mirrored modules into a new sibling repo
cannobserv-content-acquisition(or similar):Both Watcher and Archiver pin it as a path-editable dependency (same pattern as
archiver-client):Imports become
from cannobserv_content.extractors import HtmlExtractorin both repos.Pros: real Python module boundary; one canonical copy; CI signals drift via import errors; one PR per change; version-pinned for upgrades; same pattern operators already understand from
archiver-client.Cons: new repo to set up; needs its own test infra; bump cycle for breaking changes (but that's a feature, not a bug — it forces both consumers to deliberately upgrade).
Option B: Git submodule
Make
src/core/content_acquisition/a submodule pointing to the shared repo. Both Watcher and Archiver consume the same checkout.Pros: no PyPI / wheel build; reuses existing skills-submodule pattern.
Cons: submodule UX is famously thorny (every checkout needs
--recursive; CI mustgit submodule update); harder to version-pin to a specific commit deliberately.Option C: Pre-commit + CI drift check (keep current pattern, add automation)
Add a
scripts/check_mirror_parity.shto both repos that diffs the mirrored files against the sibling repo and fails the commit/PR if they differ.Pros: minimal disruption; no new infra.
Cons: still error-prone in spirit — relies on the operator/developer running both repos locally; CI needs cross-repo checkout to compare; doesn't solve the fingerprint-parity blast radius problem.
Recommendation
Option A. It mirrors the existing
archiver-clientpattern operators + agents already understand, makes drift impossible at import time, and prepares for Replicator's hash-verify path. Defer until Replicator is being stood up (Phase 6) — the urgency is fingerprint parity, and Watcher↔Archiver alone has tolerable drift today.Scope of this issue
docs/research/2026-05-06-archiver-information-model.mdand the Phase 4/5 plans in/home/exedev/archiver/docs/plans/to understand what fingerprint parity guarantees Replicator will need.docs/plans/<date>-shared-content-acquisition-package.md.Related