Refactor brittle Watcher↔Archiver "mirror discipline" for shared content-acquisition code

## Context

Watcher and Archiver share copies of several content-acquisition modules:

- `src/core/fetchers/{base,http}.py`
- `src/core/extractors/{base,html,csv_excel,pdf}.py`
- `src/core/simhash.py`
- `src/core/extraction_defaults.py`
- `src/core/logging.py`

Per the discipline documented in `AGENTS.md` ("Mirrored content-acquisition code"), any change in one repo must be manually mirrored to the other:

> Watcher and Archiver share copies of [the above files]. When changing any of these here, mirror the change to `/home/exedev/archiver/src/core/`. Notifier-style discipline; revisit when fingerprint parity becomes load-bearing (i.e., when Replicator joins the consumer set).

## The original problem

Both services need:
- **Fetchers** — to retrieve content from arbitrary URLs (HTTP, file://, eventually cloud storage).
- **Extractors** — to convert raw HTML/PDF/CSV into normalized text chunks.
- **Fingerprinting** — SHA-256 + simhash on extracted content for change detection (Watcher) and dedup/canonical addressing (Archiver).
- **Default extraction config** — translate an InfoSource `source_spec` into runtime args.
- **Logging** — JSON-structured logs with consistent field names.

When Archiver was extracted from Watcher (#149), it wasn't obvious where these should live. Both repos' content-acquisition paths need to produce **bit-identical** fingerprints, so the code must stay in lockstep. The chosen pragmatic answer: copy + AGENTS.md note + reviewer vigilance.

## Why this approach is brittle

- **Drift is silent.** Phase 5 CR round-3 finding 17 surfaced one example: `src/core/extraction_defaults.py` docstrings were updated in Watcher (#156 commit 23b8ade) without mirroring to Archiver. No tooling caught it — only a careful CR pass.
- **Reviewer load doesn't scale.** Every PR touching these files imposes a manual sync step. As Replicator (Phase 6+) joins the consumer set, this becomes three repos × O(n) PRs.
- **Fingerprint parity is the load-bearing concern.** AGENTS.md acknowledges this: \"revisit when fingerprint parity becomes load-bearing.\" Replicator's hash-verify-on-re-fetch path (#156 Phase 5 design Section 2 step 5) MUST produce the same fingerprint Watcher recorded. Silent drift here = data corruption.
- **No type-check across copies.** Each repo's CI checks its own copy. If Watcher renames a function, Archiver's copy doesn't get a red CI signal.

## Recommended alternatives

### Option A (preferred): Extract a shared Python package

Lift the mirrored modules into a new sibling repo `cannobserv-content-acquisition` (or similar):

```
/home/exedev/cannobserv-content-acquisition/
├── pyproject.toml
└── src/cannobserv_content/
    ├── fetchers/
    ├── extractors/
    ├── simhash.py
    ├── extraction_defaults.py
    └── logging.py
```

Both Watcher and Archiver pin it as a path-editable dependency (same pattern as `archiver-client`):

```toml
cannobserv-content = { path = "/home/exedev/cannobserv-content-acquisition", editable = true, version = ">=1.0.0,<2" }
```

Imports become `from cannobserv_content.extractors import HtmlExtractor` in both repos.

**Pros:** real Python module boundary; one canonical copy; CI signals drift via import errors; one PR per change; version-pinned for upgrades; same pattern operators already understand from `archiver-client`.

**Cons:** new repo to set up; needs its own test infra; bump cycle for breaking changes (but that's a feature, not a bug — it forces both consumers to deliberately upgrade).

### Option B: Git submodule

Make `src/core/content_acquisition/` a submodule pointing to the shared repo. Both Watcher and Archiver consume the same checkout.

**Pros:** no PyPI / wheel build; reuses existing skills-submodule pattern.
**Cons:** submodule UX is famously thorny (every checkout needs `--recursive`; CI must `git submodule update`); harder to version-pin to a specific commit deliberately.

### Option C: Pre-commit + CI drift check (keep current pattern, add automation)

Add a `scripts/check_mirror_parity.sh` to both repos that diffs the mirrored files against the sibling repo and fails the commit/PR if they differ.

**Pros:** minimal disruption; no new infra.
**Cons:** still error-prone in spirit — relies on the operator/developer running both repos locally; CI needs cross-repo checkout to compare; doesn't solve the fingerprint-parity blast radius problem.

### Recommendation

**Option A.** It mirrors the existing `archiver-client` pattern operators + agents already understand, makes drift impossible at import time, and prepares for Replicator's hash-verify path. Defer until Replicator is being stood up (Phase 6) — the urgency is fingerprint parity, and Watcher↔Archiver alone has tolerable drift today.

## Scope of this issue

- Research + design doc only. Do not implement here.
- Read the trajectory doc at `docs/research/2026-05-06-archiver-information-model.md` and the Phase 4/5 plans in `/home/exedev/archiver/docs/plans/` to understand what fingerprint parity guarantees Replicator will need.
- Land the design as `docs/plans/<date>-shared-content-acquisition-package.md`.
- File follow-up implementation issues when the design is approved.

## Related

- Mirror discipline lives in [AGENTS.md:90](https://github.com/CannObserv/watcher/blob/main/AGENTS.md#L90) (Watcher) and the equivalent section in Archiver's AGENTS.md.
- Surfaced during #156 Phase 5 cutover CR round 3 finding 17.
- Phase 6 (Replicator stand-up) is when fingerprint parity becomes load-bearing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor brittle Watcher↔Archiver "mirror discipline" for shared content-acquisition code #159

Context

The original problem

Why this approach is brittle

Recommended alternatives

Option A (preferred): Extract a shared Python package

Option B: Git submodule

Option C: Pre-commit + CI drift check (keep current pattern, add automation)

Recommendation

Scope of this issue

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Refactor brittle Watcher↔Archiver "mirror discipline" for shared content-acquisition code #159

Description

Context

The original problem

Why this approach is brittle

Recommended alternatives

Option A (preferred): Extract a shared Python package

Option B: Git submodule

Option C: Pre-commit + CI drift check (keep current pattern, add automation)

Recommendation

Scope of this issue

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions