M3 — Glyph extraction MVP: end-to-end generate from writer profile

Tracks all implementation work for M3 per [`docs/roadmap.md`](docs/roadmap.md).  
Design reference: [`docs/design/letter_extraction.md`](docs/design/letter_extraction.md) (status: DRAFT — this issue supersedes the open questions there).

**Depends on:** M1 (upstream loader + eligibility gate), M2 (writer attribution config).

---

## Goal

First end-to-end extraction: given a writer profile (M2) pointing at a clean upstream checkout (M1), emit one `letter_set.v1` JSON document per writer plus the referenced glyph image assets. Output must be deterministic — same upstream revision + same writer profile → bit-identical output tree.

---

## Design decisions

Three questions were left open in the draft. They are closed here.

### D1 — Writer identity when upstream metadata is silent → manual-review-only for M3

M2 established that attribution is explicitly config-driven (`attribution_method ∈ {collection_metadata, manual_review}`). The generator consumes a writer profile; it never infers identity. When upstream metadata is silent, the curator annotates with `"attribution_method": "manual_review"`. No clustering, no inference — consistent with the explicit-config philosophy already in the codebase. Auto-clustering stays out of scope through at least M4.

### D2 — Per-variant quality floor → minimum bounding-box size only for M3

The M1 eligibility gate already requires `quality.usable_for_htr == True` at the page-scan level. For per-glyph quality at extraction time, M3 enforces exactly one floor: **extracted bounding box ≥ 16 × 16 px**. Glyphs smaller than that are almost certainly noise or segmentation artifacts and are silently dropped (logged at `DEBUG` level). No contrast or sharpness metrics in M3 — those belong in M4 ("Variant quality and dedupe"). The floor value is a named constant so M4 can raise or replace it without touching extraction logic.

### D3 — Near-duplicate variants → emit all, no deduplication in M3

Deduplication is explicitly the M4 milestone. M3 emits every extracted variant, including near-duplicates. No `near_duplicate_of` field is added to the schema in M3 (that additive change is for M4 to propose). Downstream consumers receive the full set; M4 narrows it. This keeps M3 scope small and avoids coupling a schema change to the extraction engine in the same milestone.

---

## Scope

### In scope

- **`src/hletterscriptgen/extractor.py`** — core extraction engine
  - Primary entry point: `extract(profile: WriterProfile, *, output_dir: Path) -> list[Path]`
  - Loads `WriterProfile` (M2); resolves upstream checkout from `profile.upstream_path`; runs the M1 eligibility gate per entry; segments glyphs; applies the 16 × 16 px floor; emits `letter_set.v1` JSON + image assets per writer
- **Wire `generate` in `cli.py`** — replace the `EXIT_NOT_IMPLEMENTED` stub with `generate <profile_path> <output_dir>`
- **`tests/test_extractor.py`** — unit + integration tests against a small synthetic fixture set
- **Determinism test** — run extraction twice on the same inputs, assert output trees are byte-identical
- Update `AGENTS.md` stable CLI surface list once `generate` is wired

### Out of scope (M3)

- Near-duplicate detection or deduplication (M4)
- Per-variant quality metrics beyond the 16 × 16 px bounding-box floor (M4)
- Publishing / handoff into `hletterscript` (M5)
- Schema evolution (M6)
- Auto-clustering of writer identity (indefinitely out of scope)

---

## Sub-decision needed first: segmentation approach

The draft does not specify how individual Hebrew letter glyphs are segmented from a page scan. This is the core technical question that gates everything else in M3. **The first sub-PR should answer it** before the extraction engine is written:

- **Option A — Rule-based connected-component analysis** (e.g. OpenCV): fast, dependency-light, brittle on touching/overlapping letters; may be sufficient for a constrained MVP corpus.
- **Option B — Pre-annotated bounding boxes from the upstream entry files**: if the upstream scan corpus already ships annotation files (e.g. ALTO XML, PAGE XML, or a JSON sidecar), the extractor just reads coordinates rather than computing them. Zero CV dependency, maximally deterministic.
- **Option C — Pre-trained segmentation model**: most accurate, heaviest dependency, hardest to make deterministic.

Option B should be checked first — if upstream entries carry annotations, it is strictly preferable for an MVP. Option A is the fallback if they do not.

---

## Suggested PR breakdown

| # | Branch | Description |
|---|---|---|
| 1 | `feat/m3-segmentation-approach` | Spike: examine upstream fixture entries for annotation files; document chosen approach; add any new dependencies |
| 2 | `feat/m3-extractor` | `extractor.py`, eligibility integration, glyph quality-floor constant, unit tests |
| 3 | `feat/m3-generate-cli` | Wire `generate` subcommand; update AGENTS.md |
| 4 | `feat/m3-determinism-test` | End-to-end determinism test over synthetic fixture |

PRs 2–4 can proceed in sequence once PR 1 settles the segmentation approach.

---

## Acceptance criteria

- [ ] `hletterscriptgen generate <profile.json> <output_dir>` runs end-to-end on a real or synthetic writer profile without error
- [ ] Output directory contains one valid `letter_set.v1` JSON per writer plus image assets
- [ ] `hletterscriptgen validate <output_dir>/<writer>.json` exits 0 for each emitted document
- [ ] Running `generate` twice with identical inputs produces bit-identical output trees
- [ ] `hletterscriptgen check-eligible <upstream_path>/entries.jsonl` passes for all entries referenced by the profile
- [ ] `python -m pytest` stays green, coverage ≥ 90 %
- [ ] `ruff check .` and `mypy` clean

#	Branch	Description
1	`feat/m3-segmentation-approach`	Spike: examine upstream fixture entries for annotation files; document chosen approach; add any new dependencies
2	`feat/m3-extractor`	`extractor.py`, eligibility integration, glyph quality-floor constant, unit tests
3	`feat/m3-generate-cli`	Wire `generate` subcommand; update AGENTS.md
4	`feat/m3-determinism-test`	End-to-end determinism test over synthetic fixture

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M3 — Glyph extraction MVP: end-to-end generate from writer profile #16

Goal

Design decisions

D1 — Writer identity when upstream metadata is silent → manual-review-only for M3

D2 — Per-variant quality floor → minimum bounding-box size only for M3

D3 — Near-duplicate variants → emit all, no deduplication in M3

Scope

In scope

Out of scope (M3)

Sub-decision needed first: segmentation approach

Suggested PR breakdown

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

M3 — Glyph extraction MVP: end-to-end generate from writer profile #16

Description

Goal

Design decisions

D1 — Writer identity when upstream metadata is silent → manual-review-only for M3

D2 — Per-variant quality floor → minimum bounding-box size only for M3

D3 — Near-duplicate variants → emit all, no deduplication in M3

Scope

In scope

Out of scope (M3)

Sub-decision needed first: segmentation approach

Suggested PR breakdown

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions