Tracks all implementation work for M3 per docs/roadmap.md.
Design reference: docs/design/letter_extraction.md (status: DRAFT — this issue supersedes the open questions there).
Depends on: M1 (upstream loader + eligibility gate), M2 (writer attribution config).
Goal
First end-to-end extraction: given a writer profile (M2) pointing at a clean upstream checkout (M1), emit one letter_set.v1 JSON document per writer plus the referenced glyph image assets. Output must be deterministic — same upstream revision + same writer profile → bit-identical output tree.
Design decisions
Three questions were left open in the draft. They are closed here.
D1 — Writer identity when upstream metadata is silent → manual-review-only for M3
M2 established that attribution is explicitly config-driven (attribution_method ∈ {collection_metadata, manual_review}). The generator consumes a writer profile; it never infers identity. When upstream metadata is silent, the curator annotates with "attribution_method": "manual_review". No clustering, no inference — consistent with the explicit-config philosophy already in the codebase. Auto-clustering stays out of scope through at least M4.
D2 — Per-variant quality floor → minimum bounding-box size only for M3
The M1 eligibility gate already requires quality.usable_for_htr == True at the page-scan level. For per-glyph quality at extraction time, M3 enforces exactly one floor: extracted bounding box ≥ 16 × 16 px. Glyphs smaller than that are almost certainly noise or segmentation artifacts and are silently dropped (logged at DEBUG level). No contrast or sharpness metrics in M3 — those belong in M4 ("Variant quality and dedupe"). The floor value is a named constant so M4 can raise or replace it without touching extraction logic.
D3 — Near-duplicate variants → emit all, no deduplication in M3
Deduplication is explicitly the M4 milestone. M3 emits every extracted variant, including near-duplicates. No near_duplicate_of field is added to the schema in M3 (that additive change is for M4 to propose). Downstream consumers receive the full set; M4 narrows it. This keeps M3 scope small and avoids coupling a schema change to the extraction engine in the same milestone.
Scope
In scope
src/hletterscriptgen/extractor.py — core extraction engine
- Primary entry point:
extract(profile: WriterProfile, *, output_dir: Path) -> list[Path]
- Loads
WriterProfile (M2); resolves upstream checkout from profile.upstream_path; runs the M1 eligibility gate per entry; segments glyphs; applies the 16 × 16 px floor; emits letter_set.v1 JSON + image assets per writer
- Wire
generate in cli.py — replace the EXIT_NOT_IMPLEMENTED stub with generate <profile_path> <output_dir>
tests/test_extractor.py — unit + integration tests against a small synthetic fixture set
- Determinism test — run extraction twice on the same inputs, assert output trees are byte-identical
- Update
AGENTS.md stable CLI surface list once generate is wired
Out of scope (M3)
- Near-duplicate detection or deduplication (M4)
- Per-variant quality metrics beyond the 16 × 16 px bounding-box floor (M4)
- Publishing / handoff into
hletterscript (M5)
- Schema evolution (M6)
- Auto-clustering of writer identity (indefinitely out of scope)
Sub-decision needed first: segmentation approach
The draft does not specify how individual Hebrew letter glyphs are segmented from a page scan. This is the core technical question that gates everything else in M3. The first sub-PR should answer it before the extraction engine is written:
- Option A — Rule-based connected-component analysis (e.g. OpenCV): fast, dependency-light, brittle on touching/overlapping letters; may be sufficient for a constrained MVP corpus.
- Option B — Pre-annotated bounding boxes from the upstream entry files: if the upstream scan corpus already ships annotation files (e.g. ALTO XML, PAGE XML, or a JSON sidecar), the extractor just reads coordinates rather than computing them. Zero CV dependency, maximally deterministic.
- Option C — Pre-trained segmentation model: most accurate, heaviest dependency, hardest to make deterministic.
Option B should be checked first — if upstream entries carry annotations, it is strictly preferable for an MVP. Option A is the fallback if they do not.
Suggested PR breakdown
| # |
Branch |
Description |
| 1 |
feat/m3-segmentation-approach |
Spike: examine upstream fixture entries for annotation files; document chosen approach; add any new dependencies |
| 2 |
feat/m3-extractor |
extractor.py, eligibility integration, glyph quality-floor constant, unit tests |
| 3 |
feat/m3-generate-cli |
Wire generate subcommand; update AGENTS.md |
| 4 |
feat/m3-determinism-test |
End-to-end determinism test over synthetic fixture |
PRs 2–4 can proceed in sequence once PR 1 settles the segmentation approach.
Acceptance criteria
Tracks all implementation work for M3 per
docs/roadmap.md.Design reference:
docs/design/letter_extraction.md(status: DRAFT — this issue supersedes the open questions there).Depends on: M1 (upstream loader + eligibility gate), M2 (writer attribution config).
Goal
First end-to-end extraction: given a writer profile (M2) pointing at a clean upstream checkout (M1), emit one
letter_set.v1JSON document per writer plus the referenced glyph image assets. Output must be deterministic — same upstream revision + same writer profile → bit-identical output tree.Design decisions
Three questions were left open in the draft. They are closed here.
D1 — Writer identity when upstream metadata is silent → manual-review-only for M3
M2 established that attribution is explicitly config-driven (
attribution_method ∈ {collection_metadata, manual_review}). The generator consumes a writer profile; it never infers identity. When upstream metadata is silent, the curator annotates with"attribution_method": "manual_review". No clustering, no inference — consistent with the explicit-config philosophy already in the codebase. Auto-clustering stays out of scope through at least M4.D2 — Per-variant quality floor → minimum bounding-box size only for M3
The M1 eligibility gate already requires
quality.usable_for_htr == Trueat the page-scan level. For per-glyph quality at extraction time, M3 enforces exactly one floor: extracted bounding box ≥ 16 × 16 px. Glyphs smaller than that are almost certainly noise or segmentation artifacts and are silently dropped (logged atDEBUGlevel). No contrast or sharpness metrics in M3 — those belong in M4 ("Variant quality and dedupe"). The floor value is a named constant so M4 can raise or replace it without touching extraction logic.D3 — Near-duplicate variants → emit all, no deduplication in M3
Deduplication is explicitly the M4 milestone. M3 emits every extracted variant, including near-duplicates. No
near_duplicate_offield is added to the schema in M3 (that additive change is for M4 to propose). Downstream consumers receive the full set; M4 narrows it. This keeps M3 scope small and avoids coupling a schema change to the extraction engine in the same milestone.Scope
In scope
src/hletterscriptgen/extractor.py— core extraction engineextract(profile: WriterProfile, *, output_dir: Path) -> list[Path]WriterProfile(M2); resolves upstream checkout fromprofile.upstream_path; runs the M1 eligibility gate per entry; segments glyphs; applies the 16 × 16 px floor; emitsletter_set.v1JSON + image assets per writergenerateincli.py— replace theEXIT_NOT_IMPLEMENTEDstub withgenerate <profile_path> <output_dir>tests/test_extractor.py— unit + integration tests against a small synthetic fixture setAGENTS.mdstable CLI surface list oncegenerateis wiredOut of scope (M3)
hletterscript(M5)Sub-decision needed first: segmentation approach
The draft does not specify how individual Hebrew letter glyphs are segmented from a page scan. This is the core technical question that gates everything else in M3. The first sub-PR should answer it before the extraction engine is written:
Option B should be checked first — if upstream entries carry annotations, it is strictly preferable for an MVP. Option A is the fallback if they do not.
Suggested PR breakdown
feat/m3-segmentation-approachfeat/m3-extractorextractor.py, eligibility integration, glyph quality-floor constant, unit testsfeat/m3-generate-cligeneratesubcommand; update AGENTS.mdfeat/m3-determinism-testPRs 2–4 can proceed in sequence once PR 1 settles the segmentation approach.
Acceptance criteria
hletterscriptgen generate <profile.json> <output_dir>runs end-to-end on a real or synthetic writer profile without errorletter_set.v1JSON per writer plus image assetshletterscriptgen validate <output_dir>/<writer>.jsonexits 0 for each emitted documentgeneratetwice with identical inputs produces bit-identical output treeshletterscriptgen check-eligible <upstream_path>/entries.jsonlpasses for all entries referenced by the profilepython -m pyteststays green, coverage ≥ 90 %ruff check .andmypyclean