Skip to content

M3 — Glyph extraction MVP: end-to-end generate from writer profile #16

@shaypal5

Description

@shaypal5

Tracks all implementation work for M3 per docs/roadmap.md.
Design reference: docs/design/letter_extraction.md (status: DRAFT — this issue supersedes the open questions there).

Depends on: M1 (upstream loader + eligibility gate), M2 (writer attribution config).


Goal

First end-to-end extraction: given a writer profile (M2) pointing at a clean upstream checkout (M1), emit one letter_set.v1 JSON document per writer plus the referenced glyph image assets. Output must be deterministic — same upstream revision + same writer profile → bit-identical output tree.


Design decisions

Three questions were left open in the draft. They are closed here.

D1 — Writer identity when upstream metadata is silent → manual-review-only for M3

M2 established that attribution is explicitly config-driven (attribution_method ∈ {collection_metadata, manual_review}). The generator consumes a writer profile; it never infers identity. When upstream metadata is silent, the curator annotates with "attribution_method": "manual_review". No clustering, no inference — consistent with the explicit-config philosophy already in the codebase. Auto-clustering stays out of scope through at least M4.

D2 — Per-variant quality floor → minimum bounding-box size only for M3

The M1 eligibility gate already requires quality.usable_for_htr == True at the page-scan level. For per-glyph quality at extraction time, M3 enforces exactly one floor: extracted bounding box ≥ 16 × 16 px. Glyphs smaller than that are almost certainly noise or segmentation artifacts and are silently dropped (logged at DEBUG level). No contrast or sharpness metrics in M3 — those belong in M4 ("Variant quality and dedupe"). The floor value is a named constant so M4 can raise or replace it without touching extraction logic.

D3 — Near-duplicate variants → emit all, no deduplication in M3

Deduplication is explicitly the M4 milestone. M3 emits every extracted variant, including near-duplicates. No near_duplicate_of field is added to the schema in M3 (that additive change is for M4 to propose). Downstream consumers receive the full set; M4 narrows it. This keeps M3 scope small and avoids coupling a schema change to the extraction engine in the same milestone.


Scope

In scope

  • src/hletterscriptgen/extractor.py — core extraction engine
    • Primary entry point: extract(profile: WriterProfile, *, output_dir: Path) -> list[Path]
    • Loads WriterProfile (M2); resolves upstream checkout from profile.upstream_path; runs the M1 eligibility gate per entry; segments glyphs; applies the 16 × 16 px floor; emits letter_set.v1 JSON + image assets per writer
  • Wire generate in cli.py — replace the EXIT_NOT_IMPLEMENTED stub with generate <profile_path> <output_dir>
  • tests/test_extractor.py — unit + integration tests against a small synthetic fixture set
  • Determinism test — run extraction twice on the same inputs, assert output trees are byte-identical
  • Update AGENTS.md stable CLI surface list once generate is wired

Out of scope (M3)

  • Near-duplicate detection or deduplication (M4)
  • Per-variant quality metrics beyond the 16 × 16 px bounding-box floor (M4)
  • Publishing / handoff into hletterscript (M5)
  • Schema evolution (M6)
  • Auto-clustering of writer identity (indefinitely out of scope)

Sub-decision needed first: segmentation approach

The draft does not specify how individual Hebrew letter glyphs are segmented from a page scan. This is the core technical question that gates everything else in M3. The first sub-PR should answer it before the extraction engine is written:

  • Option A — Rule-based connected-component analysis (e.g. OpenCV): fast, dependency-light, brittle on touching/overlapping letters; may be sufficient for a constrained MVP corpus.
  • Option B — Pre-annotated bounding boxes from the upstream entry files: if the upstream scan corpus already ships annotation files (e.g. ALTO XML, PAGE XML, or a JSON sidecar), the extractor just reads coordinates rather than computing them. Zero CV dependency, maximally deterministic.
  • Option C — Pre-trained segmentation model: most accurate, heaviest dependency, hardest to make deterministic.

Option B should be checked first — if upstream entries carry annotations, it is strictly preferable for an MVP. Option A is the fallback if they do not.


Suggested PR breakdown

# Branch Description
1 feat/m3-segmentation-approach Spike: examine upstream fixture entries for annotation files; document chosen approach; add any new dependencies
2 feat/m3-extractor extractor.py, eligibility integration, glyph quality-floor constant, unit tests
3 feat/m3-generate-cli Wire generate subcommand; update AGENTS.md
4 feat/m3-determinism-test End-to-end determinism test over synthetic fixture

PRs 2–4 can proceed in sequence once PR 1 settles the segmentation approach.


Acceptance criteria

  • hletterscriptgen generate <profile.json> <output_dir> runs end-to-end on a real or synthetic writer profile without error
  • Output directory contains one valid letter_set.v1 JSON per writer plus image assets
  • hletterscriptgen validate <output_dir>/<writer>.json exits 0 for each emitted document
  • Running generate twice with identical inputs produces bit-identical output trees
  • hletterscriptgen check-eligible <upstream_path>/entries.jsonl passes for all entries referenced by the profile
  • python -m pytest stays green, coverage ≥ 90 %
  • ruff check . and mypy clean

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:clihletterscriptgen CLI surfacearea:extractionGlyph segmentation and per-letter extraction (M3+)type:featureNew feature or capability

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions