Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,10 @@ jobs:
uses: actions/checkout@v4
with:
repository: HeOCR/hash
# Pin to the latest upstream commit referenced by the current
# hletterscript indexes so cross-validation stays reproducible
# even if HASH later removes or renames source entries.
ref: 0e162a78d609a47858c460431be3d689b2e9e31e
path: .upstream
lfs: false
- name: Install dev dependencies
Expand Down
198 changes: 119 additions & 79 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,61 @@
# hletterscript

A dataset of **sets of per-letter images of handwritten Hebrew letters**.
Each set groups crops produced from documents written by the *same
writer*; each set typically contains several variants of the same letter
cut from different scans by that writer.

This repository is the downstream of:
[![CI](https://github.com/HeOCR/hletterscript/actions/workflows/ci.yml/badge.svg)](https://github.com/HeOCR/hletterscript/actions/workflows/ci.yml)
![Metadata license](https://img.shields.io/badge/metadata-CC0--1.0-brightgreen)
![Letter crops](https://img.shields.io/badge/letter%20crops-48-blue)
![Writer sets](https://img.shields.io/badge/writer%20sets-2-blue)

Created by [Shay Palachy Affek](http://www.shaypalachy.com/).

Per-writer sets of cropped handwritten Hebrew letter images, extracted
from rights-clean page scans for synthetic handwriting generation and
OCR/HTR research.

![Sample grid of handwritten Hebrew letter crops](docs/assets/hletterscript-sample-grid.png)

## At a Glance

| Field | Value |
| --- | --- |
| Corpus shape | Writer-grouped Hebrew letter crops |
| Current seed corpus | 48 crops from 2 verified writers |
| Covered letter forms | 25 Hebrew letter-form slugs |
| Current writers | Chaim Nachman Bialik, Rachel Bluwstein |
| Canonical crop index | [`data/index/entries.jsonl`](data/index/entries.jsonl) |
| Canonical writer index | [`data/index/writers.jsonl`](data/index/writers.jsonl) |
| Image storage | [`data/letters/<writer_id>/<letter_name>/`](data/letters/) via Git LFS |
| Metadata license | CC0 1.0 |
| Per-image rights | Inherited per crop from the upstream scan |

## What This Repository Contains

`hletterscript` is a data repository for individual Hebrew letter crops,
grouped by the person who wrote them. Each entry links the crop back to
the source scan, bounding box, extraction method, file checksum, quality
flags, and inherited rights statement.

The corpus is intentionally small and strict at this stage: it is a
validated seed dataset for the surrounding HeOCR tooling, not yet a
complete Hebrew handwriting corpus.

## Pipeline Position

```mermaid
flowchart LR
HASH["HeOCR/hash<br/>rights-clean page scans"] --> GEN["HeOCR/hletterscriptgen<br/>crop extraction"]
GEN --> THIS["HeOCR/hletterscript<br/>per-writer letter sets"]
THIS --> SYN["HeOCR/hocrsyngen<br/>synthetic documents"]
SYN --> OCR["HeOCR / HeOCRsynth<br/>OCR and HTR corpora"]
```

- [HeOCR/hash][upstream] (HASH — Hebrew Archive of Scanned Handwriting) — the
canonical, permissively-licensed source of page-level scans. Every
entry here cites its upstream scan.
- [HeOCR/hletterscriptgen][gen] — the framework that turns page scans
into per-letter crops. Each entry records which version of that
framework produced it.
Related projects:

The intended downstream consumers are synthetic-document generators
([HeOCR/hocrsyngen][syngen]) and the synthetic / real Hebrew handwriting
corpora they feed into ([HeOCR/HeOCRsynth][heocrsynth],
[HeOCR/HeOCR][heocr]).
- [HeOCR/hash][upstream] is the page-scan source of truth.
- [HeOCR/hletterscriptgen][gen] produces the letter crops.
- [HeOCR/hocrsyngen][syngen] consumes this repository for synthetic
document generation.
- [HeOCR/HeOCRsynth][heocrsynth] and [HeOCR/HeOCR][heocr] are intended
downstream OCR/HTR targets.

[upstream]: https://github.com/HeOCR/hash
[gen]: https://github.com/HeOCR/hletterscriptgen
Expand All @@ -27,84 +65,86 @@ corpora they feed into ([HeOCR/HeOCRsynth][heocrsynth],

## Dataset Layout

- `docs/dataset_structure.md` defines the repository layout and
ingestion model.
- `docs/letters.md` is the canonical Hebrew-letter enumeration
(27 forms — 22 base letters plus the 5 finals).
- `data/index/writers.jsonl` is the set-level catalog: one JSON object
per writer/scribe.
- `data/index/entries.jsonl` is the image-level catalog: one JSON
object per cropped letter image, with upstream provenance,
extraction provenance, file checksums, and inherited rights.
- `data/letters/<writer_id>/<letter_name>/` stores the image bytes.
- `schemas/writer.schema.json` and `schemas/entry.schema.json` define
the record contracts.
- `scripts/validate_indexes.py` validates JSONL records against the
schemas, enforces referential integrity, checks Hebrew-letter
codepoint/name/form consistency, pins the upstream repo URL, and
re-verifies image file checksums and sizes on disk.
- `scripts/generate_release_artifacts.py` regenerates `NOTICE.md`,
`CITATION.cff`, and `datapackage.json` deterministically from the
- [`docs/dataset_structure.md`](docs/dataset_structure.md) defines the
repository layout, index model, rights inheritance, and ingestion flow.
- [`docs/letters.md`](docs/letters.md) is the canonical Hebrew-letter
enumeration: 27 forms, covering the 22 base letters plus 5 finals.
- [`data/index/writers.jsonl`](data/index/writers.jsonl) is the set-level
catalog: one JSON object per writer or scribe.
- [`data/index/entries.jsonl`](data/index/entries.jsonl) is the crop-level
catalog: one JSON object per image, with upstream provenance,
extraction provenance, file checksums, quality flags, and inherited
rights.
- [`data/letters/`](data/letters/) stores the crop image bytes.
- [`schemas/writer.schema.json`](schemas/writer.schema.json) and
[`schemas/entry.schema.json`](schemas/entry.schema.json) define the
record contracts.
- [`scripts/validate_indexes.py`](scripts/validate_indexes.py) validates
JSONL records, referential integrity, Hebrew-letter consistency,
upstream URLs, image checksums, and file sizes.
- [`scripts/generate_release_artifacts.py`](scripts/generate_release_artifacts.py)
regenerates [`NOTICE.md`](NOTICE.md), [`CITATION.cff`](CITATION.cff),
and [`datapackage.json`](datapackage.json) deterministically from the
indexes.
- `LICENSE.md` documents the compound licensing policy for
metadata and per-image inherited rights.

## Serialization Decision
## Licensing Model

The canonical editable indexes are newline-delimited JSON (`.jsonl`),
matching the upstream scans repo's convention.
This repository uses a compound licensing model:

JSONL is deliberately used instead of CSV because these records need
nested upstream references, bounding boxes, rights inheritance,
extraction provenance, and quality measurements. CSV/Parquet/SQLite
exports can be generated later as derived artefacts; the source of
truth stays line-oriented, diffable, streamable JSON.
- Repository-authored metadata, schemas, scripts, docs, and generated
metadata exports are dedicated to the public domain under CC0 1.0.
- Each crop carries its own inherited rights block from the upstream scan.
Current seed crops are public-domain compatible, but consumers should
read the per-entry rights metadata rather than assume a uniform image
license.

See [`LICENSE`](LICENSE) for the repository metadata license and
[`LICENSE.md`](LICENSE.md) for the full per-image rights policy.

## Requirements

- **Python ≥ 3.11** (the validator uses `hashlib.file_digest`).
CI pins 3.12.
- **Git LFS** — image bytes under `data/letters/**` are tracked via
LFS (see `.gitattributes`). After cloning, run `git lfs install`
once, then `git lfs pull` to fetch the actual image bytes.
- Python 3.11 or newer. CI currently validates 3.11, 3.12, and 3.13.
- Git LFS for the image bytes under `data/letters/**`.

Run the current validation check with:
After cloning:

```bash
git lfs install && git lfs pull
git lfs install
git lfs pull
python3 -m pip install -r requirements-dev.txt
```

## Validate Locally

```bash
python3 scripts/validate_indexes.py
python3 scripts/generate_release_artifacts.py --check
python3 -m pytest
```

For the full CI-style upstream cross-check, place a checkout of
[`HeOCR/hash`](https://github.com/HeOCR/hash) at `.upstream` and run:

```bash
python3 scripts/validate_indexes.py --upstream-path .upstream
```

## Current Status

`v0.0.0-rc` — **initial setup**. The repository ships with the
schemas, validation tooling, release-artifact generator, CI workflow,
and licensing policy in place. The per-letter image indexes
(`writers.jsonl`, `entries.jsonl`) are empty: actual letter-image
ingestion happens in subsequent PRs, produced by
[HeOCR/hletterscriptgen][gen] from scans in the upstream repo.

The repository uses a compound licensing model: repository-authored
metadata is dedicated to the public domain under CC0 1.0 (see
[`LICENSE`](LICENSE)), while per-image rights are recorded individually
and inherited from each crop's upstream scan. See [`LICENSE.md`]\
(LICENSE.md) for the full policy, including the CC BY-SA ShareAlike
caveat and the rules for remix-friendly release bundles.

## How to use this repo

- [`data/index/entries.jsonl`](data/index/entries.jsonl) is the source
of truth for the per-letter image corpus — one JSON object per crop,
with upstream citation, file checksums, and inherited rights.
- [`data/index/writers.jsonl`](data/index/writers.jsonl) catalogs the
writers, including candidate leads and rejected records.
- [`schemas/entry.schema.json`](schemas/entry.schema.json) and
[`schemas/writer.schema.json`](schemas/writer.schema.json) define the
record contracts; [`scripts/validate_indexes.py`]\
(scripts/validate_indexes.py) enforces them in CI.
- Contributors adding new entries should start with
[`AGENTS.md`](AGENTS.md) for ingest rules, naming, and the pre-PR
checklist.
`v0.0.0-rc` is a validated seed corpus with 48 indexed letter crops from
2 verified writers. The repository has schema validation, deterministic
release-artifact generation, CI, and licensing policy in place.

Future ingestion work expands writer coverage and fills missing Hebrew
letter forms through crops produced by [HeOCR/hletterscriptgen][gen] from
the upstream HASH scans.

## Contributing

Contributors adding or reviewing crop entries should start with
[`AGENTS.md`](AGENTS.md). It captures ingest rules, naming conventions,
rights constraints, and the pre-PR checklist for this data repository.

## Credits

Created by [Shay Palachy Affek ](http://www.shaypalachy.com/) [[GitHub](https://github.com/shaypal5)]
Binary file added docs/assets/hletterscript-sample-grid.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading