# Italian Manuscripts OCR with Kraken

This notebook documents the end-to-end process used to create and fine-tune a Kraken OCR model for Italian handwritten manuscripts. It follows the same logical structure of the LaTeX report:

- Preparation of line images and Markdown transcriptions
- Generation and normalization of the ground truth file
- Creation of train/validation/test splits and charset
- Image preprocessing pipeline
- Creation of sidecar `.gt.txt` files
- Final training command for Kraken via `ketos`

The notebook is mainly **reproducibility and documentation**: some paths and commands are examples taken from the original environment and may need to be adapted to your own directory layout and platform.

## Environment and prerequisites

The work was carried out on a Linux machine (Pop!_OS) with an NVIDIA GPU and CUDA support. The following assumptions are made:

- You have a working Python 3 environment (e.g. via `conda`).
- `kraken` and `ketos` are installed in this environment (e.g. `pip install kraken`).
- You have the following directory structure (or equivalent):
  - `00_images/` – line images extracted from scanned pages
  - `01_texts/` – Markdown files with manual transcriptions, one per line image
  - `gt_new/` – folder where ground truth files will be written
  - `splits_new/` – folder for train/val/test splits
  - `processed/lines/` – folder for preprocessed line images used for training

Below we reconstruct all the scripts and commands used in the LaTeX report.

## 1. Generating ground truth from Markdown files

We start from:
- `00_images/` containing line images (e.g. `100_page1_line001.png`)
- `01_texts/` containing Markdown transcriptions with matching stems (e.g. `100_page1_line001.md`)

The script `make_gt_from_line_md.py` generates a tab-separated `ground_truth.txt` where each line contains:

```text
<image_path>\t<normalized_plain_text>
```

Markdown syntax is stripped and multiple lines in the `.md` file can be concatenated into a single transcription.

In [None]:
%%bash
# Generate ground truth from Markdown transcriptions
# Adjust paths as needed before running.
python3 make_gt_from_line_md.py \
    --images_dir "./00_images" \
    --md_dir "./01_texts" \
    --out "gt_new/ground_truth_new.txt"

In [None]:
# make_gt_from_line_md.py
#
# Generates gt/ground_truth.txt associating each image in images_dir
# with the corresponding markdown/text file in md_dir using the same
# stem (e.g. 100_page1_line001.png -> 100_page1_line001.md).
#
# Usage example:
# python3 make_gt_from_line_md.py --images_dir "./00_images" --md_dir "./01_texts" --out "./gt/ground_truth.txt"
#
# Options:
# --images_glob : pattern for images (default '*.*')
# --join_multiline : if true, concatenates all non-empty lines in .md (default True)
# --relpath : if true, writes relative paths for images in the GT file

import argparse
from pathlib import Path
import unicodedata
import re
import sys

def strip_markdown(s: str) -> str:
    """Lightweight Markdown stripping: images, links, formatting, quotes, extra spaces."""
    s = re.sub(r'!\[.*?\]\(.*?\)', '', s)    # images ![alt](url)
    s = re.sub(r'\[([^\]]+)\]\([^\)]+\)', r'\1', s)  # links [text](url) -> text
    s = re.sub(r'[`*_]{1,}', '', s)             # backticks, asterisks, underscores
    s = re.sub(r'^\s*>\s*', '', s, flags=re.M) # blockquotes
    s = re.sub(r'\s+', ' ', s)                 # collapse whitespace
    return s.strip()

def normalize(s: str) -> str:
    return unicodedata.normalize("NFC", s).strip()

def main():
    p = argparse.ArgumentParser()
    p.add_argument('--images_dir', required=True)
    p.add_argument('--md_dir', required=True)
    p.add_argument('--out', default='gt/ground_truth.txt')
    p.add_argument('--images_glob', default='*.*', help="glob pattern for images, default '*.*'")
    p.add_argument('--join_multiline', action='store_true', default=True,
                   help='Concatena più linee del .md in una')
    p.add_argument('--no-join', dest='join_multiline', action='store_false',
                   help='Non concatenare, usa solo la prima riga')
    p.add_argument('--relpath', action='store_true', default=True,
                   help='Usa path relative per le immagini nel GT')
    args = p.parse_args()

    images_dir = Path(args.images_dir)
    md_dir = Path(args.md_dir)
    outp = Path(args.out)
    outp.parent.mkdir(parents=True, exist_ok=True)

    imgs = sorted([p for p in images_dir.glob(args.images_glob) if p.is_file()])
    if not imgs:
        print("Nessuna immagine trovata in", images_dir, file=sys.stderr)
        sys.exit(1)

    missing_md = []
    written = 0
    with outp.open('w', encoding='utf-8') as fh:
        for img in imgs:
            stem = img.stem
            md_path = md_dir / (stem + '.md')
            if not md_path.exists():
                # Try other common text extensions
                found = None
                for ext in ['.txt', '.mdown', '.markdown']:
                    cand = md_dir / (stem + ext)
                    if cand.exists():
                        found = cand
                        break
                if found:
                    md_path = found
                else:
                    missing_md.append((img.name, str(md_path)))
                    continue

            text = md_path.read_text(encoding='utf-8')
            lines = [normalize(strip_markdown(l)) for l in text.splitlines() if l.strip()]
            if not lines:
                # Empty transcription file
                missing_md.append((img.name, f"{md_path} (vuoto)"))
                continue

            if args.join_multiline:
                txt = ' '.join(lines)
            else:
                txt = lines[0]

            txt = normalize(strip_markdown(txt))

            # Image path: relative or absolute
            if args.relpath:
                img_path_str = str(Path(args.images_dir).joinpath(img.name).as_posix())
            else:
                img_path_str = str(img.resolve())

            txt = txt.replace('\t', ' ').replace('\r', ' ').replace('\n', ' ')
            fh.write(f"{img_path_str}\t{txt}\n")
            written += 1

    print(f"Wrote {written} lines to {outp}")
    if missing_md:
        print(f"Missing or empty md for {len(missing_md)} images (showing up to 20):")
        for m in missing_md[:20]:
            print("  -", m[0], "expected", m[1])

if __name__ == '__main__':
    main()

## 2. Validation and normalization of ground truth

The next step is to validate and normalize the generated `ground_truth_new.txt`:
- Ensure each line has an image path and a transcription separated by a tab.
- Normalize whitespace and Unicode (NFC).
- Optionally apply substitutions (mapping file).
- Check for missing image files and duplicates.

The script `validate_and_normalize_gt.py` performs these checks and writes a normalized file.

In [None]:
%%bash
# Validate and normalize ground truth
python3 validate_and_normalize_gt.py \
    --gt "gt_new/ground_truth_new.txt" \
    --root "." \
    --out "gt_new/ground_truth_new_normalized.txt"

In [None]:
# validate_and_normalize_gt.py
#
# Verifies and normalizes an existing ground_truth.txt.
# Produces a cleaned version and a console report of problems found.

import argparse
import sys
import unicodedata
from pathlib import Path
from collections import defaultdict

def load_map(path: str):
    m = {}
    if not path:
        return m
    for ln in open(path, encoding='utf-8'):
        ln = ln.rstrip('\n')
        if not ln or ln.startswith('#'):
            continue
        if '\t' in ln:
            a, b = ln.split('\t', 1)
        else:
            parts = ln.split(None, 1)
            if len(parts) != 2:
                continue
            a, b = parts
        m[a] = b
    return m

def apply_map(s: str, mapping: dict) -> str:
    for a, b in mapping.items():
        s = s.replace(a, b)
    return s

def normalize_text(s: str) -> str:
    s = s.strip()
    s = ' '.join(s.split())  # collapse whitespace
    s = unicodedata.normalize('NFC', s)
    return s

def main():
    p = argparse.ArgumentParser()
    p.add_argument('--gt', required=True)
    p.add_argument('--root', default='.', help='Root dir to resolve image paths')
    p.add_argument('--out', default=None, help='Output normalized GT file (optional)')
    p.add_argument('--map', default=None, help='Optional substitutions file (from<TAB>to)')
    args = p.parse_args()

    mapping = load_map(args.map)
    root = Path(args.root)
    gt = Path(args.gt)
    if not gt.exists():
        print('gt file not found', gt, file=sys.stderr)
        sys.exit(1)

    missing = []
    empty = []
    bad_lines = []
    duplicates = defaultdict(int)
    total = 0
    normalized_lines = []

    for i, ln in enumerate(gt.read_text(encoding='utf-8').splitlines(), start=1):
        if not ln.strip():
            continue
        if '\t' not in ln:
            bad_lines.append((i, ln))
            continue
        path, txt = ln.split('\t', 1)
        path = path.strip()
        txt = txt.strip()
        txt = apply_map(txt, mapping)
        txt = normalize_text(txt)
        if txt == '':
            empty.append((i, path))
        pth = (root / path) if not Path(path).is_absolute() else Path(path)
        if not pth.exists():
            missing.append((i, path))
        duplicates[path] += 1
        normalized_lines.append((path, txt))
        total += 1

    print(f"Total lines read: {total}")
    if bad_lines:
        print(f"Lines without tab: {len(bad_lines)} (showing up to 10):")
        for i, ln in bad_lines[:10]:
            print(i, ln[:200])
    if empty:
        print(f"Empty transcriptions: {len(empty)}")
    if missing:
        print(f"Missing image files: {len(missing)} (showing up to 10):")
        for i, pth in missing[:10]:
            print(i, pth)
    dup_list = [p for p, c in duplicates.items() if c > 1]
    if dup_list:
        print(f"Duplicate image entries: {len(dup_list)} (showing up to 10):")
        for pth in dup_list[:10]:
            print(pth, duplicates[pth])

    if args.out:
        outp = Path(args.out)
        outp.parent.mkdir(parents=True, exist_ok=True)
        with outp.open('w', encoding='utf-8') as fh:
            for path, txt in normalized_lines:
                txt_clean = txt.replace('\t', ' ').replace('\r', ' ').replace('\n', ' ')
                fh.write(f"{path}\t{txt_clean}\n")
        print(f"Wrote normalized ground truth to {outp}")

if __name__ == '__main__':
    main()

## 3. Creating splits and charset

Once the normalized ground truth is available, we:
- Group entries by page (to avoid page leakage between splits).
- Split into train/validation/test sets.
- Extract the charset (set of characters actually used).

This is done by `split_and_charset.py`.

In [None]:
%%bash
# Create train/val/test splits and charset
python3 split_and_charset.py \
    gt_new/ground_truth_new_normalized.txt \
    --out splits_new \
    --val 0.05 \
    --test 0.05 \
    --group-by-page \
    --page-sep "_"

In [None]:
# split_and_charset.py
#
# Reads a ground truth file (path<TAB>transcription per line),
# normalizes text, extracts the charset, and creates train/val/test splits.

import argparse
import unicodedata
import random
from pathlib import Path
from collections import defaultdict

def normalize_text(s: str) -> str:
    return unicodedata.normalize("NFC", s).strip()

def read_ground_truth(gt_path: str):
    entries = []
    gt_path = Path(gt_path)
    if not gt_path.exists():
        raise FileNotFoundError(f"Ground truth file not found: {gt_path}")
    with gt_path.open("r", encoding="utf-8") as fh:
        for lineno, line in enumerate(fh, 1):
            raw = line.rstrip("\n")
            if not raw.strip():
                continue
            if "\t" in raw:
                img, txt = raw.split("\t", 1)
            else:
                parts = raw.split(None, 1)
                if len(parts) != 2:
                    print(f"Warning: salto linea {lineno} (formato non riconosciuto): {raw}")
                    continue
                img, txt = parts
            txt = normalize_text(txt)
            if txt == "":
                print(f"Warning: trascrizione vuota a linea {lineno}, salto.")
                continue
            entries.append((img, txt))
    return entries

def group_entries(entries, group_by_page: bool, page_sep: str):
    if not group_by_page:
        return [[e] for e in entries]
    groups = defaultdict(list)
    for img, txt in entries:
        stem = Path(img).stem
        if page_sep and page_sep in stem:
            key = stem.split(page_sep, 1)[0]
        else:
            parent = str(Path(img).parent)
            key = parent
        groups[key].append((img, txt))
    return list(groups.values())

def split_groups(groups, val_frac: float, test_frac: float, seed: int):
    random.seed(seed)
    random.shuffle(groups)
    total = sum(len(g) for g in groups)
    n_val = int(total * val_frac + 0.5)
    n_test = int(total * test_frac + 0.5)
    n_train = total - n_val - n_test

    train, val, test = [], [], []
    counts = {"train": 0, "val": 0, "test": 0}

    for g in groups:
        deficits = {
            "train": n_train - counts["train"],
            "val": n_val - counts["val"],
            "test": n_test - counts["test"]
        }
        pick = max(deficits.items(), key=lambda x: (x[1], x[0]))[0]
        if pick == "train":
            train.extend(g); counts["train"] += len(g)
        elif pick == "val":
            if counts["val"] + len(g) <= n_val:
                val.extend(g); counts["val"] += len(g)
            else:
                train.extend(g); counts["train"] += len(g)
        else:
            if counts["test"] + len(g) <= n_test:
                test.extend(g); counts["test"] += len(g)
            else:
                train.extend(g); counts["train"] += len(g)
    return train, val, test

def write_split(out_dir: str, name: str, entries):
    out = Path(out_dir)
    out.mkdir(parents=True, exist_ok=True)
    path = out / f"{name}.txt"
    with path.open("w", encoding="utf-8") as fh:
        for img, txt in entries:
            fh.write(f"{img}\t{txt}\n")
    print(f"Wrote {len(entries)} lines to {path}")

def extract_charset(entries, out_dir: str):
    chars = set()
    for _, txt in entries:
        chars.update(txt)
    chars.discard("\n"); chars.discard("\r")
    chars = sorted(chars)
    out = Path(out_dir); out.mkdir(parents=True, exist_ok=True)
    charset_path = out / "charset.txt"
    with charset_path.open("w", encoding="utf-8") as fh:
        for c in chars:
            fh.write(c + "\n")
    with (out / "charset_one_line.txt").open("w", encoding="utf-8") as fh:
        fh.write("".join(chars))
    print(f"Extracted {len(chars)} unique characters -> {charset_path}")

def main():
    p = argparse.ArgumentParser(description="Genera splits e charset da ground_truth")
    p.add_argument("gt", help="Percorso al file ground_truth (path<TAB>trascrizione per riga)")
    p.add_argument("--out", "-o", default="splits", help="Cartella di output")
    p.add_argument("--val", type=float, default=0.05, help="Frazione per validation (default 0.05)")
    p.add_argument("--test", type=float, default=0.05, help="Frazione per test (default 0.05)")
    p.add_argument("--seed", type=int, default=42, help="Seed per random shuffle")
    p.add_argument("--group-by-page", action="store_true",
                   help="Evitare contaminazione raggruppando righe della stessa pagina")
    p.add_argument("--page-sep", default="_",
                   help="Separatore per estrarre ID pagina dal file stem")
    args = p.parse_args()

    entries = read_ground_truth(args.gt)
    if not entries:
        print("Nessuna entry trovata nel ground truth. Esco.")
        return
    print(f"Letti {len(entries)} righe dal ground truth.")
    groups = group_entries(entries, args.group_by_page, args.page_sep)
    print(f"Raggruppati in {len(groups)} gruppi (group_by_page={args.group_by_page}).")
    train, val, test = split_groups(groups, args.val, args.test, args.seed)
    write_split(args.out, "train", train)
    write_split(args.out, "val", val)
    write_split(args.out, "test", test)
    write_split(args.out, "ground_truth_all", train + val + test)
    extract_charset(entries, args.out)
    print("Fatto.")

if __name__ == "__main__":
    main()

Quick sanity check on the generated splits:

In [None]:
%%bash
# Check number of lines in each split and show first few entries from train
wc -l splits_new/*.txt || true
echo
sed -n '1,5p' splits_new/train.txt || true

## 4. Image preprocessing pipeline

Ketos expects line images in a consistent format. We use ImageMagick (`convert`) to:
- deskew lines,
- convert to grayscale,
- resize to a fixed height (64 px here),
- center the line in a fixed canvas.

The following shell snippet processes all line images and writes them into `processed/lines/`.

In [None]:
%%bash
# Preprocess line images for ketos (deskew, grayscale, resize, center)
mkdir -p processed/lines

for f in 00_images/*.png; do
    base=$(basename "$f")
    convert "$f" \
        -deskew 40% \
        -colorspace Gray \
        -resize x64 \
        -background white -gravity center -extent 0x64 \
        "processed/lines/$base"
done

After preprocessing, we need to update the image paths in the split files to point to `processed/lines/` instead of `00_images/`.

In [None]:
%%bash
# Update image paths in split files to use processed/lines instead of 00_images
sed 's|^00_images/|processed/lines/|' splits_new/train.txt > splits_new/train_proc.txt
sed 's|^00_images/|processed/lines/|' splits_new/val.txt   > splits_new/val_proc.txt
sed 's|^00_images/|processed/lines/|' splits_new/test.txt  > splits_new/test_proc.txt

## 5. Extracting image lists and creating sidecar files

Kraken/ketos training expects either:
- split files with `image\ttext`, or
- image lists plus separate `.gt.txt` files per image.

Here we generate:
- simple image lists (one image path per line) for train/val/test;
- sidecar `.gt.txt` files containing the transcription for each image.

In [None]:
%%bash
# Extract only image paths from the processed splits (train/val/test)
awk -F'\t' '{print $1}' splits_new/train_proc.txt > splits_new/train_images.txt
awk -F'\t' '{print $1}' splits_new/val_proc.txt   > splits_new/val_images.txt
awk -F'\t' '{print $1}' splits_new/test_proc.txt  > splits_new/test_images.txt

In [None]:
# Inline script to create sidecar .gt.txt files from split files
from pathlib import Path

def write_sidecars(split_path: str) -> int:
    p = Path(split_path)
    created = 0
    with p.open(encoding='utf-8') as f:
        for i, line in enumerate(f, 1):
            line = line.rstrip('\n')
            if not line.strip():
                continue
            try:
                img, txt = line.split('\t', 1)
            except ValueError:
                print(f"[WARN] line {i} without TAB in {split_path}: {line[:120]}...")
                continue
            img_path = Path(img)
            if not img_path.exists():
                print(f"[WARN] missing image at line {i}: {img_path}")
                continue
            sidecar = img_path.with_suffix('.gt.txt')  # e.g. foo.png -> foo.gt.txt
            text = txt.strip('\r\n')
            sidecar.write_text(text + '\n', encoding='utf-8')
            created += 1
    return created

total = 0
for sp in ['splits_new/train_proc.txt', 'splits_new/val_proc.txt', 'splits_new/test_proc.txt']:
    if Path(sp).exists():
        c = write_sidecars(sp)
        print(f"[OK] {sp}: created {c} sidecar files")
        total += c
print("Total new sidecars created:", total)

In [None]:
# Verification: check that each image in the image lists has a corresponding .gt.txt sidecar
from pathlib import Path

missing = []
for lst in ['splits_new/train_images.txt', 'splits_new/val_images.txt', 'splits_new/test_images.txt']:
    p = Path(lst)
    if not p.exists():
        continue
    for l in p.read_text(encoding='utf-8').splitlines():
        img = Path(l.strip())
        if not img.exists():
            continue
        gt = img.with_suffix('.gt.txt')
        if not gt.exists():
            missing.append(str(img))

print("Total images without sidecar:", len(missing))
if missing[:10]:
    print("First missing:", missing[:10])

## 6. Training the Kraken model with ketos

Finally, we launch training using `ketos train`, starting from the pre‑trained
Latin manuscript model `Tridis_Medieval_EarlyModern.mlmodel` and fine‑tuning it on our Italian dataset.

> **Note:** This command is intended to be run in a shell inside the conda environment where Kraken/ketos are installed. Adjust paths and hyperparameters as needed for your setup.

In [None]:
%%bash
# Fine-tune Kraken model with ketos
# Run this in a terminal within the environment where kraken/ketos is installed.
ketos train \
    -f path \
    -i "models/Tridis_Medieval_EarlyModern.mlmodel" \
    --resize union \
    -q early \
    -N 40 \
    --min-epochs 5 \
    --lag 10 \
    -B 4 \
    -r 5e-5 \
    -o models/italian_finetuned.mlmodel_best.mlmodel \
    -t splits_new/train_images.txt \
    -e splits_new/val_images.txt