# Italian Manuscripts OCR with Kraken

This notebook documents the end-to-end process used to create and fine-tune a Kraken OCR model for Italian handwritten manuscripts. It follows the same logical structure of the LaTeX report:

- Preparation of line images and Markdown transcriptions
- Generation and normalization of the ground truth file
- Creation of train/validation/test splits and charset
- Image preprocessing pipeline
- Creation of sidecar `.gt.txt` files
- Final training command for Kraken via `ketos`

The notebook is mainly about **reproducibility and documentation**. Some paths and commands are taken from the original environment and may need to be adapted to your own directory layout and platform.

In this version, the code is **more heavily commented** to clarify both:
- **what** each block does (step by step), and
- **why** it is needed in the context of building a Kraken model for Italian manuscripts.

## Environment and prerequisites

The work was carried out on a Linux machine (Pop!_OS) with an NVIDIA GPU and CUDA support. The following assumptions are made:

- You have a working Python 3 environment (e.g. via `conda`).
- `kraken` and `ketos` are installed in this environment (e.g. `pip install kraken`).
- You have the following directory structure (or equivalent):
  - `00_images/` – line images extracted from scanned pages (after ScanTailor or similar tools)
  - `01_texts/` – Markdown files with manual transcriptions, one per line image
  - `gt_new/` – folder where ground truth files will be written
  - `splits_new/` – folder for train/val/test splits
  - `processed/lines/` – folder for preprocessed line images used for training

The main goal is to go from **raw scans + Markdown transcriptions** to a dataset that Kraken/ketos can use to fine‑tune an OCR model.

## 1. Generating ground truth from Markdown files

We start from:
- `00_images/` containing line images (e.g. `100_page1_line001.png`)
- `01_texts/` containing Markdown transcriptions with matching stems (e.g. `100_page1_line001.md`)

The script `make_gt_from_line_md.py` generates a tab-separated `ground_truth.txt` where each line contains:

```text
<image_path>\t<normalized_plain_text>
```

### Why this step?

Kraken/ketos expects a **ground truth file** that links each image to its textual transcription. Your transcriptions are more convenient to edit and read in Markdown, but for training we need **plain text without formatting**. This script:

- finds, for each image, the corresponding `.md` (or `.txt` etc.) file;
- strips Markdown formatting (links, emphasis, headers...);
- optionally joins multiple lines into a single line of text;
- writes one line per image, with path and transcription separated by a tab.

In [None]:
%%bash
# Generate ground truth from Markdown transcriptions.
#
# This command scans:
#   - ./00_images  for line images
#   - ./01_texts   for corresponding .md files
# and writes a tab-separated file gt_new/ground_truth_new.txt.
#
# You should adjust the paths if your project lives in a different directory layout.

python3 make_gt_from_line_md.py \
    --images_dir "./00_images" \
    --md_dir "./01_texts" \
    --out "gt_new/ground_truth_new.txt"

In [None]:
# make_gt_from_line_md.py
#
# PURPOSE
# -------
# Build a ground truth file in the format expected by Kraken/ketos:
#
#   <image_path>\t<plain_text_transcription>
#
# starting from:
#   - a directory of images (one image per line of text)
#   - a directory of Markdown files with the same stems as the images.
#
# This allows you to keep your human-readable transcriptions in Markdown,
# while still producing the simple text format needed for OCR training.

import argparse
from pathlib import Path
import unicodedata
import re
import sys

def strip_markdown(s: str) -> str:
    """Remove basic Markdown constructs from a string.

    This is *not* a full Markdown parser, but it's enough for typical
    line-level transcriptions:
    - remove images:           ![alt](url)
    - unwrap links:            [text](url) -> text
    - remove inline markers:   `code`, *emphasis*, _emphasis_
    - remove leading > in quotes
    - collapse whitespace.
    """
    # Remove images
    s = re.sub(r'!\[.*?\]\(.*?\)', '', s)
    # Unwrap links [text](url) -> text
    s = re.sub(r'\[([^\]]+)\]\([^\)]+\)', r'\1', s)
    # Remove simple formatting markers
    s = re.sub(r'[`*_]{1,}', '', s)
    # Remove blockquote markers at line start
    s = re.sub(r'^\s*>\s*', '', s, flags=re.M)
    # Normalize spaces
    s = re.sub(r'\s+', ' ', s)
    return s.strip()

def normalize(s: str) -> str:
    """Normalize Unicode (NFC) and trim surrounding whitespace.

    Using NFC ensures that accented characters are stored in a canonical form,
    which is important when you later build the charset and train the model.
    """
    return unicodedata.normalize("NFC", s).strip()

def main():
    parser = argparse.ArgumentParser(description="Generate ground truth from Markdown line files.")
    parser.add_argument('--images_dir', required=True, help='Directory containing line images')
    parser.add_argument('--md_dir', required=True, help='Directory containing Markdown/text transcriptions')
    parser.add_argument('--out', default='gt/ground_truth.txt', help='Output GT file')
    parser.add_argument('--images_glob', default='*.*', help="Glob pattern for images, default '*.*'")
    parser.add_argument('--join_multiline', action='store_true', default=True,
                        help='If set, concatenates all non-empty lines of the .md file')
    parser.add_argument('--no-join', dest='join_multiline', action='store_false',
                        help='Do not concatenate: use only the first non-empty line')
    parser.add_argument('--relpath', action='store_true', default=True,
                        help='Write image paths relative to images_dir instead of absolute paths')
    args = parser.parse_args()

    images_dir = Path(args.images_dir)
    md_dir = Path(args.md_dir)
    outp = Path(args.out)
    outp.parent.mkdir(parents=True, exist_ok=True)

    # Collect all files in images_dir that match the glob pattern
    imgs = sorted([p for p in images_dir.glob(args.images_glob) if p.is_file()])
    if not imgs:
        print("Nessuna immagine trovata in", images_dir, file=sys.stderr)
        sys.exit(1)

    missing_md = []  # track images without a matching transcription
    written = 0      # how many GT lines we actually write

    with outp.open('w', encoding='utf-8') as fh:
        for img in imgs:
            stem = img.stem
            # Expected Markdown file: same stem + .md
            md_path = md_dir / (stem + '.md')
            if not md_path.exists():
                # Try some alternative text extensions if .md is not present
                found = None
                for ext in ['.txt', '.mdown', '.markdown']:
                    cand = md_dir / (stem + ext)
                    if cand.exists():
                        found = cand
                        break
                if found:
                    md_path = found
                else:
                    # No transcription found: record and skip this image
                    missing_md.append((img.name, str(md_path)))
                    continue

            text = md_path.read_text(encoding='utf-8')
            # Split into non-empty lines and strip Markdown on each line
            lines = [normalize(strip_markdown(l)) for l in text.splitlines() if l.strip()]
            if not lines:
                # Transcription file exists but has no usable content
                missing_md.append((img.name, f"{md_path} (vuoto)"))
                continue

            # Either concatenate all lines or take the first one
            if args.join_multiline:
                txt = ' '.join(lines)
            else:
                txt = lines[0]

            txt = normalize(strip_markdown(txt))

            # Decide how to write the image path in the GT file
            if args.relpath:
                # Relative to the images_dir, but we store the prefix (e.g. "00_images/...")
                img_path_str = str(Path(args.images_dir).joinpath(img.name).as_posix())
            else:
                img_path_str = str(img.resolve())

            # Ensure the text has no embedded tabs/newlines, because we use TAB as separator
            txt = txt.replace('\t', ' ').replace('\r', ' ').replace('\n', ' ')
            fh.write(f"{img_path_str}\t{txt}\n")
            written += 1

    print(f"Wrote {written} lines to {outp}")
    if missing_md:
        print(f"Missing or empty md for {len(missing_md)} images (showing up to 20):")
        for m in missing_md[:20]:
            print("  -", m[0], "expected", m[1])

if __name__ == '__main__':
    main()

## 2. Validation and normalization of ground truth

After generating `ground_truth_new.txt`, we want to:

- check that each line has a path and text separated by a tab;
- normalize whitespace and Unicode (NFC) again, in a consistent way;
- optionally apply substitutions (e.g. aligning different variants of characters);
- detect missing image files and duplicate entries.

This step acts as a **sanity check** and produces a cleaned `ground_truth_new_normalized.txt` file that will be used to create train/val/test splits.

In [None]:
%%bash
# Validate and normalize the newly created ground truth file.
#
# --gt   : input GT file from the previous step
# --root : base directory used to resolve relative image paths
# --out  : path to the normalized GT file

python3 validate_and_normalize_gt.py \
    --gt "gt_new/ground_truth_new.txt" \
    --root "." \
    --out "gt_new/ground_truth_new_normalized.txt"

In [None]:
# validate_and_normalize_gt.py
#
# PURPOSE
# -------
# Load an existing ground_truth.txt, check its consistency, normalize
# transcriptions, and (optionally) write a cleaned version.
#
# This helps catch:
#   - lines without TAB separator
#   - empty transcriptions
#   - missing image files
#   - duplicate entries for the same image path.

import argparse
import sys
import unicodedata
from pathlib import Path
from collections import defaultdict

def load_map(path: str):
    """Load optional replacement rules from a text file.

    The mapping file contains lines of the form:
        from<TAB>to
    or (as a fallback) two whitespace-separated fields.
    This can be used to unify characters or fix common issues.
    """
    m = {}
    if not path:
        return m
    for ln in open(path, encoding='utf-8'):
        ln = ln.rstrip('\n')
        if not ln or ln.startswith('#'):
            continue
        if '\t' in ln:
            a, b = ln.split('\t', 1)
        else:
            parts = ln.split(None, 1)
            if len(parts) != 2:
                continue
            a, b = parts
        m[a] = b
    return m

def apply_map(s: str, mapping: dict) -> str:
    """Apply all substitution rules in `mapping` to the string `s`."""
    for a, b in mapping.items():
        s = s.replace(a, b)
    return s

def normalize_text(s: str) -> str:
    """Trim, collapse whitespace, and normalize Unicode to NFC."""
    s = s.strip()
    s = ' '.join(s.split())  # collapse multiple spaces
    s = unicodedata.normalize('NFC', s)
    return s

def main():
    parser = argparse.ArgumentParser(description="Validate and normalize a ground_truth file.")
    parser.add_argument('--gt', required=True, help='Input ground truth file')
    parser.add_argument('--root', default='.', help='Root dir to resolve image paths')
    parser.add_argument('--out', default=None, help='Output normalized GT file (optional)')
    parser.add_argument('--map', default=None, help='Optional substitutions file (from<TAB>to)')
    args = parser.parse_args()

    mapping = load_map(args.map)
    root = Path(args.root)
    gt = Path(args.gt)
    if not gt.exists():
        print('gt file not found', gt, file=sys.stderr)
        sys.exit(1)

    missing = []      # lines whose image file does not exist
    empty = []        # lines with empty transcription after normalization
    bad_lines = []    # lines without TAB separator
    duplicates = defaultdict(int)
    total = 0
    normalized_lines = []

    for i, ln in enumerate(gt.read_text(encoding='utf-8').splitlines(), start=1):
        if not ln.strip():
            continue
        if '\t' not in ln:
            bad_lines.append((i, ln))
            continue
        path, txt = ln.split('\t', 1)
        path = path.strip()
        txt = txt.strip()
        # Apply substitutions (if any) before final normalization
        txt = apply_map(txt, mapping)
        txt = normalize_text(txt)
        if txt == '':
            empty.append((i, path))

        # Resolve path relative to root (unless already absolute)
        pth = (root / path) if not Path(path).is_absolute() else Path(path)
        if not pth.exists():
            missing.append((i, path))

        duplicates[path] += 1
        normalized_lines.append((path, txt))
        total += 1

    # Console report of what we found
    print(f"Total lines read: {total}")
    if bad_lines:
        print(f"Lines without tab: {len(bad_lines)} (showing up to 10):")
        for i, ln in bad_lines[:10]:
            print(i, ln[:200])
    if empty:
        print(f"Empty transcriptions: {len(empty)}")
    if missing:
        print(f"Missing image files: {len(missing)} (showing up to 10):")
        for i, pth in missing[:10]:
            print(i, pth)
    dup_list = [p for p, c in duplicates.items() if c > 1]
    if dup_list:
        print(f"Duplicate image entries: {len(dup_list)} (showing up to 10):")
        for pth in dup_list[:10]:
            print(pth, duplicates[pth])

    # Optionally write the normalized file, ready for splitting
    if args.out:
        outp = Path(args.out)
        outp.parent.mkdir(parents=True, exist_ok=True)
        with outp.open('w', encoding='utf-8') as fh:
            for path, txt in normalized_lines:
                txt_clean = txt.replace('\t', ' ').replace('\r', ' ').replace('\n', ' ')
                fh.write(f"{path}\t{txt_clean}\n")
        print(f"Wrote normalized ground truth to {outp}")

if __name__ == '__main__':
    main()

## 3. Creating splits and charset

Once the normalized ground truth is available, we:

- group entries by page (to avoid having lines from the same page in both train and val/test);
- split into train/validation/test sets with specified proportions;
- extract the set of characters actually used (charset), which Kraken uses to define the output layer.

This is done by `split_and_charset.py`. Grouping by page is especially important when the same physical page is split into many line images: we don't want the same handwriting style from the very same page to appear simultaneously in training and validation.

In [None]:
%%bash
# Create train/val/test splits and extract charset from the normalized GT.
#
# --val  : fraction of examples reserved for validation
# --test : fraction for test
# --group-by-page : keep lines from the same page together in the same split
# --page-sep      : character used in filenames to separate page id from line id

python3 split_and_charset.py \
    gt_new/ground_truth_new_normalized.txt \
    --out splits_new \
    --val 0.05 \
    --test 0.05 \
    --group-by-page \
    --page-sep "_"

In [None]:
# split_and_charset.py
#
# PURPOSE
# -------
# Take a normalized ground truth file and:
#   - optionally group lines by page
#   - split them into train/val/test
#   - save the splits as separate files
#   - extract the character set used in all transcriptions.

import argparse
import unicodedata
import random
from pathlib import Path
from collections import defaultdict

def normalize_text(s: str) -> str:
    """Normalize Unicode (NFC) and trim.

    This is applied again in case the GT file still contains some
    inconsistencies or extra whitespace.
    """
    return unicodedata.normalize("NFC", s).strip()

def read_ground_truth(gt_path: str):
    """Read a ground truth file and return a list of (image_path, text) pairs.

    Each line is expected to be:
        <path>\t<transcription>
    Lines without a valid format or with empty text are skipped.
    """
    entries = []
    gt_path = Path(gt_path)
    if not gt_path.exists():
        raise FileNotFoundError(f"Ground truth file not found: {gt_path}")
    with gt_path.open("r", encoding="utf-8") as fh:
        for lineno, line in enumerate(fh, 1):
            raw = line.rstrip("\n")
            if not raw.strip():
                continue
            if "\t" in raw:
                img, txt = raw.split("\t", 1)
            else:
                # As a fallback, split on first whitespace.
                parts = raw.split(None, 1)
                if len(parts) != 2:
                    print(f"Warning: salto linea {lineno} (formato non riconosciuto): {raw}")
                    continue
                img, txt = parts
            txt = normalize_text(txt)
            if txt == "":
                print(f"Warning: trascrizione vuota a linea {lineno}, salto.")
                continue
            entries.append((img, txt))
    return entries

def group_entries(entries, group_by_page: bool, page_sep: str):
    """Group entries by page (or return each entry as its own group).

    When group_by_page is True, the page id is extracted from the filename
    stem using page_sep (e.g. '0001_01.png' -> page '0001'). This ensures all
    lines from the same page go to the same split.
    """
    if not group_by_page:
        return [[e] for e in entries]
    groups = defaultdict(list)
    for img, txt in entries:
        stem = Path(img).stem
        if page_sep and page_sep in stem:
            key = stem.split(page_sep, 1)[0]
        else:
            # Fallback: group by parent directory name
            parent = str(Path(img).parent)
            key = parent
        groups[key].append((img, txt))
    return list(groups.values())

def split_groups(groups, val_frac: float, test_frac: float, seed: int):
    """Split groups into train/val/test according to the given fractions.

    Groups are shuffled and then greedily assigned to the split that has the
    largest remaining deficit. This way we keep all lines from the same page
    together, while approximating the desired split ratios.
    """
    random.seed(seed)
    random.shuffle(groups)
    total = sum(len(g) for g in groups)
    n_val = int(total * val_frac + 0.5)
    n_test = int(total * test_frac + 0.5)
    n_train = total - n_val - n_test

    train, val, test = [], [], []
    counts = {"train": 0, "val": 0, "test": 0}

    for g in groups:
        deficits = {
            "train": n_train - counts["train"],
            "val": n_val - counts["val"],
            "test": n_test - counts["test"]
        }
        # Choose the split with the largest positive deficit; if all are
        # non-positive, we fall back to train.
        pick = max(deficits.items(), key=lambda x: (x[1], x[0]))[0]
        if pick == "train":
            train.extend(g); counts["train"] += len(g)
        elif pick == "val":
            if counts["val"] + len(g) <= n_val:
                val.extend(g); counts["val"] += len(g)
            else:
                train.extend(g); counts["train"] += len(g)
        else:
            if counts["test"] + len(g) <= n_test:
                test.extend(g); counts["test"] += len(g)
            else:
                train.extend(g); counts["train"] += len(g)
    return train, val, test

def write_split(out_dir: str, name: str, entries):
    """Write a split file `<name>.txt` containing `path<TAB>text` per line."""
    out = Path(out_dir)
    out.mkdir(parents=True, exist_ok=True)
    path = out / f"{name}.txt"
    with path.open("w", encoding="utf-8") as fh:
        for img, txt in entries:
            fh.write(f"{img}\t{txt}\n")
    print(f"Wrote {len(entries)} lines to {path}")

def extract_charset(entries, out_dir: str):
    """Extract the set of characters used in all transcriptions.

    This is useful for inspecting the data and for configuring the OCR model's
    output layer (alphabet).
    """
    chars = set()
    for _, txt in entries:
        chars.update(txt)
    chars.discard("\n"); chars.discard("\r")
    chars = sorted(chars)
    out = Path(out_dir); out.mkdir(parents=True, exist_ok=True)
    charset_path = out / "charset.txt"
    with charset_path.open("w", encoding="utf-8") as fh:
        for c in chars:
            fh.write(c + "\n")
    # Also provide a single-line version for quick inspection
    with (out / "charset_one_line.txt").open("w", encoding="utf-8") as fh:
        fh.write("".join(chars))
    print(f"Extracted {len(chars)} unique characters -> {charset_path}")

def main():
    parser = argparse.ArgumentParser(description="Genera splits e charset da ground_truth")
    parser.add_argument("gt", help="Percorso al file ground_truth (path<TAB>trascrizione per riga)")
    parser.add_argument("--out", "-o", default="splits", help="Cartella di output")
    parser.add_argument("--val", type=float, default=0.05, help="Frazione per validation (default 0.05)")
    parser.add_argument("--test", type=float, default=0.05, help="Frazione per test (default 0.05)")
    parser.add_argument("--seed", type=int, default=42, help="Seed per random shuffle")
    parser.add_argument("--group-by-page", action="store_true",
                        help="Evitare contaminazione raggruppando righe della stessa pagina")
    parser.add_argument("--page-sep", default="_",
                        help="Separatore per estrarre ID pagina dal file stem")
    args = parser.parse_args()

    entries = read_ground_truth(args.gt)
    if not entries:
        print("Nessuna entry trovata nel ground truth. Esco.")
        return
    print(f"Letti {len(entries)} righe dal ground truth.")
    groups = group_entries(entries, args.group_by_page, args.page_sep)
    print(f"Raggruppati in {len(groups)} gruppi (group_by_page={args.group_by_page}).")
    train, val, test = split_groups(groups, args.val, args.test, args.seed)
    write_split(args.out, "train", train)
    write_split(args.out, "val", val)
    write_split(args.out, "test", test)
    write_split(args.out, "ground_truth_all", train + val + test)
    extract_charset(entries, args.out)
    print("Fatto.")

if __name__ == "__main__":
    main()

Quick sanity check on the generated splits: we check how many lines each split contains and we print the first few lines of `train.txt` to visually verify the format.

In [None]:
%%bash
# Check number of lines in each split and show first few entries from train.
# This is useful to quickly see if the splits look reasonable.
wc -l splits_new/*.txt || true
echo
sed -n '1,5p' splits_new/train.txt || true

## 4. Image preprocessing pipeline

Ketos expects line images in a **consistent format** (same height, good contrast, minimal skew). We use ImageMagick (`convert`) to:

- deskew lines (`-deskew 40%`),
- convert to grayscale (`-colorspace Gray`),
- resize to a fixed height (64 px here, `-resize x64`),
- center the line in a fixed canvas (`-extent 0x64`).

The shell snippet below processes all line images and writes them into `processed/lines/`. This step reduces unwanted variability in the input that might make training harder or slower.

In [None]:
%%bash
# Preprocess line images for ketos (deskew, grayscale, resize, center).
#
# This loop:
#   - creates processed/lines if needed
#   - iterates over each .png image in 00_images
#   - applies a sequence of ImageMagick operations

mkdir -p processed/lines

for f in 00_images/*.png; do
    base=$(basename "$f")
    convert "$f" \
        -deskew 40% \
        -colorspace Gray \
        -resize x64 \
        -background white -gravity center -extent 0x64 \
        "processed/lines/$base"
done

After preprocessing, the paths in the split files still point to `00_images/...`. We need to update them so that they refer to the preprocessed images in `processed/lines/`.

This is a **pure string replacement** in the split files and does not change the texts.

In [None]:
%%bash
# Update image paths in split files to use processed/lines instead of 00_images.
# The text part after the TAB remains unchanged.

sed 's|^00_images/|processed/lines/|' splits_new/train.txt > splits_new/train_proc.txt
sed 's|^00_images/|processed/lines/|' splits_new/val.txt   > splits_new/val_proc.txt
sed 's|^00_images/|processed/lines/|' splits_new/test.txt  > splits_new/test_proc.txt

## 5. Extracting image lists and creating sidecar files

Kraken/ketos training can work with:
- a split file containing `image\ttext`, **or**
- a list of image paths plus separate `.gt.txt` files (sidecars) containing the transcription.

In this workflow we create:
- `train_images.txt`, `val_images.txt`, `test_images.txt` – one image path per line;
- `*.gt.txt` sidecars in the same directory as the images, each containing the corresponding transcription.

This layout is convenient when you want to reuse the same splits with different Kraken commands or configurations without rewriting a big GT file each time.

In [None]:
%%bash
# Extract only image paths from the processed splits (train/val/test).
# We drop the text part and keep only the first TAB-separated field.

awk -F'\t' '{print $1}' splits_new/train_proc.txt > splits_new/train_images.txt
awk -F'\t' '{print $1}' splits_new/val_proc.txt   > splits_new/val_images.txt
awk -F'\t' '{print $1}' splits_new/test_proc.txt  > splits_new/test_images.txt

In [None]:
# Inline script to create sidecar .gt.txt files from split files.
#
# PURPOSE
# -------
# For each line in train_proc/val_proc/test_proc:
#     <image_path>\t<text>
# we create a file <image_path>.gt.txt that contains <text>.
#
# This is the format that recent versions of ketos expect when we pass only
# lists of images for training/validation.

from pathlib import Path

def write_sidecars(split_path: str) -> int:
    """Create .gt.txt sidecar files for each image in a split.

    Parameters
    ----------
    split_path : str
        Path to a file where each line is `image_path<TAB>text`.

    Returns
    -------
    int
        Number of sidecar files created.
    """
    p = Path(split_path)
    created = 0
    with p.open(encoding='utf-8') as f:
        for i, line in enumerate(f, 1):
            line = line.rstrip('\n')
            if not line.strip():
                continue
            try:
                img, txt = line.split('\t', 1)
            except ValueError:
                # If a line does not contain a TAB, we warn and skip it.
                print(f"[WARN] line {i} without TAB in {split_path}: {line[:120]}...")
                continue

            img_path = Path(img)
            if not img_path.exists():
                # It is important that every image mentioned in the split actually exists.
                print(f"[WARN] missing image at line {i}: {img_path}")
                continue

            # Sidecar filename: image_name.gt.txt next to the image.
            sidecar = img_path.with_suffix('.gt.txt')  # e.g. foo.png -> foo.gt.txt
            text = txt.strip('\r\n')
            sidecar.write_text(text + '\n', encoding='utf-8')
            created += 1
    return created

total = 0
for sp in ['splits_new/train_proc.txt', 'splits_new/val_proc.txt', 'splits_new/test_proc.txt']:
    if Path(sp).exists():
        c = write_sidecars(sp)
        print(f"[OK] {sp}: created {c} sidecar files")
        total += c
print("Total new sidecars created:", total)

In [None]:
# Verification: check that each image in the image lists has a corresponding .gt.txt sidecar.
#
# This is a defensive check: it tells us if some images that we plan to use
# for training are missing their transcription file.

from pathlib import Path

missing = []
for lst in ['splits_new/train_images.txt', 'splits_new/val_images.txt', 'splits_new/test_images.txt']:
    p = Path(lst)
    if not p.exists():
        continue
    for l in p.read_text(encoding='utf-8').splitlines():
        img = Path(l.strip())
        if not img.exists():
            # If the image itself is missing, the dataset is inconsistent;
            # we skip here but this should be investigated.
            continue
        gt = img.with_suffix('.gt.txt')
        if not gt.exists():
            missing.append(str(img))

print("Total images without sidecar:", len(missing))
if missing[:10]:
    print("First missing:", missing[:10])

## 6. Training the Kraken model with ketos

Finally, we launch training using `ketos train`, starting from the pre‑trained Latin manuscript model `Tridis_Medieval_EarlyModern.mlmodel` and fine‑tuning it on our Italian dataset.

### Why these hyperparameters?

- `-i models/Tridis_Medieval_EarlyModern.mlmodel`: we reuse a model trained on medieval/early modern Latin manuscripts, which are graphically similar to the Italian handwriting in our data.
- `--resize union`: combine training images into a common height while preserving aspect ratio in a way compatible with the model.
- `-N 40`: maximum number of epochs.
- `--min-epochs 5`: ensure at least a few passes over the data before early stopping.
- `--lag 10`: patience for early stopping (number of epochs without improvement on validation before stopping).
- `-B 4`: batch size; relatively small here due to GPU memory and image size.
- `-r 5e-5`: learning rate; small step size for fine‑tuning to avoid destroying what the base model has already learned.
- `-t`/`-e`: text files listing images for training and validation.

> **Note:** This command is intended to be run in a shell inside the conda environment where Kraken/ketos are installed. Adjust paths and hyperparameters as needed for your setup and your GPU.

In [None]:
%%bash
# Fine-tune Kraken model with ketos.
#
# IMPORTANT:
# - Run this in a terminal within the environment where kraken/ketos is installed.
# - Make sure the base model path (-i) and output paths (-o) exist or can be created.
# - Ensure that splits_new/train_images.txt and splits_new/val_images.txt point
#   to images that have matching .gt.txt sidecars.

ketos train \
    -f path \
    -i "models/Tridis_Medieval_EarlyModern.mlmodel" \
    --resize union \
    -q early \
    -N 40 \
    --min-epochs 5 \
    --lag 10 \
    -B 4 \
    -r 5e-5 \
    -o models/italian_finetuned.mlmodel_best.mlmodel \
    -t splits_new/train_images.txt \
    -e splits_new/val_images.txt

## 7. Summary

This notebook mirrors and expands the code part of the LaTeX report and provides a structured, executable view of the complete pipeline used to:

- prepare line images and Markdown transcriptions for Kraken,
- validate and normalize the ground truth file,
- create robust train/validation/test splits and extract the charset,
- preprocess line images with ImageMagick,
- generate sidecar `.gt.txt` files,
- and train a fine‑tuned OCR model for Italian handwritten manuscripts based on Kraken.

The additional comments aim to clarify:

- the purpose of each script and command,
- the reasons behind design choices (e.g. grouping by page, Unicode normalization, reuse of a Latin base model),
- and potential points where you may need to adapt paths or parameters for your own environment.

You can now use this notebook both as a **didactic explanation** of the pipeline and as a **starting point** to run further experiments or extend the project.