# Processing facsimiles for HTR

Simon Gabay, University of Geneva

<img alt="Licence Creative Commons" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" align="right"/>

### a. Objectives

This notebook help users to:
- process entirely documents until a TEI output
- segment documents prior to uploading them in eScriptorium for corrections

### b. Remarks

This notebook is adapted for the [OpenOnDemand](https://ondemand.baobab.hpc.unige.ch) service of the UniGE. If you want to use OpenOnDemand, you need to [ask first for an HPC account](https://catalogue-si.unige.ch/hpc).

This notebook **should be** compatible with colab. Specific sections for colab are noted with the colab (<img width="25px" src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/Google_Colaboratory_SVG_Logo.svg/320px-Google_Colaboratory_SVG_Logo.svg.png"/>) logo. You can open the notebook directly on colab with the following link:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FoNDUE-HTR/Documentation/blob/master/notebook_pipeline.ipynb)

⚠️ GPU use is activated, be careful when running the notebook on other services. Colab offers a limited use without subscription, other uses (local, mybinder…) might not offer GPUs.

### c. Credits

The following work would not exist without the help of:
- [A. Pinche](https://ciham.cnrs.fr/annuaire/membres_statutaires/ariane-pinche), CNRS (page modelisation)
- [Th. Clérice](https://almanach.inria.fr/people-fr.html), INRIA Paris (computer vision)
- [K. Christensen](https://medialab.sciencespo.fr/equipe/kelly-christensen/), Sciences Po Paris (TEI conversion)
- [M. Humeau](https://crc.mnhn.fr/fr/annuaire/maxime-humeau-9510), Université de Genève / Museum national d'histoire naturelle (notebook)
- [Fl. Goy](https://www.unige.ch/ihr/fr/linstitut/lequipe/collaborateur-trices-projets-fns/floriane-goy/) for beta testing.

## 1. Set up

First check that the GPU is active:

### 1.1 Initialisation

##### 1.1.1 IIIF Manifest

In [22]:
id_book = ""

In [23]:
iiif_manifest = ""

##### 1.1.2 Models

In [6]:
model_htr = "https://github.com/FoNDUE-HTR/Documentation/releases/download/v.0.9/fondue_emmental.mlmodel"

In [7]:
model_segmonto = "https://zenodo.org/records/10972956/files/CapricciosaX.pt?download=1"

### 1.2 Configuration

In [8]:
!nvidia-smi

Thu Feb 22 10:52:24 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.36                 Driver Version: 546.33       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 4070 Ti     On  | 00000000:01:00.0  On |                  N/A |
|  0%   36C    P8              14W / 285W |   1706MiB / 12282MiB |     19%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

We will use two principal tools for information extraction:

- To segment the pages, we are going to use [YALTAi](https://github.com/PonteIneptique/YALTAi) developped by Thibault Clérice (more info: [arXiv.2207.11230](https://doi.org/10.48550/arXiv.2207.11230)).
- To extract the text we use [Kraken](https://github.com/mittagessen/kraken) developed by Benjamin Kiessling (more info: [10.34894/Z9G2EX](https://doi.org/10.34894/Z9G2EX)).

⚠️ YALTAi contains Kraken, no need to install it separately

In [9]:
!pip install --upgrade pip
!pip install YALTAi

Collecting pip
  Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.2
    Uninstalling pip-22.0.2:
      Successfully uninstalled pip-22.0.2
Successfully installed pip-24.0
Collecting YALTAi
  Downloading YALTAi-1.0.2-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting fast-deskew==1.0 (from YALTAi)
  Downloading fast_deskew-1.0-py3-none-any.whl (3.3 kB)
Collecting kraken==4.3.13 (from YALTAi)
  Downloading kraken-4.3.13-py3-none-any.whl.metadata (6.4 kB)
Collecting mean-average-precision==2021.4.26.0 (from YALTAi)
  Downloading mean_average_precision-2021.4.26.0-py3-none-any.whl (14 kB)
Collecting tabulate~=0.8.10 (from YALTAi)
  Downloading tabulate-0.8.10-py3-none-any.whl.metadata (25 kB)
Collecting ultralytics==8.0.209 (from YALTAi)
  Downloadi

###### 2. Document preparation

### 2.1 Download IIIF images

#### 2.1.1 IIIF_collector

In [12]:
#download CLI
!git clone https://github.com/rayondemiel/iiif_collector.git

fatal: destination path 'iiif_collector' already exists and is not an empty directory.


In [13]:
!pip install -r iiif_collector/requirements.txt

Collecting aiofiles==23.1.0 (from -r iiif_collector/requirements.txt (line 1))
  Downloading aiofiles-23.1.0-py3-none-any.whl.metadata (9.0 kB)
Collecting aiohttp==3.8.4 (from -r iiif_collector/requirements.txt (line 2))
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB)
Collecting async-timeout==4.0.2 (from -r iiif_collector/requirements.txt (line 4))
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting attrs==23.1.0 (from -r iiif_collector/requirements.txt (line 5))
  Downloading attrs-23.1.0-py3-none-any.whl.metadata (11 kB)
Collecting certifi==2022.12.7 (from -r iiif_collector/requirements.txt (line 6))
  Downloading certifi-2022.12.7-py3-none-any.whl.metadata (2.9 kB)
Collecting charset-normalizer==3.1.0 (from -r iiif_collector/requirements.txt (line 7))
  Downloading charset_normalizer-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (30 kB)
Collecting click==8.1.3 (from -r iiif_colle

In [74]:
!rm -r content/

In [75]:
!mkdir -p content/images/

In [34]:
!python3 iiif_collector/run.py iiif-singular {iiif_manifest} -f jpg --filename

Saving images:   5%|█▎                      | 33/633 [00:29<08:14,  1.21image/s]^C
Saving images:   5%|█▎                      | 33/633 [00:29<08:59,  1.11image/s]

Aborted!


In [49]:
!find iiif_collector/iiif_output/ -type f -name "*.jpg" -exec sh -c 'mv "$0" "content/images/"' {} \;

In [50]:
import os
for filename in os.listdir('content/'):
    if filename.endswith(".jpg"):
        full_path_old = os.path.join('content/', filename)
        new_filename = f"{id_book}_{filename}"
        full_path_new = os.path.join('content/', new_filename)
        os.rename(full_path_old, full_path_new)

## 3. Image segmentation

Some models are already available. We are going to use one trained at the University of Geneva. This model is used for layout analyzing, using the controled vocabulary [SegmOnto](https://segmonto.github.io).

SegmOnto is based on an as universal as possible modelling of a page.

<table>
  <tr>
    <th>Historical Print</th>
    <th>Medieval manuscript</th>
  </tr>
  <tr>
    <td><img src="https://github.com/gabays/CHR_2023/blob/main/images/btv1b86070385_f140_ann.jpg?raw=1" height="300px"></td>
    <td><img src="https://github.com/gabays/CHR_2023/blob/main/images/btv1b84259980_f29_ann.jpg?raw=1" height="250px"></td>
  </tr>
</table>

Data have been prepared under the supervision of Ariane Pinche (CNRS) and Simon Gabay (UniGE) with [eScriptorium](https://ieeexplore.ieee.org/document/8893029), an open source web app to prepare data.

<img src="https://github.com/gabays/CHR_2023/blob/main/images/escriptorium.png?raw=1" height="300px">

The University of Geneva is contributing via its own instance called [FoNDUE](https://www.unige.ch/lettres/humanites-numeriques/recherche/projets-de-la-chaire/fondue). The FoNDUE project aims at interfacing eScriptorium with HPC clusters using slurm (right) and not a single machine like other instances (left).

<img src="https://raw.githubusercontent.com/gabays/CHR_2023/main/images/Fondue.png" height="250px">


In [55]:
# Download the model
!wget {model_segmonto} -O content/seg_model.pt
# Load the model
from ultralytics import YOLO
model = YOLO("content/seg_model.pt")
# Use GPU if you have one (comment with # if you don't, typically on your machine)
#model.to('cuda')
# Get info about the model
model.info()
# Fuse PyTorch Conv2d and BatchNorm2d layers. This improves inference time and therefore execution time.
model.fuse()

--2024-02-22 11:50:33--  https://github.com/rayondemiel/Yolov8-Segmonto/releases/download/yolov8/prime_filet_4137.pt
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/707350493/34f87686-557a-4a90-bb63-4f2851ab484d?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240222%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240222T105034Z&X-Amz-Expires=300&X-Amz-Signature=1e309820666571ef8c6d6174fffb993eba07071ce8c22d11ad1d4e2639367d42&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=707350493&response-content-disposition=attachment%3B%20filename%3Dprime_filet_4137.pt&response-content-type=application%2Foctet-stream [following]
--2024-02-22 11:50:33--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/707350493/34f87686-557a-4a90-bb63-4f

Model summary: 295 layers, 25865005 parameters, 0 gradients, 79.1 GFLOPs
Model summary (fused): 218 layers, 25848445 parameters, 0 gradients, 78.7 GFLOPs


Let's use it now!

## 4. Optical character recognition

I now need a Kraken model. I download a generic model for prints.

In [59]:
!wget {model_htr} -O content/htr_model.mlmodel

--2024-02-22 11:51:46--  https://github.com/FoNDUE-HTR/Documentation/releases/download/v.0.9/fondue_emmental.mlmodel
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/436898644/ecdfb513-61ce-4281-a17c-a1d046c26311?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240222%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240222T105147Z&X-Amz-Expires=300&X-Amz-Signature=1b969875555f2e17eb11e29cd80bb20a7d17c4cdc1b5fdca2025d3628ca1b3d8&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=436898644&response-content-disposition=attachment%3B%20filename%3Dfondue_emmental.mlmodel&response-content-type=application%2Foctet-stream [following]
--2024-02-22 11:51:46--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/436898644/ecdfb513-61ce-4281-a17

First we segment:
- the image into zones (with our model)
- the lines (with [blla model](https://github.com/mittagessen/kraken/blob/main/kraken/blla.mlmodel)).

In [62]:
!yaltai kraken --device cuda:0 -I "content/images/*.jpg" --suffix ".xml" segment --yolo content/seg_model.pt
# If you don't have a GPU execute this line instead
#!yaltai kraken --device cpu -I "content/images/*.png" --suffix ".xml" segment --yolo content/seg_model.pt
print('\033[92m Segmentation done!')

scikit-learn version 1.2.2 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.
Torch version 2.0.1+cu117 has not been tested with coremltools. You may run into unexpected errors. Torch 2.0.0 is the most recent version that has been tested.
Loading ANN /home/rayondemiel/univ_geneve/iiif2alto/Documentation/.env/lib/python3.10/site-packages/kraken/blla.mlmodel	[0mSegmenting	[0m
image 1/1 /home/rayondemiel/univ_geneve/iiif2alto/Documentation/content/images/f13.jpg: 896x672 1 MainZone, 1 QuireMarksZone, 1 DropCapitalZone, 1 GraphicZone, 18.3ms
Speed: 6.4ms preprocess, 18.3ms inference, 2.0ms postprocess per image at shape (1, 3, 896, 672)
[32m✓[0m
Segmenting	[0m
image 1/1 /home/rayondemiel/univ_geneve/iiif2alto/Documentation/content/images/f20.jpg: 896x672 1 MainZone, 1 RunningTitleZone, 117.6ms
Speed: 3.1ms preprocess, 117.6ms inference, 1.3ms postprocess per image at shape (1, 3, 896, 672)
[32m✓[0m
Segmenting	[0

We need to correct the name of the image file in the xml file:

In [63]:
import os
import fileinput

for file in os.listdir(os.path.join("content","images")):
    if file.endswith(".xml"):
      with fileinput.FileInput(os.path.join("content","images",file), inplace=True) as f:
        for line in f:
          print(line.replace('content/images/',''), end='')
print('\033[92m All files are corrected!')

[92m All files are corrected!


Here you can download the ALTO files with the segmentation and the images to continue in eScriptorium

Then we OCRise the previously segmented images:

In [65]:
!kraken --alto --device cuda:0 --suffix ".xml" -I "content/images/*.xml" -f alto ocr -m "content/htr_model.mlmodel"
# If you don't have a GPU execute this line instead
#!kraken --alto --suffix ".xml" -I "content/images/*.xml" -f alto ocr -m "content/htr_model.mlmodel"
!mkdir -p content/data/doc_1
!mv content/images/*.xml content/data/doc_1
print('\033[92m All files are transcribed!')

scikit-learn version 1.2.2 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.
Torch version 2.0.1+cu117 has not been tested with coremltools. You may run into unexpected errors. Torch 2.0.0 is the most recent version that has been tested.
Loading ANN content/htr_model.mlmodel	[0m[32m✓[0m
[2KProcessing [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m21/21[0m [36m0:00:00[0m [33m0:00:02[0mm [33m0:00:02[0m
[?25hWriting recognition results for content/images/f29.xml	[0m[32m✓[0m
[2KProcessing [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m19/19[0m [36m0:00:00[0m [33m0:00:01[0mm [33m0:00:01[0m
[?25hWriting recognition results for content/images/f9.xml	[0m[32m✓[0m
[2KProcessing [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m  0%[0m [35m0/0[0m [36m-:--:--[0m [33m0:00:00[0m
[?25hWriting recognition results for content/images/f1.xml	[0m[32

We need to correct the file name in the xml once again:

In [66]:
import os
import fileinput

for file in os.listdir(os.path.join("content","data","doc_1")):
    if file.endswith(".xml"):
      with fileinput.FileInput(os.path.join("content","data","doc_1",file), inplace=True) as f:
        for line in f:
          print(line.replace('content/images/',''), end='')
print('\033[92m All files are corrected!')

[92m All files are corrected!


Here you can download the ALTO files with the segmentation and the transcription, and the images to continue in eScriptorium

In [71]:
!zip -r {id_book}_altos_transcribed.zip content/data/doc_1/*xml
!zip -r {id_book}_facsimiles.zip content/images/*jpg
print('\033[92m You can now download the zip files in the root folder!')

  adding: content/data/doc_1/f1.xml (deflated 58%)
  adding: content/data/doc_1/f10.xml (deflated 80%)
  adding: content/data/doc_1/f11.xml (deflated 87%)
  adding: content/data/doc_1/f12.xml (deflated 81%)
  adding: content/data/doc_1/f13.xml (deflated 88%)
  adding: content/data/doc_1/f14.xml (deflated 89%)
  adding: content/data/doc_1/f15.xml (deflated 88%)
  adding: content/data/doc_1/f16.xml (deflated 88%)
  adding: content/data/doc_1/f17.xml (deflated 89%)
  adding: content/data/doc_1/f18.xml (deflated 88%)
  adding: content/data/doc_1/f19.xml (deflated 88%)
  adding: content/data/doc_1/f2.xml (deflated 58%)
  adding: content/data/doc_1/f20.xml (deflated 88%)
  adding: content/data/doc_1/f21.xml (deflated 88%)
  adding: content/data/doc_1/f22.xml (deflated 88%)
  adding: content/data/doc_1/f23.xml (deflated 88%)
  adding: content/data/doc_1/f24.xml (deflated 88%)
  adding: content/data/doc_1/f25.xml (deflated 88%)
  adding: content/data/doc_1/f26.xml (deflated 88%)
  adding: cont