# Processing facsimiles for HTR

Simon Gabay, University of Geneva

<img alt="Licence Creative Commons" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" align="right"/>

### a. Objectives

This notebook help users to:
- process entirely documents until a TEI output
- segment documents prior to uploading them in eScriptorium for corrections

### b. Remarks

This notebook is adapted for the [OpenOnDemand](https://ondemand.baobab.hpc.unige.ch) service of the UniGE. If you want to use OpenOnDemand, you need to [ask first for an HPC account](https://catalogue-si.unige.ch/hpc).

This notebook **should be** compatible with colab. Specific sections for colab are noted with the colab (<img width="25px" src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/Google_Colaboratory_SVG_Logo.svg/320px-Google_Colaboratory_SVG_Logo.svg.png"/>) logo. You can open the notebook directly on colab with the following link:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FoNDUE-HTR/Documentation/blob/master/notebook_pipeline.ipynb)

⚠️ GPU use is activated, be careful when running the notebook on other services. Colab offers a limited use without subscription, other uses (local, mybinder…) might not offer GPUs.

### c. Credits

The following work would not exist without the help of:
- [A. Pinche](https://ciham.cnrs.fr/annuaire/membres_statutaires/ariane-pinche), CNRS (page modelisation)
- [Th. Clérice](https://almanach.inria.fr/people-fr.html), INRIA Paris (computer vision)
- [K. Christensen](https://medialab.sciencespo.fr/equipe/kelly-christensen/), Sciences Po Paris (TEI conversion)
- [M. Humeau](https://crc.mnhn.fr/fr/annuaire/maxime-humeau-9510), Université de Genève / Museum national d'histoire naturelle (notebook)
- [Fl. Goy](https://www.unige.ch/ihr/fr/linstitut/lequipe/collaborateur-trices-projets-fns/floriane-goy/) for beta testing.

## 1. Set up

First check that the GPU is active:

### 1.1 Initialisation

##### 1.1.1 IIIF Manifest

In [1]:
!ml


Currently Loaded Modules:
  1) GCCcore/12.3.0   12) Python/3.11.3               23) lxml/4.9.2
  2) zlib/1.2.13      13) cffi/1.15.1                 24) hatchling/1.18.0
  3) binutils/2.40    14) cryptography/41.0.1         25) BeautifulSoup/4.12.2
  4) bzip2/1.0.8      15) virtualenv/20.23.1          26) IPython/8.14.0
  5) ncurses/6.4      16) Python-bundle-PyPI/2023.06  27) libyaml/0.2.5
  6) libreadline/8.2  17) OpenPGM/5.2.122             28) PyYAML/6.0
  7) Tcl/8.6.13       18) libsodium/1.0.18            29) PyZMQ/25.1.1
  8) SQLite/3.42.0    19) util-linux/2.39             30) tornado/6.3.2
  9) XZ/5.4.2         20) ZeroMQ/4.3.4                31) jupyter-server/2.7.2
 10) libffi/3.4.4     21) libxml2/2.11.4              32) JupyterLab/4.0.5
 11) OpenSSL/1.1      22) libxslt/1.1.38

 



In [21]:
!export PATH=$PATH:.local/bin
!echo $PATH

/opt/ebsofts/JupyterLab/4.0.5-GCCcore-12.3.0/bin:/opt/ebsofts/jupyter-server/2.7.2-GCCcore-12.3.0/bin:/opt/ebsofts/IPython/8.14.0-GCCcore-12.3.0/bin:/opt/ebsofts/hatchling/1.18.0-GCCcore-12.3.0/bin:/opt/ebsofts/libxslt/1.1.38-GCCcore-12.3.0/bin:/opt/ebsofts/libxml2/2.11.4-GCCcore-12.3.0/bin:/opt/ebsofts/ZeroMQ/4.3.4-GCCcore-12.3.0/bin:/opt/ebsofts/util-linux/2.39-GCCcore-12.3.0/sbin:/opt/ebsofts/util-linux/2.39-GCCcore-12.3.0/bin:/opt/ebsofts/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/bin:/opt/ebsofts/virtualenv/20.23.1-GCCcore-12.3.0/bin:/opt/ebsofts/Python/3.11.3-GCCcore-12.3.0/bin:/opt/ebsofts/OpenSSL/1.1/bin:/opt/ebsofts/XZ/5.4.2-GCCcore-12.3.0/bin:/opt/ebsofts/SQLite/3.42.0-GCCcore-12.3.0/bin:/opt/ebsofts/Tcl/8.6.13-GCCcore-12.3.0/bin:/opt/ebsofts/ncurses/6.4-GCCcore-12.3.0/bin:/opt/ebsofts/bzip2/1.0.8-GCCcore-12.3.0/bin:/opt/ebsofts/binutils/2.40-GCCcore-12.3.0/bin:/opt/ebsofts/GCCcore/12.3.0/bin:/opt/cluster/admin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin


In [3]:
id_book = "LIV0158"

In [4]:
iiif_manifest = "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k87114075/manifest.json"

##### 1.1.2 Models

In [5]:
model_htr = "https://github.com/FoNDUE-HTR/Documentation/releases/download/v.0.9/fondue_emmental.mlmodel"

In [6]:
model_segmonto = "https://zenodo.org/records/10972956/files/CapricciosaM.pt?download=1"

### 1.2 Configuration

In [7]:
!nvidia-smi

Mon Aug  5 16:09:57 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:41:00.0 Off |                  N/A |
|  0%   30C    P8             31W /  370W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

We will use two principal tools for information extraction:

- To segment the pages, we are going to use [YALTAi](https://github.com/PonteIneptique/YALTAi) developped by Thibault Clérice (more info: [arXiv.2207.11230](https://doi.org/10.48550/arXiv.2207.11230)).
- To extract the text we use [Kraken](https://github.com/mittagessen/kraken) developed by Benjamin Kiessling (more info: [10.34894/Z9G2EX](https://doi.org/10.34894/Z9G2EX)).

⚠️ YALTAi contains Kraken, no need to install it separately

In [8]:
!pip install --upgrade pip
!pip install YALTAi==1.0.0

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting tabulate~=0.8.10 (from YALTAi==1.0.0)
  Using cached tabulate-0.8.10-py3-none-any.whl.metadata (25 kB)
Collecting numpy~=1.23.0 (from kraken==4.3.13->YALTAi==1.0.0)
  Using cached numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Using cached tabulate-0.8.10-py3-none-any.whl (29 kB)
Using cached numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
Installing collected packages: tabulate, numpy
[0m  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.3
    Uninstalling numpy-1.24.3:
      Successfully uninstalled numpy-1.24.3
[0mSuccessfully installed numpy-1.23.5 tabulate-0.8.10


###### 2. Document preparation

### 2.1 Download IIIF images

#### 2.1.1 IIIF_collector

In [9]:
#download CLI
!git clone https://github.com/rayondemiel/iiif_collector.git

fatal: destination path 'iiif_collector' already exists and is not an empty directory.


In [10]:
!pip install -r iiif_collector/requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Collecting certifi==2022.12.7 (from -r iiif_collector/requirements.txt (line 6))
  Using cached certifi-2022.12.7-py3-none-any.whl.metadata (2.9 kB)
Collecting numpy==1.24.3 (from -r iiif_collector/requirements.txt (line 12))
  Using cached numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting urllib3==1.26.15 (from -r iiif_collector/requirements.txt (line 21))
  Using cached urllib3-1.26.15-py2.py3-none-any.whl.metadata (48 kB)
Using cached certifi-2022.12.7-py3-none-any.whl (155 kB)
Using cached numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Using cached urllib3-1.26.15-py2.py3-none-any.whl (140 kB)
Installing collected packages: urllib3, numpy, certifi
  Attempting uninstall: numpy
    Found existing installation: numpy 1.23.5
    Uninstalling numpy-1.23.5:
      Successfully uninstalled numpy-1.23.5
[0m[31mERROR: pip's dep

In [13]:
!rm -r content/

rm: cannot remove 'content/': No such file or directory


In [14]:
!mkdir -p content/images/

In [15]:
!python3 iiif_collector/run.py iiif-singular {iiif_manifest} -f jpg --filename

Saving images: 100%|███████████████████████| 164/164 [02:02<00:00,  1.34image/s]
! Finish !


In [16]:
!find iiif_collector/iiif_output/ -type f -name "*.jpg" -exec sh -c 'mv "$0" "content/images/"' {} \;

In [17]:
import os
for filename in os.listdir('content/'):
    if filename.endswith(".jpg"):
        full_path_old = os.path.join('content/', filename)
        new_filename = f"{id_book}_{filename}"
        full_path_new = os.path.join('content/', new_filename)
        os.rename(full_path_old, full_path_new)

## 3. Image segmentation

Some models are already available. We are going to use one trained at the University of Geneva. This model is used for layout analyzing, using the controled vocabulary [SegmOnto](https://segmonto.github.io).

SegmOnto is based on an as universal as possible modelling of a page.

<table>
  <tr>
    <th>Historical Print</th>
    <th>Medieval manuscript</th>
  </tr>
  <tr>
    <td><img src="https://github.com/gabays/CHR_2023/blob/main/images/btv1b86070385_f140_ann.jpg?raw=1" height="300px"></td>
    <td><img src="https://github.com/gabays/CHR_2023/blob/main/images/btv1b84259980_f29_ann.jpg?raw=1" height="250px"></td>
  </tr>
</table>

Data have been prepared under the supervision of Ariane Pinche (CNRS) and Simon Gabay (UniGE) with [eScriptorium](https://ieeexplore.ieee.org/document/8893029), an open source web app to prepare data.

<img src="https://github.com/gabays/CHR_2023/blob/main/images/escriptorium.png?raw=1" height="300px">

The University of Geneva is contributing via its own instance called [FoNDUE](https://www.unige.ch/lettres/humanites-numeriques/recherche/projets-de-la-chaire/fondue). The FoNDUE project aims at interfacing eScriptorium with HPC clusters using slurm (right) and not a single machine like other instances (left).

<img src="https://raw.githubusercontent.com/gabays/CHR_2023/main/images/Fondue.png" height="250px">


In [18]:
# Download the model
!wget {model_segmonto} -O content/seg_model.pt
# Load the model
from ultralytics import YOLO
model = YOLO("content/seg_model.pt")
# Use GPU if you have one (comment with # if you don't, typically on your machine)
#model.to('cuda')
# Get info about the model
model.info()
# Fuse PyTorch Conv2d and BatchNorm2d layers. This improves inference time and therefore execution time.
model.fuse()

--2024-08-05 16:14:00--  https://zenodo.org/records/10972956/files/CapricciosaM.pt?download=1
Resolving zenodo.org (zenodo.org)... 188.185.79.172, 188.184.103.159, 188.184.98.238, ...
Connecting to zenodo.org (zenodo.org)|188.185.79.172|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 52062230 (50M) [application/octet-stream]
Saving to: ‘content/seg_model.pt’


2024-08-05 16:14:01 (103 MB/s) - ‘content/seg_model.pt’ saved [52062230/52062230]



Model summary: 295 layers, 25865005 parameters, 0 gradients, 79.1 GFLOPs
Model summary (fused): 218 layers, 25848445 parameters, 0 gradients, 78.7 GFLOPs


Let's use it now!

## 4. Optical character recognition

I now need a Kraken model. I download a generic model for prints.

In [19]:
!wget {model_htr} -O content/htr_model.mlmodel

--2024-08-05 16:14:12--  https://github.com/FoNDUE-HTR/Documentation/releases/download/v.0.9/fondue_emmental.mlmodel
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/436898644/ecdfb513-61ce-4281-a17c-a1d046c26311?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20240805%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240805T141412Z&X-Amz-Expires=300&X-Amz-Signature=9db258902e20f0332ce8948624c0a5e05cc998bf1fefc29825d4e601d2466a4a&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=436898644&response-content-disposition=attachment%3B%20filename%3Dfondue_emmental.mlmodel&response-content-type=application%2Foctet-stream [following]
--2024-08-05 16:14:12--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/436898644/ecdfb513-61ce-4281-a

First we segment:
- the image into zones (with our model)
- the lines (with [blla model](https://github.com/mittagessen/kraken/blob/main/kraken/blla.mlmodel)).

In [3]:
!~/.local/bin/yaltai kraken --device cuda:0 -I "content/images/*.jpg" --suffix ".xml" segment --yolo content/seg_model.pt
# If you don't have a GPU execute this line instead
#!yaltai kraken --device cpu -I "content/images/*.png" --suffix ".xml" segment --yolo content/seg_model.pt
print('\033[92m Segmentation done!')

scikit-learn version 1.2.2 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.
Torch version 2.0.1+cu117 has not been tested with coremltools. You may run into unexpected errors. Torch 2.0.0 is the most recent version that has been tested.
Loading ANN /home/users/a/alberta/.local/lib/python3.11/site-packages/kraken/blla.mlmodel	[0mSegmenting	[0m
image 1/1 /home/users/a/alberta/jupyter/content/images/f101.jpg: 896x640 1 MainZone, 1 QuireMarksZone, 1 NumberingZone, 1 RunningTitleZone, 75.7ms
Speed: 6.8ms preprocess, 75.7ms inference, 6.2ms postprocess per image at shape (1, 3, 896, 640)
[32m✓[0m
Segmenting	[0m
image 1/1 /home/users/a/alberta/jupyter/content/images/f100.jpg: 896x640 1 MainZone, 1 NumberingZone, 1 RunningTitleZone, 9.3ms
Speed: 3.6ms preprocess, 9.3ms inference, 1.3ms postprocess per image at shape (1, 3, 896, 640)
[32m✓[0m
Segmenting	[0m
image 1/1 /home/users/a/alberta/jupyter/content/images/f10

We need to correct the name of the image file in the xml file:

In [4]:
import os
import fileinput

for file in os.listdir(os.path.join("content","images")):
    if file.endswith(".xml"):
      with fileinput.FileInput(os.path.join("content","images",file), inplace=True) as f:
        for line in f:
          print(line.replace('content/images/',''), end='')
print('\033[92m All files are corrected!')

[92m All files are corrected!


Here you can download the ALTO files with the segmentation and the images to continue in eScriptorium

Then we OCRise the previously segmented images:

In [5]:
!~/.local/bin/kraken --alto --device cuda:0 --suffix ".xml" -I "content/images/*.xml" -f alto ocr -m "content/htr_model.mlmodel"
# If you don't have a GPU execute this line instead
#!kraken --alto --suffix ".xml" -I "content/images/*.xml" -f alto ocr -m "content/htr_model.mlmodel"
!mkdir -p content/data/doc_1
!mv content/images/*.xml content/data/doc_1
print('\033[92m All files are transcribed!')

scikit-learn version 1.2.2 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.
Torch version 2.0.1+cu117 has not been tested with coremltools. You may run into unexpected errors. Torch 2.0.0 is the most recent version that has been tested.
Loading ANN content/htr_model.mlmodel	[0m[32m✓[0m
[2KProcessing [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m28/28[0m [36m0:00:00[0m [33m0:00:04[0mm [33m0:00:04[0m
[?25hWriting recognition results for content/images/f100.xml	[0m[32m✓[0m
[2KProcessing [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m30/30[0m [36m0:00:00[0m [33m0:00:04[0mm [33m0:00:04[0m
[?25hWriting recognition results for content/images/f101.xml	[0m[32m✓[0m
[2KProcessing [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m27/27[0m [36m0:00:00[0m [33m0:00:05[0mm [33m0:00:05[0m
[?25hWriting recognition results for content

We need to correct the file name in the xml once again:

In [6]:
import os
import fileinput

for file in os.listdir(os.path.join("content","data","doc_1")):
    if file.endswith(".xml"):
      with fileinput.FileInput(os.path.join("content","data","doc_1",file), inplace=True) as f:
        for line in f:
          print(line.replace('content/images/',''), end='')
print('\033[92m All files are corrected!')

[92m All files are corrected!


Here you can download the ALTO files with the segmentation and the transcription, and the images to continue in eScriptorium

In [7]:
!zip -r {id_book}_altos_transcribed.zip content/data/doc_1/*xml
!zip -r {id_book}_facsimiles.zip content/images/*jpg
print('\033[92m You can now download the zip files in the root folder!')

  adding: content/data/doc_1/f100.xml (deflated 89%)
  adding: content/data/doc_1/f101.xml (deflated 88%)
  adding: content/data/doc_1/f102.xml (deflated 89%)
  adding: content/data/doc_1/f103.xml (deflated 88%)
  adding: content/data/doc_1/f104.xml (deflated 89%)
  adding: content/images/f100.jpg (deflated 0%)
  adding: content/images/f101.jpg (deflated 0%)
  adding: content/images/f102.jpg (deflated 0%)
  adding: content/images/f103.jpg (deflated 0%)
  adding: content/images/f104.jpg (deflated 0%)
[92m You can now download the zip files in the root folder!
