# Processing facsimiles for HTR

Simon Gabay, University of Geneva

<img alt="Licence Creative Commons" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" align="right"/>

### a. Objectives

This notebook help users to:
- process entirely documents until a TEI output
- segment documents prior to uploading them in eScriptorium for corrections

### b. Remarks

This notebook is adapted for the [OpenOnDemand](https://ondemand.baobab.hpc.unige.ch) service of the UniGE. If you want to use OpenOnDemand, you need to [ask first for an HPC account](https://catalogue-si.unige.ch/hpc).

This notebook **should be** compatible with colab. Specific sections for colab are noted with the colab (<img width="25px" src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/Google_Colaboratory_SVG_Logo.svg/320px-Google_Colaboratory_SVG_Logo.svg.png"/>) logo. You can open the notebook directly on colab with the following link:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FoNDUE-HTR/Documentation/blob/master/notebook_pipeline.ipynb)

⚠️ GPU use is activated, be careful when running the notebook on other services. Colab offers a limited use without subscription, other uses (local, mybinder…) might not offer GPUs.

### c. Credits

The following work would not exist without the help of:
- [A. Pinche](https://ciham.cnrs.fr/annuaire/membres_statutaires/ariane-pinche), CNRS (page modelisation)
- [Th. Clérice](https://almanach.inria.fr/people-fr.html), INRIA Paris (computer vision)
- [K. Christensen](https://medialab.sciencespo.fr/equipe/kelly-christensen/), Sciences Po Paris (TEI conversion)
- [M. Humeau](https://crc.mnhn.fr/fr/annuaire/maxime-humeau-9510), Université de Genève / Museum national d'histoire naturelle (notebook)
- [Fl. Goy](https://www.unige.ch/ihr/fr/linstitut/lequipe/collaborateur-trices-projets-fns/floriane-goy/) for beta testing.

## 1. Set up

First check that the GPU is active:

### 1.1 Initialisation

⚠️ Don't forget to create `content/` dir if not exist to put your pdf file

In [5]:
id_book = ""

In [6]:
filename_pdf = ""

##### 1.1.2 Models

In [7]:
model_htr = "https://github.com/FoNDUE-HTR/Documentation/releases/download/v.0.9/fondue_emmental.mlmodel"

In [8]:
model_segmonto = "https://zenodo.org/records/10972956/files/CapricciosaX.pt?download=1"

### 1.2 Configuration

In [9]:
!nvidia-smi

Sat Mar 16 12:18:41 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.10              Driver Version: 551.61         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4070 Ti     On  |   00000000:01:00.0  On |                  N/A |
|  0%   36C    P8             14W /  285W |    1298MiB /  12282MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

We will use two principal tools for information extraction:

- To segment the pages, we are going to use [YALTAi](https://github.com/PonteIneptique/YALTAi) developped by Thibault Clérice (more info: [arXiv.2207.11230](https://doi.org/10.48550/arXiv.2207.11230)).
- To extract the text we use [Kraken](https://github.com/mittagessen/kraken) developed by Benjamin Kiessling (more info: [10.34894/Z9G2EX](https://doi.org/10.34894/Z9G2EX)).

⚠️ YALTAi contains Kraken, no need to install it separately

In [10]:
!pip install --upgrade pip
!pip install YALTAi



## 2 Manage PDF images

In [19]:
!rm -fr content/images/

⚠️ if you want to upload a pdf :

-
create the content folder with the line infr-a
upload your p-df
move the pdf in the content fol-der
rename the pdf doc.pdf

In [20]:
pdf_path = "content/" + filename_pdf

We now need to convert this pdf into images:


In [22]:
# Convert pdf into images
!pip install pypdfium2
import pypdfium2 as pdfium
# Provide the path to the pdf
pdf = pdfium.PdfDocument(pdf_path)
# Get the number of pages
n_pages = len(pdf)
# Turn into jpeg all the pages one after the other:
for page_number in range(n_pages):
    page = pdf.get_page(page_number)
    # Decide what kind of transformation you want to do during the transformation
    pil_image = page.render(
        scale=5, # 1=72dpi, increase for a better resolution
        rotation=0, # no rotation
        crop=(0, 0, 0, 0), # no cropping
    ).to_pil()
    pil_image.save(f"content/page_{page_number+1:05d}.jpg", 'JPEG')
# I remove the pdf because I don't need it anymore
!rm pdf_path
# I dispatch the files in a dedicated folder
!mkdir -p content/images
!mv content/*jpg content/images/
print('\033[92m Images extracted!')

rm: cannot remove 'pdf_path': No such file or directory
[92m Images extracted!


## 3. Image segmentation

Some models are already available. We are going to use one trained at the University of Geneva. This model is used for layout analyzing, using the controled vocabulary [SegmOnto](https://segmonto.github.io).

SegmOnto is based on an as universal as possible modelling of a page.

<table>
  <tr>
    <th>Historical Print</th>
    <th>Medieval manuscript</th>
  </tr>
  <tr>
    <td><img src="https://github.com/gabays/CHR_2023/blob/main/images/btv1b86070385_f140_ann.jpg?raw=1" height="300px"></td>
    <td><img src="https://github.com/gabays/CHR_2023/blob/main/images/btv1b84259980_f29_ann.jpg?raw=1" height="250px"></td>
  </tr>
</table>

Data have been prepared under the supervision of Ariane Pinche (CNRS) and Simon Gabay (UniGE) with [eScriptorium](https://ieeexplore.ieee.org/document/8893029), an open source web app to prepare data.

<img src="https://github.com/gabays/CHR_2023/blob/main/images/escriptorium.png?raw=1" height="300px">

The University of Geneva is contributing via its own instance called [FoNDUE](https://www.unige.ch/lettres/humanites-numeriques/recherche/projets-de-la-chaire/fondue). The FoNDUE project aims at interfacing eScriptorium with HPC clusters using slurm (right) and not a single machine like other instances (left).

<img src="https://raw.githubusercontent.com/gabays/CHR_2023/main/images/Fondue.png" height="250px">


In [23]:
# Download the model
!wget {model_segmonto} -O content/seg_model.pt
# Load the model
from ultralytics import YOLO
model = YOLO("content/seg_model.pt")
# Use GPU if you have one (comment with # if you don't, typically on your machine)
#model.to('cuda')
# Get info about the model
model.info()
# Fuse PyTorch Conv2d and BatchNorm2d layers. This improves inference time and therefore execution time.
model.fuse()

--2024-03-16 12:31:23--  https://github.com/rayondemiel/Yolov8-Segmonto/releases/download/yolov8/prime_filet_4137.pt
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/707350493/34f87686-557a-4a90-bb63-4f2851ab484d?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240316%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240316T113123Z&X-Amz-Expires=300&X-Amz-Signature=b7be4ed8f22a9ed46ca3110300822a4a4a359d3aa1c55da5b1410fba2b38ce97&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=707350493&response-content-disposition=attachment%3B%20filename%3Dprime_filet_4137.pt&response-content-type=application%2Foctet-stream [following]
--2024-03-16 12:31:23--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/707350493/34f87686-557a-4a90-bb63-4f

Model summary: 295 layers, 25865005 parameters, 0 gradients, 79.1 GFLOPs
Model summary (fused): 218 layers, 25848445 parameters, 0 gradients, 78.7 GFLOPs


Let's use it now!

## 4. Optical character recognition

I now need a Kraken model. I download a generic model for prints.

In [24]:
!wget {model_htr} -O content/htr_model.mlmodel

--2024-03-16 12:31:28--  https://github.com/FoNDUE-HTR/Documentation/releases/download/v.0.9/fondue_emmental.mlmodel
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/436898644/ecdfb513-61ce-4281-a17c-a1d046c26311?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240316%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240316T113127Z&X-Amz-Expires=300&X-Amz-Signature=b278c7c3126803c1fb01b793c311a06217c5b6ea0185a3606def77b60ae7f7ee&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=436898644&response-content-disposition=attachment%3B%20filename%3Dfondue_emmental.mlmodel&response-content-type=application%2Foctet-stream [following]
--2024-03-16 12:31:28--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/436898644/ecdfb513-61ce-4281-a17

First we segment:
- the image into zones (with our model)
- the lines (with [blla model](https://github.com/mittagessen/kraken/blob/main/kraken/blla.mlmodel)).

In [25]:
!yaltai kraken --device cuda:0 -I "content/images/*.jpg" --suffix ".xml" segment --yolo content/seg_model.pt
# If you don't have a GPU execute this line instead
#!yaltai kraken --device cpu -I "content/images/*.png" --suffix ".xml" segment --yolo content/seg_model.pt
print('\033[92m Segmentation done!')

scikit-learn version 1.2.2 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.
Torch version 2.0.1+cu117 has not been tested with coremltools. You may run into unexpected errors. Torch 2.0.0 is the most recent version that has been tested.
Loading ANN /home/rayondemiel/univ_geneve/iiif2alto/Documentation/.env/lib/python3.10/site-packages/kraken/blla.mlmodel	[0mSegmenting	[0m
image 1/1 /home/rayondemiel/univ_geneve/iiif2alto/Documentation/content/images/page_00109.jpg: 896x608 1 MainZone, 1 NumberingZone, 1 RunningTitleZone, 16.3ms
Speed: 6.8ms preprocess, 16.3ms inference, 9.4ms postprocess per image at shape (1, 3, 896, 608)
[32m✓[0m
Segmenting	[0m
image 1/1 /home/rayondemiel/univ_geneve/iiif2alto/Documentation/content/images/page_00137.jpg: 896x608 1 MainZone, 1 NumberingZone, 1 RunningTitleZone, 15.0ms
Speed: 3.2ms preprocess, 15.0ms inference, 0.9ms postprocess per image at shape (1, 3, 896, 608)
[32m✓[0m


We need to correct the name of the image file in the xml file:

In [26]:
import os
import fileinput

for file in os.listdir(os.path.join("content","images")):
    if file.endswith(".xml"):
      with fileinput.FileInput(os.path.join("content","images",file), inplace=True) as f:
        for line in f:
          print(line.replace('content/images/',''), end='')
print('\033[92m All files are corrected!')

[92m All files are corrected!


Here you can download the ALTO files with the segmentation and the images to continue in eScriptorium

Then we OCRise the previously segmented images:

In [27]:
!kraken --alto --device cuda:0 --suffix ".xml" -I "content/images/*.xml" -f alto ocr -m "content/htr_model.mlmodel"
# If you don't have a GPU execute this line instead
#!kraken --alto --suffix ".xml" -I "content/images/*.xml" -f alto ocr -m "content/htr_model.mlmodel"
!mkdir -p content/data/doc_1
!mv content/images/*.xml content/data/doc_1
print('\033[92m All files are transcribed!')

scikit-learn version 1.2.2 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.
Torch version 2.0.1+cu117 has not been tested with coremltools. You may run into unexpected errors. Torch 2.0.0 is the most recent version that has been tested.
Loading ANN content/htr_model.mlmodel	[0m[32m✓[0m
[2KProcessing [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m  0%[0m [35m0/0[0m [36m-:--:--[0m [33m0:00:00[0m
[?25hWriting recognition results for content/images/page_00171.xml	[0m[32m✓[0m
[2KProcessing [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m27/27[0m [36m0:00:00[0m [33m0:00:09[0mm [33m0:00:09[0m
[?25hWriting recognition results for content/images/page_00022.xml	[0m[32m✓[0m
[2KProcessing [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m31/31[0m [36m0:00:00[0m [33m0:00:12[0mm [33m0:00:11[0m
[?25hWriting recognition results for content/images/

We need to correct the file name in the xml once again:

In [28]:
import os
import fileinput

for file in os.listdir(os.path.join("content","data","doc_1")):
    if file.endswith(".xml"):
      with fileinput.FileInput(os.path.join("content","data","doc_1",file), inplace=True) as f:
        for line in f:
          print(line.replace('content/images/',''), end='')
print('\033[92m All files are corrected!')

[92m All files are corrected!


Here you can download the ALTO files with the segmentation and the transcription, and the images to continue in eScriptorium

In [None]:
!zip -r {id_book}_altos_transcribed.zip content/data/doc_1/*xml
!zip -r {id_book}_facsimiles.zip content/images/*jpg
print('\033[92m You can now download the zip files in the root folder!')

  adding: content/data/doc_1/page_00001.xml (deflated 87%)
  adding: content/data/doc_1/page_00002.xml (deflated 91%)
  adding: content/data/doc_1/page_00003.xml (deflated 92%)
  adding: content/data/doc_1/page_00004.xml (deflated 88%)
  adding: content/data/doc_1/page_00005.xml (deflated 81%)
  adding: content/data/doc_1/page_00006.xml (deflated 86%)
  adding: content/data/doc_1/page_00007.xml (deflated 87%)
  adding: content/data/doc_1/page_00008.xml (deflated 87%)
  adding: content/data/doc_1/page_00009.xml (deflated 87%)
  adding: content/data/doc_1/page_00010.xml (deflated 58%)
  adding: content/data/doc_1/page_00011.xml (deflated 58%)
  adding: content/data/doc_1/page_00012.xml (deflated 88%)
  adding: content/data/doc_1/page_00013.xml (deflated 89%)
  adding: content/data/doc_1/page_00014.xml (deflated 89%)
  adding: content/data/doc_1/page_00015.xml (deflated 89%)
  adding: content/data/doc_1/page_00016.xml (deflated 89%)
  adding: content/data/doc_1/page_00017.xml (deflated 89