Automatic transcription of handwritten Old Occitan language

We propose an innovative HTR approach that leverages the transformer architecture for the recognition of handwritten Old Occitan, a low-resource language. We develop and rely on elaborate data augmentation techniques for both text and image data.

Our model combines a custom-trained Swin image encoder with a BERT-based text decoder, which we pre-train using a large-scale augmented synthetic data set and fine-tune on the small human-labeled data set.

Experimental results reveal that our approach surpasses the performance of current state-of-the-art models for Old Occitan HTR, including open-source transformer-based models such as a fine-tuned TrOCR and commercial applications like Google Cloud Vision.

Main steps

Annotation of ~ 41k data points (80% train / 10% validation / 10% test)
Pre-processing of images and text to enhance data augmentation:
- Synthetic image generation through an extended EMNIST dataset
- Synthetic corpus generation through merged Old Occitan corpora
Byte-level BPE Tokenization on a synthetic Old Occitan corpus
VisionEncoderDecoder model pre-training with synthetic images and fine-tuning with real (rotated and dilated) images
Model selection from 24 experimental setups covering four vision encoders (BEiT, DeiT, ViT and Swin) and two language decoders (GPT-2 and BERT)
Swin + BERT is the best-performing model with a test performance of weighted CER 0.005 and 96.5% correctly predicted labels.
Benchmarking against external open-source and commercial tools: EasyOCR, PaddleOCR, Tesseract OCR, Google Cloud Vision, and a fine-tuned TrOCR
- Our approach achieves SOTA results in the Old Occitan data set
Further experiments on an external data set exhibit a good performance (after automatic post-processing) of weighted CER 0.011, and 92.1% correctly predicted labels

Project Organization

├── LICENSE
├── README.md          <- This file.
├── requirements.txt   <- The requirements.txt file
├── .gitattributes     <- The extensions to be tracked with Git LFS.
│
├── augmentation       <- Includes the synthetic corpus and the extended EMNIST data set.
│
├── benchmarking       <- Includes benchmarking results with external OCR/HTR tools as well as benchmarking on an external Old Occitan data set.
│
├── data               <- Includes "Pansier corpus" and merged corpora file, labels, and original and preprocessed images for the DOM project and a different Occitan data set.
│
├── error_analysis     <- Includes file with the test set performance of our final model. Used for visualizations and benchmarking.
│    
├── model              <- Includes best-performing model (Swin + BERT) with its config.json file.
│
├── model_comparison   <- Includes inference results on validation set for 4 x 2 x 3 = 24 combinations of vision encoders, language decoders, and training setups.
│
├── src                <- Contains the source code for training (model and tokenizer), inference, data preparation (images and text), and benchmarking evaluation.
│
└── tokenizer          <- Includes the byte-level BPE tokenizer and the corpus it is trained on.

Example: Swin + BERT inference and CER calculation in `python`

import os
import torch
from torch.utils.data import Dataset
from transformers import VisionEncoderDecoderModel, PreTrainedTokenizerFast, AutoFeatureExtractor
from PIL import Image
import pandas as pd
from torchmetrics import CharErrorRate
import numpy as np


# Specify the path to the OcciGen repository
repository_path = '/path/to/OcciGen/'

# Construct the paths relative to the repository
tokenizer_path = os.path.join(repository_path, 'tokenizer/')
model_path = os.path.join(repository_path, 'model/')
image_path = os.path.join(repository_path, 'data/dom_project/test_preprocessed/')
label_path = os.path.join(repository_path, 'data/dom_project/blue_cards_labels.xlsx')

image_dataframe = pd.read_excel(label_path)
image_dataframe = image_dataframe[image_dataframe["split"] == "test"]
image_names = list(image_dataframe['preproc_file_name'])
image_labels = list(image_dataframe['label'])

# Load model
tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_path + "byte-level-BPE.tokenizer.json")
feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/swin-base-patch4-window7-224-in22k")
model = VisionEncoderDecoderModel.from_pretrained(model_path)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
model.to(device)

# Define metric
metric = CharErrorRate()

# Empty predictions list
labels_list = []
predictions_list = []
cer_list = []

# Define handling of data samples
class CustomDataset(Dataset):
    def __init__(self, image_names, image_path, feature_extractor, transform=None):
        self.image_names = image_names
        self.image_path = image_path
        self.feature_extractor = feature_extractor
        self.transform = transform

    def __len__(self):
        return len(self.image_names)

    def __getitem__(self, idx):
        image_name = self.image_names[idx]
        image = Image.open(self.image_path + image_name)
        
        # Convert image to RGB mode
        if image.mode != 'RGB':
            image = image.convert('RGB')

        if self.transform:
            image = self.transform(image)

        pixel_values = self.feature_extractor(images=image, return_tensors="pt").pixel_values.to(device)
        return pixel_values
    

dataset = CustomDataset(image_names, image_path, feature_extractor, transform=None)

# Inference and CER calculation
for i in range(len(image_labels)):
    pixel_values = dataset[i].to(device)  # Move the image tensor to the device
    try:
        # Generate ids predictions
        generated_ids = model.generate(
            pixel_values,
            no_repeat_ngram_size=100,
            max_length=200,
            num_beams=1,
            num_return_sequences=1
        )

        # Decode predictions
        generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

        # Calculate CER for each prediction in the batch
        label = image_labels[i]
        generated_text_i = generated_text[0]  # Since num_return_sequences=1, take the first generated sequence
        cer = metric(preds=generated_text_i, target=label).item()
        cer = np.round(cer, 5)

        labels_list.append(label)
        predictions_list.append(generated_text_i)
        cer_list.append(cer)

        print(f'Test example number {len(cer_list)}:\nLabel: {label}\nPrediction: {generated_text_i}\nCER: {cer}\nCurrent mean CER: {np.mean(cer_list)}\n')

    except TypeError as e:
        typekey = (pixel_values.shape, pixel_values.dtype)
        raise TypeError(f"Cannot handle this data type: {str(typekey[0])}, {str(typekey[1])}") from e

mean_cer = np.mean(cer_list)
mean_cer = np.round(mean_cer, 5)
print(f'\nMean CER over {len(cer_list)} test examples: {mean_cer}\n')

# Specify output directory
output_dir = '/your/output/directory/'

# Summarize and export inference results
output_df = pd.DataFrame({'Label': labels_list, 'Prediction': predictions_list, 'CER': cer_list})
output_df.to_excel(output_dir + 'inference_results.xlsx')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

augmentation

augmentation

benchmarking

benchmarking

data

data

error_analysis

error_analysis

model

model

model_comparison

model_comparison

src

src

tokenizer

tokenizer

.gitattributes

.gitattributes

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Automatic transcription of handwritten Old Occitan language

Main steps

Project Organization

Example: Swin + BERT inference and CER calculation in `python`

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
augmentation		augmentation
benchmarking		benchmarking
data		data
error_analysis		error_analysis
model		model
model_comparison		model_comparison
src		src
tokenizer		tokenizer
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

EstebanGarces/OcciGen

Folders and files

Latest commit

History

Repository files navigation

Automatic transcription of handwritten Old Occitan language

Main steps

Project Organization

Example: Swin + BERT inference and CER calculation in python

About

Resources

License

Stars

Watchers

Forks

Languages

Example: Swin + BERT inference and CER calculation in `python`