# Finetuning Donut for receipt parsing
With this code, you can fine-tune the powerful [Donut](https://huggingface.co/docs/transformers/model_doc/donut) to your own dataset, allowing you to create a custom model that can accurately extract the information you need from receipts. By tailoring the model to your specific requirements, you can ensure that it provides accurate and relevant output, improving efficiency and streamlining your workflow. 





## Getting Started

### Downloading Donut and installing packages
To begin, we will clone the [Donut](https://huggingface.co/docs/transformers/model_doc/donut) code repository from GitHub. Once downloaded, we will navigate to the directory containing the repository and proceed to install all the necessary packages and dependencies required for the Donut model (`!cd donut && pip install .`).

In [None]:
!git clone https://github.com/clovaai/donut.git
!cd donut && pip install .

## Your Dataset


### Upload
To use your data for finetuning [Donut](https://huggingface.co/docs/transformers/model_doc/donut), follow these steps:

1. Prepare your dataset as described in the repository [https://github.com/Inesence/REpro/tree/main/Creating_dataset](https://github.com/Inesence/REpro/tree/main/Creating_dataset).
2. Upload your dataset to your Github repository.
3. Clone and the data from Github by using the following code(replace <Inesence/receipts_LV> with the name of your repository):

In [None]:
%%bash
# clone repository
git clone https://github.com/Inesence/receipts_LV_public.git
# copy data
cp -r receipts_LV_public/data ./

### Subsetting

If you have a larger dataset that exceeds the memory capacity of Google Colab or your local resources when traing the model, you may need to cut your data into a smaller subset that can be easily handled. For me the cut-off was 200-220 image-key pairs. 

If you have enough resources to handle your data, skip this step.

In [None]:
import os

threshold = 200 # Enter your cut-off value here
main_folder = "/content/receipts_LV_public" # Enter your main folder path here

# define the path to the folder containing the files to delete
folder_path = main_folder + "/data/img/"

# loop through the files in the folder
for file_name in os.listdir(folder_path):
  
  # check if the file name contains "img" and a number, and if it has a file extension
  if "jpg" in file_name and file_name[:-4].isdigit():
    
    # extract the number from the file name
    file_num = int(file_name[:-4])
    
    # check if the file number is less than or equal to 40
    if file_num <= threshold:
      
      # build the full path to the file
      file_path = folder_path + file_name
      
      # delete the file
      os.remove(file_path)
      
# define the path to the folder containing the files to delete
folder_path = main_folder + "/data/key/"

# loop through the files in the folder
for file_name in os.listdir(folder_path):
  
  # check if the file name contains "img" and a number, and if it has a file extension
  if "json" in file_name and file_name[:-5].isdigit():
    
    # extract the number from the file name
    file_num = int(file_name[:-5])
    
    # check if the file number is less than or equal to 40
    if file_num <= threshold:
      
      # build the full path to the file
      file_path = folder_path + file_name
      
      # delete the file
      os.remove(file_path)

### Preparing your Dataset

**The following code and explanations have been directly sourced from [Philipp Schmid's guide](https://www.philschmid.de/fine-tuning-donut), and full credit goes to him.**

The next step is to prepare the dataset so that it conforms to the format that the Donut model requires. It is crucial to create a `metadata.json` file that stores important information about the images, such as the desired output.


In [4]:
import os
import json
from pathlib import Path
import shutil

# define paths
base_path = Path("data")
metadata_path = base_path.joinpath("key")
image_path = base_path.joinpath("img")
# define metadata list
metadata_list = []

# parse metadata
for file_name in metadata_path.glob("*.json"):
  with open(file_name, "r") as json_file:
    # load json file
    data = json.load(json_file)
    # create "text" column with json string
    text = json.dumps(data)
    # add to metadata list if image exists
    if image_path.joinpath(f"{file_name.stem}.jpg").is_file():
      metadata_list.append({"text":text,"file_name":f"{file_name.stem}.jpg"})
      # delete json file

# write jsonline file
with open(image_path.joinpath('metadata.jsonl'), 'w') as outfile:
    for entry in metadata_list:
        json.dump(entry, outfile)
        outfile.write('\n')

# remove old meta data
shutil.rmtree(metadata_path)


Now the data can be loaded using the `imagefolder` feature of `datasets` package.

In [None]:
import os
import json
from pathlib import Path
import shutil
from datasets import load_dataset

# define paths
base_path = Path("data")
metadata_path = base_path.joinpath("key")
image_path = base_path.joinpath("img")

# Load dataset
dataset = load_dataset("imagefolder", data_dir=image_path, split="train")

print(f"Dataset has {len(dataset)} images")
print(f"Dataset features are: {dataset.features.keys()}")


The Donut model is a type of sequence-to-sequence model that has a vision encoder and a text decoder. During the fine-tuning process, we want the model to generate text based on the images that we provide as input. To do so, we need to tokenize and preprocess the text before using it as input to the model.

In order to tokenize the text, we first need to transform the JSON string into a format that is compatible with the Donut model. To make this process easier, the ClovaAI team has developed a method called `json2token`, which we can use to create Donut-compatible documents from the JSON data.

In [6]:
new_special_tokens = [] # new tokens which will be added to the tokenizer
task_start_token = "<s>"  # start of task token
eos_token = "</s>" # eos token of tokenizer

def json2token(obj, update_special_tokens_for_json_key: bool = True, sort_json_key: bool = True):
    """
    Convert an ordered JSON object into a token sequence
    """
    if type(obj) == dict:
        if len(obj) == 1 and "text_sequence" in obj:
            return obj["text_sequence"]
        else:
            output = ""
            if sort_json_key:
                keys = sorted(obj.keys(), reverse=True)
            else:
                keys = obj.keys()
            for k in keys:
                if update_special_tokens_for_json_key:
                    new_special_tokens.append(fr"<s_{k}>") if fr"<s_{k}>" not in new_special_tokens else None
                    new_special_tokens.append(fr"</s_{k}>") if fr"</s_{k}>" not in new_special_tokens else None
                output += (
                    fr"<s_{k}>"
                    + json2token(obj[k], update_special_tokens_for_json_key, sort_json_key)
                    + fr"</s_{k}>"
                )
            return output
    elif type(obj) == list:
        return r"<sep/>".join(
            [json2token(item, update_special_tokens_for_json_key, sort_json_key) for item in obj]
        )
    else:
        # excluded special tokens for now
        obj = str(obj)
        if f"<{obj}/>" in new_special_tokens:
            obj = f"<{obj}/>"  # for categorical special tokens
        return obj


def preprocess_documents_for_donut(sample):
    # create Donut-style input
    text = json.loads(sample["text"])
    d_doc = task_start_token + json2token(text) + eos_token
    # convert all images to RGB
    image = sample["image"].convert('RGB')
    return {"image": image, "text": d_doc}

proc_dataset = dataset.map(preprocess_documents_for_donut)


Map:   0%|          | 0/221 [00:00<?, ? examples/s]

The next step involves two important tasks: tokenizing the text and encoding the images as tensors. To achieve this, we need to use the `DonutProcessor`. Additionally, we need to incorporate new special tokens into the tokenizer and adjust the image size during processing to reduce memory usage and accelerate the training process.

*Since my receipts are in Latvian language in the following code, I added the missing Latvian characters to the tokenizer by using `latvian_chars = ['ā', 'č', 'ē', 'ī', 'ķ', 'ļ', 'ņ', 'š', 'ū', 'ž']` and then applying `processor.tokenizer.add_tokens(latvian_chars)` This was necessary because not all of the Latvian characters were present in the tokenizer, and the addition of these characters to the tokenizer ensures that they will be properly recognized and processed during training.* 

*If you are training the Donut model in your language, it is important to ensure that the DonutProcessor module has all the necessary tokens to accurately generate the desired text output. In the event that some tokens are missing, you may need to add them manually to the tokenizer.*

*However, if the tokenizer does not support your language at all, you may need to consider using a different decoder altogether. It's important to note that the quality of the text output will be heavily dependent on the quality of the tokenizer and the availability of appropriate tokens. More about it in discussion here: [Finetune Donut with new tokenizer](https://discuss.huggingface.co/t/finetune-donut-with-new-tokenizer/30567/).*

*To test if the tokenizer has characters in your language you can use the code below. Just input the text in your language with characters different from english and test it. For example the tokenizer had all but three latvian characters - ļ,ņ and ķ.*

In [None]:
from transformers import DonutProcessor

# Load processor
processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base")

# Input text
text = "{'āēīõūļķčšņüäößü'}"

# Tokenize the input text
tokens = processor.tokenizer(text, add_special_tokens=True)

# Decode the tokens back to the original form
decoded_text = processor.tokenizer.decode(tokens['input_ids'])

print(f"Original text: {text}")
print(f"Decoded text: {decoded_text}")

Proceed here if your language is supported by the tokenizer. 

In [None]:
from transformers import DonutProcessor

# Load processor
processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base")

# add missing Latvian tokens to tokenizer
latvian_chars = ['ā', 'č', 'ē', 'ī', 'ķ', 'ļ', 'ņ', 'š', 'ū', 'ž']
processor.tokenizer.add_tokens(latvian_chars)

# add new special tokens to tokenizer
processor.tokenizer.add_special_tokens({"additional_special_tokens": new_special_tokens + [task_start_token] + [eos_token]})

# we update some settings which differ from pretraining; namely the size of the images + no rotation required
# resizing the image to smaller sizes from [1920, 2560] to [960,1280]
processor.feature_extractor.size = [720,960] # should be (width, height)
processor.feature_extractor.do_align_long_axis = False


The code below preprocesses data for fine-tuning the Donut model. It first converts the image into a tensor using the DonutProcessor, while also tokenizing the text in the sample. The tokenizer converts the text into a series of numerical IDs that the model can understand. The function then returns a dictionary with the image tensor, the tokenized text IDs (with padding), and the original text as a target sequence. Finally, the function is applied to the dataset using the map function, which applies the transformation to every sample in the dataset.

In [8]:
def transform_and_tokenize(sample, processor=processor, split="train", max_length=512, ignore_id=-100):
    # create tensor from image
    #sample["text"] = sample["text"].encode('utf-8').decode('utf-8')
    try:
        pixel_values = processor(
            sample["image"], random_padding=split == "train", return_tensors="pt"
        ).pixel_values.squeeze()
    except Exception as e:
        print(sample)
        print(f"Error: {e}")
        return {}

    # tokenize document
    input_ids = processor.tokenizer(
        sample["text"],
        add_special_tokens=False,
        max_length=max_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )["input_ids"].squeeze(0)

    labels = input_ids.clone()
    labels[labels == processor.tokenizer.pad_token_id] = ignore_id  # model doesn't need to predict pad token
    return {"pixel_values": pixel_values, "labels": labels, "target_sequence": sample["text"]}

# need at least 32-64GB of RAM to run this
processed_dataset = proc_dataset.map(transform_and_tokenize,remove_columns=["image","text"])


Map:   0%|          | 0/221 [00:00<?, ? examples/s]

Now the dataset needs to be split into train and validation sets.

In [None]:
processed_dataset = processed_dataset.train_test_split(test_size=0.1)
print(processed_dataset)

##  Finetuning Donut model

**The following code and explanations have been directly sourced from [Philipp Schmid's guide](https://www.philschmid.de/fine-tuning-donut), and full credit goes to him.**

### Hugging Face Hub
To access your finetuned Donut model in the future, you will need to save it somewhere. One option is to use the Hugging Face Hub, a remote model versioning service. To do this, you will first need to [create an account with Hugging Face](https://huggingface.co/join). Once you have an account, you can log in to it from your notebook using the `notebook_login` utility provided by the `huggingface_hub` package. This will enable you to push your model to the Hub for versioning and sharing.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

### Training
The next step is to initiate the training of our model. This is done by loading the pre-trained `naver-clova-ix/donut-base` model using the `VisionEncoderDecoderModel` class. The donut-base model, which is equipped with pre-trained weights, was originally introduced in the research paper "OCR-free Document Understanding Transformer" by Geewok et al..

Furthermore, apart from loading our model, we also need to make some adjustments before we start training. These include resizing the embedding layer to align with any new tokens that may have been added, adjusting the image size of our encoder to fit our dataset, and incorporating tokens that will be required for future inference.

In [None]:
import torch
from transformers import VisionEncoderDecoderModel, VisionEncoderDecoderConfig

# Load model from huggingface.com
model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base")

# Resize embedding layer to match vocabulary size
new_emb = model.decoder.resize_token_embeddings(len(processor.tokenizer))
print(f"New embedding size: {new_emb}")
# Adjust our image size and output sequence lengths
model.config.encoder.image_size = processor.feature_extractor.size[::-1] # (height, width)
model.config.decoder.max_length = len(max(processed_dataset["train"]["labels"], key=len))

# Add task token for decoder to start
model.config.pad_token_id = processor.tokenizer.pad_token_id
model.config.decoder_start_token_id = processor.tokenizer.convert_tokens_to_ids(['<s>'])[0]

# is done by Trainer
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

Next thing we need is to train our sequence-to-sequence model. We'll be using a Seq2SeqTrainer from the transformers library for this. We need to set up some training parameters, like how many epochs we want to train for, and then create a trainer object using our model and training data. Finally, we can save our trained model to the Hugging Face model hub. **Make sure to name your model in this variable `hf_repository_id`.**

In [None]:
from huggingface_hub import HfFolder
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

# hyperparameters used for multiple args
hf_repository_id = "donut-base-finetuned-Latvian-receipts-v3" # Enter your desired model name here

# Arguments for training
training_args = Seq2SeqTrainingArguments(
    output_dir=hf_repository_id,
    num_train_epochs=3,
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    weight_decay=0.01,
    fp16=True,
    logging_steps=50,
    save_total_limit=2,
    evaluation_strategy="no",
    save_strategy="epoch",
    predict_with_generate=True,
    # push to hub parameters
    report_to="tensorboard",
    push_to_hub=True,
    hub_strategy="every_save",
    hub_model_id=hf_repository_id,
    hub_token=HfFolder.get_token(),
)

# Create Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=processed_dataset["train"],
)

Now it is time to train our model!

In [None]:
# Start training
trainer.train()

Once the training process has been completed, push our processor onto the Hugging Face Hub.

In [None]:
# Save processor and create model card
processor.save_pretrained(hf_repository_id)
trainer.create_model_card()
trainer.push_to_hub()

## Evaluating the fine-tuned model
Now we will use our previously trained model to make predictions on a test document image. The run_prediction function is then defined to run inference on the image, by passing it through the model and generating a predicted output. The function takes in the test image, the model, and the processor as inputs. The image is processed and prepared for input into the model. Once the inference is complete, the predicted output is decoded and processed into a readable format. Finally, the predicted output and the truth sequence is printed for all test images.

In [None]:
import re
import transformers
from PIL import Image
from transformers import DonutProcessor, VisionEncoderDecoderModel
import torch
import random
import numpy as np

# hidde logs
transformers.logging.disable_default_handler()


# Load our model from Hugging Face
processor = DonutProcessor.from_pretrained("inesence/donut-base-finetuned-Latvian-receipts-v2")
model = VisionEncoderDecoderModel.from_pretrained("inesence/donut-base-finetuned-Latvian-receipts-v2")

# Move model to GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)



def run_prediction(sample, model=model, processor=processor):
    # prepare inputs
    pixel_values = torch.tensor(test_sample["pixel_values"]).unsqueeze(0)
    task_prompt = "<s>"
    decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids

    # run inference
    outputs = model.generate(
        pixel_values.to(device),
        decoder_input_ids=decoder_input_ids.to(device),
        max_length=model.decoder.config.max_position_embeddings,
        early_stopping=True,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,
        use_cache=True,
        num_beams=1,
        bad_words_ids=[[processor.tokenizer.unk_token_id]],
        return_dict_in_generate=True,
    )

    # process output
    prediction = processor.batch_decode(outputs.sequences)[0]
    prediction = processor.token2json(prediction)

    # load reference target
    target = processor.token2json(test_sample["target_sequence"])
    return prediction, target

# Load data from the test set
for i in range(0,len(processed_dataset["test"])):
  test_sample = processed_dataset["test"][i]
  prediction, target = run_prediction(test_sample)
  print(f"Reference:\n {target}")
  print(f"Prediction:\n {prediction}")



Now we need to calculate the accuracy of the model by testing it on the processed dataset's test samples. The accuracy percentage is calculated by dividing the number of true matches by the total number of comparisons and multiplying it by 100. The result is printed on the console.

*Philipp Schmid reported an accuracy of 75%, but in my case, the accuracy is 33.3%. This might be due to a lower number of training examples or different output needs. For example, I require the receipt number to be outputted, which can be difficult even for humans because of inconsistent placement, identification, length, and structure of receipt numbers.*

In [None]:
from tqdm import tqdm

# define counter for samples
true_counter = 0
total_counter = 0

# iterate over dataset
for sample in tqdm(processed_dataset["test"]):
  prediction, target = run_prediction(test_sample)
  for s in zip(prediction.values(), target.values()):
    if s[0] == s[1]:
      true_counter += 1
    total_counter += 1

print(f"Accuracy: {(true_counter/total_counter)*100}%")



## Further resources

**Philipp Schmid's step-by-step guide to fine-tune Donut:** https://www.philschmid.de/fine-tuning-donut

**Neha Desaraju's step-by-step guide to fine-tune Donut:** https://towardsdatascience.com/ocr-free-document-understanding-with-donut-1acfbdf099be

**Sample receipt-key dataset:** https://github.com/zzzDavid/ICDAR-2019-SROIE ; https://github.com/zzzDavid/ICDAR-2019-SROIE/tree/master/data/key

**Niels Rogge tutorial on fine-tuning Donut**: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Donut/CORD/Fine_tune_Donut_on_a_custom_dataset_(CORD)_with_PyTorch_Lightning.ipynb


**Donut on Hugging Face:** https://huggingface.co/docs/transformers/model_doc/donut

**Training Tesseract-OCR with custom data:** https://saiashish90.medium.com/training-tesseract-ocr-with-custom-data-d3f4881575c0training-tesseract-ocr-with-custom-data-d3f4881575c0