# Set-up environment

Please only use ML runtime 15.0

In [0]:
%pip install -q pytorch-lightning wandb

In [0]:
%pip install -q transformers datasets sentencepiece

In [0]:
dbutils.library.restartPython()


# Introduction

## Load image

We load a boarding pass image from the internet. In the next step we check how well existing models can extract informations from a boarding pass. 

In [0]:
import requests
import warnings
from PIL import Image
import io

# Load images from a web url 
def load_image_from_url(url):
    try:
        # Download the image from the URL
        response = requests.get(url)

        # Convert the image to a binary format
        binary_image = response.content

        # Open the image using PIL and convert to RGB (required for tuning)
        image = Image.open(io.BytesIO(binary_image)).convert("RGB")

        return image
    
    except Exception as e:
        return None


In [0]:
# Example usage
image = load_image_from_url("https://as1.ftcdn.net/v2/jpg/05/08/32/28/1000_F_508322849_dvGX1itSDANr0Tsr7AX032FuiAnDUgbP.jpg")

image

## Load model and processor

Generative AI models are trained on vast amounts of data. For example, a Generative AI (GenAI) model designed to parse text from documents has been exposed to hundreds of millions, or even billions, of documents during its pre-training. This implies that these models will likely already perform decently in extracting information from a boarding pass, saving us a significant amount of time and money by not having to start from scratch.

In this notebook, we will utilize **Donut**, a free Document Understanding Transformer. Additional information can be found on the Hugging Face [website](https://huggingface.co/docs/transformers/main/en/model_doc/donut).


In [0]:
from transformers import DonutProcessor, VisionEncoderDecoderModel

pretrained_processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
pretrained_model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")

## Understanding the GenAI process

Even though GenAI models has human-like capabilities in understanding and generating text, we cannot directly interact with these models. We have to apply a multi-step process:

### 1. **Encoding**

Encoding involves translating input, such as natural language or images, into a representation understandable by a GenAI model. Let's illustrate this with an example of natural language encoding:

```
"Say, hello world!" --> Encoding/Tokenization --> [2435, 3158, 5448, 4686, 94848, 7466, 6556, 45654, 46652]
```

Each ID in the sequence represents a combination of characters:

```
                                                    Sa    y     ,    he    l     lo    wor    ld      !
```

Tokenization describes a process where each word or character is assigned a unique ID. A tokenizer functions essentially as a very large dictionary containing numerous words and characters with unique IDs. It's crucial to note that if a tokenizer encounters characters or words it hasn't seen before, it cannot assign an ID and consequently cannot represent those characters.

### 2. **Generate**

In this process, we employ the GenAI model. Suppose we have an instruction-following chat model, and we want the model to say "hello world!" We provide the tokenized sequence to the GenAI model, and the model predicts the next token. This process continues until a stop token (e.g., `</s>`) is predicted.

- **Iteration 1:** 
```
[2435, 3158, 5448, 4686, 94848, 7466, 6556, 45654, 46652] --> GenAI model --> [4686]
```

- **Iteration 2:**
```
[2435, 3158, 5448, 4686, 94848, 7466, 6556, 45654, 46652, 4686] --> GenAI model --> [4686, 94848]
```

- **Iteration 3:**
```
[2435, 3158, 5448, 4686, 94848, 7466, 6556, 45654, 46652, 4686, 94848] --> GenAI model --> [4686, 94848, 7466]
```

- **Iteration 4:**
```
[2435, 3158, 5448, 4686, 94848, 7466, 6556, 45654, 46652, 4686, 94848, 7466] --> GenAI model --> [4686, 94848, 7466, 6556]
```

- **Iteration 5:**
```
[2435, 3158, 5448, 4686, 94848, 7466, 6556, 45654, 46652, 4686, 94848, 7466, 6556] --> GenAI model --> [4686, 94848, 7466, 6556, 45654]
```

- ...:

- ...:

- **Output:** 
```
[4686, 94848, 7466, 6556, 45654, 46652, 2]
```

### 3. **Decoding**

In the decoding process, we translate the token sequence back into natural language.

```
[4686, 94848, 7466, 6556, 45654, 46652, 2] --> Decoding --> hello world!
```


##Encoding

In [0]:
# Let's apply some encoding on the images that we already loaded
pixel_values = pretrained_processor(image, return_tensors="pt").pixel_values
print(pixel_values.shape)


As we can see, the encoding of the image resulted in a 3-dimensional matrix/tensor with the shape [1280, 960], with each dimension representing the channels "red", "green", and "blue". Each number represents a pixel.

In [0]:
pixel_values

## Generate

**DONUT** is a instruction following model, so we have to encode the document parsing instruction ```"<s_cord-v2>"``` and provide the ```pixel_values``` tensor.

In [0]:
import torch

task_prompt = "<s_cord-v2>"
decoder_input_ids = pretrained_processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]

device = "cuda" if torch.cuda.is_available() else "cpu"
pretrained_model.to(device)

outputs = pretrained_model.generate(pixel_values.to(device),
                               decoder_input_ids=decoder_input_ids.to(device),
                               max_length=pretrained_model.decoder.config.max_position_embeddings,
                               early_stopping=False,
                               pad_token_id=pretrained_processor.tokenizer.pad_token_id,
                               eos_token_id=pretrained_processor.tokenizer.eos_token_id,
                               use_cache=True,
                               num_beams=1,
                               bad_words_ids=[[pretrained_processor.tokenizer.unk_token_id]],
                               return_dict_in_generate=True,
                               output_scores=True,)

In [0]:
outputs


##Decode

We use the processor to decode the outputs and remove the special start/stop tokens. 

In [0]:
import re

sequence = pretrained_processor.batch_decode(outputs.sequences)[0]
sequence = sequence.replace(pretrained_processor.tokenizer.eos_token, "").replace(pretrained_processor.tokenizer.pad_token, "")
sequence = re.sub(r"<.*?>", "", sequence, count=1).strip()  # remove first task start token
print(sequence)


Not the output contains readable text. But we see that there are many special tokens in the output sequence. The reason is that **DONUT** was trained with a special dataset that uses special tokens to generate an output structure. 

## Convert to JSON
The processor also comes with a special token2json function that generates a final json based on the special tokens. 


In [0]:
pretrained_processor.token2json(sequence)


As you can see, the model is able to parse the text. But the json output structure isn't the best and the model tries to put the parsed text into a structure that was learned for recipt parsing. 


#Tuning


## Example of a Tuning Set

To teach the model the structure of boarding pass parsing we have to create a tuning set. We also need training samples of boarding pass images and examples of the expected output/ground truth labels.


In [0]:
# Example of an expected output for the image
ground_truth = {"passenger": "JOHN SMITH",
              "flight": "BH01122",
              "seat": "16C",
              "gate": "B27",
              "depature_date": "25MAY",
              "depature_time": "20:10",
              "from": "NYC",
              "to":"LA"}

In [0]:
import pandas as pd
import base64
import json


# Create a dictionary for the DataFrame
data = {
    'image': [image],
    'ground_truth': [ground_truth]
}

# Create the DataFrame
df = pd.DataFrame(data)

In [0]:
df["image"][0]

In [0]:
df["ground_truth"][0]

## Create a Hugging Face Dataset

Hugging Face tries to establish a common standard for large scale training and tuning sets. We will follow this approach here.

In [0]:
##--> Hint: Having a bigger training set will alway's help the model to generalize better. Think about adding additional examples. 
##--> ToDo: Check if some keys are missing and need to be added to the ground truth labels


# Let's build some list of boarding pass images from the internet
boarding_pass_url = [
  "https://media-cdn.tripadvisor.com/media/photo-s/17/34/1f/f2/boarding-pass.jpg",
  "https://t3.ftcdn.net/jpg/04/98/91/18/360_F_498911830_POumal8kACIftLIvr6L5P6vyc2P0XNZb.jpg",
  "https://miro.medium.com/v2/resize:fit:1400/1*ipDGqPL88tMkGB_oMO5iwQ.png",
  "https://assets.lastdodo.com/image/ld_large/plain/assets/catalog/assets/2013/9/17/8/3/f/pdf_83f71878-1fb3-11e3-8973-118d3d365b34.jpg",
  "https://assets.cntraveller.in/photos/6230b23b5934c6bf974e2bff/16:9/w_1280,c_limit/boarding%20pass.jpg"
]

# And build the true labels for it using json lines
# The key gt_parse was used for pre-training the DONUT model. We make use of it to simplify our tuning process
# We also introduce the key boarding_pass as a new category of documents to DONUT. 
list_of_true_labels = [
"""{"boarding_pass": 
    {"passenger": "",
     "flight": "SA8862",
     "seat": "8C",
     "gate": "",
     "depature_date": "23MAR",
     "depature_time": "13:30",
     "from": "SKUKUZA/SZK",
     "to":"JOHANNESBURG/ JNB"
    }
  }""",
"""{"boarding_pass": 
    {"passenger": "NAME SURNAME",
     "flight": "AA204",
     "seat": "20A",
     "gate": "B25",
     "depature_date": "12MAY",
     "depature_time": "",
     "boarding_time": "17:05",
     "from": "NEW YORK",
     "to":"HAWAII"
    }
  }""",
"""{"boarding_pass": 
    {"passenger": "JOHN/DOE",
     "flight": "AZ7614",
     "seat": "15F",
     "gate": "E39",
     "depature_date": "26MAR",
     "depature_time": "",
     "boarding_time": "11:35",
     "from": "ROME-FIUMICINO",
     "to":"NYC-KENNEDY"
    }
  }""",
"""{"boarding_pass": 
    {"passenger": "DEFOUR/ROXANE",
     "flight": "EZY1738",
     "seat": "10D",
     "gate": "A",
     "depature_date": "09SEP",
     "depature_time": "21:30",
     "boarding_time": "20:45",
     "from": "Nice",
     "to":"Toulouse"
    }
  }""",
"""{"boarding_pass": 
    {"passenger": "",
     "flight": "TK1890",
     "seat": "13A",
     "gate": "D22",
     "depature_date": "09NOV",
     "depature_time": "07:00",
     "boarding_time": "06:10",
     "from": "VIENNA",
     "to":"ISTANBUL"
    }
  }"""
]

In [0]:
from datasets.dataset_dict import DatasetDict
from datasets import Dataset

# For each image url we parse the image and put it into a list
image_list = []

for url in boarding_pass_url: 
  image = load_image_from_url(url)
  image_list.append(image)

##--> Hint: Usually, validation set's contain samples that are not beeing used for training. Consider splitting train and validation. 
# Create a huggingface dataset 
d = {
     'train':Dataset.from_dict({'image':image_list,'ground_truth':list_of_true_labels}),
     'validation':Dataset.from_dict({'image':image_list, 'ground_truth':list_of_true_labels}),
     }

dataset = DatasetDict(d)

# Show dataset structure
dataset

In [0]:
example = dataset['train'][0]
image = example['image']
# let's make the image a bit smaller when visualizing
width, height = image.size
display(image)

In [0]:
# let's load the corresponding JSON dictionary (as string representation)
ground_truth = example['ground_truth']
print(ground_truth)

In [0]:
from ast import literal_eval

# Check if we have well formed json lines
literal_eval(ground_truth)


##Load Model and processor


In [0]:
from transformers import VisionEncoderDecoderConfig

# Set image size
image_size = [1280, 960]

# Set max token length
max_length = 768

# update image_size of the encoder
# during pre-training, a larger image size was used
config = VisionEncoderDecoderConfig.from_pretrained("naver-clova-ix/donut-base")
config.encoder.image_size = image_size # (height, width)
# update max_length of the decoder (for generation)
config.decoder.max_length = max_length
# TODO we should actually update max_position_embeddings and interpolate the pre-trained ones:
# https://github.com/clovaai/donut/blob/0acc65a85d140852b8d9928565f0f6b2d98dc088/donut/model.py#L602

Next, we instantiate the model with our custom config, as well as the processor. Make sure that all pre-trained weights are correctly loaded (a warning would tell you if that's not the case).

In [0]:
from transformers import DonutProcessor, VisionEncoderDecoderModel

processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base")
model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base", config=config)

##Create PyTorch dataset

Here we create a regular PyTorch dataset.

The model doesn't directly take the (image, JSON) pairs as input and labels. Rather, we create pixel_values and labels. Both are PyTorch tensors. The pixel_values are the input images (resized, padded and normalized), and the labels are the input_ids of the target sequence (which is a flattened version of the JSON), with padding tokens replaced by -100 (to make sure these are ignored by the loss function). Both are created using DonutProcessor (which internally combines an image processor, for the image modality, and a tokenizer, for the text modality).

Note that we're also adding tokens to the vocabulary of the decoder (and corresponding tokenizer) for all keys of the dictionaries in our dataset, like "<s_menu>". This makes sure the model learns an embedding vector for them. Without doing this, some keys might get split up into multiple subword tokens, in which case the model just learns an embedding for the subword tokens, rather than a direct embedding for these keys.


In [0]:
import json
import random
from typing import Any, List, Tuple

import torch
from torch.utils.data import Dataset

added_tokens = []


class DonutDataset(Dataset):
    """
    PyTorch Dataset for Donut. This class takes a HuggingFace Dataset as input.

    Each row, consists of image path(png/jpg/jpeg) and gt data (json/jsonl/txt),
    and it will be converted into pixel_values (vectorized image) and labels (input_ids of the tokenized string).

    Args:
        dataset_name_or_path: name of dataset (available at huggingface.co/datasets) or the path containing image files and metadata.jsonl
        max_length: the max number of tokens for the target sequences
        split: whether to load "train", "validation" or "test" split
        ignore_id: ignore_index for torch.nn.CrossEntropyLoss
        task_start_token: the special token to be fed to the decoder to conduct the target task
        prompt_end_token: the special token at the end of the sequences
        sort_json_key: whether or not to sort the JSON keys
    """

    def __init__(
        self,
        dataset_name_or_path: object,
        max_length: int,
        ignore_id: int = -100,
        task_start_token: str = "",
        prompt_end_token: str = None,
        sort_json_key: bool = True,
    ):
        super().__init__()

        self.max_length = max_length
        self.ignore_id = ignore_id
        self.task_start_token = task_start_token
        self.prompt_end_token = (
            prompt_end_token if prompt_end_token else task_start_token
        )
        self.sort_json_key = sort_json_key

        self.dataset = dataset_name_or_path
        self.dataset_length = len(self.dataset)

    def add_tokens(self, list_of_tokens: List[str]):
        """
        Add special tokens to tokenizer and resize the token embeddings of the decoder
        """
        newly_added_num = processor.tokenizer.add_tokens(list_of_tokens)
        if newly_added_num > 0:
            model.decoder.resize_token_embeddings(len(processor.tokenizer))
            added_tokens.extend(list_of_tokens)

    def __len__(self) -> int:
        return self.dataset_length

    def json2token(
        self,
        obj: Any,
        update_special_tokens_for_json_key: bool = True,
        sort_json_key: bool = True,
    ):
        """
        Convert an ordered JSON object into a token sequence
        """
        if type(obj) == dict:
            if len(obj) == 1 and "text_sequence" in obj:
                return obj["text_sequence"]
            else:
                output = ""
                if sort_json_key:
                    keys = sorted(obj.keys(), reverse=True)
                else:
                    keys = obj.keys()
                for k in keys:
                    if update_special_tokens_for_json_key:
                        self.add_tokens([rf"", rf""])
                    output += (
                        rf""
                        + self.json2token(
                            obj[k], update_special_tokens_for_json_key, sort_json_key
                        )
                        + rf""
                    )
                return output
              
        elif type(obj) == list:
            return r"".join(
                [
                    self.json2token(
                        item, update_special_tokens_for_json_key, sort_json_key
                    )
                    for item in obj
                ]
            )

        else:
            obj = str(obj)
            if f"<{obj}/>" in added_tokens:
                obj = f"<{obj}/>"  # for categorical special tokens
            return obj

    def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """
        Load image from image_path of given dataset_path and convert into input_tensor and labels
        Convert gt data into input_ids (tokenized string)
        Returns:
            input_tensor : preprocessed image
            input_ids : tokenized gt_data
            labels : masked labels (model doesn't need to predict prompt and pad token)
        """
        sample = self.dataset[idx]

        # image input
        pixel_values = processor(sample["image"], return_tensors="pt").pixel_values
        pixel_values = pixel_values.squeeze()

        # ground truth
        target_sequence = self.json2token(sample["ground_truth"])

        input_ids = processor.tokenizer(
            target_sequence,
            add_special_tokens=False,
            max_length=self.max_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )["input_ids"].squeeze(0)

        labels = input_ids.clone()
        labels[
            labels == processor.tokenizer.pad_token_id
        ] = self.ignore_id  # model doesn't need to predict pad token
        # labels[: torch.nonzero(labels == self.prompt_end_token_id).sum() + 1] = self.ignore_id  # model doesn't need to predict prompt (for VQA)
        return pixel_values, labels, target_sequence

In [0]:
# we update some settings which differ from pretraining; namely the size of the images + no rotation required
# source: https://github.com/clovaai/donut/blob/master/config/train_cord.yaml
processor.image_processor.size = image_size[::-1] # should be (width, height)
processor.image_processor.do_align_long_axis = False

train_dataset = DonutDataset(dataset_name_or_path=dataset["train"], max_length=max_length, task_start_token="", prompt_end_token="", sort_json_key=False)

val_dataset = DonutDataset(dataset_name_or_path=dataset["validation"], max_length=max_length, task_start_token="", prompt_end_token="", sort_json_key=False)
     

In [0]:
target_sequence = "{'boarding_pass': {'passenger': '', 'flight': 'SA8862', 'seat': '8C', 'gate': '', 'depature_date': '23MAR', 'depature_time': '13:30', 'from': 'SKUKUZA/SZK', 'to': 'JOHANNESBURG/ JNB'}}"

In [0]:
input_ids= processor.tokenizer(
            target_sequence,
            add_special_tokens=False,
            max_length=768,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )["input_ids"].squeeze(0)

In [0]:
input_ids

In [0]:
# By decoding the ground truth tensor, we can validate if our ground truth label is represented in right way. 
processor.decode(input_ids)

In [0]:
# the vocab size attribute stays constants (might be a bit unintuitive - but doesn't include special tokens)
print("Original number of tokens:", processor.tokenizer.vocab_size)
print("Number of tokens after adding special tokens:", len(processor.tokenizer))

As always, it's very important to verify whether our data is prepared correctly. Let's check the first training example:

In [0]:
pixel_values, labels, target_sequence = train_dataset[0]

In [0]:
print(pixel_values.shape)

In [0]:
# let's print the tokens of the label (the first 100 token ID's)
for id in labels.tolist()[:100]:
  if id != -100:
    print(processor.decode([id]))
  else:
    print(id)

In [0]:
print(len(train_dataset))
print(len(val_dataset))

In [0]:
print(target_sequence)


##Create PyTorch DataLoaders

Next, we create corresponding PyTorch DataLoaders, which allow us to loop over the dataset in batches:

In [0]:

from torch.utils.data import DataLoader
import os

os.environ["TOKENIZERS_PARALLELISM"] = "True"

# feel free to increase the batch size if you have a lot of memory
# I'm fine-tuning on Colab and given the large image size, batch size > 1 is not feasible
train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True, num_workers=3)
val_dataloader = DataLoader(val_dataset, batch_size=1, shuffle=False, num_workers=3)

##Define LightningModule

Next, we define a LightningModule, which is the standard way to train a model in PyTorch Lightning. A LightningModule is an nn.Module with some additional functionality.

Basically, PyTorch Lightning will take care of all device placements (.to(device)) for us, as well as the backward pass, putting the model in training mode, etc.

In [0]:
from pathlib import Path
import re
from nltk import edit_distance
import numpy as np
import math
import torch
from torch.nn.utils.rnn import pad_sequence
from torch.optim.lr_scheduler import LambdaLR

import pytorch_lightning as pl
from pytorch_lightning.utilities import rank_zero_only


class DonutModelPLModule(pl.LightningModule):
    def __init__(self, config, processor, model):
        super().__init__()
        self.config = config
        self.processor = processor
        self.model = model

    def training_step(self, batch, batch_idx):
        pixel_values, labels, _ = batch
        
        outputs = self.model(pixel_values, labels=labels)
        loss = outputs.loss
        self.log("train_loss", loss, on_step=True, on_epoch=True, sync_dist=True)
        return loss

    def validation_step(self, batch, batch_idx, dataset_idx=0):
        pixel_values, labels, answers = batch
        batch_size = pixel_values.shape[0]
        # we feed the prompt to the model
        decoder_input_ids = torch.full((batch_size, 1), self.model.config.decoder_start_token_id, device=self.device)

        outputs = self.model.generate(pixel_values,
                                   decoder_input_ids=decoder_input_ids,
                                   max_length=max_length,
                                   early_stopping=False,
                                   pad_token_id=self.processor.tokenizer.pad_token_id,
                                   eos_token_id=self.processor.tokenizer.eos_token_id,
                                   use_cache=True,
                                   num_beams=1,
                                   bad_words_ids=[[self.processor.tokenizer.unk_token_id]],
                                   return_dict_in_generate=True
                                   )


        predictions = []
        for seq in self.processor.tokenizer.batch_decode(outputs.sequences):
            seq = seq.replace(self.processor.tokenizer.eos_token, "").replace(self.processor.tokenizer.pad_token, "")
            seq = re.sub(r"<.*?>", "", seq, count=1).strip()  # remove first task start token
            predictions.append(seq)


        scores = []
        for pred, answer in zip(predictions, answers):
            pred = re.sub(r"(?:(?<=>) | (?=))", "", str(answer))
            answer = answer.replace(self.processor.tokenizer.eos_token, "")
            scores.append(edit_distance(pred, answer) / max(len(pred), len(answer)))

            if self.config.get("verbose", False) and len(scores) == 1:
                print(f"--------------------------------")
                print(f"Prediction: {pred}")
                print(f"Answer: {answer}")
                print(f"Normed ED: {scores[0]}")
                print(f"--------------------------------")

        self.log("val_edit_distance", np.mean(scores))

        return scores

    def configure_optimizers(self):
        # you could also add a learning rate scheduler if you want
        optimizer = torch.optim.Adam(self.parameters(), lr=self.config.get("lr"))
    
        return optimizer





Another important thing is that we need to set 2 additional attributes in the configuration of the model. This is not required, but will allow us to train the model by only providing the decoder targets, without having to provide any decoder inputs.

The model will automatically create the decoder_input_ids (the decoder inputs) based on the labels, by shifting them one position to the right and prepending the decoder_start_token_id. I recommend checking this video if you want to understand how models like Donut automatically create decoder_input_ids - and more broadly how Donut works.


In [0]:

model.config.pad_token_id = processor.tokenizer.pad_token_id
model.config.decoder_start_token_id = processor.tokenizer.convert_tokens_to_ids([''])[0]

In [0]:
# sanity check
print("Pad token ID:", processor.decode([model.config.pad_token_id]))
print("Decoder start token ID:", processor.decode([model.config.decoder_start_token_id]))


## Tune!

Next, let's tune! This happens instantiating a PyTorch Lightning Trainer, and then calling trainer.fit.

What's great is that we can automatically train on the hardware we have (in our case, a single GPU), enable mixed precision (fp16=True, which makes sure we don't consume as much memory), add Weights and Biases logging, and so on.



In [0]:

config = {
          "max_epochs":10,
          "check_val_every_n_epoch":3,
          "gradient_clip_val":1.0,
          "num_training_samples_per_epoch": 5,
          "lr":3e-5,
          "train_batch_sizes": [1],
          "val_batch_sizes": [1],
          "num_nodes": 1,
          "warmup_steps": 0, # 800/8*30/10, 10%
          "result_path": "./result",
          "verbose": True,
          }


In [0]:
from pytorch_lightning.callbacks import Callback, EarlyStopping
import mlflow
import torch

# Make sure to release some memory
torch.cuda.empty_cache()

# Make sure that we have disabled MLflow autolog
mlflow.autolog(disable=True)

# Hint 1: Currently we train for a fixed number of epochs. 
# But we don't know at which epoch the model is not improving anymore and/or starts to overfit
# Adding early stopping to the training will speed up the training and potentially prevent overfitting
early_stop_callback = EarlyStopping(monitor="val_edit_distance", patience=4, verbose=False, mode="min")

model_module = DonutModelPLModule(config=config, processor=processor, model=model)

trainer = pl.Trainer(
        accelerator="gpu",
        devices=1,
        max_epochs=config.get("max_epochs"),
        val_check_interval=config.get("val_check_interval"),
        check_val_every_n_epoch=config.get("check_val_every_n_epoch"),
        gradient_clip_val=config.get("gradient_clip_val"),
        precision="16-mixed", 
        num_sanity_val_steps=0,
        log_every_n_steps=1, 
        callbacks=[],
        enable_checkpointing=False
        )

trainer.fit(model=model_module, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader)

##Validate

Let's check how our tuning changed the output's when we present some boarding pass image now.

In [0]:
def predict(image_list, model, processor):
  parsed_list =[]
  for image in image_list:
    # Encode image
    encoded_image_tensor = processor(image, return_tensors="pt").pixel_values
    # Generate tokens
    token_tensor= model.model.model.generate(encoded_image_tensor)
    # Decode tokens and remove special tokens
    seq = processor.tokenizer.batch_decode(token_tensor)[0]
    seq = seq.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
    seq = re.sub(r"<.*?>", "", seq, count=1).strip()
    # Append output sequence to list 
    parsed_list.append(seq)
  
  return parsed_list


In [0]:
example = dataset['train'][0]
train_image = example['image']

print(predict([train_image], model=trainer,processor=processor))

train_image


In [0]:
image = load_image_from_url("https://as1.ftcdn.net/v2/jpg/05/08/32/28/1000_F_508322849_dvGX1itSDANr0Tsr7AX032FuiAnDUgbP.jpg")

print(predict([image], model=trainer,processor=processor))

image


The output structure appears much improved now! Additionally, the model effectively parses the text from the samples used for training. However, it struggles to generate satisfactory outputs when presented with a boarding pass it hasn't encountered during training.

This phenomenon is termed **overfitting**: the model excels at handling the samples seen during training but lacks generalization to handle new samples.

### Task:

Your objective in this hackathon is to combat model **overfitting** and **enhance the quality of outputs** on new samples. You'll find some hints and to-dos in the comments. However, consider other potential strategies (advanced) to further enhance the model and bolster its robustness.

Note: Adding the validation sample to the training set isn't a valid solution.