<h1><center>Fine tuning LayoutLMv2ForTokenClassification on CORD dataset</center></h1><h1><center>Template</center></h1>  

In this notebook, we are going to fine-tune LayoutLMv2ForTokenClassification on the CORD dataset. The goal for the model is to label words appearing in recipes appropriately. This task is treated as a Named Entity Relation Extraction Task. Compared to BERT, LayoutLMv2 also incorporates visual and layout information about the tokens when encoding them into vectors. This makes the LayoutLMv2 model very powerful for document understanding tasks.

LayoutLMv2 is itself an upgrade of LayoutLM. The main novelty of LayoutLMv2 is that it also pre-trains visual embeddings, whereas the original LayoutLM only adds visual embeddings during fine-tuning.

Paper: https://arxiv.org/abs/2012.14740  
Original repo: https://github.com/microsoft/unilm/tree/master/layoutlmv2

Original dataset: https://github.com/clovaai/cord

inspired by: Niels Rogge @ https://github.com/NielsRogge

## Install Requirements

First, we install the required libraries specified in the ***requirements.txt*** file. Check it out 🤗

In [1]:
!pip install -r requirements.txt

In [2]:
# imports
import cv2
import numpy as np
from datasets import (
    Array2D,
    Array3D,
    ClassLabel,
    Features,
    Sequence,
    Value,
    load_dataset,
    load_metric,
)
from PIL import Image
from transformers import (
    LayoutLMv2ForTokenClassification,
    LayoutLMv2Processor,
    Trainer,
    TrainingArguments,
)

import cord
from perspective_transformer import four_point_transform
from utils import normalize_bbox


## Prepare Input

Let's load the FUNSD dataset from the HuggingFace hub. I have uploaded it to you already

In [2]:
# load the datset using the imported method load_dataset. The dataset name is 'MarkusDressel/cord'
# your code starts here
dataset = 

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'bboxes', 'roi', 'ner_tags', 'image_path'],
        num_rows: 800
    })
    test: Dataset({
        features: ['id', 'tokens', 'bboxes', 'roi', 'ner_tags', 'image_path'],
        num_rows: 100
    })
    validation: Dataset({
        features: ['id', 'tokens', 'bboxes', 'roi', 'ner_tags', 'image_path'],
        num_rows: 100
    })
})

As we can see, it contains a training and test split. Each example consists of an id, tokens, bounding boxes, NER tags and a path to the document image. Note: tokens might be a bit misleading here, because these are still words. We need to convert them to actual tokens (word pieces) using the tokenizer.

In [3]:
# split the dataset into train, validation and test
train_dataset = 
validation_dataset = 
test_dataset = 

SyntaxError: invalid syntax (538727764.py, line 2)

---

## Preprocess Data

First, let's store the labels in a list, and create dictionaries that let us map from labels to integer indices and vice versa. The latter will be useful when evaluating the model.

In [6]:
# get the labels from the dataset. You can find them in the feature attribute of the train_dataset
labels = 
labels = ["I-" + label for label in labels]

# create a dictionary that map an id (0... len(labels)) to its correponsing label name
id2label = 
# create a dictionary that map the actual label name to its correponsing id
label2id = {k: v for v, k in enumerate(labels)}
print(labels)

['I-menu.cnt', 'I-menu.discountprice', 'I-menu.etc', 'I-menu.itemsubtotal', 'I-menu.nm', 'I-menu.num', 'I-menu.price', 'I-menu.sub_cnt', 'I-menu.sub_etc', 'I-menu.sub_nm', 'I-menu.sub_price', 'I-menu.sub_unitprice', 'I-menu.unitprice', 'I-menu.vatyn', 'I-sub_total.discount_price', 'I-sub_total.etc', 'I-sub_total.othersvc_price', 'I-sub_total.service_price', 'I-sub_total.subtotal_price', 'I-sub_total.tax_price', 'I-total.cashprice', 'I-total.changeprice', 'I-total.creditcardprice', 'I-total.emoneyprice', 'I-total.menuqty_cnt', 'I-total.menutype_cnt', 'I-total.total_etc', 'I-total.total_price', 'I-void_menu.nm', 'I-void_menu.price']


The current images are not centered to the region of interest(ROI). Luckily, the authors provide the bounding box of the ROI. Thus we can apply a simple perspective transformation to the image and the corresponding bounding boxes. Checkout ***perspective_transformer.py*** to see what happening under the hood.

Now let's first load the image and adjust the image using our ***perspective_transformer.py***. Bounding Boxes need to be scaled between 0-1000 Unfortunately, the annotators made some mistake and created bounding boxes outside of the image. We simply fix it by deleting these bounding boxes and corresponding words for the dataset entry.

In [7]:
def preprocess_data(example):
    
    # load image into cv2
    
    # load the image of the attribute "image_path"
    image_path = 
    
    # read the image using cv2.imread
    image = cv2.imread(image_path)

#     get region of interest (roi) and the bounding boxes (bboxes) from the example variable
    roi = example["roi"]
    bboxes = example["bboxes"]
    # we only have a roi, when the image is missaligned
    if roi:
        pts = np.asarray(roi,dtype="float32")
        # apply perspective transformation
        image, bboxes = four_point_transform(image, pts, bboxes=bboxes)
    # fix bgr to rgb image representation
    image =  cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    
    
    # get width and height
    h,w = 
    # resize image to 224,224 using cv2.resize()
    image = 
    image = np.array(image)

    # LayoutLMv2 requires to adjust the bounding boxes in scale between 0-1000
    # normalize bounding boxes using the imported method normalize_bbox   hint: use a list comprehension
    boxes = [normalize_bbox(b,w,h) for b in bboxes]
    labels = example["ner_tags"]
    words = example["tokens"]
    
    # this is just some code to identify wrong annotated label in the dataset
    for i, box in enumerate(boxes):
        if (min(box) < 0 or max(box) > 1000) or ((box[3] - box[1]) < 0) or ((box[2] - box[0]) < 0) or words[i]=="":
            del words[i]
            del labels[i]
            del boxes[i]
    
    return {"image":image, "boxes":boxes, "labels":labels, "words":words}

# let's define our new features for the dataset
features = Features({
    'image': Array3D(dtype="int64", shape=(224, 224, 3)),
    'boxes':Sequence(Sequence(Value(dtype='int64'))),
    'labels': Sequence(ClassLabel(names=labels)),
    'words': Sequence(Value("string")),
})
# apply the preprocessing to all our datasets
# use the dataset.map method and remove all prior existing columns
train_dataset = 
test_dataset = 
validation_dataset = 

  0%|          | 0/800 [00:00<?, ?ex/s]

  0%|          | 0/100 [00:00<?, ?ex/s]

  0%|          | 0/100 [00:00<?, ?ex/s]

In [8]:
train_dataset.features

{'image': Array3D(shape=(224, 224, 3), dtype='int64', id=None),
 'boxes': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None),
 'labels': Sequence(feature=ClassLabel(num_classes=30, names=['I-menu.cnt', 'I-menu.discountprice', 'I-menu.etc', 'I-menu.itemsubtotal', 'I-menu.nm', 'I-menu.num', 'I-menu.price', 'I-menu.sub_cnt', 'I-menu.sub_etc', 'I-menu.sub_nm', 'I-menu.sub_price', 'I-menu.sub_unitprice', 'I-menu.unitprice', 'I-menu.vatyn', 'I-sub_total.discount_price', 'I-sub_total.etc', 'I-sub_total.othersvc_price', 'I-sub_total.service_price', 'I-sub_total.subtotal_price', 'I-sub_total.tax_price', 'I-total.cashprice', 'I-total.changeprice', 'I-total.creditcardprice', 'I-total.emoneyprice', 'I-total.menuqty_cnt', 'I-total.menutype_cnt', 'I-total.total_etc', 'I-total.total_price', 'I-void_menu.nm', 'I-void_menu.price'], names_file=None, id=None), length=-1, id=None),
 'words': Sequence(feature=Value(dtype='string', id=None), length=-1, 

---

Now let's finaly prepare the data for the model. LayoutLMv2Processor basically bring our image into the right shape and tokenizes our words(using word piece)
optionally, it can also apply ocr. (especially usefull for inferencing the model later on

In [9]:
# import the LayouLMv2Processor. The path is microsoft/layoutlmv2-base-uncased", revision="no_ocr"
processor = 

Downloading:   0%|          | 0.00/136 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/707 [00:00<?, ?B/s]

In [10]:
def prepare_input(examples):
    images = examples["image"]
    images = [np.array(image) for image in images]
    boxes = examples["boxes"]
    labels = examples["labels"]
    words = examples["words"]
    
    # pass all neccessary input to the processor. Check the documentation to capture all imprtant variables
    encoded_input = processor()
    return encoded_input

features = Features({
    'image': Array3D(dtype="int64", shape=(3, 224, 224)),
    'input_ids': Sequence(feature=Value(dtype='int64')),
    'attention_mask': Sequence(Value(dtype='int64')),
    'token_type_ids': Sequence(Value(dtype='int64')),
    'bbox': Array2D(dtype="int64", shape=(512, 4)),
    'labels': Sequence(ClassLabel(names=labels)),
})

tokenized_train_dataset = train_dataset.map(prepare_input, batched=True, features=features,remove_columns=train_dataset.column_names)
tokenized_test_dataset = test_dataset.map(prepare_input, batched=True, features=features,remove_columns=test_dataset.column_names)
tokenized_validation_dataset = validation_dataset.map(prepare_input, batched=True, features=features,remove_columns=validation_dataset.column_names)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Since we want to train our model with the TORCH version of the huggingface trainer, we define our datset format to torch

In [11]:
tokenized_train_dataset.set_format("torch")
tokenized_test_dataset.set_format("torch")
tokenized_validation_dataset.set_format("torch")

In [12]:
tokenized_train_dataset.features

{'image': Array3D(shape=(3, 224, 224), dtype='int64', id=None),
 'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'bbox': Array2D(shape=(512, 4), dtype='int64', id=None),
 'labels': Sequence(feature=ClassLabel(num_classes=30, names=['I-menu.cnt', 'I-menu.discountprice', 'I-menu.etc', 'I-menu.itemsubtotal', 'I-menu.nm', 'I-menu.num', 'I-menu.price', 'I-menu.sub_cnt', 'I-menu.sub_etc', 'I-menu.sub_nm', 'I-menu.sub_price', 'I-menu.sub_unitprice', 'I-menu.unitprice', 'I-menu.vatyn', 'I-sub_total.discount_price', 'I-sub_total.etc', 'I-sub_total.othersvc_price', 'I-sub_total.service_price', 'I-sub_total.subtotal_price', 'I-sub_total.tax_price', 'I-total.cashprice', 'I-total.changeprice', 'I-total.creditcardprice', 'I-total.emoneyprice', 'I-total.menuqty_cnt', 'I-total.menutype_cnt', 'I

---

# Training

Here we train the model using HuggingFace's Trainer.

first, let's download the model from the hub

In [13]:
# laod the LayoutLMv2ForTokenClassification model from HuggingFace. The path is 'microsoft/layoutlmv2-base-uncased'.
# Don't forget to provide the number of label using the num_labels attribute
model = 

Downloading:   0%|          | 0.00/765M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/layoutlmv2-base-uncased were not used when initializing LayoutLMv2ForTokenClassification: ['layoutlmv2.visual.backbone.bottom_up.res4.19.conv1.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res5.2.conv2.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.stem.conv1.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res4.12.conv1.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res4.10.conv2.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res4.21.conv1.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res4.16.conv1.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res4.1.conv2.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res5.2.conv1.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res4.7.conv2.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res4.20.conv1.norm.num_batches_tracked

In [4]:
#the rest of the code is already prefilled. Checkout your results from training

In [14]:
model.config.id2label=id2label
model.config.label2id=label2id

second, we define our performance metric. here we are using seqeval. Checkout https://github.com/chakki-works/seqeval to learn more about the metrics

In [15]:
# Metrics
metric = load_metric("seqeval")
return_entity_level_metrics = False

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [id2label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [id2label[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    if return_entity_level_metrics:
        # Unpack nested dictionaries
        final_results = {}
        for key, value in results.items():
            if isinstance(value, dict):
                for n, v in value.items():
                    final_results[f"{key}_{n}"] = v
            else:
                final_results[key] = value
        return final_results
    else:
        return {
            "precision": results["overall_precision"],
            "recall": results["overall_recall"],
            "f1": results["overall_f1"],
            "accuracy": results["overall_accuracy"],
        }


Downloading:   0%|          | 0.00/2.48k [00:00<?, ?B/s]

In [16]:
training_args = TrainingArguments(
    output_dir="layoutlmv2-finetuned-cord", # name of directory to store the checkpoints
    save_strategy="epoch", # we save our model after each epoch
    num_train_epochs=5, # we train for a maximum of 5 epochs
    learning_rate = 5e-5, # our learning rate
    warmup_ratio=0.1, # total training steps used for a linear warmup from 0 to learning_rate`
    fp16=True, # we use mixed precision (less memory consumption)
    evaluation_strategy = "epoch", #  we want to see some results during training
    per_device_train_batch_size=2, 
    load_best_model_at_end =True, # we want to get the best model after training
    metric_for_best_model="f1", # our best model is defined by it's highest f1-score
    report_to='none',
)

In [17]:
trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=tokenized_train_dataset, 
    eval_dataset=tokenized_validation_dataset,
    compute_metrics=compute_metrics,
)

Using amp fp16 backend


In [18]:
train_result = trainer.train()

***** Running training *****
  Num examples = 800
  Num Epochs = 5
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 2000


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.262139,0.866232,0.859223,0.862713,0.86688
2,1.949200,0.602924,0.92641,0.916667,0.921513,0.935041
3,0.710200,0.404699,0.929498,0.927994,0.928745,0.948765
4,0.386900,0.283108,0.966102,0.968447,0.967273,0.973468
5,0.262000,0.258026,0.974838,0.971683,0.973258,0.973925


***** Running Evaluation *****
  Num examples = 100
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to layoutlmv2-finetuned-cord/checkpoint-400
Configuration saved in layoutlmv2-finetuned-cord/checkpoint-400/config.json
Model weights saved in layoutlmv2-finetuned-cord/checkpoint-400/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 100
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to layoutlmv2-finetuned-cord/checkpoint-800
Configuration saved in layoutlmv2-finetuned-cord/checkpoint-800/config.json
Model weights saved in layoutlmv2-finetuned-cord/checkpoint-800/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 100
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to layoutlmv2-finetuned-cord/checkpoint-1200
Configuration saved in layoutlmv2-finetuned-cord/checkpoint-1200/config.json
Model weights saved in layoutl

In [19]:
# let's save our model. You can also directly load your model to hugginface if you want
trainer.save_model("layoutlmv2_cord_model")

Saving model checkpoint to layoutlmv2_cord_model
Configuration saved in layoutlmv2_cord_model/config.json
Model weights saved in layoutlmv2_cord_model/pytorch_model.bin


# Evaluation

In [20]:
metrics = train_result.metrics

In [21]:
trainer.log_metrics("train", metrics)

***** train metrics *****
  epoch                    =        5.0
  total_flos               =  2011243GF
  train_loss               =     0.8271
  train_runtime            = 0:10:05.03
  train_samples_per_second =      6.611
  train_steps_per_second   =      3.306


Let's evaluate our model against:
- validation dataset
- test dataset

In [22]:
metrics = trainer.evaluate()

***** Running Evaluation *****
  Num examples = 100
  Batch size = 8


  _warn_prf(average, modifier, msg_start, len(result))


In [24]:
trainer.log_metrics("eval", metrics)

***** eval metrics *****
  epoch                   =        5.0
  eval_accuracy           =     0.9739
  eval_f1                 =     0.9733
  eval_loss               =      0.258
  eval_precision          =     0.9748
  eval_recall             =     0.9717
  eval_runtime            = 0:00:02.98
  eval_samples_per_second =     33.514
  eval_steps_per_second   =      4.357


In [25]:
predictions, labels, metrics = trainer.predict(tokenized_test_dataset)

***** Running Prediction *****
  Num examples = 100
  Batch size = 8


In [26]:
predictions = np.argmax(predictions, axis=2)

In [27]:
 true_predictions = [
    [id2label[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

In [28]:
trainer.log_metrics("test", metrics)
trainer.save_metrics("test", metrics)

***** test metrics *****
  test_accuracy           =     0.9664
  test_f1                 =     0.9573
  test_loss               =     0.2995
  test_precision          =     0.9555
  test_recall             =     0.9592
  test_runtime            = 0:00:02.97
  test_samples_per_second =     33.593
  test_steps_per_second   =      4.367


looks like we did our job good. These results are comparable to the original paper. See here:

![caption](layoutlmCordResults.PNG)

Thanks for reading until the end :-)