<a href="https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/FUNSD/Fine_tuning_LayoutLMv2ForTokenClassification_on_FUNSD_using_HuggingFace_Trainer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we are going to fine-tune `LayoutLMv2ForTokenClassification` on the [FUNSD](https://guillaumejaume.github.io/FUNSD/) dataset. The goal for the model is to label words appearing in scanned documents appropriately. This task is treated as a NER problem (sequence labeling). However, compared to BERT, LayoutLMv2 also incorporates visual and layout information about the tokens when encoding them into vectors. This makes the LayoutLMv2 model very powerful for document understanding tasks.

LayoutLMv2 is itself an upgrade of LayoutLM. The main novelty of LayoutLMv2 is that it also pre-trains visual embeddings, whereas the original LayoutLM only adds visual embeddings during fine-tuning.

* Paper: https://arxiv.org/abs/2012.14740
* Original repo: https://github.com/microsoft/unilm/tree/master/layoutlmv2

## Install dependencies

First, we install the required libraries:
* Transformers (for the LayoutLMv2 model)
* Datasets (for data preprocessing)
* Seqeval (for metrics)
* Detectron2 (which LayoutLMv2 requires for its visual backbone).



In [1]:
!pip install -q git+https://github.com/huggingface/transformers.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 895 kB 8.4 MB/s 
[K     |████████████████████████████████| 50 kB 9.2 MB/s 
[K     |████████████████████████████████| 636 kB 62.0 MB/s 
[K     |████████████████████████████████| 3.3 MB 48.4 MB/s 
[?25h  Building wheel for transformers (PEP 517) ... [?25l[?25hdone


In [2]:
!pip install -q datasets seqeval

[?25l[K     |█▎                              | 10 kB 31.0 MB/s eta 0:00:01[K     |██▌                             | 20 kB 19.3 MB/s eta 0:00:01[K     |███▊                            | 30 kB 15.5 MB/s eta 0:00:01[K     |█████                           | 40 kB 14.1 MB/s eta 0:00:01[K     |██████▏                         | 51 kB 6.9 MB/s eta 0:00:01[K     |███████▍                        | 61 kB 6.9 MB/s eta 0:00:01[K     |████████▋                       | 71 kB 7.2 MB/s eta 0:00:01[K     |██████████                      | 81 kB 8.1 MB/s eta 0:00:01[K     |███████████▏                    | 92 kB 8.6 MB/s eta 0:00:01[K     |████████████▍                   | 102 kB 6.6 MB/s eta 0:00:01[K     |█████████████▋                  | 112 kB 6.6 MB/s eta 0:00:01[K     |██████████████▉                 | 122 kB 6.6 MB/s eta 0:00:01[K     |████████████████                | 133 kB 6.6 MB/s eta 0:00:01[K     |█████████████████▎              | 143 kB 6.6 MB/s eta 0:00:01[K 

In [3]:
!pip install -q pyyaml==5.1
# workaround: install old version of pytorch since detectron2 hasn't released packages for pytorch 1.9 (issue: https://github.com/facebookresearch/detectron2/issues/3158)
!pip install -q torch==1.8.0+cu101 torchvision==0.9.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

# install detectron2 that matches pytorch 1.8
# See https://detectron2.readthedocs.io/tutorials/install.html for instructions
#!pip install -q detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.8/index.html
!python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

[?25l[K     |█▏                              | 10 kB 27.9 MB/s eta 0:00:01[K     |██▍                             | 20 kB 35.2 MB/s eta 0:00:01[K     |███▋                            | 30 kB 23.9 MB/s eta 0:00:01[K     |████▉                           | 40 kB 18.8 MB/s eta 0:00:01[K     |██████                          | 51 kB 7.2 MB/s eta 0:00:01[K     |███████▏                        | 61 kB 7.7 MB/s eta 0:00:01[K     |████████▍                       | 71 kB 6.7 MB/s eta 0:00:01[K     |█████████▋                      | 81 kB 7.5 MB/s eta 0:00:01[K     |██████████▊                     | 92 kB 7.7 MB/s eta 0:00:01[K     |████████████                    | 102 kB 6.8 MB/s eta 0:00:01[K     |█████████████▏                  | 112 kB 6.8 MB/s eta 0:00:01[K     |██████████████▍                 | 122 kB 6.8 MB/s eta 0:00:01[K     |███████████████▌                | 133 kB 6.8 MB/s eta 0:00:01[K     |████████████████▊               | 143 kB 6.8 MB/s eta 0:00:01[K 

To be able to share your model with the community on the HuggingFace hub, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/welcome) if you haven't already!) then uncomment the following cell and input your username and password (this only works on Colab, in a regular notebook, you need to do this in a terminal):

In [4]:
#!huggingface-cli login

Then you need to install Git-LFS (which is used by the hub) and setup Git if you haven't already. Uncomment the following instructions and adapt with your name and email:

In [5]:
#!apt install git-lfs
#!git config --global user.email "example@gmail.com"
#!git config --global user.name "your name"

## Prepare the data

Let's load the FUNSD dataset from the HuggingFace hub.

In [6]:
from datasets import load_dataset

datasets = load_dataset("nielsr/funsd")

Downloading:   0%|          | 0.00/4.54k [00:00<?, ?B/s]

Downloading and preparing dataset funsd/funsd (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/funsd/funsd/1.0.0/8b0472b536a2dcb975d59a4fb9d6fea4e6a1abe260b7fed6f75301e168cbe595...


Downloading:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset funsd downloaded and prepared to /root/.cache/huggingface/datasets/funsd/funsd/1.0.0/8b0472b536a2dcb975d59a4fb9d6fea4e6a1abe260b7fed6f75301e168cbe595. Subsequent calls will reuse this data.


As we can see, it contains a training and test split. Each example consists of an id, tokens, bounding boxes, NER tags (in IOB format) and a document image. Note: tokens might be a bit misleading here, because these are still words. We need to convert them to actual tokens (word pieces) using the tokenizer. 

In [7]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'words', 'bboxes', 'ner_tags', 'image_path'],
        num_rows: 149
    })
    test: Dataset({
        features: ['id', 'words', 'bboxes', 'ner_tags', 'image_path'],
        num_rows: 50
    })
})

In [8]:
datasets['train'].features

{'bboxes': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None),
 'id': Value(dtype='string', id=None),
 'image_path': Value(dtype='string', id=None),
 'ner_tags': Sequence(feature=ClassLabel(num_classes=7, names=['O', 'B-HEADER', 'I-HEADER', 'B-QUESTION', 'I-QUESTION', 'B-ANSWER', 'I-ANSWER'], names_file=None, id=None), length=-1, id=None),
 'words': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}

## Preprocess data

First, let's store the labels in a list, and create dictionaries that let us map from labels to integer indices and vice versa. The latter will be useful when evaluating the model.

In [9]:
labels = datasets['train'].features['ner_tags'].feature.names
print(labels)

['O', 'B-HEADER', 'I-HEADER', 'B-QUESTION', 'I-QUESTION', 'B-ANSWER', 'I-ANSWER']


In [10]:
id2label = {v: k for v, k in enumerate(labels)}
label2id = {k: v for v, k in enumerate(labels)}
label2id

{'B-ANSWER': 5,
 'B-HEADER': 1,
 'B-QUESTION': 3,
 'I-ANSWER': 6,
 'I-HEADER': 2,
 'I-QUESTION': 4,
 'O': 0}

Next, let's use `LayoutLMv2Processor` to prepare the data for the model.

In [11]:
from PIL import Image
from transformers import LayoutLMv2Processor
from datasets import Features, Sequence, ClassLabel, Value, Array2D, Array3D

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")

# we need to define custom features
features = Features({
    'image': Array3D(dtype="int64", shape=(3, 224, 224)),
    'input_ids': Sequence(feature=Value(dtype='int64')),
    'attention_mask': Sequence(Value(dtype='int64')),
    'token_type_ids': Sequence(Value(dtype='int64')),
    'bbox': Array2D(dtype="int64", shape=(512, 4)),
    'labels': Sequence(ClassLabel(names=labels)),
})

def preprocess_data(examples):
  images = [Image.open(path).convert("RGB") for path in examples['image_path']]
  words = examples['words']
  boxes = examples['bboxes']
  word_labels = examples['ner_tags']
  
  encoded_inputs = processor(images, words, boxes=boxes, word_labels=word_labels,
                             padding="max_length", truncation=True)
  
  return encoded_inputs

train_dataset = datasets['train'].map(preprocess_data, batched=True, remove_columns=datasets['train'].column_names,
                                      features=features)
test_dataset = datasets['test'].map(preprocess_data, batched=True, remove_columns=datasets['test'].column_names,
                                      features=features)

Downloading:   0%|          | 0.00/136 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/707 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [12]:
train_dataset

Dataset({
    features: ['image', 'input_ids', 'attention_mask', 'token_type_ids', 'bbox', 'labels'],
    num_rows: 149
})

Let's verify the first example:

In [13]:
processor.tokenizer.decode(train_dataset['input_ids'][0])

'[CLS] r & d : suggestion : date : licensee yes no 597005708 r & d quality improvement suggestion / solution form name / phone ext. : m. hamann p. harper, p. martinez 9 / 3 / 92 r & d group : j. s. wigand supervisor / manager discontinue coal retention analyses on licensee submitted product samples ( note : coal retention testing is not performed by most licensees. other b & w physical measurements as ends stability and inspection for soft spots in ciparettes are thought to be sufficient measures to assure cigarette physical integrity. the proposed action will increase laboratory productivity. ) suggested solutions ( s ) : delete coal retention from the list of standard analyses performed on licensee submitted product samples. special requests for coal retention testing could still be submitted on an exception basis. have you contacted your manager / supervisor? manager comments : manager, please contact suggester and forward comments to the quality council. qip. wp [SEP] [PAD] [PAD] [

In [14]:
print(train_dataset['labels'][0])

[-100, 0, -100, -100, 3, 3, -100, 3, -100, 5, -100, 3, 3, 0, -100, -100, -100, -100, -100, 1, -100, -100, 2, 2, 2, -100, 2, 2, 3, 4, 4, 4, -100, -100, 4, 5, -100, 6, -100, 6, -100, 6, -100, 6, -100, 6, 5, -100, 6, -100, 6, 3, -100, -100, 4, -100, 5, -100, 6, -100, 6, -100, 3, 4, 4, 5, -100, -100, -100, 6, 6, 6, 6, 6, -100, 6, 6, 6, 6, -100, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, -100, -100, 6, 6, -100, -100, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, -100, -100, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, -100, 6, 6, 6, 6, 6, 6, 6, 6, 6, 3, 4, 4, -100, -100, 4, 5, -100, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, -100, 6, 6, 6, -100, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, -100, 1, 2, 2, 2, 2, -100, 2, -100, 3, 4, -100, 5, -100, 6, 6, 6, -100, 6, 6, 5, 6, 6, 6, 6, -100, 0, -100, 0, 0, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 

Finally, let's set the format to PyTorch.

In [15]:
train_dataset.set_format(type="torch")
test_dataset.set_format(type="torch")

In [16]:
train_dataset.features.keys()

dict_keys(['image', 'input_ids', 'attention_mask', 'token_type_ids', 'bbox', 'labels'])

Next, we create corresponding dataloaders.

In [17]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=2)

Let's verify a batch:

In [18]:
batch = next(iter(train_dataloader))

for k,v in batch.items():
  print(k, v.shape)

image torch.Size([4, 3, 224, 224])
input_ids torch.Size([4, 512])
attention_mask torch.Size([4, 512])
token_type_ids torch.Size([4, 512])
bbox torch.Size([4, 512, 4])
labels torch.Size([4, 512])


## Train the model

Here we train the model using HuggingFace's Trainer. We need to overwrite a few methods, namely those that return the PyTorch dataloaders, as we defined custom dataloaders above.

We can initialize a `Trainer` by passing our model as well as `TrainingArguments`. See the [docs](https://huggingface.co/transformers/main_classes/trainer.html) for all possible arguments..

In [20]:
from transformers import LayoutLMv2ForTokenClassification, TrainingArguments, Trainer
from datasets import load_metric
import numpy as np

model = LayoutLMv2ForTokenClassification.from_pretrained('microsoft/layoutlmv2-base-uncased',
                                                                      num_labels=len(label2id))

# Set id2label and label2id 
model.config.id2label = id2label
model.config.label2id = label2id

# Metrics
metric = load_metric("seqeval")
return_entity_level_metrics = True

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [id2label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [id2label[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    if return_entity_level_metrics:
        # Unpack nested dictionaries
        final_results = {}
        for key, value in results.items():
            if isinstance(value, dict):
                for n, v in value.items():
                    final_results[f"{key}_{n}"] = v
            else:
                final_results[key] = value
        return final_results
    else:
        return {
            "precision": results["overall_precision"],
            "recall": results["overall_recall"],
            "f1": results["overall_f1"],
            "accuracy": results["overall_accuracy"],
        }

class FunsdTrainer(Trainer):
    def get_train_dataloader(self):
      return train_dataloader

    def get_test_dataloader(self, test_dataset):
      return test_dataloader

args = TrainingArguments(
    output_dir="layoutlmv2-finetuned-funsd-v2", # name of directory to store the checkpoints
    max_steps=1000, # we train for a maximum of 1,000 batches
    warmup_ratio=0.1, # we warmup a bit
    fp16=True, # we use mixed precision (less memory consumption)
    push_to_hub=True, # after training, we'd like to push our model to the hub
    push_to_hub_model_id=f"layoutlmv2-finetuned-funsd-test", # this is the name we'll use for our model on the hub
)

# Initialize our Trainer
trainer = FunsdTrainer(
    model=model,
    args=args,
    compute_metrics=compute_metrics,
)

loading configuration file https://huggingface.co/microsoft/layoutlmv2-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/882f0cab8dbb456e5b1d6e3b96e864be0cb6c2bc5d20ee88eeda10b3c0317332.a3f80b6502f00efe74a01c6a007f196803229a24224659c81c1da7c4cf5316e8
Model config LayoutLMv2Config {
  "attention_probs_dropout_prob": 0.1,
  "convert_sync_batchnorm": true,
  "coordinate_size": 128,
  "detectron2_config_args": {
    "MODEL.ANCHOR_GENERATOR.SIZES": [
      [
        32
      ],
      [
        64
      ],
      [
        128
      ],
      [
        256
      ],
      [
        512
      ]
    ],
    "MODEL.BACKBONE.NAME": "build_resnet_fpn_backbone",
    "MODEL.FPN.IN_FEATURES": [
      "res2",
      "res3",
      "res4",
      "res5"
    ],
    "MODEL.MASK_ON": true,
    "MODEL.PIXEL_STD": [
      57.375,
      57.12,
      58.395
    ],
    "MODEL.POST_NMS_TOPK_TEST": 1000,
    "MODEL.RESNETS.ASPECT_RATIOS": [
      [
        0.5,
        1.0,
    

Let's train the model! By default, the Trainer saves checkpoints every 500 steps.

In [21]:
trainer.train()

***** Running training *****
  Num examples = 8000
  Num Epochs = 9223372036854775807
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1000


Step,Training Loss
500,0.7982
1000,0.2184


Saving model checkpoint to layoutlmv2-finetuned-funsd-v2/checkpoint-500
Configuration saved in layoutlmv2-finetuned-funsd-v2/checkpoint-500/config.json
Model weights saved in layoutlmv2-finetuned-funsd-v2/checkpoint-500/pytorch_model.bin
Saving model checkpoint to layoutlmv2-finetuned-funsd-v2/checkpoint-1000
Configuration saved in layoutlmv2-finetuned-funsd-v2/checkpoint-1000/config.json
Model weights saved in layoutlmv2-finetuned-funsd-v2/checkpoint-1000/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1000, training_loss=0.5082762908935546, metrics={'train_runtime': 628.7041, 'train_samples_per_second': 12.725, 'train_steps_per_second': 1.591, 'total_flos': 2117231568021504.0, 'train_loss': 0.5082762908935546, 'epoch': 26.01})

To compute metrics on the test set, we can run `trainer.predict()`. We get predictions, labels, and metrics back.

In [22]:
predictions, labels, metrics = trainer.predict(test_dataset)

***** Running Prediction *****
  Num examples = 50
  Batch size = 2


In [23]:
print(metrics)

{'test_loss': 0.6071677207946777, 'test_ANSWER_precision': 0.7882219705549264, 'test_ANSWER_recall': 0.8603213844252163, 'test_ANSWER_f1': 0.822695035460993, 'test_ANSWER_number': 809, 'test_HEADER_precision': 0.6017699115044248, 'test_HEADER_recall': 0.5714285714285714, 'test_HEADER_f1': 0.5862068965517241, 'test_HEADER_number': 119, 'test_QUESTION_precision': 0.8778846153846154, 'test_QUESTION_recall': 0.8572769953051643, 'test_QUESTION_f1': 0.867458432304038, 'test_QUESTION_number': 1065, 'test_overall_precision': 0.8236738703339882, 'test_overall_recall': 0.8414450577019569, 'test_overall_f1': 0.8324646314221892, 'test_overall_accuracy': 0.831893532570628, 'test_runtime': 4.3836, 'test_samples_per_second': 11.406, 'test_steps_per_second': 1.597}


The numbers I got where:

* run 1: `'test_overall_precision': 0.8190854870775348, 'test_overall_recall': 0.8268941294530858, 'test_overall_f1': 0.8229712858926342`
* run 2: `'test_overall_precision': 0.8236738703339882, 'test_overall_recall': 0.8414450577019569, 'test_overall_f1': 0.8324646314221892`

## Share model on the hub

Finally, we can easily push our model to the hub as follows:

In [None]:
trainer.push_to_hub()