## Set-up environment

First, we install 🤗 Transformers, as well as 🤗 Datasets and Seqeval (the latter is useful for evaluation metrics such as F1 on sequence labeling tasks).

In [1]:
!pip install -q git+https://github.com/huggingface/transformers.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m73.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone


In [2]:
!pip install -q datasets seqeval

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/469.0 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━[0m [32m256.0/469.0 KB[0m [31m7.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 KB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [3

## Load dataset

Next, we load a dataset from the 🤗 [hub](https://huggingface.co/datasets/nielsr/funsd-layoutlmv3). This one is the [FUNSD](https://guillaumejaume.github.io/FUNSD/) dataset, a collection of annotated forms.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import json
from datasets import load_dataset
from PIL import Image
import transformers
from datasets import Features, Sequence, ClassLabel, Value, Array2D, Array3D, load_metric
from transformers import LayoutLMv2Model, LayoutLMv2Config, LayoutLMv2Processor, LayoutXLMTokenizer
from transformers import LayoutXLMProcessor
import numpy as np
from transformers import LayoutLMv2ForTokenClassification, AdamW, TrainingArguments, Trainer,AutoTokenizer
import torch
from tqdm.notebook import tqdm
import pandas as pd

In [5]:
PATH = '/content/drive/MyDrive/we/data_IB/'

In [6]:
TRAIN_PATH = PATH + 'train.json'
VAL_PATH = PATH + 'validation.json'
TEST_PATH = PATH + 'test.json'

In [7]:
with open(TRAIN_PATH) as outfile:
  data = json.load(outfile)

In [8]:
features = Features({
    'id': Value(dtype='int64', id=None),
    'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
    'bboxes': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None),
    'ner_tags': Sequence(feature=ClassLabel(
        num_classes=11,
        names=['NONE',
 'B-WE_JOB_TITLE',
 'I-WE_JOB_TITLE',
 'B-WE_DATE',
 'I-WE_DATE',
 'B-WE_LOC',
 'I-WE_LOC',
 'B-WE_ORG',
 'I-WE_ORG',
 'B-WE_DESCRIPTION',
 'I-WE_DESCRIPTION']
        , id=None), 
        length=-1, id=None),
    'image': Value(dtype='string', id=None),
    })

In [9]:
def iob_to_label(label):
    """
    Changes the label input in case of there isnt one

    Args:
        label: label of word
        
    Returns:
        label
    """
    
    label = label[2:]
    if not label:
      return 'o'
    return label

In [21]:
train_val_dataset = load_dataset('json', data_files={'train':TRAIN_PATH, 'val': VAL_PATH, 'test': TEST_PATH},field="cvs",features=features)



  0%|          | 0/3 [00:00<?, ?it/s]

As we can see, the dataset consists of 2 splits ("train" and "test"), and each example contains a list of words ("tokens") with corresponding boxes ("bboxes"), and the words are tagged ("ner_tags"). Each example also include the original image ("image").

In [22]:
train_val_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'bboxes', 'ner_tags', 'image'],
        num_rows: 64
    })
    val: Dataset({
        features: ['id', 'tokens', 'bboxes', 'ner_tags', 'image'],
        num_rows: 14
    })
    test: Dataset({
        features: ['id', 'tokens', 'bboxes', 'ner_tags', 'image'],
        num_rows: 14
    })
})

Let's check the features:

In [23]:
train_val_dataset["train"].features

{'id': Value(dtype='int64', id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'bboxes': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(names=['NONE', 'B-WE_JOB_TITLE', 'I-WE_JOB_TITLE', 'B-WE_DATE', 'I-WE_DATE', 'B-WE_LOC', 'I-WE_LOC', 'B-WE_ORG', 'I-WE_ORG', 'B-WE_DESCRIPTION', 'I-WE_DESCRIPTION'], id=None), length=-1, id=None),
 'image': Value(dtype='string', id=None)}

Note that you can directly see the example in a notebook (as the "image" column is of type [Image](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Image)).

In [24]:
example = train_val_dataset["train"][0]
example["image"]

'/content/drive/MyDrive/cv_images/Curriculo2020compressed-3.png'

In [25]:
labels = train_val_dataset['train'].features['ner_tags'].feature.names
labels

['NONE',
 'B-WE_JOB_TITLE',
 'I-WE_JOB_TITLE',
 'B-WE_DATE',
 'I-WE_DATE',
 'B-WE_LOC',
 'I-WE_LOC',
 'B-WE_ORG',
 'I-WE_ORG',
 'B-WE_DESCRIPTION',
 'I-WE_DESCRIPTION']

In [26]:
id2label = {v: k for v, k in enumerate(labels)}
label2id = {k: v for v, k in enumerate(labels)}
label2id

{'NONE': 0,
 'B-WE_JOB_TITLE': 1,
 'I-WE_JOB_TITLE': 2,
 'B-WE_DATE': 3,
 'I-WE_DATE': 4,
 'B-WE_LOC': 5,
 'I-WE_LOC': 6,
 'B-WE_ORG': 7,
 'I-WE_ORG': 8,
 'B-WE_DESCRIPTION': 9,
 'I-WE_DESCRIPTION': 10}

## Prepare dataset

Next, we prepare the dataset for the model. This can be done very easily using `LayoutLMv3Processor`, which internally wraps a `LayoutLMv3FeatureExtractor` (for the image modality) and a `LayoutLMv3Tokenizer` (for the text modality) into one.

Basically, the processor does the following internally:
* the feature extractor is used to resize + normalize each document image into `pixel_values`
* the tokenizer is used to turn the words, boxes and NER tags into token-level `input_ids`, `attention_mask` and `labels`.

The processor simply returns a dictionary that contains all these keys.

In [27]:
from transformers import AutoProcessor

# we'll use the Auto API here - it will load LayoutLMv3Processor behind the scenes,
# based on the checkpoint we provide from the hub
processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)

We'll first create `id2label` and label2id mappings, useful for inference. Note that `LayoutLMv3ForTokenClassification` (the model we'll use later on) will simply output an integer index for a particular class (for each token), so we still need to map it to an actual class name.

In [28]:
from datasets.features import ClassLabel

features = train_val_dataset["train"].features
column_names = train_val_dataset["train"].column_names
image_column_name = "image"
text_column_name = "tokens"
boxes_column_name = "bboxes"
label_column_name = "ner_tags"

# In the event the labels are not a `Sequence[ClassLabel]`, we will need to go through the dataset to get the
# unique labels.
def get_label_list(labels):
    unique_labels = set()
    for label in labels:
        unique_labels = unique_labels | set(label)
    label_list = list(unique_labels)
    label_list.sort()
    return label_list

if isinstance(features[label_column_name].feature, ClassLabel):
    label_list = features[label_column_name].feature.names
    # No need to convert the labels since they are already ints.
    id2label = {k: v for k,v in enumerate(label_list)}
    label2id = {v: k for k,v in enumerate(label_list)}
else:
    label_list = get_label_list(dataset["train"][label_column_name])
    id2label = {k: v for k,v in enumerate(label_list)}
    label2id = {v: k for k,v in enumerate(label_list)}
num_labels = len(label_list)

In [30]:
print(label_list)

['NONE', 'B-WE_JOB_TITLE', 'I-WE_JOB_TITLE', 'B-WE_DATE', 'I-WE_DATE', 'B-WE_LOC', 'I-WE_LOC', 'B-WE_ORG', 'I-WE_ORG', 'B-WE_DESCRIPTION', 'I-WE_DESCRIPTION']


In [31]:
print(id2label)

{0: 'NONE', 1: 'B-WE_JOB_TITLE', 2: 'I-WE_JOB_TITLE', 3: 'B-WE_DATE', 4: 'I-WE_DATE', 5: 'B-WE_LOC', 6: 'I-WE_LOC', 7: 'B-WE_ORG', 8: 'I-WE_ORG', 9: 'B-WE_DESCRIPTION', 10: 'I-WE_DESCRIPTION'}


Next, we'll define a function which we can apply on the entire dataset.

In [32]:
def prepare_examples(examples):
  images = [Image.open(path).convert("RGB") for path in examples['image']]
  words = examples[text_column_name]
  boxes = examples[boxes_column_name]
  word_labels = examples[label_column_name]

  encoding = processor(images, words, boxes=boxes, word_labels=word_labels,
                             return_overflowing_tokens=True,
                             return_offsets_mapping=True,
                       truncation=True, padding="max_length")
  
  sample_mapping = encoding.pop("overflow_to_sample_mapping")

  offset_mapping = encoding.pop("offset_mapping")


  return encoding

In [36]:
from datasets import Features, Sequence, ClassLabel, Value, Array2D, Array3D

# we need to define custom features for `set_format` (used later on) to work properly
features = Features({
    'pixel_values': Array3D(dtype="float32", shape=(3, 224, 224)),
    'input_ids': Sequence(feature=Value(dtype='int64')),
    'attention_mask': Sequence(Value(dtype='int64')),
    'bbox': Array2D(dtype="int64", shape=(512, 4)),
    'labels': Sequence(feature=Value(dtype='int64')),
})

train_dataset = train_val_dataset["train"].map(
    prepare_examples,
    batched=True,
    remove_columns=column_names,
    features=features,
)
eval_dataset = train_val_dataset["val"].map(
    prepare_examples,
    batched=True,
    remove_columns=column_names,
    features=features,
)
test_dataset = train_val_dataset["test"].map(
    prepare_examples,
    batched=True,
    remove_columns=column_names,
    features=features,
)

Map:   0%|          | 0/64 [00:00<?, ? examples/s]

Map:   0%|          | 0/14 [00:00<?, ? examples/s]

Map:   0%|          | 0/14 [00:00<?, ? examples/s]

In [37]:
train_dataset.features

{'pixel_values': Array3D(shape=(3, 224, 224), dtype='float32', id=None),
 'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'bbox': Array2D(shape=(512, 4), dtype='int64', id=None),
 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}

In [38]:
example = train_dataset[0]
processor.tokenizer.decode(example["input_ids"])

"<s> Feb 2016 - Apr 2017 Marketing Specialist PHC Software -Manage and re-organize the new marketing department; -Manage all the identity, promotion and graphic material of annual event; -Graphic and event agencies management (Laranja Mecânica and iMotion); -Email marketing manager (HTML, CSS, responsive); -Product campaign manager; -New software versions campaign manager; -Social Media manager (Twitter, Facebook, LinkedIn); -Interns Responsible (using Trello for managing tasks and assignment); -Brand differentiation project manager (marketing research about PHC brand, product, partners and market); -Department budget and logistics management; -External agencies management (clipping); -Website management (content, front-end); -New website planification (rebranding, mock-ups, wireframe, content, SEO, Analytics integration); -PHC Rebranding (planification of new content, graphic materials, software and platforms updates, etc); -Intranet and Extranet manager; -Case studies – create script

Next, we set the format to PyTorch.

In [39]:
train_dataset.set_format("torch")

Let's verify that everything was created properly:

In [40]:
import torch

example = train_dataset[0]
for k,v in example.items():
    print(k,v.shape)

pixel_values torch.Size([3, 224, 224])
input_ids torch.Size([512])
attention_mask torch.Size([512])
bbox torch.Size([512, 4])
labels torch.Size([512])


In [41]:
eval_dataset

Dataset({
    features: ['pixel_values', 'input_ids', 'attention_mask', 'bbox', 'labels'],
    num_rows: 17
})

In [42]:
processor.tokenizer.decode(eval_dataset[0]["input_ids"])

'<s> 2008/17 - OLIVA CONSTRUCTIONS AND REAL ESTATE DEVELOPMENTS - DIRECTOR-Restructuring,developingandimplementingstrategicplanningand thecreationofmanagementindicators-implementationofERP/redesignofthe main processes.; Commercial process and definition of strategies with the implementation of the whole process and articulation between: marketing, communication,trainingofsalesteams,managementoflargeclients,contracts andrelationshipwiththemarket. ACHIEVEMENT:Investmentininnovationandgrowth. 2005/08–BISTEKSUPERMARKET-MANAGEMENTCONSULTANT-Strategic plan; technology management / processes; management between stores and sectors;indicatorsinallsectors. ACHIEVEMENT:Improvementandreengineeringofinternalprocesses.. 2005/08 – BERIMBAU COMMUNICATION INTELLIGENCE - DIRECTOR - Relationshipwithclientsandadministrativeorfinancialareas. ACHIEVEMENT:Marketpositionandleveragethebrand. 2005 /07 – UNIVERSITY OF SOUTHERN SANTA CATARINA – UNISUL - UNIVERSITY PROFESSOR - Marketing, Strategic Planning, Proces

In [43]:
for id, label in zip(train_dataset[0]["input_ids"], train_dataset[0]["labels"]):
  print(processor.tokenizer.decode([id]), label.item())

<s> -100
 Feb 3
 2016 4
 - 4
 Apr 4
 2017 4
 Marketing 2
 Specialist 2
 PH 8
C -100
 Software 8
 - 10
Man -100
age -100
 and 10
 re 10
- -100
organ -100
ize -100
 the 10
 new 10
 marketing 10
 department 10
; -100
 - 10
Man -100
age -100
 all 10
 the 10
 identity 10
, -100
 promotion 10
 and 10
 graphic 10
 material 10
 of 10
 annual 10
 event 10
; -100
 - 10
G -100
raphic -100
 and 10
 event 10
 agencies 10
 management 10
 ( 10
L -100
aran -100
ja -100
 M 10
ec -100
â -100
n -100
ica -100
 and 10
 i 10
Motion -100
); -100
 - 10
Email -100
 marketing 10
 manager 10
 ( 10
HTML -100
, -100
 CSS 10
, -100
 responsive 10
); -100
 - 10
Product -100
 campaign 10
 manager 10
; -100
 - 10
New -100
 software 10
 versions 10
 campaign 10
 manager 10
; -100
 - 10
Social -100
 Media 10
 manager 10
 ( 10
Twitter -100
, -100
 Facebook 10
, -100
 LinkedIn 10
); -100
 - 10
Intern -100
s -100
 Respons 10
ible -100
 ( 10
using -100
 Tre 10
llo -100
 for 10
 managing 10
 tasks 10
 and 10
 assignment 10
)

## Define metrics

Next, we define a `compute_metrics` function, which is used by the Trainer to ... compute metrics.

This function should take a named tuple as input, and return a dictionary as output as stated in the [docs](https://huggingface.co/docs/transformers/main_classes/trainer).

In [44]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cuda')

In [45]:
train_dataset.set_format(type="torch", device=device)
eval_dataset.set_format(type="torch", device=device)
test_dataset.set_format(type="torch", device=device)

In [46]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
val_dataloader = DataLoader(eval_dataset, batch_size=2, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=2, shuffle=True)

In [47]:
from datasets import load_metric

metric = load_metric("seqeval")

  metric = load_metric("seqeval")


Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

In [48]:
import numpy as np

return_entity_level_metrics = False

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    if return_entity_level_metrics:
        # Unpack nested dictionaries
        final_results = {}
        for key, value in results.items():
            if isinstance(value, dict):
                for n, v in value.items():
                    final_results[f"{key}_{n}"] = v
            else:
                final_results[key] = value
        return final_results
    else:
        return {
            "precision": results["overall_precision"],
            "recall": results["overall_recall"],
            "f1": results["overall_f1"],
            "accuracy": results["overall_accuracy"],
        }

## Define the model

Next we define the model: this is a Transformer encoder with pre-trained weights, and a randomly initialized head on top for token classification.

In [49]:
from torch import nn
from transformers import Trainer


class CustomTrainer(Trainer):
  pass
  """
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # compute custom loss (suppose one has 3 labels with different weights)
        loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0])) / biggest
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss"""

In [50]:
from transformers import LayoutLMv3ForTokenClassification

model = LayoutLMv3ForTokenClassification.from_pretrained("microsoft/layoutlmv3-base",
                                                         id2label=id2label,
                                                         label2id=label2id)
model.to(device)

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of LayoutLMv3ForTokenClassification were not initialized from the model checkpoint at microsoft/layoutlmv3-base and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


LayoutLMv3ForTokenClassification(
  (layoutlmv3): LayoutLMv3Model(
    (embeddings): LayoutLMv3TextEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (x_position_embeddings): Embedding(1024, 128)
      (y_position_embeddings): Embedding(1024, 128)
      (h_position_embeddings): Embedding(1024, 128)
      (w_position_embeddings): Embedding(1024, 128)
    )
    (patch_embed): LayoutLMv3PatchEmbeddings(
      (proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
    )
    (pos_drop): Dropout(p=0.0, inplace=False)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
    (encoder): LayoutLMv3Encoder

In [51]:
# Set id2label and label2id 
model.config.id2label = id2label
model.config.label2id = label2id

In [52]:
from torch.utils.tensorboard import SummaryWriter
# Writer will output to ./runs/ directory by default
writer = SummaryWriter()

In [None]:
import torch.nn as nn
import os
model.to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)
loss_fct = nn.CrossEntropyLoss()
global_step = 0
num_train_epochs = 20
t_total = len(train_dataloader) * num_train_epochs # total number of training steps 

metric = load_metric("seqeval")
return_entity_level_metrics = True

min_valid_loss = np.inf

counter = 0
patience_counter = 0

for epoch in range(num_train_epochs):
  running_loss = 0
  correct=0
  total=0
  print("Epoch:", epoch)
  for batch in tqdm(train_dataloader):
    #put the model in training mode
    model.train() 
    # zero the parameter gradients
    optimizer.zero_grad()

    # forward + backward + optimize
    outputs = model(**batch) 

    predictions = outputs.logits.argmax(dim=2)

    true_predictions = [
      [id2label[p.item()] for (p, l) in zip(prediction, label) if l != -100]
      for prediction, label in zip(predictions,batch['labels'])
      ]
    true_labels = [
      [id2label[l.item()] for (p, l) in zip(prediction, label) if l != -100]
      for prediction, label in zip(predictions,  batch['labels'])
      ]
    metric.add_batch(predictions=true_predictions, references=true_labels)
    labels = batch['labels']
    logits = outputs.get("logits")
    loss = loss_fct(logits.view(-1, model.config.num_labels), labels.view(-1))
    writer.add_scalar("Loss/train", loss, epoch)
        
    # print loss every 100 steps
    if global_step % 100 == 0:
      print(f"Loss after {global_step} steps: {loss.item()}")
      final_score = metric.compute(predictions=true_predictions, references=true_labels)
      print(final_score)
      writer.add_scalar("overall_precision/train", final_score["overall_precision"], epoch)
      writer.add_scalar("overall_recall/train", final_score["overall_recall"], epoch)
      writer.add_scalar("overall_f1/train", final_score["overall_f1"], epoch)
      writer.add_scalar("overall_accuracy/train", final_score["overall_accuracy"], epoch)

    loss.backward()
    optimizer.step()

    # Incrementing loss
    running_loss += loss.item()

    global_step += 1

  valid_loss = 0.0
  model.eval()
  for batch in tqdm(val_dataloader, desc="Evaluating"):
    with torch.no_grad():
      # forward pass
      outputs = model(**batch) 
      
      labels = batch['labels']
      logits = outputs.get("logits")
      loss = loss_fct(logits.view(-1, model.config.num_labels), labels.view(-1))
      writer.add_scalar("Loss/val", loss, epoch)
      # Incrementing loss
      valid_loss += loss.item()
    
  # Averaging out loss over entire batch
  running_loss /= len(train_dataloader)
  valid_loss /= len(val_dataloader)

  print('Training loss: {} \t\t Validation Loss: {}'.format(running_loss, valid_loss))
  # predictions
  predictions = outputs.logits.argmax(dim=2)

  # Remove ignored index (special tokens)
  true_predictions = [
      [id2label[p.item()] for (p, l) in zip(prediction, label) if l != -100]
      for prediction, label in zip(predictions, batch['labels'])
  ]
  true_labels = [
      [id2label[l.item()] for (p, l) in zip(prediction, label) if l != -100]
      for prediction, label in zip(predictions, batch['labels'])
  ]

  metric.add_batch(predictions=true_predictions, references=true_labels)

  final_score = metric.compute(predictions=true_predictions, references=true_labels)
  print(final_score)
  writer.add_scalar("overall_precision/val", final_score["overall_precision"], epoch)
  writer.add_scalar("overall_recall/val", final_score["overall_recall"], epoch)
  writer.add_scalar("overall_f1/val", final_score["overall_f1"], epoch)
  writer.add_scalar("overall_accuracy/val", final_score["overall_accuracy"], epoch)

  if min_valid_loss > valid_loss :
    print(f'Validation Loss Decreased({min_valid_loss:^.6f}--->{valid_loss:^.6f})')
    min_valid_loss = valid_loss        

  accu = 0
  if final_score['overall_accuracy'] > accu:
      accu = final_score['overall_accuracy']
      name_model = f"/content/modelLMv3/model-{accu:.3f}"
      
      model.save_pretrained(name_model)
      model.save_pretrained("/content/drive/MyDrive/modelLMv3")
      torch.save(model.state_dict(), "/content/drive/MyDrive/modelLMv3.pt")
  else:
        patience_counter += 1
        if patience_counter >= 3:
            print(f'Validation Loss did not improve for {patience_counter} epochs. Stopping training.')
            break

In [55]:
train_state = {}

In [56]:
running_loss = 0.
running_acc = 0.
metric = load_metric("seqeval")
for batch_index, batch in enumerate(tqdm(test_dataloader)):
  outputs = model(**batch) 
  labels = batch['labels']
  logits = outputs.get("logits")
  loss = loss_fct(logits.view(-1, model.config.num_labels), labels.view(-1))
  loss_batch = loss.item()
  running_loss += (loss_batch - running_loss) / (batch_index + 1)
  writer.add_scalar("Loss/test", loss, batch_index)

  predictions = outputs.logits.argmax(dim=2)
  # Remove ignored index (special tokens)
  true_predictions = [
      [id2label[p.item()] for (p, l) in zip(prediction, label) if l != -100]
      for prediction, label in zip(predictions, batch['labels'])
  ]
  true_labels = [
      [id2label[l.item()] for (p, l) in zip(prediction, label) if l != -100]
      for prediction, label in zip(predictions, batch['labels'])
  ]
  final_score = metric.compute(predictions=true_predictions, references=true_labels)
  acc_batch = final_score["overall_accuracy"]
  running_acc += (acc_batch - running_acc) / (batch_index + 1)
  writer.add_scalar("overall_precision/test", final_score["overall_precision"], batch_index)
  writer.add_scalar("overall_recall/test", final_score["overall_recall"], batch_index)
  writer.add_scalar("overall_f1/test", final_score["overall_f1"], batch_index)
  writer.add_scalar("overall_accuracy/test", final_score["overall_accuracy"], batch_index)

  0%|          | 0/7 [00:00<?, ?it/s]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [57]:
train_state['test_loss'] = running_loss
train_state['test_acc'] = running_acc
print("Test loss: {:.3f}".format(train_state['test_loss']))
print("Test Accuracy: {:.2f}".format(train_state['test_acc']))

Test loss: 0.148
Test Accuracy: 0.96


## Define TrainingArguments + Trainer

Next we define the `TrainingArguments`, which define all hyperparameters related to training. Note that there is a huge amount of parameters to tweak, check the [docs](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) for more info.

In [None]:
%load_ext tensorboard

In [58]:
!zip -r /content/runs.zip /content/runs/

  adding: content/runs/ (stored 0%)
  adding: content/runs/Mar05_13-04-35_ba6f97062b08/ (stored 0%)
  adding: content/runs/Mar05_13-04-35_ba6f97062b08/events.out.tfevents.1678021475.ba6f97062b08.1442.0 (deflated 68%)


In [59]:
from google.colab import files
files.download("/content/runs.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

We can now instantiate a Trainer, with the model and args defined above. We also provide our datasets, as well as a "default data collator" - which will batch the examples using `torch.stack`. We also provide our `compute_metrics` function defined above.

## Inference

You can load the model for inference as follows:

In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(f"{OUTPUT_DIR}checkpoint-3000")

Let's take an example of the training dataset to show inference.

In [None]:
example = dataset["val"][0]
print(example.keys())

We first prepare it for the model using the processor.

In [None]:
print(example['image'])

In [None]:
image = Image.open(example['image'])
words = example["tokens"]
boxes = example["bboxes"]
word_labels = example["ner_tags"]
print(len(words))
encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")
for k,v in encoding.items():
  print(k,v.shape)

Next, we do a forward pass. We use torch.no_grad() as we don't require gradient computation.

In [None]:
next(model.parameters()).is_cuda

In [None]:
with torch.no_grad():
  outputs = model(**encoding.to("cuda"))

The model outputs logits of shape (batch_size, seq_len, num_labels).

In [None]:
logits = outputs.logits
logits.shape

We take the highest score for each token, using argmax. This serves as the predicted label for each token.

In [None]:
predictions = logits.argmax(-1).squeeze().tolist()
print(predictions)

Let's compare this to the ground truth: note that many labels are -100, as we're only labeling the first subword token of each word.

NOTE: at "true inference" time, you don't have access to labels, see the latest section of this notebook how you can use `offset_mapping` in that case.

In [None]:
labels = encoding.labels.squeeze().tolist()
print(labels)

So let's only compare predictions and labels at positions where the label isn't -100. We also want to have the bounding boxes of these (unnormalized):

In [None]:
def unnormalize_box(bbox, width, height):
     return [
         bbox[0],
         bbox[1],
         bbox[2],
         bbox[3],
     ]

token_boxes = encoding.bbox.squeeze().tolist()
width, height = image.size

true_predictions = [model.config.id2label[pred] for pred, label in zip(predictions, labels) if label != - 100]
true_labels = [model.config.id2label[label] for prediction, label in zip(predictions, labels) if label != -100]
true_boxes = [unnormalize_box(box, width, height) for box, label in zip(token_boxes, labels) if label != -100]

In [None]:
len(true_predictions)

In [None]:
'NONE','EDUC_DATE','EDUC_COURSE','EDUC_LOC',
               'EDUC_GRADE','EDUC_DESCRIPTION','EDUC_SCHOOL'

In [None]:
from PIL import ImageDraw, ImageFont

draw = ImageDraw.Draw(image)

font = ImageFont.load_default()

def iob_to_label(label):
    if label == 'NONE':
      return label
    else:
      label = label
      if not label:
        return 'other'
      return label

label2color = {'none': 'blue',
               'educ_loc': 'black', 
               'educ_date': 'green',
               'educ_course': 'orange',
               'educ_grade': 'red',
               'educ_description': 'purple',
               'educ_school': 'brown'
               }

for prediction, box in zip(true_predictions, true_boxes):
    predicted_label = iob_to_label(prediction).lower()
    draw.rectangle(box, outline=label2color[predicted_label])
    draw.text((box[0] + 10, box[1] - 10), text=predicted_label, fill=label2color[predicted_label], font=font)

image

Compare this to the ground truth:

In [None]:
image = example["image"]
image = Image.open(example['image'])

draw = ImageDraw.Draw(image)

for word, box, label in zip(example['tokens'], example['bboxes'], example['ner_tags']):
  actual_label = iob_to_label(id2label[label]).lower()
  box = unnormalize_box(box, width, height)
  draw.rectangle(box, outline=label2color[actual_label], width=2)
  draw.text((box[0] + 10, box[1] - 10), actual_label, fill=label2color[actual_label], font=font)

image

## Note: inference when you don't have labels

The code above used the `labels` to determine which tokens were at the start of a particular word or not. Of course, at inference time, you don't have access to any labels. In that case, you can leverage the `offset_mapping` returned by the tokenizer. I do have a notebook for that (for LayoutLMv2, but it's equivalent for LayoutLMv3) [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/FUNSD/True_inference_with_LayoutLMv2ForTokenClassification_%2B_Gradio_demo.ipynb).