<a href="https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LiLT/Fine_tune_LiLT_on_a_custom_dataset%2C_in_any_language.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Set-up environment

In [1]:
!pip install -q transformers datasets

[K     |████████████████████████████████| 5.5 MB 4.9 MB/s 
[K     |████████████████████████████████| 451 kB 53.4 MB/s 
[K     |████████████████████████████████| 7.6 MB 56.1 MB/s 
[K     |████████████████████████████████| 182 kB 59.2 MB/s 
[K     |████████████████████████████████| 212 kB 26.3 MB/s 
[K     |████████████████████████████████| 115 kB 14.9 MB/s 
[K     |████████████████████████████████| 127 kB 56.2 MB/s 
[?25h

In [26]:
!pip install -q evaluate seqeval

## Load dataset

In [2]:
from datasets import load_dataset

dataset = load_dataset("nielsr/funsd-iob-original")

Downloading builder script:   0%|          | 0.00/4.85k [00:00<?, ?B/s]

Downloading and preparing dataset funsd-iob-original/funsd to /root/.cache/huggingface/datasets/nielsr___funsd-iob-original/funsd/1.0.0/cc411fe09328f1b872ea528393c33f0252ef52c2c8675385ffcd39b449bfcf34...


Downloading data:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset funsd-iob-original downloaded and prepared to /root/.cache/huggingface/datasets/nielsr___funsd-iob-original/funsd/1.0.0/cc411fe09328f1b872ea528393c33f0252ef52c2c8675385ffcd39b449bfcf34. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
label_list = dataset["train"].features["ner_tags"].feature.names
id2label = {id:label for id, label in enumerate(label_list)}
print(id2label)

{0: 'O', 1: 'B-HEADER', 2: 'I-HEADER', 3: 'B-QUESTION', 4: 'I-QUESTION', 5: 'B-ANSWER', 6: 'I-ANSWER'}


## Create PyTorch Dataset

In [4]:
dataset["train"].features

{'id': Value(dtype='string', id=None),
 'words': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'bboxes': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None),
 'original_bboxes': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-HEADER', 'I-HEADER', 'B-QUESTION', 'I-QUESTION', 'B-ANSWER', 'I-ANSWER'], id=None), length=-1, id=None),
 'image': Image(decode=True, id=None)}

In [6]:
from torch.utils.data import Dataset
from PIL import Image
import torch

def normalize_bbox(bbox, width, height):
    return [
        int(1000 * (bbox[0] / width)),
        int(1000 * (bbox[1] / height)),
        int(1000 * (bbox[2] / width)),
        int(1000 * (bbox[3] / height)),
    ]

class CustomDataset(Dataset):
  def __init__(self, dataset, tokenizer):
    self.dataset = dataset
    self.tokenizer = tokenizer

  def __len__(self):
    return len(self.dataset)

  def __getitem__(self, idx):
    # get item
    example = self.dataset[idx]
    image = example["image"]
    words = example["words"]
    boxes = example["original_bboxes"]
    ner_tags = example["ner_tags"]

    # prepare for the model
    width, height = image.size

    bbox = []
    labels = []
    for word, box, label in zip(words, boxes, ner_tags):
        box = normalize_bbox(box, width, height)
        n_word_tokens = len(tokenizer.tokenize(word))
        bbox.extend([box] * n_word_tokens)
        labels.extend([label] + ([-100] * (n_word_tokens - 1)))

    cls_box = sep_box = [0, 0, 0, 0]
    bbox = [cls_box] + bbox + [sep_box]
    labels = [-100] + labels + [-100]

    encoding = self.tokenizer(" ".join(words), truncation=True, max_length=512)
    sequence_length = len(encoding.input_ids)
    # truncate boxes and labels based on length of input ids
    labels = labels[:sequence_length]
    bbox = bbox[:sequence_length] 

    encoding["bbox"] = bbox
    encoding["labels"] = labels
    
    return encoding

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("nielsr/lilt-xlm-roberta-base")

Downloading:   0%|          | 0.00/451 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/280 [00:00<?, ?B/s]

In [8]:
train_dataset = CustomDataset(dataset["train"], tokenizer)
eval_dataset = CustomDataset(dataset["test"], tokenizer)

In [9]:
example = train_dataset[0]

In [10]:
tokenizer.decode(example["input_ids"])

'<s> R&D : Suggestion: Date: Licensee Yes No 597005708 R&D QUALITY IMPROVEMENT SUGGESTION/ SOLUTION FORM Name / Phone Ext. : M. Hamann P. Harper, P. Martinez 9/ 3/ 92 R&D Group: J. S. Wigand Supervisor / Manager Discontinue coal retention analyses on licensee submitted product samples (Note : Coal Retention testing is not performed by most licensees. Other B&W physical measurements as ends stability and inspection for soft spots in ciparettes are thought to be sufficient measures to assure cigarette physical integrity. The proposed action will increase laboratory productivity. ) Suggested Solutions (s) : Delete coal retention from the list of standard analyses performed on licensee submitted product samples. Special requests for coal retention testing could still be submitted on an exception basis. Have you contacted your Manager/ Supervisor? Manager Comments: Manager, please contact suggester and forward comments to the Quality Council. qip. wp</s>'

In [11]:
for k,v in example.items():
  print(k,len(v))

input_ids 244
attention_mask 244
bbox 244
labels 244


In [12]:
for word, box, label in zip(dataset["train"][0]["words"], dataset["train"][0]["original_bboxes"], dataset["train"][0]["ner_tags"]):
  print(word, box, id2label[label])

R&D [292, 91, 376, 175] O
: [219, 316, 225, 327] B-QUESTION
Suggestion: [95, 355, 169, 370] B-QUESTION
Date: [482, 268, 518, 282] B-QUESTION
Licensee [511, 309, 570, 323] B-ANSWER
Yes [461, 605, 483, 619] B-QUESTION
No [545, 603, 563, 617] B-QUESTION
597005708 [525, 904, 641, 926] O
R&D [257, 203, 279, 214] B-HEADER
QUALITY [285, 203, 334, 216] I-HEADER
IMPROVEMENT [341, 201, 418, 211] I-HEADER
SUGGESTION/ [256, 215, 324, 229] I-HEADER
SOLUTION [331, 214, 387, 228] I-HEADER
FORM [395, 215, 423, 228] I-HEADER
Name [89, 274, 118, 289] B-QUESTION
/ [117, 274, 127, 288] I-QUESTION
Phone [128, 274, 163, 289] I-QUESTION
Ext. [169, 272, 196, 287] I-QUESTION
: [196, 274, 204, 288] I-QUESTION
M. [215, 272, 230, 287] B-ANSWER
Hamann [237, 272, 287, 286] I-ANSWER
P. [293, 272, 307, 286] I-ANSWER
Harper, [314, 274, 363, 285] I-ANSWER
P. [370, 272, 384, 285] I-ANSWER
Martinez [390, 271, 451, 282] I-ANSWER
9/ [543, 265, 560, 279] B-ANSWER
3/ [560, 264, 575, 279] I-ANSWER
92 [575, 264, 590, 279] I-AN

In [13]:
len(example["input_ids"])

244

In [14]:
for id, box, label in zip(example["input_ids"], example["bbox"], example["labels"]):
  if label != -100:
    print(tokenizer.decode([id]), box, id2label[label])
  else:
    print(tokenizer.decode([id]), box, -100)

<s> [0, 0, 0, 0] -100
R [383, 91, 493, 175] O
& [383, 91, 493, 175] -100
D [383, 91, 493, 175] -100
: [287, 316, 295, 327] B-QUESTION
Sug [124, 355, 221, 370] B-QUESTION
gestion [124, 355, 221, 370] -100
: [124, 355, 221, 370] -100
Date [632, 268, 679, 282] B-QUESTION
: [632, 268, 679, 282] -100
License [670, 309, 748, 323] B-ANSWER
e [670, 309, 748, 323] -100
Yes [604, 605, 633, 619] B-QUESTION
No [715, 603, 738, 617] B-QUESTION
 [688, 904, 841, 926] O
597 [688, 904, 841, 926] -100
005 [688, 904, 841, 926] -100
708 [688, 904, 841, 926] -100
R [337, 203, 366, 214] B-HEADER
& [337, 203, 366, 214] -100
D [337, 203, 366, 214] -100
QUAL [374, 203, 438, 216] I-HEADER
ITY [374, 203, 438, 216] -100
IMP [447, 201, 548, 211] I-HEADER
ROV [447, 201, 548, 211] -100
EMENT [447, 201, 548, 211] -100
S [335, 215, 425, 229] I-HEADER
UG [335, 215, 425, 229] -100
GES [335, 215, 425, 229] -100
TION [335, 215, 425, 229] -100
/ [335, 215, 425, 229] -100
S [434, 214, 507, 228] I-HEADER
OLU [434, 214, 507, 2

## Define PyTorch DataLoader

In [15]:
from torch.utils.data import DataLoader

def collate_fn(features):
  boxes = [feature["bbox"] for feature in features]
  labels = [feature["labels"] for feature in features]
  # use tokenizer to pad input_ids
  batch = tokenizer.pad(features, padding="max_length", max_length=512)

  sequence_length = torch.tensor(batch["input_ids"]).shape[1]
  batch["labels"] = [labels_example + [-100] * (sequence_length - len(labels_example)) for labels_example in labels]
  batch["bbox"] = [boxes_example + [[0, 0, 0, 0]] * (sequence_length - len(boxes_example)) for boxes_example in boxes]

  # convert to PyTorch
  # batch = {k: torch.tensor(v, dtype=torch.int64) if isinstance(v[0], list) else v for k, v in batch.items()}
  batch = {k: torch.tensor(v) for k, v in batch.items()}

  return batch

train_dataloader = DataLoader(train_dataset, batch_size=2, shuffle=True, collate_fn=collate_fn)
eval_dataloader = DataLoader(eval_dataset, batch_size=2, shuffle=False, collate_fn=collate_fn)

In [16]:
batch = next(iter(train_dataloader))

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [17]:
for k,v in batch.items():
  print(k,v.shape)

input_ids torch.Size([2, 512])
attention_mask torch.Size([2, 512])
bbox torch.Size([2, 512, 4])
labels torch.Size([2, 512])


In [18]:
tokenizer.decode(batch["input_ids"][0])

"<s> Address City State Zip BRAND 1982 88057519 New SATIN 100's Direct Account Please Ship To Delivery Date No. OF CARTONS Satin Filter 100's Satin Menthol 100's Signature of Hetail Purchaser Order Taken By Sales Representative DIRECT ACCOUNT</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad

In [19]:
for id, box, label in zip(batch["input_ids"][0], batch["bbox"][0], batch["labels"][0]):
  if label.item() != -100:
    print(tokenizer.decode([id]), box, id2label[label.item()])
  else:
    print(tokenizer.decode([id]), box, label.item())

<s> tensor([0, 0, 0, 0]) -100
Address tensor([508, 483, 568, 496]) B-QUESTION
City tensor([510, 507, 539, 518]) B-QUESTION
State tensor([508, 529, 548, 540]) B-QUESTION
Zip tensor([846, 528, 870, 542]) B-QUESTION
BR tensor([796, 584, 877, 598]) B-QUESTION
AND tensor([796, 584, 877, 598]) -100
1982 tensor([911, 409, 942, 422]) O
8 tensor([902, 747, 925, 832]) O
805 tensor([902, 747, 925, 832]) -100
75 tensor([902, 747, 925, 832]) -100
19 tensor([902, 747, 925, 832]) -100
New tensor([764, 270, 879, 304]) B-HEADER
SA tensor([710, 306, 868, 351]) I-HEADER
TIN tensor([710, 306, 868, 351]) -100
100 tensor([871, 313, 939, 348]) I-HEADER
' tensor([871, 313, 939, 348]) -100
s tensor([871, 313, 939, 348]) -100
Direct tensor([510, 436, 552, 447]) B-QUESTION
Account tensor([552, 434, 612, 447]) I-QUESTION
Please tensor([508, 458, 561, 472]) B-QUESTION
Shi tensor([562, 455, 590, 472]) I-QUESTION
p tensor([562, 455, 590, 472]) -100
To tensor([594, 459, 613, 470]) I-QUESTION
Delivery tensor([507, 554

## Define model

In [20]:
from transformers import LiltForTokenClassification

model = LiltForTokenClassification.from_pretrained("nielsr/lilt-xlm-roberta-base", id2label=id2label)

Downloading:   0%|          | 0.00/774 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Some weights of LiltForTokenClassification were not initialized from the model checkpoint at nielsr/lilt-xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Train the model in native PyTorch

Uncomment the code below if you want to train the model just in native PyTorch.

In [21]:
# device = "cuda" if torch.cuda.is_available() else "cpu"
# model.to(device)

# optimizer = torch.optim.AdamW(model.parameters(), lr=5-5)

# model.train()
# for epoch in range(2):
#   for batch in train_dataloader:
#       # zero the parameter gradients
#       optimizer.zero_grad()

#       inputs = {k:v.to(device) for k,v in batch.items()}

#       outputs = model(**inputs)

#       loss = outputs.loss
#       loss.backward()

#       optimizer.step()

## Train the model using 🤗 Trainer

We first define a compute_metrics function as well as TrainingArguments.

In [27]:
import evaluate

metric = evaluate.load("seqeval")

In [28]:
import numpy as np
from seqeval.metrics import classification_report

return_entity_level_metrics = False

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    if return_entity_level_metrics:
        # Unpack nested dictionaries
        final_results = {}
        for key, value in results.items():
            if isinstance(value, dict):
                for n, v in value.items():
                    final_results[f"{key}_{n}"] = v
            else:
                final_results[key] = value
        return final_results
    else:
        return {
            "precision": results["overall_precision"],
            "recall": results["overall_recall"],
            "f1": results["overall_f1"],
            "accuracy": results["overall_accuracy"],
        }

In [33]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test",
                                  num_train_epochs=30,
                                  learning_rate=5e-5,
                                  evaluation_strategy="steps",
                                  eval_steps=100,
                                  load_best_model_at_end=True,
                                  metric_for_best_model="f1")

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Next we define a custom Trainer which uses the DataLoaders we created above.

In [34]:
from transformers.data.data_collator import default_data_collator

class CustomTrainer(Trainer):
  def get_train_dataloader(self):
    return train_dataloader

  def get_eval_dataloader(self, eval_dataset = None):
    return eval_dataloader

# Initialize our Trainer
trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 149
  Num Epochs = 30
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 2250
  Number of trainable parameters = 283567815


Step,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
100,No log,1.368461,0.629163,0.712175,0.6681,0.741463
200,No log,1.497292,0.68498,0.692308,0.688624,0.719545
300,No log,1.585323,0.684028,0.702496,0.693139,0.726447
400,No log,1.692035,0.657696,0.709628,0.682676,0.734682
500,0.149500,1.637528,0.689837,0.733062,0.710793,0.740131
600,0.149500,1.672109,0.675588,0.745797,0.708959,0.715549
700,0.149500,1.877164,0.683581,0.727458,0.704837,0.740857
800,0.149500,2.084933,0.659132,0.742741,0.698443,0.724994
900,0.149500,2.121748,0.731194,0.7081,0.719462,0.740373
1000,0.047300,2.14017,0.683736,0.73459,0.708251,0.734197


***** Running Evaluation *****
  Num examples = 50
  Batch size = 8
***** Running Evaluation *****
  Num examples = 50
  Batch size = 8
***** Running Evaluation *****
  Num examples = 50
  Batch size = 8
***** Running Evaluation *****
  Num examples = 50
  Batch size = 8
***** Running Evaluation *****
  Num examples = 50
  Batch size = 8
Saving model checkpoint to test/checkpoint-500
Configuration saved in test/checkpoint-500/config.json
Model weights saved in test/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test/checkpoint-500/tokenizer_config.json
Special tokens file saved in test/checkpoint-500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 50
  Batch size = 8
***** Running Evaluation *****
  Num examples = 50
  Batch size = 8
***** Running Evaluation *****
  Num examples = 50
  Batch size = 8
***** Running Evaluation *****
  Num examples = 50
  Batch size = 8
***** Running Evaluation *****
  Num examples = 50
  Batch size = 8
Saving mode

## Inference

For inference I refer to my LayoutLM notebooks, as it would be equivalent.