<a href="https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LiLT/%5BHuggingFace_Trainer%5D_Fine_tune_LiltForTokenClassification_on_FUNSD_(nielsr_funsd_layoutlmv3).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Set-up environment

First, we install 🤗 Transformers and Datasets.

## Load dataset

Let's load the FUNSD dataset from the hub. Note that this dataset contains segment positions, which will help in boosting the performance (as shown in [this paper](https://arxiv.org/abs/2105.11210)).

In [2]:
from datasets import load_dataset

dataset = load_dataset("nielsr/funsd-layoutlmv3")

Found cached dataset funsd-layoutlmv3 (/home/hj36wegi/scratch/data/huggingface/datasets/nielsr___funsd-layoutlmv3/funsd/1.0.0/0e3f4efdfd59aa1c3b4952c517894f7b1fc4d75c12ef01bcc8626a69e41c1bb9)


  0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'bboxes', 'ner_tags', 'image'],
        num_rows: 149
    })
    test: Dataset({
        features: ['id', 'tokens', 'bboxes', 'ner_tags', 'image'],
        num_rows: 50
    })
})

In [4]:
dataset["train"].features

{'id': Value(dtype='string', id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'bboxes': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-HEADER', 'I-HEADER', 'B-QUESTION', 'I-QUESTION', 'B-ANSWER', 'I-ANSWER'], id=None), length=-1, id=None),
 'image': Image(decode=True, id=None)}

We'll create id2label and label2id mappings here, useful for inference.

In [5]:
label_list = dataset["train"].features['ner_tags'].feature.names
id2label = {id:label for id, label in enumerate(label_list)}
label2id = {label:id for id, label in enumerate(label_list)}
print(id2label)

{0: 'O', 1: 'B-HEADER', 2: 'I-HEADER', 3: 'B-QUESTION', 4: 'I-QUESTION', 5: 'B-ANSWER', 6: 'I-ANSWER'}


In [6]:
example = dataset["train"][0]
print(example["tokens"])
print(example["bboxes"])
print(example["ner_tags"])

['R&D', ':', 'Suggestion:', 'Date:', 'Licensee', 'Yes', 'No', '597005708', 'R&D', 'QUALITY', 'IMPROVEMENT', 'SUGGESTION/', 'SOLUTION', 'FORM', 'Name', '/', 'Phone', 'Ext.', ':', 'M.', 'Hamann', 'P.', 'Harper,', 'P.', 'Martinez', '9/', '3/', '92', 'R&D', 'Group:', 'J.', 'S.', 'Wigand', 'Supervisor', '/', 'Manager', 'Discontinue', 'coal', 'retention', 'analyses', 'on', 'licensee', 'submitted', 'product', 'samples', '(Note', ':', 'Coal', 'Retention', 'testing', 'is', 'not', 'performed', 'by', 'most', 'licensees.', 'Other', 'B&W', 'physical', 'measurements', 'as', 'ends', 'stability', 'and', 'inspection', 'for', 'soft', 'spots', 'in', 'ciparettes', 'are', 'thought', 'to', 'be', 'sufficient', 'measures', 'to', 'assure', 'cigarette', 'physical', 'integrity.', 'The', 'proposed', 'action', 'will', 'increase', 'laboratory', 'productivity', '.', ')', 'Suggested', 'Solutions', '(s)', ':', 'Delete', 'coal', 'retention', 'from', 'the', 'list', 'of', 'standard', 'analyses', 'performed', 'on', 'licen

## Transform dataset

Next, we'll use LayoutLMv3's tokenizer to prepare data for the model. The reason we can use this tokenizer is because it has the same vocabulary as roberta-base (which we'll combine with LiLT).

In [9]:
from transformers import AutoTokenizer

# tokenizer = AutoTokenizer.from_pretrained("microsoft/layoutlmv3-base")
# tokenizer = AutoTokenizer.from_pretrained("nielsr/lilt-xlm-roberta-base")
tokenizer = AutoTokenizer.from_pretrained("SCUT-DLVCLab/lilt-infoxlm-base")

We'll use [set_transform](https://huggingface.co/docs/datasets/process#format-transform) here to do the preprocessing on-the-fly.

In [10]:
def prepare_examples(batch):
  encoding = tokenizer(batch["tokens"],
                        boxes=batch["bboxes"],
                        word_labels=batch["ner_tags"],
                        padding="max_length",
                        max_length=512,
                        truncation=True,
                        return_tensors="pt")
  
  return encoding

dataset.set_transform(prepare_examples)

Let's verify an example:

In [11]:
example = dataset["train"][0]
print(example.keys())

dict_keys(['input_ids', 'attention_mask', 'bbox', 'labels'])


In [12]:
tokenizer.decode(example["input_ids"])

'<s> R&D : Suggestion: Date: Licensee Yes No 597005708 R&D QUALITY IMPROVEMENT SUGGESTION/ SOLUTION FORM Name / Phone Ext. : M. Hamann P. Harper, P. Martinez 9/ 3/ 92 R&D Group: J. S. Wigand Supervisor / Manager Discontinue coal retention analyses on licensee submitted product samples (Note : Coal Retention testing is not performed by most licensees. Other B&W physical measurements as ends stability and inspection for soft spots in ciparettes are thought to be sufficient measures to assure cigarette physical integrity. The proposed action will increase laboratory productivity. ) Suggested Solutions (s) : Delete coal retention from the list of standard analyses performed on licensee submitted product samples. Special requests for coal retention testing could still be submitted on an exception basis. Have you contacted your Manager/ Supervisor? Manager Comments: Manager, please contact suggester and forward comments to the Quality Council. qip. wp</s><pad><pad><pad><pad><pad><pad><pad><p

In [11]:
for id, box, label in zip(example["input_ids"].tolist(),
                          example["bbox"].tolist(),
                          example["labels"].tolist()):
  if label != -100:
    print(tokenizer.decode([id]), box, id2label[label])
  else:
    print(tokenizer.decode([id]), box, label)

<s> [0, 0, 0, 0] -100
 R [383, 91, 493, 175] O
& [383, 91, 493, 175] -100
D [383, 91, 493, 175] -100
 : [287, 316, 295, 327] B-QUESTION
 Suggest [124, 355, 221, 370] B-QUESTION
ion [124, 355, 221, 370] -100
: [124, 355, 221, 370] -100
 Date [632, 268, 679, 282] B-QUESTION
: [632, 268, 679, 282] -100
 License [670, 309, 748, 323] B-ANSWER
e [670, 309, 748, 323] -100
 Yes [604, 605, 633, 619] B-QUESTION
 No [715, 603, 738, 617] B-QUESTION
 5 [688, 904, 841, 926] O
97 [688, 904, 841, 926] -100
005 [688, 904, 841, 926] -100
708 [688, 904, 841, 926] -100
 R [335, 201, 555, 229] B-HEADER
& [335, 201, 555, 229] -100
D [335, 201, 555, 229] -100
 QU [335, 201, 555, 229] I-HEADER
AL [335, 201, 555, 229] -100
ITY [335, 201, 555, 229] -100
 IM [335, 201, 555, 229] I-HEADER
PROV [335, 201, 555, 229] -100
EMENT [335, 201, 555, 229] -100
 S [335, 201, 555, 229] I-HEADER
UG [335, 201, 555, 229] -100
G [335, 201, 555, 229] -100
EST [335, 201, 555, 229] -100
ION [335, 201, 555, 229] -100
/ [335, 201, 55

## Load model

Here we load the LiLT, combined with RoBERTa-base, from the hub.

In [13]:
from transformers import AutoModelForTokenClassification

# model = AutoModelForTokenClassification.from_pretrained("SCUT-DLVCLab/lilt-roberta-en-base", id2label=id2label, label2id=label2id)
model = AutoModelForTokenClassification.from_pretrained("SCUT-DLVCLab/lilt-infoxlm-base", id2label=id2label, label2id=label2id)

Downloading (…)lve/main/config.json:   0%|          | 0.00/721 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Some weights of LiltForTokenClassification were not initialized from the model checkpoint at SCUT-DLVCLab/lilt-infoxlm-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Train!

Next, we'll train the model using 🤗 Trainer. Check the docs [here](https://huggingface.co/docs/transformers/main_classes/trainer) for all details.

We define a compute_metrics function which will be used to compute metrics like F1 score, precision and recall (per entity type) on the evaluation set during training. We use 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) for that.

In [14]:
!pip install evaluate seqeval

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting evaluate
  Using cached evaluate-0.4.0-py3-none-any.whl (81 kB)
Collecting seqeval
  Using cached seqeval-1.2.2-py3-none-any.whl
Collecting scikit-learn>=0.21.3
  Using cached scikit_learn-1.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB)
Collecting joblib>=1.1.1
  Using cached joblib-1.2.0-py3-none-any.whl (297 kB)
Installing collected packages: joblib, scikit-learn, seqeval, evaluate
Successfully installed evaluate-0.4.0 joblib-1.2.0 scikit-learn-1.2.2 seqeval-1.2.2


In [14]:
import evaluate
import torch
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
# Metric
metric = evaluate.load("seqeval")

return_entity_level_metrics = False

# Taken from the token-classification example
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    if return_entity_level_metrics:
        # Unpack nested dictionaries
        final_results = {}
        for key, value in results.items():
            if isinstance(value, dict):
                for n, v in value.items():
                    final_results[f"{key}_{n}"] = v
            else:
                final_results[key] = value
        return final_results
    else:
        return {
            "precision": results["overall_precision"],
            "recall": results["overall_recall"],
            "f1": results["overall_f1"],
            "accuracy": results["overall_accuracy"],
        }

cuda


We also log in to our HuggingFace account as we'll push the model to the hub during training.

In [None]:
!huggingface-cli login


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .
        
Token: 
Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in yo

In [15]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(output_dir="lilt-roberta-en-base-finetuned-funsd",
                         overwrite_output_dir=True,
                         remove_unused_columns=False,
                         warmup_steps=0.1,
                         max_steps=2000,
                         evaluation_strategy="steps",
                         eval_steps=100,
                         push_to_hub=False)

trainer = Trainer(
        model=model,
        args=args,
        train_dataset=dataset["train"],
        eval_dataset=dataset["test"],
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )

max_steps is given, it will override any value given in num_train_epochs


Let's train!

In [16]:
trainer.train()

***** Running training *****
  Num examples = 149
  Num Epochs = 106
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 2000
  Number of trainable parameters = 283567815
You're using a LayoutXLMTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss


***** Running Evaluation *****
  Num examples = 50
  Batch size = 8
  Num examples = 50
  Batch size = 8
***** Running Evaluation *****
  Num examples = 50
  Batch size = 8


KeyboardInterrupt: 

## Evaluate

Evaluation can be done as follows:

In [None]:
metrics = trainer.evaluate()
print(metrics)

***** Running Evaluation *****
  Num examples = 50
  Batch size = 8


{'eval_loss': 1.655214786529541, 'eval_precision': 0.8761670761670761, 'eval_recall': 0.8857426726279185, 'eval_f1': 0.8809288537549407, 'eval_accuracy': 0.8068465470105789, 'eval_runtime': 2.9687, 'eval_samples_per_second': 16.842, 'eval_steps_per_second': 2.358, 'epoch': 105.26}


In [None]:
metrics['eval_f1']

0.8809288537549407

## Push to hub

Let's push everything to the hub, such that we can reuse the model afterwards using `from_pretrained`. This will also include a model card and TensorBoard metrics.

Check the resulting model [here](https://huggingface.co/nielsr/lilt-roberta-en-base-finetuned-funsd).

In [None]:
trainer.push_to_hub()

Saving model checkpoint to lilt-roberta-en-base-finetuned-funsd
Configuration saved in lilt-roberta-en-base-finetuned-funsd/config.json
Model weights saved in lilt-roberta-en-base-finetuned-funsd/pytorch_model.bin
tokenizer config file saved in lilt-roberta-en-base-finetuned-funsd/tokenizer_config.json
Special tokens file saved in lilt-roberta-en-base-finetuned-funsd/special_tokens_map.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.34k/497M [00:00<?, ?B/s]

Upload file runs/Sep30_06-31-33_2c83b8a52643/events.out.tfevents.1664521390.2c83b8a52643.68.5: 100%|##########…

Upload file runs/Sep30_06-31-33_2c83b8a52643/events.out.tfevents.1664519498.2c83b8a52643.68.3:  23%|##3       …

remote: Scanning LFS files for validity, may be slow...        
remote: LFS file scan complete.        
To https://huggingface.co/nielsr/lilt-roberta-en-base-finetuned-funsd
   767eda6..661834b  main -> main

remote: LFS file scan complete.        
To https://huggingface.co/nielsr/lilt-roberta-en-base-finetuned-funsd
   767eda6..661834b  main -> main

To https://huggingface.co/nielsr/lilt-roberta-en-base-finetuned-funsd
   661834b..43a62d7  main -> main

   661834b..43a62d7  main -> main



'https://huggingface.co/nielsr/lilt-roberta-en-base-finetuned-funsd/commit/661834b5eeac1997307882ab0e41beccbbd6d414'