<a href="https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LiLT/%5BHuggingFace_Trainer%5D_Fine_tune_LiltForTokenClassification_on_FUNSD_(nielsr_funsd_layoutlmv3).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Set-up environment

First, we install 🤗 Transformers and Datasets.

In [None]:
!pip install -q git+https://github.com/huggingface/transformers.git

rm: cannot remove 'transformers': No such file or directory
Cloning into 'transformers'...
remote: Enumerating objects: 141060, done.[K
remote: Counting objects: 100% (29/29), done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 141060 (delta 7), reused 18 (delta 5), pack-reused 141031[K
Receiving objects: 100% (141060/141060), 117.41 MiB | 14.07 MiB/s, done.
Resolving deltas: 100% (104136/104136), done.
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone


In [None]:
!pip install -q datasets

[K     |████████████████████████████████| 431 kB 14.0 MB/s 
[K     |████████████████████████████████| 212 kB 34.0 MB/s 
[K     |████████████████████████████████| 115 kB 36.2 MB/s 
[K     |████████████████████████████████| 127 kB 32.3 MB/s 
[?25h

## Load dataset

Let's load the FUNSD dataset from the hub. Note that this dataset contains segment positions, which will help in boosting the performance (as shown in [this paper](https://arxiv.org/abs/2105.11210)).

In [None]:
from datasets import load_dataset

dataset = load_dataset("nielsr/funsd-layoutlmv3")



Downloading builder script:   0%|          | 0.00/5.13k [00:00<?, ?B/s]

Downloading and preparing dataset funsd-layoutlmv3/funsd to /root/.cache/huggingface/datasets/nielsr___funsd-layoutlmv3/funsd/1.0.0/0e3f4efdfd59aa1c3b4952c517894f7b1fc4d75c12ef01bcc8626a69e41c1bb9...


Downloading data:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset funsd-layoutlmv3 downloaded and prepared to /root/.cache/huggingface/datasets/nielsr___funsd-layoutlmv3/funsd/1.0.0/0e3f4efdfd59aa1c3b4952c517894f7b1fc4d75c12ef01bcc8626a69e41c1bb9. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'bboxes', 'ner_tags', 'image'],
        num_rows: 149
    })
    test: Dataset({
        features: ['id', 'tokens', 'bboxes', 'ner_tags', 'image'],
        num_rows: 50
    })
})

In [None]:
dataset["train"].features

{'id': Value(dtype='string', id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'bboxes': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(num_classes=7, names=['O', 'B-HEADER', 'I-HEADER', 'B-QUESTION', 'I-QUESTION', 'B-ANSWER', 'I-ANSWER'], id=None), length=-1, id=None),
 'image': Image(decode=True, id=None)}

We'll create id2label and label2id mappings here, useful for inference.

In [None]:
label_list = dataset["train"].features['ner_tags'].feature.names
id2label = {id:label for id, label in enumerate(label_list)}
label2id = {label:id for id, label in enumerate(label_list)}
print(id2label)

{0: 'O', 1: 'B-HEADER', 2: 'I-HEADER', 3: 'B-QUESTION', 4: 'I-QUESTION', 5: 'B-ANSWER', 6: 'I-ANSWER'}


In [None]:
example = dataset["train"][0]
print(example["tokens"])
print(example["bboxes"])
print(example["ner_tags"])

['R&D', ':', 'Suggestion:', 'Date:', 'Licensee', 'Yes', 'No', '597005708', 'R&D', 'QUALITY', 'IMPROVEMENT', 'SUGGESTION/', 'SOLUTION', 'FORM', 'Name', '/', 'Phone', 'Ext.', ':', 'M.', 'Hamann', 'P.', 'Harper,', 'P.', 'Martinez', '9/', '3/', '92', 'R&D', 'Group:', 'J.', 'S.', 'Wigand', 'Supervisor', '/', 'Manager', 'Discontinue', 'coal', 'retention', 'analyses', 'on', 'licensee', 'submitted', 'product', 'samples', '(Note', ':', 'Coal', 'Retention', 'testing', 'is', 'not', 'performed', 'by', 'most', 'licensees.', 'Other', 'B&W', 'physical', 'measurements', 'as', 'ends', 'stability', 'and', 'inspection', 'for', 'soft', 'spots', 'in', 'ciparettes', 'are', 'thought', 'to', 'be', 'sufficient', 'measures', 'to', 'assure', 'cigarette', 'physical', 'integrity.', 'The', 'proposed', 'action', 'will', 'increase', 'laboratory', 'productivity', '.', ')', 'Suggested', 'Solutions', '(s)', ':', 'Delete', 'coal', 'retention', 'from', 'the', 'list', 'of', 'standard', 'analyses', 'performed', 'on', 'licen

## Transform dataset

Next, we'll use LayoutLMv3's tokenizer to prepare data for the model. The reason we can use this tokenizer is because it has the same vocabulary as roberta-base (which we'll combine with LiLT).

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/layoutlmv3-base")

Downloading:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

We'll use [set_transform](https://huggingface.co/docs/datasets/process#format-transform) here to do the preprocessing on-the-fly.

In [None]:
def prepare_examples(batch):
  encoding = tokenizer(batch["tokens"],
                        boxes=batch["bboxes"],
                        word_labels=batch["ner_tags"],
                        padding="max_length",
                        max_length=512,
                        truncation=True,
                        return_tensors="pt")
  
  return encoding

dataset.set_transform(prepare_examples)

Let's verify an example:

In [None]:
example = dataset["train"][0]
print(example.keys())

dict_keys(['input_ids', 'attention_mask', 'bbox', 'labels'])


In [None]:
tokenizer.decode(example["input_ids"])

'<s> R&D : Suggestion: Date: Licensee Yes No 597005708 R&D QUALITY IMPROVEMENT SUGGESTION/ SOLUTION FORM Name / Phone Ext. : M. Hamann P. Harper, P. Martinez 9/ 3/ 92 R&D Group: J. S. Wigand Supervisor / Manager Discontinue coal retention analyses on licensee submitted product samples (Note : Coal Retention testing is not performed by most licensees. Other B&W physical measurements as ends stability and inspection for soft spots in ciparettes are thought to be sufficient measures to assure cigarette physical integrity. The proposed action will increase laboratory productivity. ) Suggested Solutions (s) : Delete coal retention from the list of standard analyses performed on licensee submitted product samples. Special requests for coal retention testing could still be submitted on an exception basis. Have you contacted your Manager/ Supervisor? Manager Comments: Manager, please contact suggester and forward comments to the Quality Council. qip. wp</s><pad><pad><pad><pad><pad><pad><pad><p

In [None]:
for id, box, label in zip(example["input_ids"].tolist(),
                          example["bbox"].tolist(),
                          example["labels"].tolist()):
  if label != -100:
    print(tokenizer.decode([id]), box, id2label[label])
  else:
    print(tokenizer.decode([id]), box, label)

<s> [0, 0, 0, 0] -100
 R [383, 91, 493, 175] O
& [383, 91, 493, 175] -100
D [383, 91, 493, 175] -100
 : [287, 316, 295, 327] B-QUESTION
 Suggest [124, 355, 221, 370] B-QUESTION
ion [124, 355, 221, 370] -100
: [124, 355, 221, 370] -100
 Date [632, 268, 679, 282] B-QUESTION
: [632, 268, 679, 282] -100
 License [670, 309, 748, 323] B-ANSWER
e [670, 309, 748, 323] -100
 Yes [604, 605, 633, 619] B-QUESTION
 No [715, 603, 738, 617] B-QUESTION
 5 [688, 904, 841, 926] O
97 [688, 904, 841, 926] -100
005 [688, 904, 841, 926] -100
708 [688, 904, 841, 926] -100
 R [335, 201, 555, 229] B-HEADER
& [335, 201, 555, 229] -100
D [335, 201, 555, 229] -100
 QU [335, 201, 555, 229] I-HEADER
AL [335, 201, 555, 229] -100
ITY [335, 201, 555, 229] -100
 IM [335, 201, 555, 229] I-HEADER
PROV [335, 201, 555, 229] -100
EMENT [335, 201, 555, 229] -100
 S [335, 201, 555, 229] I-HEADER
UG [335, 201, 555, 229] -100
G [335, 201, 555, 229] -100
EST [335, 201, 555, 229] -100
ION [335, 201, 555, 229] -100
/ [335, 201, 55

## Load model

Here we load the LiLT, combined with RoBERTa-base, from the hub.

In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("SCUT-DLVCLab/lilt-roberta-en-base", id2label=id2label, label2id=label2id)

Downloading:   0%|          | 0.00/789 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Some weights of LiltForTokenClassification were not initialized from the model checkpoint at nielsr/lilt-roberta-en-base and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Train!

Next, we'll train the model using 🤗 Trainer. Check the docs [here](https://huggingface.co/docs/transformers/main_classes/trainer) for all details.

We define a compute_metrics function which will be used to compute metrics like F1 score, precision and recall (per entity type) on the evaluation set during training. We use 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) for that.

In [None]:
!pip install -q evaluate seqeval

[K     |████████████████████████████████| 69 kB 813 kB/s 
[K     |████████████████████████████████| 43 kB 2.2 MB/s 
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


In [None]:
import evaluate
import torch
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Metric
metric = evaluate.load("seqeval")

return_entity_level_metrics = False

# Taken from the token-classification example
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    if return_entity_level_metrics:
        # Unpack nested dictionaries
        final_results = {}
        for key, value in results.items():
            if isinstance(value, dict):
                for n, v in value.items():
                    final_results[f"{key}_{n}"] = v
            else:
                final_results[key] = value
        return final_results
    else:
        return {
            "precision": results["overall_precision"],
            "recall": results["overall_recall"],
            "f1": results["overall_f1"],
            "accuracy": results["overall_accuracy"],
        }

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

We also log in to our HuggingFace account as we'll push the model to the hub during training.

In [None]:
!huggingface-cli login


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .
        
Token: 
Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in yo

In [None]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(output_dir="lilt-roberta-en-base-finetuned-funsd",
                         overwrite_output_dir=True,
                         remove_unused_columns=False,
                         warmup_steps=0.1,
                         max_steps=2000,
                         evaluation_strategy="steps",
                         eval_steps=100,
                         push_to_hub=True)

trainer = Trainer(
        model=model,
        args=args,
        train_dataset=dataset["train"],
        eval_dataset=dataset["test"],
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
/content/lilt-roberta-en-base-finetuned-funsd is already a clone of https://huggingface.co/nielsr/lilt-roberta-en-base-finetuned-funsd. Make sure you pull the latest changes with `repo.git_pull()`.
max_steps is given, it will override any value given in num_train_epochs


Let's train!

In [None]:
trainer.train()

***** Running training *****
  Num examples = 149
  Num Epochs = 106
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 2000


Step,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
100,No log,1.17892,0.850598,0.848485,0.84954,0.786877
200,No log,1.238219,0.836011,0.878788,0.856866,0.796981
300,No log,1.376612,0.85571,0.889717,0.872382,0.790919
400,No log,1.559013,0.836812,0.876304,0.856103,0.779151
500,0.040000,1.437926,0.856178,0.881272,0.868543,0.799239
600,0.040000,1.539659,0.859256,0.894685,0.876612,0.80542
700,0.040000,1.613237,0.862052,0.87233,0.86716,0.793296
800,0.040000,1.648337,0.856595,0.887233,0.871645,0.777725
900,0.040000,1.659324,0.864101,0.881272,0.872602,0.789492
1000,0.004400,1.670427,0.859452,0.871833,0.865598,0.792464


***** Running Evaluation *****
  Num examples = 50
  Batch size = 8
***** Running Evaluation *****
  Num examples = 50
  Batch size = 8
***** Running Evaluation *****
  Num examples = 50
  Batch size = 8
***** Running Evaluation *****
  Num examples = 50
  Batch size = 8
***** Running Evaluation *****
  Num examples = 50
  Batch size = 8
Saving model checkpoint to lilt-roberta-en-base-finetuned-funsd/checkpoint-500
Configuration saved in lilt-roberta-en-base-finetuned-funsd/checkpoint-500/config.json
Model weights saved in lilt-roberta-en-base-finetuned-funsd/checkpoint-500/pytorch_model.bin
tokenizer config file saved in lilt-roberta-en-base-finetuned-funsd/checkpoint-500/tokenizer_config.json
Special tokens file saved in lilt-roberta-en-base-finetuned-funsd/checkpoint-500/special_tokens_map.json
tokenizer config file saved in lilt-roberta-en-base-finetuned-funsd/tokenizer_config.json
Special tokens file saved in lilt-roberta-en-base-finetuned-funsd/special_tokens_map.json
***** Runni

TrainOutput(global_step=2000, training_loss=0.011410338178277015, metrics={'train_runtime': 1688.2765, 'train_samples_per_second': 9.477, 'train_steps_per_second': 1.185, 'total_flos': 4362983351362560.0, 'train_loss': 0.011410338178277015, 'epoch': 105.26})

## Evaluate

Evaluation can be done as follows:

In [None]:
metrics = trainer.evaluate()
print(metrics)

***** Running Evaluation *****
  Num examples = 50
  Batch size = 8


{'eval_loss': 1.655214786529541, 'eval_precision': 0.8761670761670761, 'eval_recall': 0.8857426726279185, 'eval_f1': 0.8809288537549407, 'eval_accuracy': 0.8068465470105789, 'eval_runtime': 2.9687, 'eval_samples_per_second': 16.842, 'eval_steps_per_second': 2.358, 'epoch': 105.26}


In [None]:
metrics['eval_f1']

0.8809288537549407

## Push to hub

Let's push everything to the hub, such that we can reuse the model afterwards using `from_pretrained`. This will also include a model card and TensorBoard metrics.

Check the resulting model [here](https://huggingface.co/nielsr/lilt-roberta-en-base-finetuned-funsd).

In [None]:
trainer.push_to_hub()

Saving model checkpoint to lilt-roberta-en-base-finetuned-funsd
Configuration saved in lilt-roberta-en-base-finetuned-funsd/config.json
Model weights saved in lilt-roberta-en-base-finetuned-funsd/pytorch_model.bin
tokenizer config file saved in lilt-roberta-en-base-finetuned-funsd/tokenizer_config.json
Special tokens file saved in lilt-roberta-en-base-finetuned-funsd/special_tokens_map.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.34k/497M [00:00<?, ?B/s]

Upload file runs/Sep30_06-31-33_2c83b8a52643/events.out.tfevents.1664521390.2c83b8a52643.68.5: 100%|##########…

Upload file runs/Sep30_06-31-33_2c83b8a52643/events.out.tfevents.1664519498.2c83b8a52643.68.3:  23%|##3       …

remote: Scanning LFS files for validity, may be slow...        
remote: LFS file scan complete.        
To https://huggingface.co/nielsr/lilt-roberta-en-base-finetuned-funsd
   767eda6..661834b  main -> main

remote: LFS file scan complete.        
To https://huggingface.co/nielsr/lilt-roberta-en-base-finetuned-funsd
   767eda6..661834b  main -> main

To https://huggingface.co/nielsr/lilt-roberta-en-base-finetuned-funsd
   661834b..43a62d7  main -> main

   661834b..43a62d7  main -> main



'https://huggingface.co/nielsr/lilt-roberta-en-base-finetuned-funsd/commit/661834b5eeac1997307882ab0e41beccbbd6d414'