Add support for transformers LayoutLMv2. #5450

HOZHENWAI · 2021-10-27T14:37:16Z

Is your feature request related to a problem? Please describe.
On the current version of 2.7.0 of allennlp and versions 4.11.3 of transformers, layoutlmv2 is not supported :

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/allennlp/allennlp/modules/token_embedders/pretrained_transformer_mismatched_embedder.py", line 80, in __init__
    self._matched_embedder = PretrainedTransformerEmbedder(
  File "/root/allennlp/allennlp/modules/token_embedders/pretrained_transformer_embedder.py", line 123, in __init__
    tokenizer = PretrainedTransformerTokenizer(
  File "/root/allennlp/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 79, in __init__
    self._reverse_engineer_special_tokens("a", "b", model_name, tokenizer_kwargs)
  File "/root/allennlp/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 112, in _reverse_engineer_special_tokens
    dummy_output = tokenizer_with_special_tokens.encode_plus(
  File "/root/anaconda3/envs/alenlayout/lib/python3.8/site-packages/transformers/models/layoutlmv2/tokenization_layoutlmv2_fast.py", line 430, in encode_plus
    return self._encode_plus(
  File "/root/anaconda3/envs/alenlayout/lib/python3.8/site-packages/transformers/models/layoutlmv2/tokenization_layoutlmv2_fast.py", line 639, in _encode_plus
    batched_output = self._batch_encode_plus(
  File "/root/anaconda3/envs/alenlayout/lib/python3.8/site-packages/transformers/models/layoutlmv2/tokenization_layoutlmv2_fast.py", line 493, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
TypeError: PreTokenizedInputSequence must be Union[List[str], Tuple[str]]

Error occurs since they added an argument boxes as second argument of the fast layoutlm_v2 tokenizer which breaks the reverse engineer of the special tokens of allennlp pretrained_transformer_tokenizer.

Describe the solution you'd like
Ideally, naming the arguments in tokenizer_with_special_tokens.encode_plus of pretrained_transformer_tokenizer should do the work but I'm afraid of repercussions on other tokenizer that have different argument name (those not based of Bert maybe?)
Moreover since layoutlm_v2 added a few input to the model (images and boxes), modifications should be made to _unfold_long_sequences, _fold_long_sequences and forward of the pretrained_transformer_embedder and pretrained_transformer_mismatched_embedder to account for additional inputs.

If it's okay with you, I'd like to work in it.

The text was updated successfully, but these errors were encountered:

epwalsh · 2021-10-29T23:07:58Z

Hi @HOZHENWAI, yes this would be a good fix to have. Feel free to open a PR when you're ready.

HOZHENWAI added the Feature request label Oct 27, 2021

epwalsh added the Contributions welcome label Oct 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for transformers LayoutLMv2. #5450

Add support for transformers LayoutLMv2. #5450

HOZHENWAI commented Oct 27, 2021

epwalsh commented Oct 29, 2021

Add support for transformers LayoutLMv2. #5450

Add support for transformers LayoutLMv2. #5450

Comments

HOZHENWAI commented Oct 27, 2021

epwalsh commented Oct 29, 2021