Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Add support for transformers LayoutLMv2. #5450

Open
HOZHENWAI opened this issue Oct 27, 2021 · 1 comment
Open

Add support for transformers LayoutLMv2. #5450

HOZHENWAI opened this issue Oct 27, 2021 · 1 comment

Comments

@HOZHENWAI
Copy link

Is your feature request related to a problem? Please describe.
On the current version of 2.7.0 of allennlp and versions 4.11.3 of transformers, layoutlmv2 is not supported :

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/allennlp/allennlp/modules/token_embedders/pretrained_transformer_mismatched_embedder.py", line 80, in __init__
    self._matched_embedder = PretrainedTransformerEmbedder(
  File "/root/allennlp/allennlp/modules/token_embedders/pretrained_transformer_embedder.py", line 123, in __init__
    tokenizer = PretrainedTransformerTokenizer(
  File "/root/allennlp/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 79, in __init__
    self._reverse_engineer_special_tokens("a", "b", model_name, tokenizer_kwargs)
  File "/root/allennlp/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 112, in _reverse_engineer_special_tokens
    dummy_output = tokenizer_with_special_tokens.encode_plus(
  File "/root/anaconda3/envs/alenlayout/lib/python3.8/site-packages/transformers/models/layoutlmv2/tokenization_layoutlmv2_fast.py", line 430, in encode_plus
    return self._encode_plus(
  File "/root/anaconda3/envs/alenlayout/lib/python3.8/site-packages/transformers/models/layoutlmv2/tokenization_layoutlmv2_fast.py", line 639, in _encode_plus
    batched_output = self._batch_encode_plus(
  File "/root/anaconda3/envs/alenlayout/lib/python3.8/site-packages/transformers/models/layoutlmv2/tokenization_layoutlmv2_fast.py", line 493, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
TypeError: PreTokenizedInputSequence must be Union[List[str], Tuple[str]]

Error occurs since they added an argument boxes as second argument of the fast layoutlm_v2 tokenizer which breaks the reverse engineer of the special tokens of allennlp pretrained_transformer_tokenizer.

Describe the solution you'd like
Ideally, naming the arguments in tokenizer_with_special_tokens.encode_plus of pretrained_transformer_tokenizer should do the work but I'm afraid of repercussions on other tokenizer that have different argument name (those not based of Bert maybe?)
Moreover since layoutlm_v2 added a few input to the model (images and boxes), modifications should be made to _unfold_long_sequences, _fold_long_sequences and forward of the pretrained_transformer_embedder and pretrained_transformer_mismatched_embedder to account for additional inputs.

If it's okay with you, I'd like to work in it.

@epwalsh
Copy link
Member

epwalsh commented Oct 29, 2021

Hi @HOZHENWAI, yes this would be a good fix to have. Feel free to open a PR when you're ready.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants