LayoutXLM for Token Classification on FUNSD #35

TahaDouaji · 2021-09-25T20:22:17Z

Hello Niels, first thanks a lot for all of your awesome tutorials,
I'm trying to apply LayoutLM v2 Token classification tutorial on LayoutXLM, and I'm facing few issues.
I'm trying to have a processer for LayoutXLM, so converting this line

from transformers import LayoutLMv2Processor
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")

to those, but none worked.

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutxlm-base", revision="no_ocr")

feature_extractor = LayoutLMv2FeatureExtractor(apply_ocr=False)
tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutxlm-base")
processor = LayoutLMv2Processor(feature_extractor, tokenizer)

So, Can you please help me figure out what to change to have it work?
Many thanks in advance!.

The text was updated successfully, but these errors were encountered:

ManuelFay · 2021-09-27T13:34:03Z

You don't need to change the processor, just the model weights. Just keep the processor as is with the lmv2 and change the model path.

https://huggingface.co/transformers/model_doc/layoutxlm.html

NielsRogge · 2021-09-27T13:36:32Z

Yes, but LayoutXLM also requires another tokenizer (based on XLM-RoBERTa).

I've updated the docs: https://huggingface.co/transformers/master/model_doc/layoutxlm.html

ManuelFay · 2021-09-27T14:17:29Z

Using autotokenizer to load the xlmroberta tokenizer does not work, since it raises a ValueError when initializing the processor:

def __init__(self, feature_extractor, tokenizer):
        if not isinstance(feature_extractor, LayoutLMv2FeatureExtractor):
            raise ValueError(
                f"`feature_extractor` has to be of type {LayoutLMv2FeatureExtractor.__class__}, but is {type(feature_extractor)}"
            )
        if not isinstance(tokenizer, (LayoutLMv2Tokenizer, LayoutLMv2TokenizerFast)):
            raise ValueError(
                f"`tokenizer` has to be of type {LayoutLMv2Tokenizer.__class__} or {LayoutLMv2TokenizerFast.__class__}, but is {type(tokenizer)}"
            )


        self.feature_extractor = feature_extractor
        self.tokenizer = tokenizer
...

TahaDouaji · 2021-09-27T14:57:39Z

Thank you both @ManuelFay @NielsRogge for your replies!
Yes as mentioned above, the AutoTokenizer is not working, so how can we load the Tokenizer into our processor?

ReyzGS · 2021-10-01T09:42:55Z

Hi, I'm having the same issue as @ManuelFay when using processor with AutoTokenizer. Did you find any fix for this?

NielsRogge · 2021-10-01T09:45:46Z

Hi,

So the processor combines a feature extractor (for the image part) and a tokenizer (for the text part). LayoutLMv2 and LayoutXLM share the same feature extractor, but require a different tokenizer. The problem is that currently, the tokenizer for LayoutXLM does not support bounding boxes. You should use the XLMRobertaTokenizer (as LayoutXLM is a multilingual model, it uses a different multilingual vocabulary, the same as XLM-RoBERTa), like so:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/layoutxlm-base")

So for now, one cannot use the processor for LayoutXLM. One should instead use the feature extractor and tokenizer separately to prepare the data for the model.

You can take a look at my existing notebooks for LayoutLM (v1) on how to prepare data for the model, when no processor is available (just replace the tokenizer that is used there with the tokenizer above and add the output of the feature extractor).

ReyzGS · 2021-10-01T10:28:10Z

Thank you so much for your quick reply! I will try that!

ReyzGS · 2021-10-01T10:33:19Z

Also, sorry to bother again, but given that bounding boxes are not supported by LayoutXLM tokenizer, does that mean that I just don't use them as parameters for the model?

NielsRogge · 2021-10-18T07:49:56Z

LayoutXLM is equivalent to LayoutLMv2, so it also requires boxes as input. But there's good news: there's a PR open that will add support for LayoutXLM in the processor: huggingface/transformers#14030

NielsRogge · 2021-11-03T08:01:00Z

Good news, LayoutXLMProcessor is now available (the PR linked above has been merged)!

You should install Transformers from source for now to use it: pip install git+https://github.com/huggingface/transformers.git.

Therefore, closing this issue.

NielsRogge closed this as completed Nov 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LayoutXLM for Token Classification on FUNSD #35

LayoutXLM for Token Classification on FUNSD #35

TahaDouaji commented Sep 25, 2021 •

edited

ManuelFay commented Sep 27, 2021

NielsRogge commented Sep 27, 2021

ManuelFay commented Sep 27, 2021 •

edited

TahaDouaji commented Sep 27, 2021

ReyzGS commented Oct 1, 2021

NielsRogge commented Oct 1, 2021 •

edited

ReyzGS commented Oct 1, 2021

ReyzGS commented Oct 1, 2021

NielsRogge commented Oct 18, 2021 •

edited

NielsRogge commented Nov 3, 2021

LayoutXLM for Token Classification on FUNSD #35

LayoutXLM for Token Classification on FUNSD #35

Comments

TahaDouaji commented Sep 25, 2021 • edited

ManuelFay commented Sep 27, 2021

NielsRogge commented Sep 27, 2021

ManuelFay commented Sep 27, 2021 • edited

TahaDouaji commented Sep 27, 2021

ReyzGS commented Oct 1, 2021

NielsRogge commented Oct 1, 2021 • edited

ReyzGS commented Oct 1, 2021

ReyzGS commented Oct 1, 2021

NielsRogge commented Oct 18, 2021 • edited

NielsRogge commented Nov 3, 2021

TahaDouaji commented Sep 25, 2021 •

edited

ManuelFay commented Sep 27, 2021 •

edited

NielsRogge commented Oct 1, 2021 •

edited

NielsRogge commented Oct 18, 2021 •

edited