Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LayoutXLM for Token Classification on FUNSD #35

Closed
TahaDouaji opened this issue Sep 25, 2021 · 10 comments
Closed

LayoutXLM for Token Classification on FUNSD #35

TahaDouaji opened this issue Sep 25, 2021 · 10 comments

Comments

@TahaDouaji
Copy link

TahaDouaji commented Sep 25, 2021

Hello Niels, first thanks a lot for all of your awesome tutorials,
I'm trying to apply LayoutLM v2 Token classification tutorial on LayoutXLM, and I'm facing few issues.
I'm trying to have a processer for LayoutXLM, so converting this line

from transformers import LayoutLMv2Processor
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")

to those, but none worked.

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutxlm-base", revision="no_ocr")

feature_extractor = LayoutLMv2FeatureExtractor(apply_ocr=False)
tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutxlm-base")
processor = LayoutLMv2Processor(feature_extractor, tokenizer)

So, Can you please help me figure out what to change to have it work?
Many thanks in advance!.

@ManuelFay
Copy link

You don't need to change the processor, just the model weights. Just keep the processor as is with the lmv2 and change the model path.

https://huggingface.co/transformers/model_doc/layoutxlm.html

@NielsRogge
Copy link
Owner

Yes, but LayoutXLM also requires another tokenizer (based on XLM-RoBERTa).

I've updated the docs: https://huggingface.co/transformers/master/model_doc/layoutxlm.html

@ManuelFay
Copy link

ManuelFay commented Sep 27, 2021

Using autotokenizer to load the xlmroberta tokenizer does not work, since it raises a ValueError when initializing the processor:

def __init__(self, feature_extractor, tokenizer):
        if not isinstance(feature_extractor, LayoutLMv2FeatureExtractor):
            raise ValueError(
                f"`feature_extractor` has to be of type {LayoutLMv2FeatureExtractor.__class__}, but is {type(feature_extractor)}"
            )
        if not isinstance(tokenizer, (LayoutLMv2Tokenizer, LayoutLMv2TokenizerFast)):
            raise ValueError(
                f"`tokenizer` has to be of type {LayoutLMv2Tokenizer.__class__} or {LayoutLMv2TokenizerFast.__class__}, but is {type(tokenizer)}"
            )


        self.feature_extractor = feature_extractor
        self.tokenizer = tokenizer
...

@TahaDouaji
Copy link
Author

Thank you both @ManuelFay @NielsRogge for your replies!
Yes as mentioned above, the AutoTokenizer is not working, so how can we load the Tokenizer into our processor?

@ReyzGS
Copy link

ReyzGS commented Oct 1, 2021

Hi, I'm having the same issue as @ManuelFay when using processor with AutoTokenizer. Did you find any fix for this?

@NielsRogge
Copy link
Owner

NielsRogge commented Oct 1, 2021

Hi,

So the processor combines a feature extractor (for the image part) and a tokenizer (for the text part). LayoutLMv2 and LayoutXLM share the same feature extractor, but require a different tokenizer. The problem is that currently, the tokenizer for LayoutXLM does not support bounding boxes. You should use the XLMRobertaTokenizer (as LayoutXLM is a multilingual model, it uses a different multilingual vocabulary, the same as XLM-RoBERTa), like so:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/layoutxlm-base")

So for now, one cannot use the processor for LayoutXLM. One should instead use the feature extractor and tokenizer separately to prepare the data for the model.

You can take a look at my existing notebooks for LayoutLM (v1) on how to prepare data for the model, when no processor is available (just replace the tokenizer that is used there with the tokenizer above and add the output of the feature extractor).

@ReyzGS
Copy link

ReyzGS commented Oct 1, 2021

Thank you so much for your quick reply! I will try that!

@ReyzGS
Copy link

ReyzGS commented Oct 1, 2021

Also, sorry to bother again, but given that bounding boxes are not supported by LayoutXLM tokenizer, does that mean that I just don't use them as parameters for the model?

@NielsRogge
Copy link
Owner

NielsRogge commented Oct 18, 2021

LayoutXLM is equivalent to LayoutLMv2, so it also requires boxes as input. But there's good news: there's a PR open that will add support for LayoutXLM in the processor: huggingface/transformers#14030

@NielsRogge
Copy link
Owner

Good news, LayoutXLMProcessor is now available (the PR linked above has been merged)!

You should install Transformers from source for now to use it: pip install git+https://github.com/huggingface/transformers.git.

Therefore, closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants