New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LayoutXLM for Token Classification on FUNSD #35
Comments
You don't need to change the processor, just the model weights. Just keep the processor as is with the lmv2 and change the model path. https://huggingface.co/transformers/model_doc/layoutxlm.html |
Yes, but LayoutXLM also requires another tokenizer (based on XLM-RoBERTa). I've updated the docs: https://huggingface.co/transformers/master/model_doc/layoutxlm.html |
Using autotokenizer to load the xlmroberta tokenizer does not work, since it raises a ValueError when initializing the processor: def __init__(self, feature_extractor, tokenizer):
if not isinstance(feature_extractor, LayoutLMv2FeatureExtractor):
raise ValueError(
f"`feature_extractor` has to be of type {LayoutLMv2FeatureExtractor.__class__}, but is {type(feature_extractor)}"
)
if not isinstance(tokenizer, (LayoutLMv2Tokenizer, LayoutLMv2TokenizerFast)):
raise ValueError(
f"`tokenizer` has to be of type {LayoutLMv2Tokenizer.__class__} or {LayoutLMv2TokenizerFast.__class__}, but is {type(tokenizer)}"
)
self.feature_extractor = feature_extractor
self.tokenizer = tokenizer
... |
Thank you both @ManuelFay @NielsRogge for your replies! |
Hi, I'm having the same issue as @ManuelFay when using processor with AutoTokenizer. Did you find any fix for this? |
Hi, So the processor combines a feature extractor (for the image part) and a tokenizer (for the text part). LayoutLMv2 and LayoutXLM share the same feature extractor, but require a different tokenizer. The problem is that currently, the tokenizer for LayoutXLM does not support bounding boxes. You should use the XLMRobertaTokenizer (as LayoutXLM is a multilingual model, it uses a different multilingual vocabulary, the same as XLM-RoBERTa), like so:
So for now, one cannot use the processor for LayoutXLM. One should instead use the feature extractor and tokenizer separately to prepare the data for the model. You can take a look at my existing notebooks for LayoutLM (v1) on how to prepare data for the model, when no processor is available (just replace the tokenizer that is used there with the tokenizer above and add the output of the feature extractor). |
Thank you so much for your quick reply! I will try that! |
Also, sorry to bother again, but given that bounding boxes are not supported by LayoutXLM tokenizer, does that mean that I just don't use them as parameters for the model? |
LayoutXLM is equivalent to LayoutLMv2, so it also requires boxes as input. But there's good news: there's a PR open that will add support for LayoutXLM in the processor: huggingface/transformers#14030 |
Good news, You should install Transformers from source for now to use it: Therefore, closing this issue. |
Hello Niels, first thanks a lot for all of your awesome tutorials,
I'm trying to apply LayoutLM v2 Token classification tutorial on LayoutXLM, and I'm facing few issues.
I'm trying to have a processer for LayoutXLM, so converting this line
to those, but none worked.
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutxlm-base", revision="no_ocr")
So, Can you please help me figure out what to change to have it work?
Many thanks in advance!.
The text was updated successfully, but these errors were encountered: