-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[bug] fix llava processor to calculate unpadding size correctly #37988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the |
Sorry I mistakenly didn't fix the _get_unpadded_features metheds in prev PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thanks for noticing it and fixinff! Left a few comments on new tests and we can add a couple slow tests with real checkpoints from the hub :)
expected_image_tokens = 1526 | ||
image_token_index = 32000 | ||
image = torch.randint(0, 2, (3, 501, 322)) | ||
expected_image_tokens = 680 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, is this really correct? Interesting that number of tokens is twice less even when the image size is increased by few pixels 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think this is because model is changed from "llava-hf/llava-v1.6-vicuna-7b-hf" to a Processor that is initialized in setUpClass. I will change it back to "llava-hf/llava-v1.6-vicuna-7b-hf"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oke, so the test was failing before the fix we made on image sizes? We can leave as is then, and only update this test method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it was passed, but the input image torch.randint(0, 2, (3, 500, 316)) cannot cover the edge case before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, my initial question remains. If we use the same processor params with slightly bigger image, how come the number of tokens is twice less. I want to make sure it didn't break anything fundamentally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay maybe there are something I have to fix other than just pad & unpad logic. Let me check them thoroughly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turned out default params of AutoProcessor.from_pretrained("llava-hf/llava-v1.6-vicuna-7b-hf")
and self.processor
is different. Let me set the processor params explicitly to make the test work independently.
And I found edge case of unpadding logic. I'm gonna commit the fix later today. Thanks :)
@zucchini-nlp is this the reason the TRL CI tests are failing for odd-sized images for |
@kashif yes, seems like it. We'll add a test on our side as well with odd image sizes |
we had: dataset_dict = {
"prompt": [
[{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Describe the image in great detail."}]}],
[{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Is this bus in the USA?"}]}],
[{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Give a thorough description of the image."}]}],
[{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Who are the people in the image?"}]}],
[{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "What is written?"}]}],
],
"chosen": [
[{"role": "assistant", "content": [{"type": "text", "text": "The image features a modern, multi-colored train."}]}],
[{"role": "assistant", "content": [{"type": "text", "text": "Yes, it can be assumed that this bus is in the USA."}]}],
[{"role": "assistant", "content": [{"type": "text", "text": "The image features a forest path."}]}],
[{"role": "assistant", "content": [{"type": "text", "text": "There are two individuals, possibly girls or women."}]}],
[{"role": "assistant", "content": [{"type": "text", "text": '"ccpb".'}]}],
],
"rejected": [
[{"role": "assistant", "content": [{"type": "text", "text": "The image features a modern, colorful train."}]}],
[{"role": "assistant", "content": [{"type": "text", "text": "No, it's not in the USA."}]}],
[{"role": "assistant", "content": [{"type": "text", "text": "The image features a forest path surrounded by trees."}]}],
[{"role": "assistant", "content": [{"type": "text", "text": "In the image, there are two individuals."}]}],
[{"role": "assistant", "content": [{"type": "text", "text": '"ccpb".'}]}],
],
"images": [
[Image.fromarray(np.random.randint(0, 255, (92, 33, 3), dtype=np.uint8))],
[Image.fromarray(np.random.randint(0, 255, (64, 48, 3), dtype=np.uint8))],
[Image.fromarray(np.random.randint(0, 255, (80, 152, 3), dtype=np.uint8))],
[Image.fromarray(np.random.randint(0, 255, (57, 24, 3), dtype=np.uint8))],
[Image.fromarray(np.random.randint(0, 255, (102, 48, 3), dtype=np.uint8))],
],
} |
Thanks for the review. I'll check and fix what is needed :) |
The simplest edge case looks like the following image which should not be unpadded. In the previous logic (https://github.com/huggingface/transformers/blob/v4.51.3/src/transformers/models/llava_next/modeling_llava_next.py#L116), new_height is computed with round, so it result in unpadding both top and bottom, which is wrong. And in the refined logic (cc581ab#diff-24741085a5a99c475d86aa2d98b106a6d3b8685375a2bcc4803f47bd5d2b8852R143), new_height is computed with math.ceil and padding could contain residual, so it results in unpadding bottom, which is also incorrect. Another problem is these computation use patch-leveled length. If only one pixel occupies the bottom (or right)-most patch, that patch should not be classified as a padding patch. However, patch-leveled computation can't cover this, because it lost some precision. In conclusion, unpadding logic should use recovered length IMO. |
If this is valid and the commits look okay to you, I will fix them for llava_next_video and llava_onevision too. |
@cyr0930 Thanks for detailed information! It makes sense for me that we can't unpad on Also, let's add an integration test with official weights, instead of the dummy test we have currently. Our main goal is to make sure the models can do forward pass correctly. We can even use the same image sizes provided above, which are failing for TRL 😉 |
okay I see :) |
IMO reverting the unpad logic as it was is more relevant, because unpad logic doesn't have to be the same as pad logic. As we observed, unpad logic uses different unit (num_pad), and removing one pixel vs removing one padding feature is quite different. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, indeed the tests should have raised errors if we ran slow tests. I am wondering if you could add a non-slow test though. Same was a unpad test but it checks for model.forward
running without error when input is not a square with even sizes
Otherwise looks good to me, thanks a lot for handling it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Last comment, let's not use a processor in model tests, we usually prepare dummy inputs and use them instead
I'll approve and you can ping me to merge when the comment is addressed
# prepare processor & input | ||
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf") | ||
image = torch.randint(0, 2, (3, 503, 316)).numpy() # odd-sized image | ||
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]" | ||
inputs_dict = processor(images=[image], text=[prompt], return_tensors="pt", padding=True).to(torch_device) | ||
|
||
# prepare model configuration | ||
config = self.model_tester.get_config() | ||
config.text_config.vocab_size = len(processor.tokenizer) | ||
config.vision_config.patch_size = 14 | ||
config.vision_config.image_size = processor.image_processor.size["height"] | ||
config.image_grid_pinpoints = processor.image_processor.image_grid_pinpoints | ||
config.image_token_index = processor.image_token_id | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do without relying on the processor. A dummy pixels will be enough as long as the image_sizes
are [[503, 316]]
😄
Thanks for the reviews! |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…ingface#37988) * fix llava processor to calculate unpad size correctly * repo consistency * Revert "repo consistency" & "setUp in llava family" This reverts commit 26a50af. * add edge case test for padding & unpadding * compute unpadding size from original size * make test config explicit * Revert "compute unpadding size from original size" This reverts commit 752cd27. * Revert "add edge case test for padding & unpadding" This reverts commit ccbd094. * revert unpad logic * remove irrelevant tests * model test * remove processor from model test --------- Co-authored-by: jaycha <jaycha@ncsoft.com>
What does this PR do?
Additional fix for #37819.
I didn't fix _get_unpadded_features methods in llava family processor.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@zucchini-nlp