[bug] fix llava processor to calculate unpadding size correctly #37988

cyr0930 · 2025-05-07T05:20:23Z

What does this PR do?

Additional fix for #37819.
I didn't fix _get_unpadded_features methods in llava family processor.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

github-actions · 2025-05-07T05:20:38Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

cyr0930 · 2025-05-07T05:21:16Z

Sorry I mistakenly didn't fix the _get_unpadded_features metheds in prev PR

zucchini-nlp

Right, thanks for noticing it and fixinff! Left a few comments on new tests and we can add a couple slow tests with real checkpoints from the hub :)

tests/models/llava/test_processor_llava.py

zucchini-nlp · 2025-05-07T11:04:09Z

tests/models/llava_next/test_processor_llava_next.py

-        expected_image_tokens = 1526
-        image_token_index = 32000
+        image = torch.randint(0, 2, (3, 501, 322))
+        expected_image_tokens = 680


hmm, is this really correct? Interesting that number of tokens is twice less even when the image size is increased by few pixels 🤔

Ah I think this is because model is changed from "llava-hf/llava-v1.6-vicuna-7b-hf" to a Processor that is initialized in setUpClass. I will change it back to "llava-hf/llava-v1.6-vicuna-7b-hf"

oke, so the test was failing before the fix we made on image sizes? We can leave as is then, and only update this test method

No it was passed, but the input image torch.randint(0, 2, (3, 500, 316)) cannot cover the edge case before

In that case, my initial question remains. If we use the same processor params with slightly bigger image, how come the number of tokens is twice less. I want to make sure it didn't break anything fundamentally

okay maybe there are something I have to fix other than just pad & unpad logic. Let me check them thoroughly.

It turned out default params of AutoProcessor.from_pretrained("llava-hf/llava-v1.6-vicuna-7b-hf") and self.processor is different. Let me set the processor params explicitly to make the test work independently.

And I found edge case of unpadding logic. I'm gonna commit the fix later today. Thanks :)

tests/models/llava_onevision/test_processor_llava_onevision.py

kashif · 2025-05-07T11:24:09Z

@zucchini-nlp is this the reason the TRL CI tests are failing for odd-sized images for LlavaNextForConditionalGeneration? See here: https://github.com/huggingface/trl/actions/runs/14881264112/job/41789730121

zucchini-nlp · 2025-05-07T11:34:36Z

@kashif yes, seems like it. We'll add a test on our side as well with odd image sizes

kashif · 2025-05-07T11:36:30Z

we had:

     dataset_dict = {
            "prompt": [
                [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Describe the image in great detail."}]}],
                [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Is this bus in the USA?"}]}],
                [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Give a thorough description of the image."}]}],
                [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Who are the people in the image?"}]}],
                [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "What is written?"}]}],
            ],
            "chosen": [
                [{"role": "assistant", "content": [{"type": "text", "text": "The image features a modern, multi-colored train."}]}],
                [{"role": "assistant", "content": [{"type": "text", "text": "Yes, it can be assumed that this bus is in the USA."}]}],
                [{"role": "assistant", "content": [{"type": "text", "text": "The image features a forest path."}]}],
                [{"role": "assistant", "content": [{"type": "text", "text": "There are two individuals, possibly girls or women."}]}],
                [{"role": "assistant", "content": [{"type": "text", "text": '"ccpb".'}]}],
            ],
            "rejected": [
                [{"role": "assistant", "content": [{"type": "text", "text": "The image features a modern, colorful train."}]}],
                [{"role": "assistant", "content": [{"type": "text", "text": "No, it's not in the USA."}]}],
                [{"role": "assistant", "content": [{"type": "text", "text": "The image features a forest path surrounded by trees."}]}],
                [{"role": "assistant", "content": [{"type": "text", "text": "In the image, there are two individuals."}]}],
                [{"role": "assistant", "content": [{"type": "text", "text": '"ccpb".'}]}],
            ],
            "images": [
                [Image.fromarray(np.random.randint(0, 255, (92, 33, 3), dtype=np.uint8))],
                [Image.fromarray(np.random.randint(0, 255, (64, 48, 3), dtype=np.uint8))],
                [Image.fromarray(np.random.randint(0, 255, (80, 152, 3), dtype=np.uint8))],
                [Image.fromarray(np.random.randint(0, 255, (57, 24, 3), dtype=np.uint8))],
                [Image.fromarray(np.random.randint(0, 255, (102, 48, 3), dtype=np.uint8))],
            ],
        }

cyr0930 · 2025-05-07T11:39:12Z

Right, thanks for noticing it and fixinff! Left a few comments on new tests and we can add a couple slow tests with real checkpoints from the hub :)

Thanks for the review. I'll check and fix what is needed :)

This reverts commit 26a50af.

cyr0930 · 2025-05-09T08:10:21Z

The simplest edge case looks like the following image which should not be unpadded.

In the previous logic (https://github.com/huggingface/transformers/blob/v4.51.3/src/transformers/models/llava_next/modeling_llava_next.py#L116), new_height is computed with round, so it result in unpadding both top and bottom, which is wrong.

And in the refined logic (cc581ab#diff-24741085a5a99c475d86aa2d98b106a6d3b8685375a2bcc4803f47bd5d2b8852R143), new_height is computed with math.ceil and padding could contain residual, so it results in unpadding bottom, which is also incorrect.

Another problem is these computation use patch-leveled length. If only one pixel occupies the bottom (or right)-most patch, that patch should not be classified as a padding patch. However, patch-leveled computation can't cover this, because it lost some precision.

In conclusion, unpadding logic should use recovered length IMO.

cyr0930 · 2025-05-09T08:23:45Z

If this is valid and the commits look okay to you, I will fix them for llava_next_video and llava_onevision too.

zucchini-nlp · 2025-05-09T12:26:47Z

@cyr0930 Thanks for detailed information! It makes sense for me that we can't unpad on recovered length since the features are already patchified by the time unpadding happens. I'd prefer to leave everything as is and adapt the logic for expanding placeholder tokens, because we're currently over-engineering and diverging from original implementation

Also, let's add an integration test with official weights, instead of the dummy test we have currently. Our main goal is to make sure the models can do forward pass correctly. We can even use the same image sizes provided above, which are failing for TRL 😉

cyr0930 · 2025-05-09T12:53:54Z

@cyr0930 Thanks for detailed information! It makes sense for me that we can't unpad on recovered length since the features are already patchified by the time unpadding happens. I'd prefer to leave everything as is and adapt the logic for expanding placeholder tokens, because we're currently over-engineering and diverging from original implementation

Also, let's add an integration test with official weights, instead of the dummy test we have currently. Our main goal is to make sure the models can do forward pass correctly. We can even use the same image sizes provided above, which are failing for TRL 😉

okay I see :)
As it could be somewhat different from original inplementation, I'll rollback some of them.
Actually removing or adding one or two padded features doesn't make that much difference as vision encoder has larger receptive field, so it can cover edge case image in some sense.
Thanks for the review! I will also fix the test cases to integration test.

This reverts commit 752cd27.

This reverts commit ccbd094.

cyr0930 · 2025-05-10T09:14:11Z

IMO reverting the unpad logic as it was is more relevant, because unpad logic doesn't have to be the same as pad logic. As we observed, unpad logic uses different unit (num_pad), and removing one pixel vs removing one padding feature is quite different.
In the same manner, test with unpad is no longer relevant, only checking # image tokens and # image features would be fine. Therefore, I removed unpad_test.
And I don't think more test is required as this part (https://github.com/huggingface/transformers/blob/v4.51.3/src/transformers/models/llava_next/modeling_llava_next.py#L643) always check if # image tokens == # image features, so all the existing integration tests can't be passed if unpad logic is wrong.

zucchini-nlp

Great, indeed the tests should have raised errors if we ran slow tests. I am wondering if you could add a non-slow test though. Same was a unpad test but it checks for model.forward running without error when input is not a square with even sizes

Otherwise looks good to me, thanks a lot for handling it!

zucchini-nlp

Thanks! Last comment, let's not use a processor in model tests, we usually prepare dummy inputs and use them instead

I'll approve and you can ping me to merge when the comment is addressed

zucchini-nlp · 2025-05-13T11:48:37Z

tests/models/llava_onevision/test_modeling_llava_onevision.py

+        # prepare processor & input
+        processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")
+        image = torch.randint(0, 2, (3, 503, 316)).numpy()  # odd-sized image
+        prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
+        inputs_dict = processor(images=[image], text=[prompt], return_tensors="pt", padding=True).to(torch_device)
+
+        # prepare model configuration
+        config = self.model_tester.get_config()
+        config.text_config.vocab_size = len(processor.tokenizer)
+        config.vision_config.patch_size = 14
+        config.vision_config.image_size = processor.image_processor.size["height"]
+        config.image_grid_pinpoints = processor.image_processor.image_grid_pinpoints
+        config.image_token_index = processor.image_token_id
+


I think we can do without relying on the processor. A dummy pixels will be enough as long as the image_sizes are [[503, 316]] 😄

…unpad

cyr0930 · 2025-05-13T13:34:07Z

Thanks! Last comment, let's not use a processor in model tests, we usually prepare dummy inputs and use them instead

I'll approve and you can ping me to merge when the comment is addressed

Thanks for the reviews!

HuggingFaceDocBuilderDev · 2025-05-13T13:49:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ingface#37988) * fix llava processor to calculate unpad size correctly * repo consistency * Revert "repo consistency" & "setUp in llava family" This reverts commit 26a50af. * add edge case test for padding & unpadding * compute unpadding size from original size * make test config explicit * Revert "compute unpadding size from original size" This reverts commit 752cd27. * Revert "add edge case test for padding & unpadding" This reverts commit ccbd094. * revert unpad logic * remove irrelevant tests * model test * remove processor from model test --------- Co-authored-by: jaycha <jaycha@ncsoft.com>

fix llava processor to calculate unpad size correctly

3d09e4a

github-actions bot marked this pull request as draft May 7, 2025 05:20

repo consistency

26a50af

cyr0930 marked this pull request as ready for review May 7, 2025 05:53

github-actions bot requested review from molbap and ydshieh May 7, 2025 05:53

zucchini-nlp reviewed May 7, 2025

View reviewed changes

cyr0930 added 4 commits May 8, 2025 07:44

Revert "repo consistency" & "setUp in llava family"

d463b2e

This reverts commit 26a50af.

add edge case test for padding & unpadding

ccbd094

compute unpadding size from original size

752cd27

make test config explicit

93945b4

cyr0930 and others added 5 commits May 10, 2025 08:00

Revert "compute unpadding size from original size"

95e8c10

This reverts commit 752cd27.

Revert "add edge case test for padding & unpadding"

869bfb7

This reverts commit ccbd094.

revert unpad logic

83f5c80

Merge remote-tracking branch 'origin/main' into fix/llava_unpad

6d54332

remove irrelevant tests

b78d6ec

zucchini-nlp reviewed May 12, 2025

View reviewed changes

cyr0930 and others added 2 commits May 13, 2025 05:34

model test

26e01a1

Merge remote-tracking branch 'origin/main' into fix/llava_unpad

b65383c

Merge remote-tracking branch 'origin/main' into fix/llava_unpad

1d5cb3d

cyr0930 requested a review from zucchini-nlp May 13, 2025 05:48

cyr0930 mentioned this pull request May 13, 2025

fix multi-image case for llava-onevision #38084

Merged

5 tasks

zucchini-nlp approved these changes May 13, 2025

View reviewed changes

cyr0930 added 3 commits May 13, 2025 13:20

remove processor from model test

2ddb648

Merge remote-tracking branch 'origin/fix/llava_unpad' into fix/llava_…

d38ad4e

…unpad

Merge branch 'main' into fix/llava_unpad

87fddba

cyr0930 requested a review from zucchini-nlp May 13, 2025 13:33

zucchini-nlp enabled auto-merge (squash) May 13, 2025 13:38

zucchini-nlp approved these changes May 13, 2025

View reviewed changes

zucchini-nlp merged commit a5cc7a6 into huggingface:main May 13, 2025
15 checks passed

[bug] fix llava processor to calculate unpadding size correctly #37988

[bug] fix llava processor to calculate unpadding size correctly #37988

Uh oh!

Conversation

cyr0930 commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

github-actions bot commented May 7, 2025

Uh oh!

cyr0930 commented May 7, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kashif commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented May 7, 2025

Uh oh!

kashif commented May 7, 2025

Uh oh!

cyr0930 commented May 7, 2025

Uh oh!

cyr0930 commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyr0930 commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented May 9, 2025

Uh oh!

cyr0930 commented May 9, 2025

Uh oh!

cyr0930 commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cyr0930 commented May 13, 2025

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 13, 2025

Uh oh!

Uh oh!

cyr0930 commented May 7, 2025 •

edited

Loading

kashif commented May 7, 2025 •

edited

Loading

cyr0930 commented May 9, 2025 •

edited

Loading

cyr0930 commented May 9, 2025 •

edited

Loading

cyr0930 commented May 10, 2025 •

edited

Loading