Skip to content

[bug] fix llava processor to calculate unpadding size correctly #37988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
May 13, 2025

Conversation

cyr0930
Copy link
Contributor

@cyr0930 cyr0930 commented May 7, 2025

What does this PR do?

Additional fix for #37819.
I didn't fix _get_unpadded_features methods in llava family processor.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@zucchini-nlp

@github-actions github-actions bot marked this pull request as draft May 7, 2025 05:20
Copy link
Contributor

github-actions bot commented May 7, 2025

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

@cyr0930
Copy link
Contributor Author

cyr0930 commented May 7, 2025

Sorry I mistakenly didn't fix the _get_unpadded_features metheds in prev PR

@cyr0930 cyr0930 marked this pull request as ready for review May 7, 2025 05:53
@github-actions github-actions bot requested review from molbap and ydshieh May 7, 2025 05:53
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks for noticing it and fixinff! Left a few comments on new tests and we can add a couple slow tests with real checkpoints from the hub :)

expected_image_tokens = 1526
image_token_index = 32000
image = torch.randint(0, 2, (3, 501, 322))
expected_image_tokens = 680
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is this really correct? Interesting that number of tokens is twice less even when the image size is increased by few pixels 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think this is because model is changed from "llava-hf/llava-v1.6-vicuna-7b-hf" to a Processor that is initialized in setUpClass. I will change it back to "llava-hf/llava-v1.6-vicuna-7b-hf"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oke, so the test was failing before the fix we made on image sizes? We can leave as is then, and only update this test method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it was passed, but the input image torch.randint(0, 2, (3, 500, 316)) cannot cover the edge case before

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, my initial question remains. If we use the same processor params with slightly bigger image, how come the number of tokens is twice less. I want to make sure it didn't break anything fundamentally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay maybe there are something I have to fix other than just pad & unpad logic. Let me check them thoroughly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turned out default params of AutoProcessor.from_pretrained("llava-hf/llava-v1.6-vicuna-7b-hf") and self.processor is different. Let me set the processor params explicitly to make the test work independently.

And I found edge case of unpadding logic. I'm gonna commit the fix later today. Thanks :)

@kashif
Copy link
Contributor

kashif commented May 7, 2025

@zucchini-nlp is this the reason the TRL CI tests are failing for odd-sized images for LlavaNextForConditionalGeneration? See here: https://github.com/huggingface/trl/actions/runs/14881264112/job/41789730121

@zucchini-nlp
Copy link
Member

@kashif yes, seems like it. We'll add a test on our side as well with odd image sizes

@kashif
Copy link
Contributor

kashif commented May 7, 2025

we had:

     dataset_dict = {
            "prompt": [
                [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Describe the image in great detail."}]}],
                [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Is this bus in the USA?"}]}],
                [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Give a thorough description of the image."}]}],
                [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Who are the people in the image?"}]}],
                [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "What is written?"}]}],
            ],
            "chosen": [
                [{"role": "assistant", "content": [{"type": "text", "text": "The image features a modern, multi-colored train."}]}],
                [{"role": "assistant", "content": [{"type": "text", "text": "Yes, it can be assumed that this bus is in the USA."}]}],
                [{"role": "assistant", "content": [{"type": "text", "text": "The image features a forest path."}]}],
                [{"role": "assistant", "content": [{"type": "text", "text": "There are two individuals, possibly girls or women."}]}],
                [{"role": "assistant", "content": [{"type": "text", "text": '"ccpb".'}]}],
            ],
            "rejected": [
                [{"role": "assistant", "content": [{"type": "text", "text": "The image features a modern, colorful train."}]}],
                [{"role": "assistant", "content": [{"type": "text", "text": "No, it's not in the USA."}]}],
                [{"role": "assistant", "content": [{"type": "text", "text": "The image features a forest path surrounded by trees."}]}],
                [{"role": "assistant", "content": [{"type": "text", "text": "In the image, there are two individuals."}]}],
                [{"role": "assistant", "content": [{"type": "text", "text": '"ccpb".'}]}],
            ],
            "images": [
                [Image.fromarray(np.random.randint(0, 255, (92, 33, 3), dtype=np.uint8))],
                [Image.fromarray(np.random.randint(0, 255, (64, 48, 3), dtype=np.uint8))],
                [Image.fromarray(np.random.randint(0, 255, (80, 152, 3), dtype=np.uint8))],
                [Image.fromarray(np.random.randint(0, 255, (57, 24, 3), dtype=np.uint8))],
                [Image.fromarray(np.random.randint(0, 255, (102, 48, 3), dtype=np.uint8))],
            ],
        }

@cyr0930
Copy link
Contributor Author

cyr0930 commented May 7, 2025

Right, thanks for noticing it and fixinff! Left a few comments on new tests and we can add a couple slow tests with real checkpoints from the hub :)

Thanks for the review. I'll check and fix what is needed :)

@cyr0930
Copy link
Contributor Author

cyr0930 commented May 9, 2025

The simplest edge case looks like the following image which should not be unpadded.
image

In the previous logic (https://github.com/huggingface/transformers/blob/v4.51.3/src/transformers/models/llava_next/modeling_llava_next.py#L116), new_height is computed with round, so it result in unpadding both top and bottom, which is wrong.

And in the refined logic (cc581ab#diff-24741085a5a99c475d86aa2d98b106a6d3b8685375a2bcc4803f47bd5d2b8852R143), new_height is computed with math.ceil and padding could contain residual, so it results in unpadding bottom, which is also incorrect.

Another problem is these computation use patch-leveled length. If only one pixel occupies the bottom (or right)-most patch, that patch should not be classified as a padding patch. However, patch-leveled computation can't cover this, because it lost some precision.

In conclusion, unpadding logic should use recovered length IMO.

@cyr0930
Copy link
Contributor Author

cyr0930 commented May 9, 2025

If this is valid and the commits look okay to you, I will fix them for llava_next_video and llava_onevision too.

@zucchini-nlp
Copy link
Member

@cyr0930 Thanks for detailed information! It makes sense for me that we can't unpad on recovered length since the features are already patchified by the time unpadding happens. I'd prefer to leave everything as is and adapt the logic for expanding placeholder tokens, because we're currently over-engineering and diverging from original implementation

Also, let's add an integration test with official weights, instead of the dummy test we have currently. Our main goal is to make sure the models can do forward pass correctly. We can even use the same image sizes provided above, which are failing for TRL 😉

@cyr0930
Copy link
Contributor Author

cyr0930 commented May 9, 2025

@cyr0930 Thanks for detailed information! It makes sense for me that we can't unpad on recovered length since the features are already patchified by the time unpadding happens. I'd prefer to leave everything as is and adapt the logic for expanding placeholder tokens, because we're currently over-engineering and diverging from original implementation

Also, let's add an integration test with official weights, instead of the dummy test we have currently. Our main goal is to make sure the models can do forward pass correctly. We can even use the same image sizes provided above, which are failing for TRL 😉

okay I see :)
As it could be somewhat different from original inplementation, I'll rollback some of them.
Actually removing or adding one or two padded features doesn't make that much difference as vision encoder has larger receptive field, so it can cover edge case image in some sense.
Thanks for the review! I will also fix the test cases to integration test.

@cyr0930
Copy link
Contributor Author

cyr0930 commented May 10, 2025

IMO reverting the unpad logic as it was is more relevant, because unpad logic doesn't have to be the same as pad logic. As we observed, unpad logic uses different unit (num_pad), and removing one pixel vs removing one padding feature is quite different.
In the same manner, test with unpad is no longer relevant, only checking # image tokens and # image features would be fine. Therefore, I removed unpad_test.
And I don't think more test is required as this part (https://github.com/huggingface/transformers/blob/v4.51.3/src/transformers/models/llava_next/modeling_llava_next.py#L643) always check if # image tokens == # image features, so all the existing integration tests can't be passed if unpad logic is wrong.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, indeed the tests should have raised errors if we ran slow tests. I am wondering if you could add a non-slow test though. Same was a unpad test but it checks for model.forward running without error when input is not a square with even sizes

Otherwise looks good to me, thanks a lot for handling it!

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Last comment, let's not use a processor in model tests, we usually prepare dummy inputs and use them instead

I'll approve and you can ping me to merge when the comment is addressed

Comment on lines 270 to 283
# prepare processor & input
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")
image = torch.randint(0, 2, (3, 503, 316)).numpy() # odd-sized image
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
inputs_dict = processor(images=[image], text=[prompt], return_tensors="pt", padding=True).to(torch_device)

# prepare model configuration
config = self.model_tester.get_config()
config.text_config.vocab_size = len(processor.tokenizer)
config.vision_config.patch_size = 14
config.vision_config.image_size = processor.image_processor.size["height"]
config.image_grid_pinpoints = processor.image_processor.image_grid_pinpoints
config.image_token_index = processor.image_token_id

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do without relying on the processor. A dummy pixels will be enough as long as the image_sizes are [[503, 316]] 😄

@cyr0930 cyr0930 requested a review from zucchini-nlp May 13, 2025 13:33
@cyr0930
Copy link
Contributor Author

cyr0930 commented May 13, 2025

Thanks! Last comment, let's not use a processor in model tests, we usually prepare dummy inputs and use them instead

I'll approve and you can ping me to merge when the comment is addressed

Thanks for the reviews!

@zucchini-nlp zucchini-nlp enabled auto-merge (squash) May 13, 2025 13:38
@zucchini-nlp zucchini-nlp merged commit a5cc7a6 into huggingface:main May 13, 2025
15 checks passed
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025
…ingface#37988)

* fix llava processor to calculate unpad size correctly

* repo consistency

* Revert "repo consistency" & "setUp in llava family"

This reverts commit 26a50af.

* add edge case test for padding & unpadding

* compute unpadding size from original size

* make test config explicit

* Revert "compute unpadding size from original size"

This reverts commit 752cd27.

* Revert "add edge case test for padding & unpadding"

This reverts commit ccbd094.

* revert unpad logic

* remove irrelevant tests

* model test

* remove processor from model test

---------

Co-authored-by: jaycha <jaycha@ncsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants