Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Mismatch between get_multimodal_embedding output and PlaceholderRange #15144

Open
2 of 7 tasks
DarkLight1337 opened this issue Mar 19, 2025 · 1 comment
Open
2 of 7 tasks
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed multi-modality Related to multi-modality (#4194) v1

Comments

@DarkLight1337
Copy link
Member

DarkLight1337 commented Mar 19, 2025

In V1, we expect the output of get_multimodal_embedding to correspond to the PlaceholderRange, which is in turn constructed based on PromptUpdateDetails.features. However, the current V1 code doesn't validate this, causing the model to crash during inference when under high load (e.g. #14897, #14963).

From a quick look at the code, these models output embedding sizes which are inconsistent with the placeholder range:

(Basically, any model that has image newline/column tokens after applying HF processor needs a mask to map image patch features to image embeddings, as described below.)

To fix this, we can follow these steps:

  1. Update the multi-modal processor to output a mask to indicate which positions in the PlaceholderRange-aligned embeddings should the patch features (outputted by vision encoder) be assigned to. This mask can be called embed_is_patch.
  2. Use scatter_patch_features to scatter the patch features into the image embedding tensor.
  3. When merging multimodal embeddings, use select_patch_features to recover the patch features from the image embeddings. The number of patch features should correspond to the number of image tokens (which is a subset of the feature tokens in PromptUpdateDetails).

Follow-up work:

  • Update model development docs for Fuyu (assigned to @DarkLight1337)
  • Add validation in V1 engine (assigned to @ywang96)
@DarkLight1337 DarkLight1337 converted this from a draft issue Mar 19, 2025
@DarkLight1337 DarkLight1337 moved this from Todo to In Progress in Multi-modality Core Mar 19, 2025
@DarkLight1337 DarkLight1337 changed the title [V1] Fix mismatch between get_multimodal_embedding output and PlaceholderRange [Bug] Mismatch between get_multimodal_embedding output and PlaceholderRange Mar 19, 2025
@DarkLight1337 DarkLight1337 added bug Something isn't working v1 multi-modality Related to multi-modality (#4194) help wanted Extra attention is needed labels Mar 19, 2025
@kylehh
Copy link
Contributor

kylehh commented Mar 25, 2025

I will work on Fuyu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed multi-modality Related to multi-modality (#4194) v1
Projects
Status: In Progress
Development

No branches or pull requests

4 participants