[Bug] Mismatch between `get_multimodal_embedding` output and `PlaceholderRange` #15144

DarkLight1337 · 2025-03-19T16:53:23Z

In V1, we expect the output of get_multimodal_embedding to correspond to the PlaceholderRange, which is in turn constructed based on PromptUpdateDetails.features. However, the current V1 code doesn't validate this, causing the model to crash during inference when under high load (e.g. #14897, #14963).

From a quick look at the code, these models output embedding sizes which are inconsistent with the placeholder range:

Fuyu
Gemma3 (fixed by [Bugfix] Re-enable Gemma3 for V1 #14980)
Idefics3
InternVL-based models (fixed by [Bugfix] Fix embedding assignment for InternVL-based models #15086)
MiniCPM-V (does not support V1 yet, fixed by [Model] MiniCPM-V/O supports V1 #15487)

(Basically, any model that has image newline/column tokens after applying HF processor needs a mask to map image patch features to image embeddings, as described below.)

To fix this, we can follow these steps:

Update the multi-modal processor to output a mask to indicate which positions in the PlaceholderRange-aligned embeddings should the patch features (outputted by vision encoder) be assigned to. This mask can be called embed_is_patch.
Use scatter_patch_features to scatter the patch features into the image embedding tensor.
When merging multimodal embeddings, use select_patch_features to recover the patch features from the image embeddings. The number of patch features should correspond to the number of image tokens (which is a subset of the feature tokens in PromptUpdateDetails).

Follow-up work:

Update model development docs for Fuyu (assigned to @DarkLight1337)
Add validation in V1 engine (assigned to @ywang96)

The text was updated successfully, but these errors were encountered:

kylehh · 2025-03-25T16:29:12Z

I will work on Fuyu

DarkLight1337 added this to Multi-modality Core Mar 15, 2025

DarkLight1337 converted this from a draft issue Mar 19, 2025

DarkLight1337 moved this from Todo to In Progress in Multi-modality Core Mar 19, 2025

DarkLight1337 assigned ywang96, DarkLight1337 and Isotr0py and unassigned Isotr0py, DarkLight1337 and ywang96 Mar 19, 2025

DarkLight1337 changed the title ~~[V1] Fix mismatch between get_multimodal_embedding output and PlaceholderRange~~ [Bug] Mismatch between get_multimodal_embedding output and PlaceholderRange Mar 19, 2025

DarkLight1337 added bug v1 multi-modality help wanted labels Mar 19, 2025

DarkLight1337 assigned DarkLight1337 and ywang96 Mar 19, 2025

DarkLight1337 mentioned this issue Mar 19, 2025

[Bugfix] Fix embedding assignment for InternVL-based models #15086

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Mismatch between `get_multimodal_embedding` output and `PlaceholderRange` #15144

[Bug] Mismatch between `get_multimodal_embedding` output and `PlaceholderRange` #15144

DarkLight1337 commented Mar 19, 2025 •

edited

Loading

kylehh commented Mar 25, 2025

[Bug] Mismatch between get_multimodal_embedding output and PlaceholderRange #15144

[Bug] Mismatch between get_multimodal_embedding output and PlaceholderRange #15144

Comments

DarkLight1337 commented Mar 19, 2025 • edited Loading

kylehh commented Mar 25, 2025

[Bug] Mismatch between `get_multimodal_embedding` output and `PlaceholderRange` #15144

[Bug] Mismatch between `get_multimodal_embedding` output and `PlaceholderRange` #15144

DarkLight1337 commented Mar 19, 2025 •

edited

Loading