[Bug] Mismatch between get_multimodal_embedding
output and PlaceholderRange
#15144
Open
2 of 7 tasks
Labels
bug
Something isn't working
help wanted
Extra attention is needed
multi-modality
Related to multi-modality (#4194)
v1
In V1, we expect the output of
get_multimodal_embedding
to correspond to thePlaceholderRange
, which is in turn constructed based onPromptUpdateDetails.features
. However, the current V1 code doesn't validate this, causing the model to crash during inference when under high load (e.g. #14897, #14963).From a quick look at the code, these models output embedding sizes which are inconsistent with the placeholder range:
(Basically, any model that has image newline/column tokens after applying HF processor needs a mask to map image patch features to image embeddings, as described below.)
To fix this, we can follow these steps:
PlaceholderRange
-aligned embeddings should the patch features (outputted by vision encoder) be assigned to. This mask can be calledembed_is_patch
.scatter_patch_features
to scatter the patch features into the image embedding tensor.select_patch_features
to recover the patch features from the image embeddings. The number of patch features should correspond to the number of image tokens (which is a subset of the feature tokens inPromptUpdateDetails
).Follow-up work:
The text was updated successfully, but these errors were encountered: