Error in input expansion for generate
with num_return_sequences
> 1 for multi-image inputs to AutoModelForImageTextToText
#37900
Labels
System Info
Who can help?
@zucchini-nlp @amyeroberts @qubvel
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I want to generate multiple responses to the same prompt with an image-text-to-text model. One straightforward way to do this is to use the
generate
function withnum_return_sequences
> 1 in theGenerationConfig
. However, there appears to be an issue with this. I will use the example of thegoogle/gemma-3-12b-it
model to present the issue but anecdotally observed this with other models (mistral-community/pixtral-12b
,mistralai/Mistral-Small-3.1-24B-Base-2503
, etc.) but not sure which ones, or to what extend the specific model influences this issue.When using generate with
num_return_sequences
> 1, the inputs are first expanded and then passed to the sample function.transformers/src/transformers/generation/utils.py
Lines 2486 to 2492 in 86777b5
I suspect that the expansion for image inputs when there are multiple images present does not work as expected leading to this error. More details in reproduction/expected behavior.
Here is a code snippet that reproduces the behavior in my setting:
This yields the following list for
outputs
:Note how the first two captions are for the first image, the second two for the second image, and so on. This should not be the case, the model is capable of describing the correct image. A description of how this can be determined is under expected behavior.
Expected behavior
If, instead of asking for 8 completions for 1 prompt, I ask for 1 completion each of 8 copies of the prompt, this issue is fixed.
yields the outputs
which is expected behavior.
This suggests a bug in how the inputs are expanded for generation.
The text was updated successfully, but these errors were encountered: