Attention mask for multi-image input in gemma3 #38053

deval281shah · 2025-05-09T19:05:33Z

System Info

As per the attention mask example in the Gemma3 blog (https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/gemma3/attention-ascii.png), it looks like there is non-causal attention within the image and causal attention across images (i.e., an image does not attend to a future image). However, when running gemma3 generate using transformers (v4.51.3), looks like there is non-causal attention across images.

import torch
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
import os
import pickle
import torch._dynamo
torch._dynamo.config.suppress_errors = True

ckpt = "google/gemma-3-4b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(
    ckpt, device_map="auto", torch_dtype=torch.float32,
)
processor = AutoProcessor.from_pretrained(ckpt)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "First image: "},
            {"type": "image", "path": "img1.jpg"},
            {"type": "text", "text": "Second image:"},
            {"type": "image", "path": "img2.jpg"},
            {"type": "text", "text": "Describe all images in single sentence."}
        ]
    }
]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

generation = model.generate(**inputs, max_new_tokens=1, return_dict_in_generate=True,output_attentions=True , do_sample=False)

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Code attached

Expected behavior

Should the attention across image be non-causal or causal?

The text was updated successfully, but these errors were encountered:

zucchini-nlp · 2025-05-12T09:52:19Z

Hmm right, just checked the mask and an image attends to all images back and forward. I believe this needs a fix, will check with the original implementation once more and make a fix if needed, thanks for reporting

deval281shah added the bug label May 9, 2025

zucchini-nlp linked a pull request May 12, 2025 that will close this issue

[gemma3] fix bidirectional attention mask #38080

Open

sammysun0711 mentioned this issue May 14, 2025

Gemma3 Bidirectional Attention Mask Bug huggingface/optimum-intel#1304

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention mask for multi-image input in gemma3 #38053

Attention mask for multi-image input in gemma3 #38053

deval281shah commented May 9, 2025 •

edited

Loading

zucchini-nlp commented May 12, 2025

Attention mask for multi-image input in gemma3 #38053

Attention mask for multi-image input in gemma3 #38053

Comments

deval281shah commented May 9, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

zucchini-nlp commented May 12, 2025

deval281shah commented May 9, 2025 •

edited

Loading