[BLIP2] BLIP2QFormerLayer is missing the self.intermediate parameter, which makes training from scratch impossible #30846

tongda · 2024-05-16T07:05:37Z

System Info

transformers: 4.40.2
python: 3.10.14
system: Ubuntu 20.04

Who can help?

@amyeroberts
@NielsRogge

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Refer to BLIP2QformerLayer code here and here, self.intermediate parameter is used to project text instruction embedding to LLM latent space. But BLIP2QformerModel.forward only accept query_embeds as input, which is always embeded query_tokens. So the text instruction path in BLIP2QformerLayer will never go through.

According to original paper:

...a text transformer that can function as both a text encoder and a text decoder. We create a set number of learnable query embeddings as input to the image transformer. The queries interact with each other through self-attention layers, and interact with frozen image features through cross-attention layers (inserted every other transformer block). The queries can additionally interact with the text through the same self-attention layers.

Although the inference part do not need the query-text interaction, the model should provide a way for query-text interaction for training phase.

Expected behavior

Since current implementation forces user embeds the query outside of Qformer. To minimize the side effect, I think a simple way to fix it is to add self.intermediate and self.output parameter back to BLIP2QformerLayer and add a parameter query_length to BLIP2QFormerModel.forward(), which is used to mark the length of query_tokens. Nevertheless, the embedding of text instruction for Qformer should be consistent with Qformer backbone, instead of LLM embedding.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-05-20T11:17:19Z

Hi @tongda, thanks for raising this issue! Would you like to open a PR with this suggestion?

cc @younesbelkada

younesbelkada · 2024-05-21T07:38:25Z

I agree that this makes sense, although if we do that it won't be backward compatible as we're going to change the way layers are designed. I think our implementation follows quite closely the original implementation: https://github.com/salesforce/LAVIS/blob/main/lavis/models/blip2_models/Qformer.py that also implements it that way (one needs to pass already processed query_embeds)

tongda · 2024-05-21T10:00:11Z

@younesbelkada Thanks for confirmation.
Just FYI, the difference between the original implementation and transformers version is at this line:

    def forward(
        self,
        input_ids=None, # <---- the input_ids is the instruction for Qformer
        position_ids=None,
        query_embeds=None, # <---- the query_embeds is embedding of pretrained query_tokens of Qformer
        past_key_values_length=0,
    ):
        if input_ids is not None:
            seq_length = input_ids.size()[1]
        else:
            seq_length = 0

        if position_ids is None:
            position_ids = self.position_ids[
                :, past_key_values_length : seq_length + past_key_values_length
            ].clone()

        if input_ids is not None:
            embeddings = self.word_embeddings(input_ids)
            if self.position_embedding_type == "absolute":
                position_embeddings = self.position_embeddings(position_ids)
                embeddings = embeddings + position_embeddings

            if query_embeds is not None:
                embeddings = torch.cat((query_embeds, embeddings), dim=1) # <---- if the input_ids exists, the final embeddings is concatenation of both. The query embed part and instruction embed part are treated in different way in the later process.
        else:
            embeddings = query_embeds

        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

@amyeroberts I would like to make the change. However some unittest breaks, I will make a PR when I figure out how to fix the tests.

younesbelkada · 2024-05-21T10:32:55Z

Thanks for double checking, this is clearer for me indeed this could make it BC - let us know when you open the PR !

NielsRogge · 2024-05-21T15:52:04Z

Thanks for raising this issue, it is is actually being addressed at #29261

tongda · 2024-05-23T07:39:24Z

Cool! I have checked the PR, it indeed include the change that I need and even more. Good job.

tongda changed the title ~~BLIP2QFormerLayer is missing the self.intermediate parameter, which makes training from scratch impossible~~ [BLIP2] BLIP2QFormerLayer is missing the self.intermediate parameter, which makes training from scratch impossible May 17, 2024

tongda closed this as completed May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BLIP2] BLIP2QFormerLayer is missing the self.intermediate parameter, which makes training from scratch impossible #30846

[BLIP2] BLIP2QFormerLayer is missing the self.intermediate parameter, which makes training from scratch impossible #30846

tongda commented May 16, 2024 •

edited

amyeroberts commented May 20, 2024

younesbelkada commented May 21, 2024

tongda commented May 21, 2024

younesbelkada commented May 21, 2024

NielsRogge commented May 21, 2024

tongda commented May 23, 2024

[BLIP2] BLIP2QFormerLayer is missing the self.intermediate parameter, which makes training from scratch impossible #30846

[BLIP2] BLIP2QFormerLayer is missing the self.intermediate parameter, which makes training from scratch impossible #30846

Comments

tongda commented May 16, 2024 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented May 20, 2024

younesbelkada commented May 21, 2024

tongda commented May 21, 2024

younesbelkada commented May 21, 2024

NielsRogge commented May 21, 2024

tongda commented May 23, 2024

tongda commented May 16, 2024 •

edited