Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BLIP2] BLIP2QFormerLayer is missing the self.intermediate parameter, which makes training from scratch impossible #30846

Closed
1 of 4 tasks
tongda opened this issue May 16, 2024 · 6 comments

Comments

@tongda
Copy link

tongda commented May 16, 2024

System Info

transformers: 4.40.2
python: 3.10.14
system: Ubuntu 20.04

Who can help?

@amyeroberts
@NielsRogge

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Refer to BLIP2QformerLayer code here and here, self.intermediate parameter is used to project text instruction embedding to LLM latent space. But BLIP2QformerModel.forward only accept query_embeds as input, which is always embeded query_tokens. So the text instruction path in BLIP2QformerLayer will never go through.

According to original paper:

...a text transformer that can function as both a text encoder and a text decoder. We create a set number of learnable query embeddings as input to the image transformer. The queries interact with each other through self-attention layers, and interact with frozen image features through cross-attention layers (inserted every other transformer block). The queries can additionally interact with the text through the same self-attention layers.

Although the inference part do not need the query-text interaction, the model should provide a way for query-text interaction for training phase.

Expected behavior

Since current implementation forces user embeds the query outside of Qformer. To minimize the side effect, I think a simple way to fix it is to add self.intermediate and self.output parameter back to BLIP2QformerLayer and add a parameter query_length to BLIP2QFormerModel.forward(), which is used to mark the length of query_tokens. Nevertheless, the embedding of text instruction for Qformer should be consistent with Qformer backbone, instead of LLM embedding.

@tongda tongda changed the title BLIP2QFormerLayer is missing the self.intermediate parameter, which makes training from scratch impossible [BLIP2] BLIP2QFormerLayer is missing the self.intermediate parameter, which makes training from scratch impossible May 17, 2024
@amyeroberts
Copy link
Collaborator

Hi @tongda, thanks for raising this issue! Would you like to open a PR with this suggestion?

cc @younesbelkada

@younesbelkada
Copy link
Contributor

I agree that this makes sense, although if we do that it won't be backward compatible as we're going to change the way layers are designed. I think our implementation follows quite closely the original implementation: https://github.com/salesforce/LAVIS/blob/main/lavis/models/blip2_models/Qformer.py that also implements it that way (one needs to pass already processed query_embeds)

@tongda
Copy link
Author

tongda commented May 21, 2024

@younesbelkada Thanks for confirmation.
Just FYI, the difference between the original implementation and transformers version is at this line:

    def forward(
        self,
        input_ids=None, # <---- the input_ids is the instruction for Qformer
        position_ids=None,
        query_embeds=None, # <---- the query_embeds is embedding of pretrained query_tokens of Qformer
        past_key_values_length=0,
    ):
        if input_ids is not None:
            seq_length = input_ids.size()[1]
        else:
            seq_length = 0

        if position_ids is None:
            position_ids = self.position_ids[
                :, past_key_values_length : seq_length + past_key_values_length
            ].clone()

        if input_ids is not None:
            embeddings = self.word_embeddings(input_ids)
            if self.position_embedding_type == "absolute":
                position_embeddings = self.position_embeddings(position_ids)
                embeddings = embeddings + position_embeddings

            if query_embeds is not None:
                embeddings = torch.cat((query_embeds, embeddings), dim=1) # <---- if the input_ids exists, the final embeddings is concatenation of both. The query embed part and instruction embed part are treated in different way in the later process.
        else:
            embeddings = query_embeds

        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

@amyeroberts I would like to make the change. However some unittest breaks, I will make a PR when I figure out how to fix the tests.

@younesbelkada
Copy link
Contributor

Thanks for double checking, this is clearer for me indeed this could make it BC - let us know when you open the PR !

@NielsRogge
Copy link
Contributor

Thanks for raising this issue, it is is actually being addressed at #29261

@tongda
Copy link
Author

tongda commented May 23, 2024

Cool! I have checked the PR, it indeed include the change that I need and even more. Good job.

@tongda tongda closed this as completed May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants