Skip to content

special_image_mask handling can get hit by accidental same embedding value at certain dims #38012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lancercat opened this issue May 8, 2025 · 5 comments

Comments

@lancercat
Copy link

lancercat commented May 8, 2025

special_image_mask = inputs_embeds == self.get_input_embeddings()(

FWIS should be testing if the [whole vector] is same enough to the special token embedding,
instead of testing same at each dim,
which should be something like

            if input_ids is None:
                special_image_mask =((inputs_embeds - self.get_input_embeddings()(
                    torch.tensor(self.config.image_token_index, dtype=torch.long, device=inputs_embeds.device)
                )).abs().sum(-1)<0.009).unsqueeze(-1);
            else:
                special_image_mask = (input_ids == self.config.image_token_index).unsqueeze(-1)
            special_image_mask = special_image_mask.expand_as(inputs_embeds).to(inputs_embeds.device)

@lancercat lancercat changed the title special_image_mask handling can get hit by accidental same embedding at certain dims special_image_mask handling can get hit by accidental same embedding value at certain dims May 8, 2025
@zucchini-nlp
Copy link
Member

@lancercat It should not be an issue if embeds are obtained using the same model with the same dtype. Is it failing for you at inference time?

@lancercat
Copy link
Author

No.
Bcs inference takes id and thus does not compare embedding.
Doing some prefix finetuning and it suddenly became angry bcs the random embedding hits some one or two dims :)

@zucchini-nlp
Copy link
Member

Ah, oke, so IIUC the issue, when doing prefix tuning some virtual inputs embeds get assigned an image token idx. I still dont think this needs a fix, because comparing the diff to an arbitrary <0.009 will lead to more issues if vocabulary has embeddings nearly identical to the image token

If you are doing prefix tuning with PEFT, we can try to fix it on PEFT side (though I think PEFT doesn't expand embeds but rather cache). If it was a custom script for tuning, I suggest to initialize virtual embeddings to be non-equal to the image token

@lancercat
Copy link
Author

lancercat commented May 8, 2025

I am already meddling with the PEFT.
The problem is bigger than i thought due to Gemma's 4d attention mask and its right alignment magick :)
I am currently writing a dedicated PEFT class for it...

Anyway, if the embedding comparison behaviour is not intended, maybe remove the logic?
as the current comparison does not seem to do what it seems to do either...

@zucchini-nlp
Copy link
Member

The feature works when users pass embeds = get_input_embeddings(input_ids) to forward along with the pixels. It should not be removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants