Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Theory question about the token weightings for symetric search. #9

Closed
cm2435 opened this issue Aug 15, 2022 · 2 comments
Closed

Theory question about the token weightings for symetric search. #9

cm2435 opened this issue Aug 15, 2022 · 2 comments

Comments

@cm2435
Copy link

cm2435 commented Aug 15, 2022

First things first, I loved reading your paper. Was clear, concise and has great implications for semantic search going forward. Cannot compiment highly enough!

One question. I would like to make use of a similar method to get semantic embedding for non GPT auto regressive language models. In the paper I read

The causal attention mask in an auto-regressive decoder transformer, tokens do not attend to
future tokens like in an encoder transformer. Hence, only the last token has attended to all tokens in a
sequence. To account for this information mismatch, we propose to give later tokens a higher weight
using a position-weighted mean pooling method:
v =
S∑
i=1
wihi where wi = i
∑S
i=1 i (2)
where S is the sequence length, hi the ith hidden state and v the query or document embedding. We
compare weighted mean pooling with last token pooling, where the hidden state of the final token is
the embedding, and regular mean pooling

This trick is really neat, but I was wondering if this would work for autoregressive decoder only models that use a causal language model loss, for example the XGLM model set? https://huggingface.co/facebook/xglm-564M

How about for autoregressive LM's that do not make use of causal language model losses, but instead use next-token prediction language modeling? Such as the CodeGen model set? https://arxiv.org/pdf/2203.13474.pdf [if you are unfamilar, the training section is 2.2 :) ]

I understand if there is not a clear answer to these questions, but I would love to hear your thoughts either way.
Thanks again!

@Muennighoff
Copy link
Owner

If I'm not mistaken next-token prediction language modeling == causal language model loss == the objective of pretrained SGPT models. They are all causal decoder-only models with the same loss objective, so yes weighted mean pooling should work well for all of them.

@cm2435
Copy link
Author

cm2435 commented Aug 16, 2022

@Muennighoff So it is. Sorry, thanks for the clarification.

@cm2435 cm2435 closed this as completed Aug 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants