You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First things first, I loved reading your paper. Was clear, concise and has great implications for semantic search going forward. Cannot compiment highly enough!
One question. I would like to make use of a similar method to get semantic embedding for non GPT auto regressive language models. In the paper I read
The causal attention mask in an auto-regressive decoder transformer, tokens do not attend to
future tokens like in an encoder transformer. Hence, only the last token has attended to all tokens in a
sequence. To account for this information mismatch, we propose to give later tokens a higher weight
using a position-weighted mean pooling method:
v =
S∑
i=1
wihi where wi = i
∑S
i=1 i (2)
where S is the sequence length, hi the ith hidden state and v the query or document embedding. We
compare weighted mean pooling with last token pooling, where the hidden state of the final token is
the embedding, and regular mean pooling
This trick is really neat, but I was wondering if this would work for autoregressive decoder only models that use a causal language model loss, for example the XGLM model set? https://huggingface.co/facebook/xglm-564M
How about for autoregressive LM's that do not make use of causal language model losses, but instead use next-token prediction language modeling? Such as the CodeGen model set? https://arxiv.org/pdf/2203.13474.pdf [if you are unfamilar, the training section is 2.2 :) ]
I understand if there is not a clear answer to these questions, but I would love to hear your thoughts either way.
Thanks again!
The text was updated successfully, but these errors were encountered:
If I'm not mistaken next-token prediction language modeling == causal language model loss == the objective of pretrained SGPT models. They are all causal decoder-only models with the same loss objective, so yes weighted mean pooling should work well for all of them.
First things first, I loved reading your paper. Was clear, concise and has great implications for semantic search going forward. Cannot compiment highly enough!
One question. I would like to make use of a similar method to get semantic embedding for non GPT auto regressive language models. In the paper I read
This trick is really neat, but I was wondering if this would work for autoregressive decoder only models that use a causal language model loss, for example the XGLM model set? https://huggingface.co/facebook/xglm-564M
How about for autoregressive LM's that do not make use of causal language model losses, but instead use next-token prediction language modeling? Such as the CodeGen model set? https://arxiv.org/pdf/2203.13474.pdf [if you are unfamilar, the training section is 2.2 :) ]
I understand if there is not a clear answer to these questions, but I would love to hear your thoughts either way.
Thanks again!
The text was updated successfully, but these errors were encountered: