You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The embeddings produced by a model running in llamafile seem to be substantially different from those produced by llama.cpp.
llama.cpp embeddings are very close (~0.99 cosine similarity) to those produced by the same model via HuggingFace (which I'm treating as the 'reference embeddings'). On the other hand, llamafile embeddings only get ~0.6 cosine similarity to the HuggingFace embeddings. I tested this across multiple llamafile versions (see results below).
Tested with:
llamafile versions from v0.8.1 - v0.7.1
llama.cpp commit: 6ecf3189
MacBook Pro with Apple M2 Pro (32 GB)
MacOS 14.2.1
Only tested with one model: all-MiniLM-L6-v2 (BERT architecture)
To inspect the differences between embeddings produced by different backends, I embed the text "Alice has had it with computers." with the same(-ish) model running in HF, llama.cpp, and llamafile:
I use the F32 GGUF to remove any quantization effects and stay as equivalent to the HuggingFace reference model as possible.
Then, I look at the cosine similarity between the embedding produced by HF vs llamafile and compare this to the cosine-sim between the embedding from HF vs llama.cpp. I would expect the two cosine-sim scores to be the same, but they are not, as the results below show.
Results
Results across the last 6 llamafile releases (v0.7.1 to v0.8.1):
It turns out the upstream project is inserting CLS and SEP tokens around the input before passing them to llama_decode(). I've identified the key line in the server code that needs to change, to make our embedding output consistent with llama.cpp in this case. With the change I'm about to push, cosine similarity will be 0.9999+ similar to llama.cpp.
Please note we're no longer importing upstream changes on the server. The upstream implementation has diverged significantly since they removed LLaVA support. You will likely encounter other differences in behavior. If you do, feel free to file another issue and I'll pinpoint what needs to change.
What is the issue?
The embeddings produced by a model running in llamafile seem to be substantially different from those produced by llama.cpp.
llama.cpp embeddings are very close (~0.99 cosine similarity) to those produced by the same model via HuggingFace (which I'm treating as the 'reference embeddings'). On the other hand, llamafile embeddings only get ~0.6 cosine similarity to the HuggingFace embeddings. I tested this across multiple llamafile versions (see results below).
Tested with:
6ecf3189
How to replicate the issue
I put all the scripts/information to replicate this issue in this repo: https://github.com/k8si/replicate-llamafile-embeddings-issue
The short version:
To inspect the differences between embeddings produced by different backends, I embed the text "Alice has had it with computers." with the same(-ish) model running in HF, llama.cpp, and llamafile:
I use the F32 GGUF to remove any quantization effects and stay as equivalent to the HuggingFace reference model as possible.
Then, I look at the cosine similarity between the embedding produced by HF vs llamafile and compare this to the cosine-sim between the embedding from HF vs llama.cpp. I would expect the two cosine-sim scores to be the same, but they are not, as the results below show.
Results
Results across the last 6 llamafile releases (v0.7.1 to v0.8.1):
The test does not work prior to v0.7.1 as BERT was not supported before this release, and
all-MiniLM-L6-v2
is a BERT architecture.The text was updated successfully, but these errors were encountered: