Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected output from server.cpp /embedding endpoint #391

Closed
k8si opened this issue May 2, 2024 · 1 comment
Closed

Unexpected output from server.cpp /embedding endpoint #391

k8si opened this issue May 2, 2024 · 1 comment
Assignees
Labels

Comments

@k8si
Copy link
Collaborator

k8si commented May 2, 2024

What is the issue?

The embeddings produced by a model running in llamafile seem to be substantially different from those produced by llama.cpp.

llama.cpp embeddings are very close (~0.99 cosine similarity) to those produced by the same model via HuggingFace (which I'm treating as the 'reference embeddings'). On the other hand, llamafile embeddings only get ~0.6 cosine similarity to the HuggingFace embeddings. I tested this across multiple llamafile versions (see results below).

Tested with:

  • llamafile versions from v0.8.1 - v0.7.1
  • llama.cpp commit: 6ecf3189
  • MacBook Pro with Apple M2 Pro (32 GB)
  • MacOS 14.2.1
  • Only tested with one model: all-MiniLM-L6-v2 (BERT architecture)

How to replicate the issue

I put all the scripts/information to replicate this issue in this repo: https://github.com/k8si/replicate-llamafile-embeddings-issue

The short version:

To inspect the differences between embeddings produced by different backends, I embed the text "Alice has had it with computers." with the same(-ish) model running in HF, llama.cpp, and llamafile:

  1. HuggingFace - used sentence-transformers/all-MiniLM-L6-v2 pytorch weights directly
  2. llamafile - used full F32 GGUF version of the model from leliuga/all-MiniLM-L6-v2-GGUF
  3. llama.cpp - used full F32 GGUF version of the model from leliuga/all-MiniLM-L6-v2-GGUF

I use the F32 GGUF to remove any quantization effects and stay as equivalent to the HuggingFace reference model as possible.

Then, I look at the cosine similarity between the embedding produced by HF vs llamafile and compare this to the cosine-sim between the embedding from HF vs llama.cpp. I would expect the two cosine-sim scores to be the same, but they are not, as the results below show.

Results

Results across the last 6 llamafile releases (v0.7.1 to v0.8.1):

$ cat results/results-* | grep -A 2 "RESULTS"

RESULTS (llamafile v0.7.1):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.7.2):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.7.3):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.7.4):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.8):
cosine-sim(emb_hf, emb_llamafile) = 0.605049
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.8.1):
cosine-sim(emb_hf, emb_llamafile) = 0.605049
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--

The test does not work prior to v0.7.1 as BERT was not supported before this release, and all-MiniLM-L6-v2 is a BERT architecture.

@k8si k8si added the bug label May 2, 2024
@jart jart self-assigned this May 4, 2024
@jart
Copy link
Collaborator

jart commented May 4, 2024

It turns out the upstream project is inserting CLS and SEP tokens around the input before passing them to llama_decode(). I've identified the key line in the server code that needs to change, to make our embedding output consistent with llama.cpp in this case. With the change I'm about to push, cosine similarity will be 0.9999+ similar to llama.cpp.

Please note we're no longer importing upstream changes on the server. The upstream implementation has diverged significantly since they removed LLaVA support. You will likely encounter other differences in behavior. If you do, feel free to file another issue and I'll pinpoint what needs to change.

@jart jart closed this as completed in 7900294 May 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants