Unexpected output from server.cpp `/embedding` endpoint #391

k8si · 2024-05-02T20:24:59Z

What is the issue?

The embeddings produced by a model running in llamafile seem to be substantially different from those produced by llama.cpp.

llama.cpp embeddings are very close (~0.99 cosine similarity) to those produced by the same model via HuggingFace (which I'm treating as the 'reference embeddings'). On the other hand, llamafile embeddings only get ~0.6 cosine similarity to the HuggingFace embeddings. I tested this across multiple llamafile versions (see results below).

Tested with:

llamafile versions from v0.8.1 - v0.7.1
llama.cpp commit: 6ecf3189
MacBook Pro with Apple M2 Pro (32 GB)
MacOS 14.2.1
Only tested with one model: all-MiniLM-L6-v2 (BERT architecture)

How to replicate the issue

I put all the scripts/information to replicate this issue in this repo: https://github.com/k8si/replicate-llamafile-embeddings-issue

The short version:

To inspect the differences between embeddings produced by different backends, I embed the text "Alice has had it with computers." with the same(-ish) model running in HF, llama.cpp, and llamafile:

HuggingFace - used sentence-transformers/all-MiniLM-L6-v2 pytorch weights directly
llamafile - used full F32 GGUF version of the model from leliuga/all-MiniLM-L6-v2-GGUF
llama.cpp - used full F32 GGUF version of the model from leliuga/all-MiniLM-L6-v2-GGUF

I use the F32 GGUF to remove any quantization effects and stay as equivalent to the HuggingFace reference model as possible.

Then, I look at the cosine similarity between the embedding produced by HF vs llamafile and compare this to the cosine-sim between the embedding from HF vs llama.cpp. I would expect the two cosine-sim scores to be the same, but they are not, as the results below show.

Results

Results across the last 6 llamafile releases (v0.7.1 to v0.8.1):

$ cat results/results-* | grep -A 2 "RESULTS"

RESULTS (llamafile v0.7.1):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.7.2):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.7.3):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.7.4):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.8):
cosine-sim(emb_hf, emb_llamafile) = 0.605049
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.8.1):
cosine-sim(emb_hf, emb_llamafile) = 0.605049
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--

The test does not work prior to v0.7.1 as BERT was not supported before this release, and all-MiniLM-L6-v2 is a BERT architecture.

The text was updated successfully, but these errors were encountered:

jart · 2024-05-04T03:03:10Z

It turns out the upstream project is inserting CLS and SEP tokens around the input before passing them to llama_decode(). I've identified the key line in the server code that needs to change, to make our embedding output consistent with llama.cpp in this case. With the change I'm about to push, cosine similarity will be 0.9999+ similar to llama.cpp.

Please note we're no longer importing upstream changes on the server. The upstream implementation has diverged significantly since they removed LLaVA support. You will likely encounter other differences in behavior. If you do, feel free to file another issue and I'll pinpoint what needs to change.

k8si added the bug label May 2, 2024

jart self-assigned this May 4, 2024

jart closed this as completed in 7900294 May 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected output from server.cpp `/embedding` endpoint #391

Unexpected output from server.cpp `/embedding` endpoint #391

k8si commented May 2, 2024

jart commented May 4, 2024

Unexpected output from server.cpp /embedding endpoint #391

Unexpected output from server.cpp /embedding endpoint #391

Comments

k8si commented May 2, 2024

What is the issue?

How to replicate the issue

Results

jart commented May 4, 2024

Unexpected output from server.cpp `/embedding` endpoint #391

Unexpected output from server.cpp `/embedding` endpoint #391