-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Problem (one or two sentences)
In code, there are definitions like this:
"nomic-embed-code": {
dimension: 3584,
scoreThreshold: 0.15,
queryPrefix: "Represent this query for searching relevant code: ",
},
queryPrefix is then used as an instruction part for embedder, for both indexing and queries. This is incorrect. All modern embedders (including this nomic, but actually >95% of the others) require DIFFERENT instructions for indexing and queries,
or AT LEAST, most of them, require indexing without any queryPrefix, and query with some prefix.
Context (who is affected and when)
Embedding performance with self-hosted embedders is sub par due to this.
Most public paid embedding APIs, however, are not affected, because many of them are, indeed, instrtuction-less.
Reproduction steps
Enable Indexing. Observe verbose logs on prompts for embedder on llama-cpp side (as an example). Find improper use of the queryPrefix:
If none defined, none used (incorrect),
If some defined, used for BOTH indexing and search (also incorrect for most of them).
Expected result
We should have different templates for indexing and for querying.
Actual result
We now have single templates for indexing and for querying, which is incorrect for most modern self-hosted embedders.
Variations tried (optional)
No response
App Version
3.50.3 (79d11ff)
API Provider (optional)
OpenAI Compatible
Model Used (optional)
No response
Roo Code Task Links (optional)
No response