Incorrect total value with k-NN search #97807

devatadecco · 2023-07-19T11:47:20Z

Elasticsearch Version

Version: 8.8.0, Build: docker/c01029875a091076ed42cdb3a41c10b1a9a5a20f/2023-05-23T17:16:07.179039820Z, JVM: 20.0.1

Installed Plugins

None

Java Version

bundled

OS Version

Linux 5179425a12d1 5.10.104-linuxkit #1 SMP PREEMPT Thu Mar 17 17:05:54 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

Problem Description

Incorrect total value with k-NN search

Running knn queries total value always match the k param in the query while there are more than k results, if k exceed the right total value it will be shown correct.

Steps to Reproduce

1 - Create index with proper mapping for knn search
2- Populate more than one document
3- Make knn search with k=1 and a vector, check total
4- Increase k value with same vector, check total

PUT /test
{
"mappings": {
"properties": {
"title-vector": {
"type": "dense_vector",
"dims": 5,
"index": true,
"similarity": "dot_product"
}
}
}
}

POST /test/_bulk?refresh=true&pretty
{ "index": { "_id": "1" } }
{ "title-vector": [1, 0, 0, 0, 0] }
{ "index": { "_id": "2" } }
{ "title-vector": [1, 0, 0, 0, 0] }
{ "index": { "_id": "3" } }
{ "title-vector": [1, 0, 0, 0, 0] }

POST test/_search
{ "knn": {
"field": "title-vector",
"query_vector": [1, 0, 0, 0, 0],
"k": 2,
"num_candidates": 10000}}

Logs (if relevant)

No response

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-07-20T10:58:19Z

Pinging @elastic/es-search (Team:Search)

benwtrent · 2023-07-20T12:49:17Z

This is interesting. I tried the same thing on the deprecated _knn_search API and it worked ok.

Something weird with parsing or serialization. Digging in :)

benwtrent · 2023-07-20T14:18:30Z

OK, @devatadecco I have an answer for you that may incur further questions :)

kNN is an approximate search. This means the structure does it's best to NOT have to search all the available vectors.

In your simple example, the number of vectors is small, so it might be surprising that the total hit count for global k matches k.

But, let's consider the situation where we have 1M+ vectors. Per-shard, we will only gather up to numCandidates. While we keep track of the number of vectors visited while exploring the graph, these vectors used for exploration will likely NOT be in the top-k. This leads us to the following options:

1. Should documents not in the top-k (only used when exploring the graph) be added to our hit count?
1. Should we logically restrict the hit count to the number of documents returned in the global top-k?
1. Should the hit count be a summation of all the numCandidates gathered from every shard?
1. Should the hit count be a summation of all the topK gathered from every shard?
1. Should the hit count just be the number of documents that have the kNN field?

Currently, the kNN section is a global topK over every shard. You ask for the top 100 docs, you get the top 100 docs from a global view over all your data. In this case, option 2 gives us the most consistency (the number doesn't fluctuate with number of shards) and it is the one currently implemented. Given the goals of kNN (always restricting to finding the nearest vectors), option 5, while consistent, seems to break the idea of hits referring to the nearest k docs.

We can contrast kNN with a traditional term query. The hit count for a term query is the number of documents matching the term. What does this mean in the context of kNN? The number of docs that have the vector field? Or the number of docs matched (globally) for the knn search?

devatadecco · 2023-07-21T08:44:21Z

Thank you for responding.

Provided example focuses on showing the behavior. My intention is to maintain consistent pagination while we set min_score.

And if you're asking me! I prefer not to have k in the query at all, and instead, I would favor using size and min_score so that the answer is straightforward. Where shards hit a record above the min_sore count it for total, and finally, orchestrator returns top scored records of given size of page.

Let me know if there is a proper way for pagination that I miss?

benwtrent · 2023-07-21T14:57:11Z

And if you're asking me! I prefer not to have k in the query at all, and instead, I would favor using size and min_score so that the answer is straightforward.

Interesting, I can see that being a potential solution. We will have to think some more. There might be a better solution here that neither of us see.

I do agree, that k should default to size() and if using pagination, we can be even nicer and set it to size + from and cut off the from when merging. The tricky part is handling num_candidates correctly and deep pagination (we would have a hard limit at 10,000 K). We would have to provide a good default there as it relates to k.

Let me know if there is a proper way for pagination that I miss?

The tricky part with pagination is that kNN has a static upper limit of 10k. So, at some point you would hit that limit depending on how deep you are paging.

Really, as it is right now, there is no nice way to pagination global top k. I think some future planned work could make this better. I will try to keep this issue updated!

boorboor · 2023-08-10T10:28:09Z

Any update?

benwtrent · 2023-08-10T12:48:17Z

@devatadecco & @boorboor

I would argue that the "total hit" is still correct for kNN since kNN is technically always a "match_all" unless you provide some filter. It just limits to the k nearest.

As for improvements around kNN experience at query time check this issues out:

If y'all have requirements for these issues, let us know!

boorboor · 2023-08-16T09:27:09Z

We reverted back to using cosine similarity and script I wouldn't contradict your argument. However, I see an inconsistency between the value of K and the limit. I see that you are addressing this in #97533.

devatadecco added >bug needs:triage Requires assignment of a team area label labels Jul 19, 2023

pxsalehi added :Search/Search Search-related issues that do not fall into other categories and removed needs:triage Requires assignment of a team area label labels Jul 20, 2023

elasticsearchmachine added the Team:Search Meta label for search team label Jul 20, 2023

benwtrent self-assigned this Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect total value with k-NN search #97807

Incorrect total value with k-NN search #97807

devatadecco commented Jul 19, 2023

elasticsearchmachine commented Jul 20, 2023

benwtrent commented Jul 20, 2023

benwtrent commented Jul 20, 2023

devatadecco commented Jul 21, 2023 •

edited

benwtrent commented Jul 21, 2023

boorboor commented Aug 10, 2023

benwtrent commented Aug 10, 2023

boorboor commented Aug 16, 2023

Incorrect total value with k-NN search #97807

Incorrect total value with k-NN search #97807

Comments

devatadecco commented Jul 19, 2023

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Incorrect total value with k-NN search

Steps to Reproduce

Logs (if relevant)

elasticsearchmachine commented Jul 20, 2023

benwtrent commented Jul 20, 2023

benwtrent commented Jul 20, 2023

devatadecco commented Jul 21, 2023 • edited

benwtrent commented Jul 21, 2023

boorboor commented Aug 10, 2023

benwtrent commented Aug 10, 2023

boorboor commented Aug 16, 2023

devatadecco commented Jul 21, 2023 •

edited