Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect total value with k-NN search #97807

Open
devatadecco opened this issue Jul 19, 2023 · 8 comments
Open

Incorrect total value with k-NN search #97807

devatadecco opened this issue Jul 19, 2023 · 8 comments
Assignees
Labels
>bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@devatadecco
Copy link

Elasticsearch Version

Version: 8.8.0, Build: docker/c01029875a091076ed42cdb3a41c10b1a9a5a20f/2023-05-23T17:16:07.179039820Z, JVM: 20.0.1

Installed Plugins

None

Java Version

bundled

OS Version

Linux 5179425a12d1 5.10.104-linuxkit #1 SMP PREEMPT Thu Mar 17 17:05:54 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

Problem Description

Incorrect total value with k-NN search

Running knn queries total value always match the k param in the query while there are more than k results, if k exceed the right total value it will be shown correct.

Steps to Reproduce

1 - Create index with proper mapping for knn search
2- Populate more than one document
3- Make knn search with k=1 and a vector, check total
4- Increase k value with same vector, check total

PUT /test
{
"mappings": {
"properties": {
"title-vector": {
"type": "dense_vector",
"dims": 5,
"index": true,
"similarity": "dot_product"
}
}
}
}

POST /test/_bulk?refresh=true&pretty
{ "index": { "_id": "1" } }
{ "title-vector": [1, 0, 0, 0, 0] }
{ "index": { "_id": "2" } }
{ "title-vector": [1, 0, 0, 0, 0] }
{ "index": { "_id": "3" } }
{ "title-vector": [1, 0, 0, 0, 0] }

POST test/_search
{ "knn": {
"field": "title-vector",
"query_vector": [1, 0, 0, 0, 0],
"k": 2,
"num_candidates": 10000}}

Logs (if relevant)

No response

@devatadecco devatadecco added >bug needs:triage Requires assignment of a team area label labels Jul 19, 2023
@pxsalehi pxsalehi added :Search/Search Search-related issues that do not fall into other categories and removed needs:triage Requires assignment of a team area label labels Jul 20, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Jul 20, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@benwtrent benwtrent self-assigned this Jul 20, 2023
@benwtrent
Copy link
Member

This is interesting. I tried the same thing on the deprecated _knn_search API and it worked ok.

Something weird with parsing or serialization. Digging in :)

@benwtrent
Copy link
Member

OK, @devatadecco I have an answer for you that may incur further questions :)

kNN is an approximate search. This means the structure does it's best to NOT have to search all the available vectors.

In your simple example, the number of vectors is small, so it might be surprising that the total hit count for global k matches k.

But, let's consider the situation where we have 1M+ vectors. Per-shard, we will only gather up to numCandidates. While we keep track of the number of vectors visited while exploring the graph, these vectors used for exploration will likely NOT be in the top-k. This leads us to the following options:

    1. Should documents not in the top-k (only used when exploring the graph) be added to our hit count?
    1. Should we logically restrict the hit count to the number of documents returned in the global top-k?
    1. Should the hit count be a summation of all the numCandidates gathered from every shard?
    1. Should the hit count be a summation of all the topK gathered from every shard?
    1. Should the hit count just be the number of documents that have the kNN field?

Currently, the kNN section is a global topK over every shard. You ask for the top 100 docs, you get the top 100 docs from a global view over all your data. In this case, option 2 gives us the most consistency (the number doesn't fluctuate with number of shards) and it is the one currently implemented. Given the goals of kNN (always restricting to finding the nearest vectors), option 5, while consistent, seems to break the idea of hits referring to the nearest k docs.

We can contrast kNN with a traditional term query. The hit count for a term query is the number of documents matching the term. What does this mean in the context of kNN? The number of docs that have the vector field? Or the number of docs matched (globally) for the knn search?

@devatadecco
Copy link
Author

devatadecco commented Jul 21, 2023

Thank you for responding.

Provided example focuses on showing the behavior. My intention is to maintain consistent pagination while we set min_score.

And if you're asking me! I prefer not to have k in the query at all, and instead, I would favor using size and min_score so that the answer is straightforward. Where shards hit a record above the min_sore count it for total, and finally, orchestrator returns top scored records of given size of page.

Let me know if there is a proper way for pagination that I miss?

@benwtrent
Copy link
Member

And if you're asking me! I prefer not to have k in the query at all, and instead, I would favor using size and min_score so that the answer is straightforward.

Interesting, I can see that being a potential solution. We will have to think some more. There might be a better solution here that neither of us see.

I do agree, that k should default to size() and if using pagination, we can be even nicer and set it to size + from and cut off the from when merging. The tricky part is handling num_candidates correctly and deep pagination (we would have a hard limit at 10,000 K). We would have to provide a good default there as it relates to k.

Let me know if there is a proper way for pagination that I miss?

The tricky part with pagination is that kNN has a static upper limit of 10k. So, at some point you would hit that limit depending on how deep you are paging.

Really, as it is right now, there is no nice way to pagination global top k. I think some future planned work could make this better. I will try to keep this issue updated!

@boorboor
Copy link

Any update?

@benwtrent
Copy link
Member

@devatadecco & @boorboor

I would argue that the "total hit" is still correct for kNN since kNN is technically always a "match_all" unless you provide some filter. It just limits to the k nearest.

As for improvements around kNN experience at query time check this issues out:

If y'all have requirements for these issues, let us know!

@boorboor
Copy link

We reverted back to using cosine similarity and script I wouldn't contradict your argument. However, I see an inconsistency between the value of K and the limit. I see that you are addressing this in #97533.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

5 participants