Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexed Vector Similarity and kNN Search #35101

Closed
alexey-milovidov opened this issue Mar 7, 2022 · 13 comments
Closed

Indexed Vector Similarity and kNN Search #35101

alexey-milovidov opened this issue Mar 7, 2022 · 13 comments
Labels

Comments

@alexey-milovidov
Copy link
Member

Use case

Search and analytics on datasets with vector embeddings.

Describe the solution you'd like

The solution is being implemented by the team led by @FArthur-cmd

@alexey-milovidov alexey-milovidov added feature st-community-taken External developer is working on that labels Mar 7, 2022
@qieqieplus
Copy link
Contributor

Good to see a soluntion for KNN(maybe ANN sometime later?)
I actually wrote one arrayDistance myself as my toy project to learn Clickhouse.

@alexey-milovidov
Copy link
Member Author

@qieqieplus Now we also have Lp norms and distance functions:

https://presentations.clickhouse.com/release_21.11/#20
#27933

Good for non-indexed, brute-force search.

@FArthur-cmd
Copy link
Contributor

In fact, we are now working with algorithms for ANN

@qieqieplus
Copy link
Contributor

qieqieplus commented Mar 11, 2022

@qieqieplus Now we also have Lp norms and distance functions:

https://presentations.clickhouse.com/release_21.11/#20 #27933

Good for non-indexed, brute-force search.

Yes I'm aware of that, as I commented in #27933, those functions only work for data type tuple.
Correct me if i'm wrong, but data type array could be much more suitable for high dimensional dense vectors (like
hundreds if not thousands dimensions, in real world scenario), especially for engines other than memory .

I made a quick benchmark of BF search on L2Distance and a naive array based approach using higher-order functions:
SELECT id, arraySum(arrayMap((x, y)->(x - y) * (x - y), feature, [?])) as l2sqr from ? order by l2sqr limit 10;
Also with my native arrayL2Distance implementation backed by BLAS library Eigen.

With 1Million random generate 256D Uint8 vectors, each query time and memory increased:
L2Distance: 500ms ~ 700ms, 982MB
array functions: 400ms ~ 500ms, 410MB
native arrayL2Distance: 240ms ~ 320ms, 298MB

Test Platform: CPU: Xeon(R) CPU E5-2630 v4

@alexey-milovidov
Copy link
Member Author

@qieqieplus Yes, it makes sense to implement functions on arrays with BLAS, the performance numbers are very impressive.
I will be happy if you'll send a PR.

@qieqieplus
Copy link
Contributor

@alexey-milovidov I'm glad to. I'll send a PR when ready.

@zhanglistar
Copy link
Contributor

Wow, clickhouse now is a set of many tools.

@FArthur-cmd
Copy link
Contributor

Faiss index #37011

HNSW index #37007

@FArthur-cmd
Copy link
Contributor

Annoy #37215
DiskANN #37392

@Sivannnnnn
Copy link

Hi, when is Clickhouse going to launch vector retrieval for KNN or Ann? Any plans?
@alexey-milovidov @FArthur-cmd

@alexey-milovidov alexey-milovidov removed the st-community-taken External developer is working on that label Jun 2, 2023
@alexey-milovidov
Copy link
Member Author

ClickHouse For AI presentation: https://presentations.clickhouse.com/meetup74/ai/
A practical overview of vector search in ClickHouse: https://clickhouse.com/blog/vector-search-clickhouse-p2

@alexey-milovidov
Copy link
Member Author

@alexey-milovidov
Copy link
Member Author

Attempt to make Annoy index ready to use: #50312

But it is not required, as in the video above, I showed how to construct the same index manually, and it allows tuning between speed and precision dynamically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants