Should we consider supporting sparse vector?

pg-vector/nmslib and other projects support sparse vector capabilities. 

I took a quick look at their implementation principles: they did not modify the implementation of hnsw, but only supported sparse vectors by modifying the metric/scoring function.

- pg-vector: https://github.com/pgvector/pgvector/commit/abac7a3f776d4edbb423a000ba5234d3e8eab465
- nmslib: https://github.com/nmslib/nmslib/blob/master/similarity_search/include/space/space_sparse_vector.h#L138

So I raise two questions:

- Q1: It seems that we can support sparse vectors by doing similar ScoreFunction optimization. Is this true? Is it possible for us to consider this direction later?
- Q2: Is graph indexing (hnsw/diskAnn) the most appropriate way to index sparse vectors? Because vector products such as es(elasticsearch)/milvus/qdrant basically use inverted indexes to implement sparse vector indexes. My idea is that when the query token is long, if only sparse vector retrieval is performed (without filter), then the graph index may be relatively fast, and only the topK level documents need to be scanned from the graph index, so the graph index efficiency is acceptable. The inverted index needs to find all tokens, and then calculate the similarity of each hit document in memory. Without considering pruning optimization, if the number of hit documents is particularly large, it will take a long time to compare each document. However, es uses the inverted index technology, so I speculate that they may have considered that the inverted index consumes less CPU in the indexing stage and their pruning optimization is better. Most users use filters, and there are many optimizations for merging inverted chains, so the final amount of calculation is not that large. Therefore, the user scenario of ES is suitable for inverted index technology.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Should we consider supporting sparse vector? #401

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Should we consider supporting sparse vector? #401

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions