Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Milvus: A Purpose-Built Vector Data Management System #32

Closed
Sunt-ing opened this issue Sep 12, 2021 · 2 comments
Closed

Milvus: A Purpose-Built Vector Data Management System #32

Sunt-ing opened this issue Sep 12, 2021 · 2 comments

Comments

@Sunt-ing
Copy link
Owner

Sunt-ing commented Sep 12, 2021

https://raw.githubusercontent.com/Sunt-ing/database-system-readings/main/papers/Milvus.pdf

@Sunt-ing
Copy link
Owner Author

Sunt-ing commented Oct 17, 2021

图片

图片

图片

What problem does the paper solve? Is it important?

Problem:

  • No purpose-built database for large-scale vector processing

Importance:

  • In 2025, 80% of data will be unstructured which is often converted to feature vector by vector embeddings such as item2vec, word2vec, doc2vec, and graph2vec.
  • Vector processing is significant for machine learning, including recommending system, NLP, CV.

How does it solve the problem?

  • Storage: memory + (persisted storage)/S3/HDFS.

  • CPU: support SIMD

  • GPU: support heterogeneous computing

    • since GPU memory is limited, computing should be batch by batch, rather than query by query.
    • if the query number in a batch is higher than a threshold, use GPU; else use CPU.
  • Scalability: multi-node supporting with eventual consistency

  • Functionalities:

    • Fast data update during query processing
    • Attribute filtering
    • Multi-vector query processing

How does this work relate to other research?

图片

The whole database is built on top of the Facebook Faiss library with many improvements.

Vector search libraries: Facebook Faiss & Microsoft SPTAG

  • cannot leverage disk or support distributed mode
  • cannot handle dynamic data upsert
  • cannot support advanced query processing
  • cannot leverage SIMD and GPU

Systems that support vector search: Alibaba AnalyticDB-V & Alibaba PASE

  • legacy components (such as optimizer and storage engine) do not best leverage CPU and GPU for vector processing
  • do not support advanced query processing

System designed for vector search: Jingdong Vearch

  • 6.4x ~ 47.0x slower than Milvus
  • do not support multi-vector query processing

System Design

Query processing:

  • supported entity: multi-vector with numerical attributes, rather than a single vector
  • supported similarity metrics: Euclidean distance, inner product, cosine similarity, Hamming distance, and Jaccard distance
  • supported interface: RESTful API, Python, Java, Go, and C++

Pluggable index:

  • no winner in all dimensions (including performance, accuracy, and space overhead)
  • there are many new indexes coming out every year

Dynamic data management:

  • LSM-based
  • snapshot isolation

Storage Management:

  • vector storage: column-oriented where an entity has multiple columns
  • attribute storage: use skip pointers (min/max values) inspired by Snowflake

LRU based buffer pool

Distributed System: inspired by Snowflake and Amazon Aurora

  • shared storage using S3
  • compute nodes are stateless (unless local cache and SSD are used) and managed by k8s
    • writer: only one instance since Milvus is read-heavy
    • reader: data is shared with consistent hashing

What could be improved?

  • leverage FPGA
  • to be cloud-native

Others

  • I think Milvus is, though popular among people in the industry, of low novelty. Just like Spark (RDD from NSDI'2012).

@Sunt-ing
Copy link
Owner Author

Sunt-ing commented Oct 25, 2021

I think I can share some things that got from a private talk between some of Milvus main authors and me:

  • Is Milvus profitable? Not yet. Milvus is fully open-sourced and has no profit by now (October 25, 2021), but there are some ways to be profitable definitely, such as selling distribution, providing cloud service...
  • Why did Milvus join Linux Foundation rather than ASF or CNCF? Actually, both ASF and CNCF were considered seriously but failed to be chosen due to the following two reasons: First, ASF and CNCF have tons of projects and Milvus might not get enough attention; Second and most importantly, the leader of Milvus plays an important role in Linux Foundation.
  • Noticing Milvus 1.0 (the version described in this paper) only implements eventual consistency, while Milvus 2.0 implements snapshot consistency, so can eventual consistency satisfy most customers? Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant