Milvus: A Purpose-Built Vector Data Management System #32

Sunt-ing · 2021-09-12T07:10:57Z

https://raw.githubusercontent.com/Sunt-ing/database-system-readings/main/papers/Milvus.pdf

Sunt-ing · 2021-10-17T13:59:39Z

What problem does the paper solve? Is it important?

Problem:

No purpose-built database for large-scale vector processing

Importance:

In 2025, 80% of data will be unstructured which is often converted to feature vector by vector embeddings such as item2vec, word2vec, doc2vec, and graph2vec.
Vector processing is significant for machine learning, including recommending system, NLP, CV.

How does it solve the problem?

Storage: memory + (persisted storage)/S3/HDFS.
CPU: support SIMD
GPU: support heterogeneous computing
- since GPU memory is limited, computing should be batch by batch, rather than query by query.
- if the query number in a batch is higher than a threshold, use GPU; else use CPU.
Scalability: multi-node supporting with eventual consistency
Functionalities:
- Fast data update during query processing
- Attribute filtering
- Multi-vector query processing

How does this work relate to other research?

The whole database is built on top of the Facebook Faiss library with many improvements.

Vector search libraries: Facebook Faiss & Microsoft SPTAG

cannot leverage disk or support distributed mode
cannot handle dynamic data upsert
cannot support advanced query processing
cannot leverage SIMD and GPU

Systems that support vector search: Alibaba AnalyticDB-V & Alibaba PASE

legacy components (such as optimizer and storage engine) do not best leverage CPU and GPU for vector processing
do not support advanced query processing

System designed for vector search: Jingdong Vearch

6.4x ~ 47.0x slower than Milvus
do not support multi-vector query processing

System Design

Query processing:

supported entity: multi-vector with numerical attributes, rather than a single vector
supported similarity metrics: Euclidean distance, inner product, cosine similarity, Hamming distance, and Jaccard distance
supported interface: RESTful API, Python, Java, Go, and C++

Pluggable index:

no winner in all dimensions (including performance, accuracy, and space overhead)
there are many new indexes coming out every year

Dynamic data management:

LSM-based
snapshot isolation

Storage Management:

vector storage: column-oriented where an entity has multiple columns
attribute storage: use skip pointers (min/max values) inspired by Snowflake

LRU based buffer pool

Distributed System: inspired by Snowflake and Amazon Aurora

shared storage using S3
compute nodes are stateless (unless local cache and SSD are used) and managed by k8s
- writer: only one instance since Milvus is read-heavy
- reader: data is shared with consistent hashing

What could be improved?

leverage FPGA
to be cloud-native

Others

I think Milvus is, though popular among people in the industry, of low novelty. Just like Spark (RDD from NSDI'2012).

Sunt-ing · 2021-10-25T08:06:58Z

I think I can share some things that got from a private talk between some of Milvus main authors and me:

Is Milvus profitable? Not yet. Milvus is fully open-sourced and has no profit by now (October 25, 2021), but there are some ways to be profitable definitely, such as selling distribution, providing cloud service...
Why did Milvus join Linux Foundation rather than ASF or CNCF? Actually, both ASF and CNCF were considered seriously but failed to be chosen due to the following two reasons: First, ASF and CNCF have tons of projects and Milvus might not get enough attention; Second and most importantly, the leader of Milvus plays an important role in Linux Foundation.
Noticing Milvus 1.0 (the version described in this paper) only implements eventual consistency, while Milvus 2.0 implements snapshot consistency, so can eventual consistency satisfy most customers? Yes.

Sunt-ing added c:SIGMOD t:ml industrial t:db labels Sep 12, 2021

Sunt-ing closed this as completed Oct 17, 2021

Sunt-ing mentioned this issue Oct 13, 2022

Manu: A Cloud Native Vector Database Management System #126

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Milvus: A Purpose-Built Vector Data Management System #32

Milvus: A Purpose-Built Vector Data Management System #32

Sunt-ing commented Sep 12, 2021 •

edited

Loading

Sunt-ing commented Oct 17, 2021 •

edited

Loading

Sunt-ing commented Oct 25, 2021 •

edited

Loading

Milvus: A Purpose-Built Vector Data Management System #32

Milvus: A Purpose-Built Vector Data Management System #32

Comments

Sunt-ing commented Sep 12, 2021 • edited Loading

Sunt-ing commented Oct 17, 2021 • edited Loading

What problem does the paper solve? Is it important?

How does it solve the problem?

How does this work relate to other research?

System Design

Query processing:

Pluggable index:

Dynamic data management:

Storage Management:

LRU based buffer pool

Distributed System: inspired by Snowflake and Amazon Aurora

What could be improved?

Others

Sunt-ing commented Oct 25, 2021 • edited Loading

Sunt-ing commented Sep 12, 2021 •

edited

Loading

Sunt-ing commented Oct 17, 2021 •

edited

Loading

Sunt-ing commented Oct 25, 2021 •

edited

Loading