1. 依赖安装
   pymilvus[model] v2.5.10
   openai v1.82.0
   requests v2.32.3
   tqdm v4.67.1
   torch v2.7.0

In [27]:
!pip install "pymilvus[model]==2.5.10" openai==1.82.0 requests==2.32.3 tqdm==4.67.1 torch==2.7.0

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: http://mirrors.cloud.aliyuncs.com/pypi/simple/
[0m

2. 准备 deepseek api key，设置到环境变量，从环境变量获取到api key

In [28]:
import os

api_key = os.getenv("DEEPSEEK_API_KEY")

3. 准备数据，已准备好(跳过)

In [None]:
#!wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip
#!unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs

In [None]:
4. 读取数据

In [54]:
from glob import glob

text_lines = []

for file_path in glob("milvus_docs/en/faq/*.md", recursive=True):
    # 以只读模式读取文件内容
    with open(file_path, "r") as file:
        file_text = file.read()

    # 以 "# " 简单分割字符
    text_lines += file_text.split("# ")

In [55]:
len(text_lines)

72

5. 以 openAI 风格的 api 构建 deepseek 实例

In [56]:
from openai import OpenAI

deepseek_client = OpenAI(
    api_key = api_key,
    base_url = "https://api.deepseek.com/v1"
)

6. 使用 milvus 的 DefaultEmbeddingFunction 模型，定义一个 embedding 模型

In [57]:
from pymilvus import model as milvus_model

embedding_model = milvus_model.DefaultEmbeddingFunction()

6.1 测试，将文本转成向量

In [58]:
test_embedding = embedding_model.encode_queries(["That is a test"])[0]
embedding_dim = len(test_embedding)
print(embedding_dim)
print(test_embedding[:10])

768
[-0.02752976  0.0608853   0.00388525 -0.00215193 -0.02774976 -0.0118618
 -0.04020916 -0.06023417 -0.03813156  0.0100272 ]


7. 创建 milvus collection

In [59]:
from pymilvus import MilvusClient

milvus_client = MilvusClient(uri="./my_milvus.db")

collection_name = "my_rag_collection"

In [60]:
# 如果 collection 已存在，则删除 collection
if milvus_client.has_collection(collection_name):
    milvus_client.drop_collection(collection_name)

In [61]:
milvus_client.create_collection(
    collection_name = collection_name,
    dimension = embedding_dim,
    # metric_type (距离度量类型): 作用：定义如何计算向量之间的相似程度。
    metric_type="IP", # IP (内积) - 值越大通常越相似；L2 (欧氏距离) - 值越小越相似；COSINE (余弦相似度) - 通常转换为距离，值越小越相似。 选择依据：根据你的嵌入模型的特性和期望的相似性定义来选择。
    # consistency_level (一致性级别): 作用：定义数据写入后，读取操作能多快看到这些新数据。
    consistency_level="Strong", # Strong (强一致性): 总是读到最新数据，可能稍慢。 Bounded (有界过期): 可能读到几秒内旧数据，性能较好 (默认)。 Session (会话一致性): 自己写入的自己能立刻读到。 Eventually (最终一致性): 最终会读到新数据，但没时间保证，性能最好。 选择依据：在数据实时性要求和系统性能之间做权衡。
)

8. 插入数据，将之前读取到的文件数据插入 collection

In [62]:
from tqdm import tqdm

data = []

doc_embeddings = embedding_model.encode_documents(text_lines)

for i, line in enumerate(tqdm(text_lines, desc = "Creating embeddings")):
    data.append({ "id": i, "vector": doc_embeddings[i], "text": line })

milvus_client.insert(collection_name=collection_name, data=data)


Creating embeddings: 100%|██████████| 72/72 [00:00<00:00, 95114.93it/s]


{'insert_count': 72, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71], 'cost': 0}

9. 构建问题

In [63]:
question = "How is data stored in milvus?"

10. 在 collection 中搜索该问题

In [64]:
search_res = milvus_client.search(
    collection_name=collection_name,
    data=embedding_model.encode_queries([question]), # 将问题转成向量
    limit=3, # 返回前3个结果
    search_params={"metric_type": "IP", "params": {}}, # 内积距离
    output_fields=["text"], # 返回 text 字段
)

In [65]:
# 查看查询结果
import json

retrieved_lines_with_distances = [
    (res["entity"]["text"], res["distance"]) for res in search_res[0]
]

print(json.dumps(retrieved_lines_with_distances, indent=4))

[
    [
        " Where does Milvus store data?\n\nMilvus deals with two types of data, inserted data and metadata. \n\nInserted data, including vector data, scalar data, and collection-specific schema, are stored in persistent storage as incremental log. Milvus supports multiple object storage backends, including [MinIO](https://min.io/), [AWS S3](https://aws.amazon.com/s3/?nc1=h_ls), [Google Cloud Storage](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) (GCS), [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs), [Alibaba Cloud OSS](https://www.alibabacloud.com/product/object-storage-service), and [Tencent Cloud Object Storage](https://www.tencentcloud.com/products/cos) (COS).\n\nMetadata are generated within Milvus. Each Milvus module has its own metadata that are stored in etcd.\n\n###",
        0.65726637840271
    ],
    [
        "How does Milvus flush data?\n\nMilvus returns success when inserted data are loaded to the

In [66]:
# 处理结果成字符串
context = "\n".join(
    [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]
)

In [67]:
context

" Where does Milvus store data?\n\nMilvus deals with two types of data, inserted data and metadata. \n\nInserted data, including vector data, scalar data, and collection-specific schema, are stored in persistent storage as incremental log. Milvus supports multiple object storage backends, including [MinIO](https://min.io/), [AWS S3](https://aws.amazon.com/s3/?nc1=h_ls), [Google Cloud Storage](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) (GCS), [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs), [Alibaba Cloud OSS](https://www.alibabacloud.com/product/object-storage-service), and [Tencent Cloud Object Storage](https://www.tencentcloud.com/products/cos) (COS).\n\nMetadata are generated within Milvus. Each Milvus module has its own metadata that are stored in etcd.\n\n###\nHow does Milvus flush data?\n\nMilvus returns success when inserted data are loaded to the message queue. However, the data are not yet flushed to the dis

In [68]:
question

'How is data stored in milvus?'

11. 定义 prompt

In [69]:
SYSTEM_PROMPT = """
Human: 你是一个 AI 助手。你能够从提供的上下文段落片段中找到问题的答案。
"""
USER_PROMPT = f"""
请使用以下用 <context> 标签括起来的信息片段来回答用 <question> 标签括起来的问题。最后追加原始回答的中文翻译，并用 <translated>和</translated> 标签标注。
<context>
{context}
</context>
<question>
{question}
</question>
<translated>
</translated>
"""

In [70]:
USER_PROMPT

"\n请使用以下用 <context> 标签括起来的信息片段来回答用 <question> 标签括起来的问题。最后追加原始回答的中文翻译，并用 <translated>和</translated> 标签标注。\n<context>\n Where does Milvus store data?\n\nMilvus deals with two types of data, inserted data and metadata. \n\nInserted data, including vector data, scalar data, and collection-specific schema, are stored in persistent storage as incremental log. Milvus supports multiple object storage backends, including [MinIO](https://min.io/), [AWS S3](https://aws.amazon.com/s3/?nc1=h_ls), [Google Cloud Storage](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) (GCS), [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs), [Alibaba Cloud OSS](https://www.alibabacloud.com/product/object-storage-service), and [Tencent Cloud Object Storage](https://www.tencentcloud.com/products/cos) (COS).\n\nMetadata are generated within Milvus. Each Milvus module has its own metadata that are stored in etcd.\n\n###\nHow does Milvus flush data?\n\nMilvus 

In [71]:
response = deepseek_client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        { "role": "system", "content": SYSTEM_PROMPT },
        { "role": "user", "content": USER_PROMPT }
    ]
)

print(response.choices[0].message.content)

Milvus stores data in two main categories: inserted data and metadata.

1. **Inserted Data** (including vector data, scalar data, and collection-specific schema):
   - Stored as incremental logs in persistent storage.
   - Supports multiple object storage backends: MinIO, AWS S3, Google Cloud Storage (GCS), Azure Blob Storage, Alibaba Cloud OSS, and Tencent Cloud Object Storage (COS).

2. **Metadata**:
   - Generated internally by Milvus.
   - Each Milvus module has its own metadata stored in etcd.

Additionally, data flushing occurs asynchronously by default. When data is inserted, it first goes to a message queue, and the data node later writes it to persistent storage. Calling `flush()` forces immediate writing of all queued data to disk.

<translated>
Milvus以两种主要类型存储数据：插入数据和元数据。

1. **插入数据**（包括向量数据、标量数据和集合特定模式）：
   - 以增量日志形式存储在持久化存储中。
   - 支持多种对象存储后端：MinIO、AWS S3、Google云存储(GCS)、Azure Blob存储、阿里云OSS和腾讯云对象存储(COS)。

2. **元数据**：
   - 由Milvus内部生成。
   - 每个Milvus模块都有自己存储在etcd中的元数据。

默认情况下，