Vector Store 向量存储是一种专门的数据存储方式，能够基于向量表示进行索引和检索信息。经常用于搜索非结构化数据，例如文本、图像和音频，以根据语义相似性而不是精确的关键字匹配来检索相关信息。

![Similarity Search](https://python.langchain.com/assets/images/vectorstores-2540b4bc355b966c99b0f02cfdddb273.png)

这是 Vector stores 官方提供集成的[表格](https://python.langchain.com/docs/integrations/vectorstores/)。、

接下来，我们使用 OpenAI 的 interface 来创建一个 [vector store](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.in_memory.InMemoryVectorStore.html#inmemoryvectorstore)。

In [26]:
# 还是一样，先把我们所有的配置类写在前面
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0.0,
    max_tokens=2048,
    timeout=None,
    max_retries=2,
)

In [27]:
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings

# 初始化向量存储
vector_store = InMemoryVectorStore(OpenAIEmbeddings())

In [None]:
# 我们有几个方式读取文件，第一个是自己写一个函数
from langchain_core.documents import Document
import chardet

def read_from_file(path: str) -> None | Document:
    # 检测文件编码格式
    with open(path, "rb") as f:
        raw = f.read()
        result = chardet.detect(raw)
        encoding = result['encoding']
    # 用检测到的编码方式来读取文本
    with open(path, "r", encoding=encoding) as f:
        content = f.read()
    return Document(page_content=content, metadata={"source": path})

# 或者是我们使用 LangChain 自带的加载器
# 但是需要安装依赖环境
# !pip install pypdf
from langchain_community.document_loaders import PyPDFLoader
doc_1 = PyPDFLoader("./documents/ECE273_syllabus.pdf").load()
doc_2 = PyPDFLoader("./documents/ECE208_homework_tutorial.pdf").load()
doc_3 = PyPDFLoader("./documents/ECE269_Final-A-Solution.pdf").load()

LangChain 给我们提供了很多种方式进行文件的读取 [Document loaders](https://python.langchain.com/docs/how_to/#document-loaders)，包含着 PDF、web pages、CSV、JSON 等等。同时 LangChain 也提供了部分平台的数据接口，可以参考[官方文档](https://python.langchain.com/docs/integrations/document_loaders/)。

In [29]:
# 为了方便导入，我们所有的文档只选择第一面，也就是 doc[0] 这一项
doc_1 = Document(id="ece_273_page_1", page_content=doc_1[0].page_content, metadata=doc_1[0].metadata)
doc_2 = Document(id="ece_208_page_1", page_content=doc_2[0].page_content, metadata=doc_2[0].metadata)
doc_3 = Document(id="ece_269_page_1", page_content=doc_3[0].page_content, metadata=doc_3[0].metadata)

In [30]:
documents = [doc_1, doc_2, doc_3]
vector_store.add_documents(documents=documents)

['ece_273_page_1', 'ece_208_page_1', 'ece_269_page_1']

In [41]:
# 遍历向量数据库的前 3 条数据查看内容
top_n = 3
for index, (id, doc) in enumerate(vector_store.store.items()):
    if index < top_n:
        print("==============================")
        # docs have keys 'id', 'vector', 'text', 'metadata'
        print(f"Document {index + 1}: {id}\nContent: {doc['text'][:200]}...")
    else:
        break

Document 1: ece_273_page_1
Content: ECE273: Convex Optimization and Applications — Spring 2025
Course Description:This course covers the theoretical and algorithmic foundations of optimiza-
tion. We will cover some convex analysis and d...
Document 2: ece_208_page_1
Content: ECE 208 Homework tutorial
WI-22
This is a short tutorial on how to use GitHub Classroom. It will cover some usage of GitHub
commands, but we recommend you to refer to other materials for learning how ...
Document 3: ece_269_page_1
Content: Exam-A Solution
ECE 269- Linear Algebra and Applications, Final (A)
1. (20 points) Prove or disprove each statement:
(a) Every orthogonal matrixQ → Rn→n is diagonalizable overR.
Solution: False. Count...


In [47]:
# 使用查询语句 "syllabus" 进行向量检索
# k=1 表示返回最相似的 1 条文档

results = vector_store.similarity_search(query="document that contains some questions",k=1)
for doc in results:
    print(f"* {doc.page_content[:200]}\nmetadata:[{doc.metadata}]")

* Exam-A Solution
ECE 269- Linear Algebra and Applications, Final (A)
1. (20 points) Prove or disprove each statement:
(a) Every orthogonal matrixQ → Rn→n is diagonalizable overR.
Solution: False. Count
metadata:[{'producer': 'macOS Version 15.3.2 (Build 24D81) Quartz PDFContext', 'creator': 'LaTeX with hyperref', 'creationdate': "D:20250407183239Z00'00'", 'moddate': "D:20250407183239Z00'00'", 'source': './documents/ECE269_Final-A-Solution.pdf', 'total_pages': 6, 'page': 0, 'page_label': '1'}]


可以看到，我们嵌入的指令是：`vector_store = InMemoryVectorStore(OpenAIEmbeddings())`，即任何添加到 `vector_store` 的文档都进行了嵌入。我们使用 `query="document that contains some questions"` 进行搜索的时候，尽管我们的文档中没有任何 "question" 的字眼，但是计算机理解了文档的含义并给出了最符合 `question` 的文档（就是上学期的期末考试 PDF 文档）。

[官方文档](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.in_memory.InMemoryVectorStore.html#langchain_core.vectorstores.in_memory.InMemoryVectorStore)也有关于 filter 的使用，很简单，具体操作如下：
```python
def _filter_function(doc: Document) -> bool:
    return doc.metadata.get("bar") == "baz"

results = vector_store.similarity_search(
    query="thud", k=1, filter=_filter_function
)
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")
```

In [50]:
# 我们可以通过计算 score 的方法，直观的看出相似匹配度
results = vector_store.similarity_search_with_score(
    query="documents that contain the course grading and topics in this quarter", k=3
)
for doc, score in results:
    print(f"* [SIM={score:3f}]\nContent: {doc.page_content[:100]}\nMetadata:[{doc.metadata}]")

* [SIM=0.791037]
Content: ECE273: Convex Optimization and Applications — Spring 2025
Course Description:This course covers the
Metadata:[{'producer': 'macOS Version 15.3.2 (Build 24D81) Quartz PDFContext', 'creator': 'Preview', 'creationdate': "D:20250407182812Z00'00'", 'title': 'syllabus', 'author': 'Nethan Hu', 'moddate': "D:20250407182812Z00'00'", 'source': './documents/ECE273_syllabus.pdf', 'total_pages': 6, 'page': 0, 'page_label': '1'}]
* [SIM=0.761357]
Content: ECE 208 Homework tutorial
WI-22
This is a short tutorial on how to use GitHub Classroom. It will cov
Metadata:[{'producer': 'macOS Version 15.3.2 (Build 24D81) Quartz PDFContext', 'creator': 'LaTeX with hyperref', 'creationdate': "D:20250407183313Z00'00'", 'moddate': "D:20250407183313Z00'00'", 'source': './documents/ECE208_homework_tutorial.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}]
* [SIM=0.726692]
Content: Exam-A Solution
ECE 269- Linear Algebra and Applications, Final (A)
1. (20 points) Prove or disprove
Me

可以看到，尽管三个文档都有 grading 的内容，但是通过 topics 这个关键词，我们的嵌入 LLM 还是认为最匹配的是第一个文档，即 syllabus。我们进行包装一下，使用一个简单的 RAG 功能 `.as_retriever()`。这段代码创建了一个支持 MMR（最大边际相关性）的检索器 retriever，它能从向量库中返回与查询 query 最相关且多样化的文档，用于后续 LLM 生成回答（即 RAG 流程中的“Retrieve”步骤）。

In [53]:
retriever = vector_store.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 1, "fetch_k": 2, "lambda_mult": 0.5},
)
ans = retriever.invoke("I want to find some documents that teach me how to upload my homework.")
print(f"Doc id: {ans[0].id}\nmetadata: {ans[0].metadata}\npage_content: {ans[0].page_content[:200]}")

Doc id: ece_208_page_1
metadata: {'producer': 'macOS Version 15.3.2 (Build 24D81) Quartz PDFContext', 'creator': 'LaTeX with hyperref', 'creationdate': "D:20250407183313Z00'00'", 'moddate': "D:20250407183313Z00'00'", 'source': './documents/ECE208_homework_tutorial.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}
page_content: ECE 208 Homework tutorial
WI-22
This is a short tutorial on how to use GitHub Classroom. It will cover some usage of GitHub
commands, but we recommend you to refer to other materials for learning how 
