# LangChain 核心模块：Data Conneciton - Vector Stores

存储和搜索非结构化数据最常见的方法之一是将其嵌入并存储生成的嵌入向量，然后在查询时将非结构化查询进行嵌入，并检索与嵌入查询“最相似”的嵌入向量。

向量存储库负责为您存储已经过嵌入处理的数据并执行向量搜索。


![](https://python.langchain.com/assets/images/vector_stores-125d1675d58cfb46ce9054c9019fea72.jpg)


下面以 `Chroma` 为例展示功能和用法

In [2]:
# 安装必要依赖包
!pip install chromadb

Looking in indexes: https://mirrors.aliyun.com/pypi/simple/


## 使用 Chroma 作为向量数据库，实现语义搜索


In [8]:
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
import os
os.environ['OPENAI_API_KEY'] = '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'
# 加载长文本
raw_documents = TextLoader('state_of_the_union.txt').load()

In [9]:
# 实例化文本分割器
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)

In [10]:
# 分割文本
documents = text_splitter.split_documents(raw_documents)

In [11]:
documents

[Document(metadata={'source': 'state_of_the_union.txt'}, page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny.'),
 Document(metadata={'source': 'state_of_the_union.txt'}, page_content='Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy

In [12]:
embeddings_model = OpenAIEmbeddings()

### 注意：Pandas 相关包首次导入错误后，再次执行即可正确导入

In [23]:
pip install numpy==1.26.4

Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Collecting numpy==1.26.4
  Downloading https://mirrors.aliyun.com/pypi/packages/ae/8c/ab03a7c25741f9ebc92684a20125fbc9fc1b8e1e700beb9197d750fdff88/numpy-1.26.4-cp39-cp39-macosx_11_0_arm64.whl (14.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.0/14.0 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gradio 4.22.0 requires pillow<11.0,>=8.0, but you have pillow 11.2.1 which is incompatible.
langchain-community 0.3.10 requires langsmith<0.2.0,>=0.1.125, but you have langsmith 0.3.20 which is incompatible.
nlopt 2.9.1 requires

In [13]:
# 将分割后的文本，使用 OpenAI 嵌入模型获取嵌入向量，并存储在 Chroma 中
db = Chroma.from_documents(documents, embeddings_model)

#### 使用文本进行语义相似度搜索

In [14]:
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)

Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3


Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny.


### 使用嵌入向量进行语义相似度搜索

In [15]:
embedding_vector = embeddings_model.embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)
print(docs[0].page_content)

Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3


Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny.
