# LangChain 核心模块：Data Conneciton - Vector Stores

存储和搜索非结构化数据最常见的方法之一是将其嵌入并存储生成的嵌入向量，然后在查询时将非结构化查询进行嵌入，并检索与嵌入查询“最相似”的嵌入向量。

向量存储库负责为您存储已经过嵌入处理的数据并执行向量搜索。


![](https://python.langchain.com/assets/images/vector_stores-125d1675d58cfb46ce9054c9019fea72.jpg)


下面以 `Chroma` 为例展示功能和用法

In [1]:
# 安装必要依赖包
!pip install chromadb

[0m

## 使用 Chroma 作为向量数据库，实现语义搜索


In [2]:
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

# 加载长文本
raw_documents = TextLoader('../tests/state_of_the_union.txt').load()

In [3]:
# 实例化文本分割器
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=0)

In [4]:
# 分割文本
documents = text_splitter.split_documents(raw_documents)

Created a chunk of size 215, which is longer than the specified 200
Created a chunk of size 232, which is longer than the specified 200
Created a chunk of size 242, which is longer than the specified 200
Created a chunk of size 219, which is longer than the specified 200
Created a chunk of size 304, which is longer than the specified 200
Created a chunk of size 205, which is longer than the specified 200
Created a chunk of size 332, which is longer than the specified 200
Created a chunk of size 215, which is longer than the specified 200
Created a chunk of size 203, which is longer than the specified 200
Created a chunk of size 281, which is longer than the specified 200
Created a chunk of size 201, which is longer than the specified 200
Created a chunk of size 250, which is longer than the specified 200
Created a chunk of size 325, which is longer than the specified 200
Created a chunk of size 242, which is longer than the specified 200


In [5]:
documents

[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.', metadata={'source': '../tests/state_of_the_union.txt'}),
 Document(page_content='Last year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.', metadata={'source': '../tests/state_of_the_union.txt'}),
 Document(page_content='With a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny.', metadata={'source': '../tests/state_of_the_union.txt'}),
 Document(page_content='Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated.', metadata={'source': '../tests/state_of_the_union.txt'}),
 Document

In [6]:
embeddings_model = OpenAIEmbeddings()

### 注意：Pandas 相关包首次导入错误后，再次执行即可正确导入

In [8]:
# 将分割后的文本，使用 OpenAI 嵌入模型获取嵌入向量，并存储在 Chroma 中
db = Chroma.from_documents(documents, embeddings_model)

No embedding_function provided, using default embedding function: DefaultEmbeddingFunction https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2


#### 使用文本进行语义相似度搜索

In [9]:
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


### 使用嵌入向量进行语义相似度搜索

In [10]:
embedding_vector = embeddings_model.embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)
print(docs[0].page_content)

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
