# LangChain 核心模块：Data Conneciton - Vector Stores

存储和搜索非结构化数据最常见的方法之一是将其嵌入并存储生成的嵌入向量，然后在查询时将非结构化查询进行嵌入，并检索与嵌入查询“最相似”的嵌入向量。

向量存储库负责为您存储已经过嵌入处理的数据并执行向量搜索。


![](https://python.langchain.com/assets/images/vector_stores-9dc1ecb68c4cb446df110764c9cc07e0.jpg)


下面以 `Chroma` 为例展示功能和用法

In [1]:
# 安装必要依赖包
!pip install chromadb

[0m

## 使用 Chroma 作为向量数据库，实现语义搜索


In [2]:
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

# 加载长文本
raw_documents = TextLoader('../../tests/state_of_the_union.txt').load()

In [6]:
# 实例化文本分割器
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=0)

In [7]:
# 分割文本
documents = text_splitter.split_documents(raw_documents)

Created a chunk of size 164, which is longer than the specified 100
Created a chunk of size 169, which is longer than the specified 100
Created a chunk of size 122, which is longer than the specified 100
Created a chunk of size 121, which is longer than the specified 100
Created a chunk of size 139, which is longer than the specified 100
Created a chunk of size 181, which is longer than the specified 100
Created a chunk of size 101, which is longer than the specified 100
Created a chunk of size 113, which is longer than the specified 100
Created a chunk of size 129, which is longer than the specified 100
Created a chunk of size 146, which is longer than the specified 100
Created a chunk of size 136, which is longer than the specified 100
Created a chunk of size 189, which is longer than the specified 100
Created a chunk of size 215, which is longer than the specified 100
Created a chunk of size 124, which is longer than the specified 100
Created a chunk of size 118, which is longer tha

In [8]:
documents

[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.', metadata={'source': '../../tests/state_of_the_union.txt'}),
 Document(page_content='Last year COVID-19 kept us apart. This year we are finally together again.', metadata={'source': '../../tests/state_of_the_union.txt'}),
 Document(page_content='Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.', metadata={'source': '../../tests/state_of_the_union.txt'}),
 Document(page_content='With a duty to one another to the American people to the Constitution.', metadata={'source': '../../tests/state_of_the_union.txt'}),
 Document(page_content='And with an unwavering resolve that freedom will always triumph over tyranny.', metadata={'source': '../../tests/state_of_the_union.txt'}),
 Document(page_content='Six days ago, Russia’s Vladimir Putin sought to shake the founda

In [9]:
# 将分割后的文本，使用 OpenAI 嵌入模型获取嵌入向量，并存储在 Chroma 中
db = Chroma.from_documents(documents, OpenAIEmbeddings())

#### 使用文本进行语义相似度搜索

In [10]:
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


### 使用嵌入向量进行语义相似度搜索

In [11]:
# embeddings_model = OpenAIEmbeddings()
# embedding_vector = embeddings_model.embed_query(query)
embedding_vector = OpenAIEmbeddings().embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)
print(docs[0].page_content)

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
