In [None]:

# LangChain官方文档教程
# https://python.langchain.com/docs/tutorials/retrievers/



In [None]:
# Build a semantic search engine
# 构建语义搜索引擎
# This tutorial will familiarize you with LangChain's document loader, embedding, and vector store abstractions. These abstractions are designed to support retrieval of data-- from (vector) databases and other sources-- for integration with LLM workflows. They are important for applications that fetch data to be reasoned over as part of model inference, as in the case of retrieval-augmented generation, or RAG (see our RAG tutorial here).
# 本教程将帮助您熟悉 LangChain 的文档加载器 、 嵌入和向量存储抽象。这些抽象旨在支持从（向量）数据库和其他来源检索数据，以便与 LLM 工作流集成。对于需要获取数据进行模型推理的应用程序而言，它们非常重要，例如检索增强生成 ( RAG) 的情况（请参阅此处的 RAG 教程）。

# Here we will build a search engine over a PDF document. This will allow us to retrieve passages in the PDF that are similar to an input query.
# 在这里，我们将基于 PDF 文档构建一个搜索引擎。这将使我们能够检索 PDF 中与输入查询相似的段落。


In [None]:
# 概念
# 本指南侧重于文本数据的检索。我们将涵盖以下概念：

# 文档（Documents）和文档加载器（Document Loaders）：
# Document 是 LangChain 中表示一段文本及其相关元数据的抽象。它包含 page_content（文本内容）、metadata（任意元数据字典）和可选的 id（字符串标识符）。
# 文档加载器用于从各种数据源（如 PDF、HTML、JSON 等）加载数据并将其转换为 Document 对象。

# 文本分割器（Text Splitters）：
# 用于将大的 Document 对象分割成更小的、可管理的块（chunks）。这对于信息检索和后续问答非常重要，可以避免相关信息被无关文本“稀释”。

# 嵌入（Embeddings）：
# 将文本转换为数值向量（即嵌入向量）的模型。这些向量捕获了文本的语义信息，使得语义相似的文本在向量空间中距离更近。

# 向量存储（Vector Stores）和检索器（Retrievers）：
# 向量存储：一种特殊的数据结构，用于存储文本的嵌入向量，并支持高效的相似性搜索。
# 检索器：LangChain 中的一个可运行（Runnable）接口，用于根据查询从各种数据源（包括向量存储）检索相关文档。


In [None]:
# Setup  设置
# Jupyter Notebook  Jupyter 笔记本
# This and other tutorials are perhaps most conveniently run in a Jupyter notebook. See here for instructions on how to install.
# 本教程和其他教程可能最方便在 Jupyter Notebook 中运行。请参阅此处了解安装说明。

In [None]:
# Installation  安装
# This tutorial requires the langchain-community and pypdf packages:
# 本教程需要 langchain-community 和 pypdf 包：


In [None]:
# pip install langchain-community pypdf
# 或
# conda install langchain-community pypdf -c conda-forge

In [None]:
# (Gemini) PS A:\> pip install langchain-community pypdf
# Collecting langchain-community
#   Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
# Collecting pypdf
#   Downloading pypdf-5.8.0-py3-none-any.whl.metadata (7.1 kB)
# ...
# ...
# ...
# Successfully installed dataclasses-json-0.6.7 langchain-0.3.26 langchain-community-0.3.27 marshmallow-3.26.1 mypy-extensions-1.1.0 numpy-2.3.1 pydantic-settings-2.10.1 pypdf-5.8.0 typing-inspect-0.9.0


In [None]:
# LangSmith  朗史密斯
# Many of the applications you build with LangChain will contain multiple steps with multiple invocations of LLM calls. As these applications get more and more complex, it becomes crucial to be able to inspect what exactly is going on inside your chain or agent. The best way to do this is with LangSmith.
# 使用 LangChain 构建的许多应用程序都包含多个步骤，需要多次调用 LLM 函数。随着这些应用程序变得越来越复杂，能够检查链或代理内部究竟发生了什么变得至关重要。最好的方法是使用 LangSmith 。

# After you sign up at the link above, make sure to set your environment variables to start logging traces:
# 通过上面的链接注册后，请确保设置环境变量以开始记录跟踪：


In [None]:
# export LANGSMITH_TRACING="true"
# export LANGSMITH_API_KEY="..."

In [None]:
# Or, if in a notebook, you can set them with:
# 或者，如果在笔记本中，您可以使用以下方式设置它们：


In [1]:
import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"

if not os.environ.get("LANGSMITH_API_KEY"):
  os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter API key for LANGSMITH: ")





In [2]:
# Documents and Document Loaders
# 文档和文档加载器
# LangChain implements a Document abstraction, which is intended to represent a unit of text and associated metadata. It has three attributes:
# LangChain 实现了 Document 抽象，用于表示文本单元及其相关的元数据。它具有三个属性：

# page_content: a string representing the content;
# page_content ：表示内容的字符串；

# metadata: a dict containing arbitrary metadata;
# metadata ：一个字典，包含文档的任意元数据（如来源、页码等）。

# id: (optional) a string identifier for the document.
# id ：（可选）文档的字符串标识符。
# The metadata attribute can capture information about the source of the document, its relationship to other documents, and other information. Note that an individual Document object often represents a chunk of a larger document.
# metadata 属性可以捕获有关文档来源、其与其他文档的关系以及其他信息。请注意，单个 Document 对象通常代表较大文档的一部分。

# We can generate sample documents when desired:
# 我们可以根据需要生成示例文档：


In [3]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

print("手动创建的文档示例:")
for doc in documents:
    print(f"内容: {doc.page_content}")
    print(f"元数据: {doc.metadata}\n")
    

手动创建的文档示例:
内容: Dogs are great companions, known for their loyalty and friendliness.
元数据: {'source': 'mammal-pets-doc'}

内容: Cats are independent pets that often enjoy their own space.
元数据: {'source': 'mammal-pets-doc'}



In [4]:
# However, the LangChain ecosystem implements document loaders that integrate with hundreds of common sources. This makes it easy to incorporate data from these sources into your AI application.
# 然而，LangChain 生态系统实现了与数百个常见来源集成的文档加载器 。这使得将这些来源的数据轻松地合并到您的 AI 应用程序中变得容易。


In [5]:
# Loading documents  加载文档
# Let's load a PDF into a sequence of Document objects. There is a sample PDF in the LangChain repo here -- a 10-k filing for Nike from 2023. We can consult the LangChain documentation for available PDF document loaders. Let's select PyPDFLoader, which is fairly lightweight.
# 让我们将 PDF 加载到 Document 对象序列中。LangChain 仓库中有一个示例 PDF—— 耐克 2023 年的 10-k 文件。我们可以查阅 LangChain 文档，了解可用的 PDF 文档加载器 。我们选择 PyPDFLoader ，它相当轻量级。
# nke-10k-2023.pdf的链接：https://github.com/langchain-ai/langchain/tree/master/docs/docs/example_data
# 在您的项目目录下创建一个 Document 文件夹，并将下载的 nke-10k-2023.pdf 文件放入其中。

# PyPDFLoader loads one Document object per PDF page. For each, we can easily access:
# PyPDFLoader 为每个 PDF 页面加载一个 Document 对象。对于每个 PDF 页面，我们可以轻松访问：

# The string content of the page;
# 页面的字符串内容；
# Metadata containing the file name and page number.
# 包含文件名和页码的元数据。


In [6]:
from langchain_community.document_loaders import PyPDFLoader

# 定义文件路径
file_path = "./Document/nke-10k-2023.pdf"

# 初始化 PyPDFLoader
loader = PyPDFLoader(file_path)

# 加载文档，PyPDFLoader 会为每一页创建一个 Document 对象
docs = loader.load()

# 打印加载的文档数量
print(f"加载的文档数量 (PDF 页数): {len(docs)}")

# 打印第一个文档的内容和元数据的前200个字符
print("\n第一个文档的内容 (前200字符):")
print(f"{docs[0].page_content[:200]}\n")
print("第一个文档的元数据:")
print(f"{docs[0].metadata}")




加载的文档数量 (PDF 页数): 107

第一个文档的内容 (前200字符):
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
F

第一个文档的元数据:
{'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': './Document/nke-10k-2023.pdf', 'total_pages': 107, 'page': 0, 'page_label': '1'}


In [7]:
# Splitting  分裂
# For both information retrieval and downstream question-answering purposes, a page may be too coarse a representation. Our goal in the end will be to retrieve Document objects that answer an input query, and further splitting our PDF will help ensure that the meanings of relevant portions of the document are not "washed out" by surrounding text.
# 无论是对于信息检索还是后续的问答系统而言，页面的呈现方式可能过于粗糙。我们的最终目标是检索能够回答输入查询的 Document 对象，而进一步拆分 PDF 将有助于确保文档相关部分的含义不会被周围的文本“冲淡”。

# We can use text splitters for this purpose. Here we will use a simple text splitter that partitions based on characters. We will split our documents into chunks of 1000 characters with 200 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the RecursiveCharacterTextSplitter, which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.
# 我们可以使用文本分割器来实现这一点。这里我们将使用一个基于字符进行分区的简单文本分割器。我们将文档分割成 1000 个字符的块。 两个词块之间有 200 个字符的重叠。这种重叠有助于 减轻将声明与重要内容分离的可能性 与之相关的上下文。我们使用 RecursiveCharacterTextSplitter 会使用常用分隔符（例如换行符）递归地拆分文档，直到每个块达到合适的大小。对于一般的文本用例，建议使用此文本拆分器。

# We set add_start_index=True so that the character index where each split Document starts within the initial Document is preserved as metadata attribute “start_index”.
# 我们设置 add_start_index=True ，以便每个分割文档在初始文档中开始的字符索引被保留为元数据属性“start_index”。

# See this guide for more detail about working with PDFs, including how to extract text from specific sections and images.
# 请参阅本指南以获取有关处理 PDF 的更多详细信息，包括如何从特定部分和图像中提取文本。


In [8]:

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import os


# 初始化 RecursiveCharacterTextSplitter
# chunk_size: 每个块的最大字符数
# chunk_overlap: 块之间的重叠字符数，有助于保留上下文
# add_start_index: 在元数据中添加每个块在原始文档中的起始字符索引
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    add_start_index=True
)

# 分割文档
all_splits = text_splitter.split_documents(docs)

# 打印分割后的块数量
print(f"分割后的文档块数量: {len(all_splits)}")

# 打印第一个分割块的内容和元数据
print("\n第一个分割块的内容 (前200字符):")
print(f"{all_splits[0].page_content[:200]}\n")
print("第一个分割块的元数据:")
print(f"{all_splits[0].metadata}")


分割后的文档块数量: 516

第一个分割块的内容 (前200字符):
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
F

第一个分割块的元数据:
{'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': './Document/nke-10k-2023.pdf', 'total_pages': 107, 'page': 0, 'page_label': '1', 'start_index': 0}


In [9]:
# Embeddings  嵌入
# 向量搜索通过将文本转换为数值向量（嵌入向量）来实现。语义相似的文本在向量空间中距离更近，这使得我们可以通过计算向量相似度来查找相关文本。

# Vector search is a common way to store and search over unstructured data (such as unstructured text). The idea is to store numeric vectors that are associated with the text. Given a query, we can embed it as a vector of the same dimension and use vector similarity metrics (such as cosine similarity) to identify related text.
# 向量搜索是存储和搜索非结构化数据（例如非结构化文本）的常用方法。其理念是存储与文本相关的数字向量。给定一个查询，我们可以将其嵌入为相同维度的向量，并使用向量相似度指标（例如余弦相似度）来识别相关文本。

# LangChain supports embeddings from dozens of providers. These models specify how text should be converted into a numeric vector. Let's select a model:
# LangChain 支持数十家提供商的嵌入。这些模型指定了如何将文本转换为数字向量。让我们选择一个模型：

# Select embeddings model:
# 选择嵌入模型 ：

# Google Gemini ▾


In [10]:
# pip install -qU langchain-google-genai

In [11]:
import getpass
import os

if not os.environ.get("GOOGLE_API_KEY"):
  os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter API key for Google Gemini: ")

from langchain_google_genai import GoogleGenerativeAIEmbeddings

# 初始化 Google 嵌入模型
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

In [12]:
# 好的，我来分析您提供的 `Build-a-semantic-search-engine.ipynb` 文件中的报错信息。

# 根据您提供的文件内容，报错发生在执行以下代码块时（`execution_count: 6`）：

# ```python
# import getpass
# import os

# if not os.environ.get("GOOGLE_API_KEY"):
#   os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter API key for Google Gemini: ")

# from langchain_google_genai import GoogleGenerativeAIEmbeddings

# # 初始化 Google 嵌入模型
# embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
# ```

# 错误信息是：

# ```
# ImportError: cannot import name 'cygrpc' from 'grpc._cython' (A:\Anaconda\envs\Gemini\Lib\site-packages\grpc\_cython\__init__.py)
# ```

# ### 错误分析

# 这个 `ImportError` 明确指出 Python 无法从 `grpc._cython` 模块中导入 `cygrpc` 名称。

# `grpc` 是一个高性能的远程过程调用（RPC）框架，而 `cygrpc` 是 `grpc` 库的一个 Cython 扩展，它提供了 Python 和 C 之间的桥梁，用于提高性能。

# **这个错误的常见原因有：**

# 1.  **`grpcio` 包版本不兼容或损坏：** `grpcio` 是 `grpc` 的 Python 实现。这个错误通常发生在 `grpcio` 包的安装出现问题，或者您当前环境中安装的 `grpcio` 版本与其他依赖（例如 `langchain-google-genai` 或 `google-api-core`）不兼容时。
# 2.  **环境问题：** 尽管不太常见，但有时虚拟环境或 Python 安装本身的问题也可能导致 C 扩展模块无法正确加载。

# ### 解决方案

# 最常见的解决方案是强制重新安装 `grpcio` 包，以确保其正确编译和安装。

# 请按照以下步骤操作：

# 1.  **激活您的 Conda 环境：**
#     首先，确保您在正确的 Conda 环境中执行这些命令。根据您提供的路径 `A:\Anaconda\envs\Gemini`，您的环境名称可能是 `Gemini`。

#     打开 PowerShell，并激活您的环境：
#     ```powershell
#     conda activate Gemini
#     ```

# 2.  **卸载现有的 `grpcio` 包：**
#     为了确保彻底清除可能损坏的安装，请先卸载 `grpcio`。

#     ```powershell
#     pip uninstall grpcio -y
#     ```
#     `-y` 参数表示自动确认卸载，无需手动输入 `y`。

# 3.  **重新安装 `grpcio`：**
#     现在，重新安装 `grpcio`。有时，指定一个已知的稳定版本可以避免兼容性问题，但通常最新版本也能解决问题。我们可以先尝试不指定版本，让 `pip` 安装兼容的最新版。

#     ```powershell
#     pip install grpcio
#     ```
#     如果问题仍然存在，您可以尝试安装一个特定版本的 `grpcio`，例如 `pip install grpcio==1.49.1`（这是一个已知相对稳定的版本，但具体哪个版本最兼容取决于您的所有依赖）。

# 4.  **重新安装 `langchain-google-genai`：**
#     虽然不总是必需，但为了确保 `langchain-google-genai` 与新安装的 `grpcio` 完全兼容，最好也重新安装它。

#     ```powershell
#     pip install -U langchain-google-genai
#     ```
#     `-U` 参数表示升级到最新版本。

# 5.  **重新启动 Jupyter Notebook 内核：**
#     在 Jupyter Notebook 中，即使您在终端中重新安装了包，Notebook 的内核可能仍然加载着旧的、损坏的模块。因此，在执行完上述步骤后，**务必重新启动您 Jupyter Notebook 的内核**。在 Jupyter Notebook 界面中，通常可以通过菜单栏的 `Kernel -> Restart` 来完成。

# 完成这些步骤后，再次运行 Jupyter Notebook 中的相关代码块（特别是报错的那个），应该就能解决 `ImportError: cannot import name 'cygrpc'` 的问题了。

In [13]:
# 对前两个文档块生成嵌入向量
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

# 验证向量长度是否相同
assert len(vector_1) == len(vector_2)
# 生成的向量长度
print(f"Generated vectors of length {len(vector_1)}\n")
# 第一个向量的前10个元素
print(vector_1[:10])

Generated vectors of length 768

[0.003303560661152005, -0.01885664090514183, -0.023528870195150375, 0.013265608809888363, 0.04694835841655731, 0.04489297419786453, 0.030707117170095444, 0.017642803490161896, 0.0011852466268464923, 0.028473228216171265]


In [None]:
# Armed with a model for generating text embeddings, we can next store them in a special data structure that supports efficient similarity search.
# 有了生成文本嵌入的模型，我们接下来可以将它们存储在支持高效相似性搜索的特殊数据结构中。


In [None]:
# Vector stores  向量存储
# LangChain VectorStore objects contain methods for adding text and Document objects to the store, and querying them using various similarity metrics. They are often initialized with embedding models, which determine how text data is translated to numeric vectors.
# LangChain VectorStore 对象包含用于将文本和 Document 对象添加到存储区，并使用各种相似度指标进行查询的方法。它们通常使用嵌入模型进行初始化，该模型决定了如何将文本数据转换为数字向量。

# LangChain includes a suite of integrations with different vector store technologies. Some vector stores are hosted by a provider (e.g., various cloud providers) and require specific credentials to use; some (such as Postgres) run in separate infrastructure that can be run locally or via a third-party; others can run in-memory for lightweight workloads. Let's select a vector store:
# LangChain 包含一系列与不同向量存储技术的集成 。一些向量存储由提供商（例如，各种云提供商）托管，需要特定的凭据才能使用；一些向量存储（例如 Postgres ）在单独的基础架构中运行，可以在本地运行或通过第三方运行；其他一些向量存储可以在内存中运行，以应对轻量级工作负载。让我们选择一个向量存储：

# Select vector store:
# 选择向量存储 ：

# 内存中 ▾


In [None]:
# pip install -qU langchain-core

In [14]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

In [15]:
# Having instantiated our vector store, we can now index the documents.
# 实例化我们的向量存储后，我们现在可以索引文档。

In [16]:
ids = vector_store.add_documents(documents=all_splits)

In [17]:
# Note that most vector store implementations will allow you to connect to an existing vector store-- e.g., by providing a client, index name, or other information. See the documentation for a specific integration for more detail.
# 请注意，大多数向量存储实现都允许您连接到现有的向量存储——例如，通过提供客户端、索引名称或其他信息。有关更多详细信息，请参阅特定集成的文档。

# Once we've instantiated a VectorStore that contains documents, we can query it. VectorStore includes methods for querying:
# 一旦我们实例化了包含文档的 VectorStore ，我们就可以对其进行查询。VectorStore 包含以下查询方法：

# Synchronously and asynchronously;
# 同步和异步；
# By string query and by vector;
# 通过字符串查询和通过向量；
# With and without returning similarity scores;
# 返回和不返回相似度分数；
# By similarity and maximum marginal relevance (to balance similarity with query to diversity in retrieved results).
# 通过相似性和最大边际相关性 （平衡查询的相似性和检索结果的多样性）。
# The methods will generally include a list of Document objects in their outputs.
# 这些方法通常会在其输出中包含一个 Document 对象列表。


In [18]:
# Usage  用法
# Embeddings typically represent text as a "dense" vector such that texts with similar meanings are geometrically close. This lets us retrieve relevant information just by passing in a question, without knowledge of any specific key-terms used in the document.
# 嵌入通常将文本表示为一个“密集”向量，使得含义相似的文本在几何上接近。这使得我们只需传入一个问题即可检索相关信息，而无需了解文档中使用的任何特定关键词。

# Return documents based on similarity to a string query:
# 根据与字符串查询的相似性返回文档：


In [21]:
results = vector_store.similarity_search(
    "How many distribution centers does Nike have in the US?"
)

print(results[0])

page_content='operations. We also lease an office complex in Shanghai, China, our headquarters for our Greater China geography, occupied by employees focused on implementing our
wholesale, NIKE Direct and merchandising strategies in the region, among other functions.
In the United States, NIKE has eight significant distribution centers. Five are located in or near Memphis, Tennessee, two of which are owned and three of which are
leased. Two other distribution centers, one located in Indianapolis, Indiana and one located in Dayton, Tennessee, are leased and operated by third-party logistics
providers. One distribution center for Converse is located in Ontario, California, which is leased. NIKE has a number of distribution facilities outside the United States,
some of which are leased and operated by third-party logistics providers. The most significant distribution facilities outside the United States are located in Laakdal,' metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 

In [22]:
# Async query:  异步查询：

In [23]:
results = await vector_store.asimilarity_search("When was Nike incorporated?")

print(results[0])

page_content='Table of Contents
PART I
ITEM 1. BUSINESS
GENERAL
NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"
"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.
Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is
the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores
and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales' metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'cr

In [24]:
# Return scores:  返回分数：

In [25]:
# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.

results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

Score: 0.7798893082658928

page_content='Table of Contents
FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTSThe following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:
FISCAL 2023 COMPARED TO FISCAL 2022
• NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.
The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,
2 and 1 percentage points to NIKE, Inc. Revenues, respectively.
• NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This
increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale
equivalent basis.' metad

In [26]:
# Return documents based on similarity to an embedded query:
# 根据与嵌入式查询的相似性返回文档：

In [27]:
embedding = embeddings.embed_query("How were Nike's margins impacted in 2023?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])

page_content='Table of Contents
GROSS MARGIN
FISCAL 2023 COMPARED TO FISCAL 2022
For fiscal 2023, our consolidated gross profit increased 4% to $22,292 million compared to $21,479 million for fiscal 2022. Gross margin decreased 250 basis points to
43.5% for fiscal 2023 compared to 46.0% for fiscal 2022 due to the following:
*Wholesale equivalent
The decrease in gross margin for fiscal 2023 was primarily due to:
• Higher NIKE Brand product costs, on a wholesale equivalent basis, primarily due to higher input costs and elevated inbound freight and logistics costs as well as
product mix;
• Lower margin in our NIKE Direct business, driven by higher promotional activity to liquidate inventory in the current period compared to lower promotional activity in
the prior period resulting from lower available inventory supply;
• Unfavorable changes in net foreign currency exchange rates, including hedges; and
• Lower off-price margin, on a wholesale equivalent basis.
This was partially offset by:'

In [None]:
# Retrievers  猎犬
# LangChain VectorStore objects do not subclass Runnable. LangChain Retrievers are Runnables, so they implement a standard set of methods (e.g., synchronous and asynchronous invoke and batch operations). Although we can construct retrievers from vector stores, retrievers can interface with non-vector store sources of data, as well (such as external APIs).
# LangChain VectorStore 对象并非 Runnable 的子类。LangChain Retriever 是 Runnable 对象，因此它们实现了一组标准方法（例如，同步和异步 invoke 以及 batch 操作）。虽然我们可以从向量存储构造检索器，但检索器也可以与非向量存储数据源（例如外部 API）进行交互。

# We can create a simple version of this ourselves, without subclassing Retriever. If we choose what method we wish to use to retrieve documents, we can create a runnable easily. Below we will build one around the similarity_search method:
# 我们可以自己创建一个简单的版本，无需继承 Retriever 子类。如果我们选择想要用来检索文档的方法，可以轻松创建一个Runnable。下面我们将围绕 similarity_search 方法构建一个：


# 手动包装 `similarity_search` 方法：
# 我们可以将 vector_store.similarity_search 方法包装成一个 Runnable。
# #  手动创建检索器 (包装 similarity_search 方法)
# @chain
# def custom_retriever(query: str) -> List[Document]:
#     # 使用 vector_store 的 similarity_search 方法
#     # k=1 表示只返回最相似的一个文档块
#     return vector_store.similarity_search(query, k=1)

# print("\n使用自定义检索器进行批量查询:")
# custom_retriever_results = custom_retriever.batch(
#     [
#         "How many distribution centers does Nike have in the US?",
#         "When was Nike incorporated?",
#     ],
# )

In [28]:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain


@chain
def retriever(query: str) -> List[Document]:
    return vector_store.similarity_search(query, k=1)


retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

[[Document(id='bf3c8c08-2c15-4207-a668-e745ece2bc6c', metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': './Document/nke-10k-2023.pdf', 'total_pages': 107, 'page': 26, 'page_label': '27', 'start_index': 804}, page_content='operations. We also lease an office complex in Shanghai, China, our headquarters for our Greater China geography, occupied by employees focused on implementing our\nwholesale, NIKE Direct and merchandising strategies in the region, among other functions.\nIn the United States, NIKE has eight significant distribution centers. Five are located in or near Memphis, Tennessee, two of which are owned and thr

In [None]:
# Vectorstores implement an as_retriever method that will generate a Retriever, specifically a VectorStoreRetriever. These retrievers include specific search_type and search_kwargs attributes that identify what methods of the underlying vector store to call, and how to parameterize them. For instance, we can replicate the above with the following:
# Vectorstore 实现了 as_retriever 方法，该方法将生成一个 Retriever，具体来说是一个 VectorStoreRetriever 。这些检索器包含特定的 search_type 和 search_kwargs 属性，用于标识要调用底层 Vectorstore 的哪些方法以及如何参数化它们。例如，我们可以使用以下命令复制上述内容：


# 使用 `vector_store.as_retriever()` 方法：
# VectorStore 实现了 as_retriever 方法，可以直接生成一个 VectorStoreRetriever 对象。这种方法更常用，因为它提供了更多的配置选项。

# # 使用 vector_store.as_retriever() 创建检索器
# # search_type: 指定搜索类型，这里使用 "similarity"
# # search_kwargs: 传递给搜索方法的额外参数，这里设置 k=1 表示返回一个结果
# retriever_from_vectorstore = vector_store.as_retriever(
#     search_type="similarity",
#     search_kwargs={"k": 1},
# )

# print("\n使用 vector_store.as_retriever() 创建的检索器进行批量查询:")
# retriever_from_vectorstore_results = retriever_from_vectorstore.batch(
#     [
#         "How many distribution centers does Nike have in the US?",
#         "When was Nike incorporated?",
#     ],
# )


In [29]:
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

[[Document(id='bf3c8c08-2c15-4207-a668-e745ece2bc6c', metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': './Document/nke-10k-2023.pdf', 'total_pages': 107, 'page': 26, 'page_label': '27', 'start_index': 804}, page_content='operations. We also lease an office complex in Shanghai, China, our headquarters for our Greater China geography, occupied by employees focused on implementing our\nwholesale, NIKE Direct and merchandising strategies in the region, among other functions.\nIn the United States, NIKE has eight significant distribution centers. Five are located in or near Memphis, Tennessee, two of which are owned and thr

In [None]:
# 您会看到与自定义检索器类似的输出，这表明 as_retriever 方法成功地复制了相同的功能。

# VectorStoreRetriever 支持多种 search_type：

# "similarity" (默认)：基于相似度搜索。

# "mmr" (maximum marginal relevance)：最大边际相关性，用于平衡检索结果的相似性和多样性。

# "similarity_score_threshold"：根据相似度分数阈值过滤文档。

# 至此，您已经完成了 LangChain 教程中“构建语义搜索引擎”的所有核心步骤，包括文档加载、分割、嵌入、向量存储以及检索器的使用。

In [None]:

# VectorStoreRetriever supports search types of "similarity" (default), "mmr" (maximum marginal relevance, described above), and "similarity_score_threshold". We can use the latter to threshold documents output by the retriever by similarity score.
# VectorStoreRetriever 支持三种搜索类型： "similarity" （默认）、 "mmr" （最大边际相关性，如上所述）以及 "similarity_score_threshold" 。我们可以使用后者根据相似度得分对检索器输出的文档进行阈值设置。

# Retrievers can easily be incorporated into more complex applications, such as retrieval-augmented generation (RAG) applications that combine a given question with retrieved context into a prompt for a LLM. To learn more about building such an application, check out the RAG tutorial tutorial.
# 检索器可以轻松集成到更复杂的应用程序中，例如检索增强生成 (RAG) 应用程序，该应用程序将给定问题与检索到的上下文结合，形成法学硕士 (LLM) 的提示。要了解更多关于构建此类应用程序的信息，请查看 RAG 教程。


In [None]:
# Learn more:  了解更多：
# Retrieval strategies can be rich and complex. For example:
# 检索策略可以丰富而复杂。例如：

# We can infer hard rules and filters from a query (e.g., "using documents published after 2020");
# 我们可以从查询中推断出硬规则和过滤器 （例如，“使用 2020 年之后发布的文档”）；
# We can return documents that are linked to the retrieved context in some way (e.g., via some document taxonomy);
# 我们可以返回以某种方式（例如，通过某种文档分类法）链接到检索到的上下文的文档 ；
# We can generate multiple embeddings for each unit of context;
# 我们可以为每个上下文单元生成多个嵌入 ；
# We can ensemble results from multiple retrievers;
# 我们可以整合来自多个检索器的结果 ；
# We can assign weights to documents, e.g., to weigh recent documents higher.
# 我们可以为文档分配权重，例如，赋予最近的文档更高的权重。
# The retrievers section of the how-to guides covers these and other built-in retrieval strategies.
# 操作指南的检索器部分涵盖了这些和其他内置检索策略。

# It is also straightforward to extend the BaseRetriever class in order to implement custom retrievers. See our how-to guide here.
# 扩展 BaseRetriever 类来实现自定义检索器也很简单。请参阅此处的操作指南。


In [None]:
# Next steps  后续步骤 Next steps
# You've now seen how to build a semantic search engine over a PDF document.
# 现在您已经了解了如何在 PDF 文档上构建语义搜索引擎。

# For more on document loaders:
# 有关文档加载器的更多信息：

# Conceptual guide  概念指南
# How-to guides  操作指南
# Available integrations  可用的集成
# For more on embeddings:
# 有关嵌入的更多信息：

# Conceptual guide  概念指南
# How-to guides  操作指南
# Available integrations  可用的集成
# For more on vector stores:
# 有关向量存储的更多信息：

# Conceptual guide  概念指南
# How-to guides  操作指南
# Available integrations  可用的集成
# For more on RAG, see:
# 有关 RAG 的更多信息，请参阅：

# Build a Retrieval Augmented Generation (RAG) App
# 构建检索增强生成 (RAG) 应用程序
# Related how-to guides  相关操作指南
