## 1. 基于PDF文件进行检索

### 1.1 PDF加载
利用`PyPDFLoader`加载PDF文件

In [26]:
# 导入PDFLoader，并加载PDF文件
from langchain_community.document_loaders import PyPDFLoader

file_path = "./ntk.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))
print("___________________")
print("查看第一页的内容")
print(f"{docs[0].page_content[:200]}\n")
print(docs[0].metadata)

107
___________________
查看第一页的内容
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
F

{'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': './ntk.pdf', 'total_pages': 107, 'page': 0, 'page_label': '1'}


### 1.2 分割
对于信息检索和下游问答的目的而言，一页可能过于粗糙地表示信息。最终我们的目标是检索出能够回答输入查询的 Document 对象，进一步拆分我们的 PDF 将有助于确保文档中相关部分的意义不会被周围的文字“冲淡”。

我们可以使用文本分割器来实现这一目的。这里我们将使用一个基于字符的简单文本分割器。我们将文档分割成每段 1000 个字符，每段之间有 200 个字符的重叠。重叠有助于减少将陈述与其相关的重要上下文分开的可能性。我们将使用 RecursiveCharacterTextSplitter，它会递归地使用常见的分隔符（如换行符）来分割文档，直到每段都是适当大小。这是推荐的一般文本使用场景中的文本分割器。

In [None]:
# 将pdf文档进行分割
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

### 1.3 Vector stores
- 利用Embedding模型将文本转换为向量
- 利用向量库存储向量
- 利用向量库查询向量

In [None]:
from langchain_core.vectorstores import InMemoryVectorStore
import getpass
import os
# from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import DashScopeEmbeddings

embeddings = DashScopeEmbeddings(
    model = "text-embedding-v3"
)

vector_store = InMemoryVectorStore(embeddings)

# 索引文档
ids = vector_store.add_documents(documents=all_splits)

In [None]:
# 根据字符串查询的相似度返回文档
results = vector_store.similarity_search(
    "How many distribution centers does Nike have in the US?"
)

print(results[0])
print(results[0].metadata)

- 返回分数

    - 代码开头的注释说明了分数的含义。分数是一个距离度量，
    - 与相似度成反比关系。这意味着分数越低，相似度越高。

In [None]:

results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

- 根据嵌入查询的相似性返回文档
    - 这是将一个query进行嵌入

    - 然后在向量库种进行搜索，与上面不同的是：上面可以直接输入query，这里需要先进行嵌入

In [None]:
embedding = embeddings.embed_query("How were Nike's margins impacted in 2023?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0].metadata)

### 1.4 Retrievers
- 我们可以自己创建一个简单的版本，不需要子类化 Retriever 。如果我们选择使用哪种方法来检索文档，就可以很容易地创建一个可运行的版本。下面我们将围绕 similarity_search 方法构建一个：

In [None]:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain


@chain
def retriever(query: str) -> List[Document]:
    return vector_store.similarity_search(query, k=1)


retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

- 向量存储实现了一个 as_retriever 方法，该方法将生成一个检索器，具体来说是一个 VectorStoreRetriever。这些检索器包括特定的 search_type 和 search_kwargs 属性，用于标识底层向量存储的哪些方法以及如何参数化它们。例如，我们可以使用以下内容来复制上述内容

In [None]:
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

检索策略可以丰富且复杂。例如：
- 我们可以从查询中推断出硬规则和过滤器（例如，“使用 2020 年及以后发布的文档”）；
- 我们可以返回与检索上下文以某种方式链接的文档（例如，通过某些文档分类法）；
- 我们可以为每个上下文单元生成多个嵌入；
- 我们可以从多个检索器中 ensemble 结果；
- 我们可以给文档赋予权重，例如，使近期的文档权重更高。



## 2. Build an Extraction Chain
我们将使用聊天模型的工具调用功能从无结构文本中提取结构化信息

### 2.1 The Schema
- 首先，我们需要描述我们想要从文本中提取什么信息。
- 我们将使用 Pydantic 定义一个示例模式来提取个人信息。

    - 不要让 LLM 编造信息！上面我们使用了 Optional 作为属性，允许 LLM 输出 None 如果它不知道答案。

In [44]:
from typing import Optional

from pydantic import BaseModel, Field


class Person(BaseModel):
    """Information about a person."""

    # ^ Doc-string for the entity Person.
    # This doc-string is sent to the LLM as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(default=None, description="The name of the person")
    hair_color: Optional[str] = Field(
        default=None, description="The color of the person's hair if known"
    )
    height_in_meters: Optional[str] = Field(
        default=None, description="Height measured in meters"
    )

### 2. The Extractor 
- 让我们使用上面定义的模式创建一个信息提取器。

In [45]:
from typing import Optional

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from pydantic import BaseModel, Field

# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked to extract, "
            "return null for the attribute's value.",
        ),
        # Please see the how-to about improving performance with
        # reference examples.
        # MessagesPlaceholder('examples'),
        ("human", "{text}"),
    ]
)

In [46]:
# 选择LLM
from langchain_openai import ChatOpenAI
import os 


# 1. 导入LLM
# 初始化模型
model = ChatOpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    model="qwen-plus",  # 此处以qwen-plus为例，您可按需更换模型名称
)



In [None]:
structured_llm = model.with_structured_output(schema=Person)

In [48]:
text = "Alan Smith is 6 feet tall and has blond hair."
prompt = prompt_template.invoke({"text": text})


In [49]:
prompt

ChatPromptValue(messages=[SystemMessage(content="You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value.", additional_kwargs={}, response_metadata={}), HumanMessage(content='Alan Smith is 6 feet tall and has blond hair.', additional_kwargs={}, response_metadata={})])