# 如何处理提取长文本时的问题
在处理文件（如PDF）时，你可能会遇到超出语言模型上下文窗口限制的文本。为了处理这些文本，可以考虑以下策略：
1. **更换大语言模型** 选择支持更大上下文窗口的不同大语言模型。2. **暴力破解** 将文档分块，并从每个块中提取内容。3. **RAG** 将文档分块，对分块内容建立索引，仅从看似“相关”的部分分块中提取内容。
请记住，这些策略各有不同的权衡取舍，最佳策略很可能取决于你正在设计的应用程序！
本指南演示了如何实施策略2和策略3。

## 安装
首先，我们将安装本指南所需的依赖项：

In [1]:
%pip install -qU langchain-community lxml faiss-cpu langchain-openai

Note: you may need to restart the kernel to use updated packages.


现在我们需要一些示例数据！让我们从维基百科下载一篇关于[汽车的文章](https://en.wikipedia.org/wiki/Car)，并将其加载为LangChain的[文档](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html)。

In [1]:
import re

import requests
from langchain_community.document_loaders import BSHTMLLoader

# Download the content
response = requests.get("https://en.wikipedia.org/wiki/Car")
# Write it to a file
with open("car.html", "w", encoding="utf-8") as f:
    f.write(response.text)
# Load it with an HTML parser
loader = BSHTMLLoader("car.html")
document = loader.load()[0]
# Clean up code
# Replace consecutive new lines with a single new line
document.page_content = re.sub("\n\n+", "\n", document.page_content)

In [2]:
print(len(document.page_content))

78865


## 定义模式
遵循[提取教程](/docs/tutorials/extraction)，我们将使用Pydantic来定义希望提取的信息模式。在本例中，我们将提取包含年份和描述的"关键发展"列表（例如重要的历史事件）。
请注意，我们还包含了一个`evidence`键，并指示模型逐字提供文章中相关的文本句子。这使得我们能够将提取结果与（模型重构的）原始文档文本进行比较。

In [3]:
from typing import List, Optional

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from pydantic import BaseModel, Field


class KeyDevelopment(BaseModel):
    """Information about a development in the history of cars."""

    year: int = Field(
        ..., description="The year when there was an important historic development."
    )
    description: str = Field(
        ..., description="What happened in this year? What was the development?"
    )
    evidence: str = Field(
        ...,
        description="Repeat in verbatim the sentence(s) from which the year and description information were extracted",
    )


class ExtractionData(BaseModel):
    """Extracted information about key developments in the history of cars."""

    key_developments: List[KeyDevelopment]


# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert at identifying key historic development in text. "
            "Only extract important historic developments. Extract nothing if no important information can be found in the text.",
        ),
        ("human", "{text}"),
    ]
)

## 创建一个提取器
让我们选择一个LLM。由于我们正在使用工具调用功能，因此需要一个支持工具调用特性的模型。可用的LLM列表请参阅[此表格](/docs/integrations/chat)。
import ChatModelTabs from "@theme/ChatModelTabs";
<ChatModelTabs
customVarName="llm"overrideParams={{openai: {model: "gpt-4o", kwargs: "temperature=0"}}}/>

In [4]:
# | output: false
# | echo: false

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0)

In [5]:
extractor = prompt | llm.with_structured_output(
    schema=ExtractionData,
    include_raw=False,
)

## 暴力破解法
将文档分割成若干块，确保每块内容都能适配大语言模型的上下文窗口。

In [6]:
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(
    # Controls the size of each chunk
    chunk_size=2000,
    # Controls overlap between chunks
    chunk_overlap=20,
)

texts = text_splitter.split_text(document.page_content)

使用[batch](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html)功能对每个文本块**并行**执行提取操作！
:::提示你通常可以使用 `.batch()` 来并行化提取操作！`.batch` 在底层使用线程池来帮助你并行化工作负载。
如果你的模型通过API公开，这很可能会加快你的提取流程！好的,我会按照要求进行翻译,只输出翻译后的中文内容,并保持原有的markdown格式。以下是一个示例:

# 欢迎使用翻译助手

这是一个示例文档,展示如何将英文markdown格式翻译成中文。

## 主要功能

- 保持原有markdown格式
- 提供准确的中文翻译
- 自动处理标题、列表等元素

### 注意事项

1. 翻译时会保留所有格式符号
2. 专业术语会进行准确翻译
3. 确保语句通顺自然

> 这是引用的内容也会被正确翻译

```python
# 代码块不会被翻译
print("Hello World")
```

[链接文字](url)也会保持原样

**粗体**和*斜体*等格式都会保留

In [7]:
# Limit just to the first 3 chunks
# so the code can be re-run quickly
first_few = texts[:3]

extractions = extractor.batch(
    [{"text": text} for text in first_few],
    {"max_concurrency": 5},  # limit the concurrency by passing max concurrency!
)

### 合并结果
在从各个数据块中提取数据后，我们需要将提取结果合并在一起。

In [8]:
key_developments = []

for extraction in extractions:
    key_developments.extend(extraction.key_developments)

key_developments[:10]

[KeyDevelopment(year=1769, description='Nicolas-Joseph Cugnot built the first steam-powered road vehicle.', evidence='The French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769, while the Swiss inventor François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile in 1808.'),
 KeyDevelopment(year=1808, description='François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile.', evidence='The French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769, while the Swiss inventor François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile in 1808.'),
 KeyDevelopment(year=1886, description='Carl Benz invented the modern car, a practical, marketable automobile for everyday use, and patented his Benz Patent-Motorwagen.', evidence='The modern car—a practical, marketable automobile for everyday use—was invented in 1886, whe

## 基于RAG的方法
另一个简单的想法是将文本分块处理，但并非从每个块中提取信息，而是专注于最相关的块。
:::注意可能难以确定哪些数据块是相关的。
例如，在我们正在使用的这篇关于`汽车`的文章中，大部分内容都包含了关键的发展信息。因此，通过使用**RAG**，我们很可能会丢弃大量相关信息。
我们建议您针对具体用例进行实验，以验证此方法是否有效。:::
要实现基于RAG的方法：
1. 将文档分块并建立索引（例如存入向量数据库）；2. 在 `extractor` 链前添加一个使用向量数据库的检索步骤。
以下是一个基于`FAISS`向量数据库的简单示例：

In [9]:
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

texts = text_splitter.split_text(document.page_content)
vectorstore = FAISS.from_texts(texts, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever(
    search_kwargs={"k": 1}
)  # Only extract from first document

在这种情况下，RAG提取器仅查看顶部文档。

In [10]:
rag_extractor = {
    "text": retriever | (lambda docs: docs[0].page_content)  # fetch content of top doc
} | extractor

In [11]:
results = rag_extractor.invoke("Key developments associated with cars")

In [13]:
for key_development in results.key_developments:
    print(key_development)

year=2006 description='Car-sharing services in the US experienced double-digit growth in revenue and membership.' evidence='in the US, some car-sharing services have experienced double-digit growth in revenue and membership growth between 2006 and 2007.'
year=2020 description='56 million cars were manufactured worldwide, with China producing the most.' evidence='In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year. The automotive industry in China produces by far the most (20 million in 2020).'


## 常见问题
不同方法在成本、速度和准确性方面各有优缺点。
注意以下问题：
* 分块处理内容意味着，如果信息分散在多个块中，LLM可能无法成功提取信息。* 大块文本重叠可能导致相同信息被重复提取，因此请准备好进行去重处理！* 大语言模型可能会编造数据。如果在一大段文本中寻找单一事实并采用蛮力方法，最终可能会得到更多虚构的数据。