# Part 12: 多重表征索引
主要的思路：对文档进行摘要，通过摘要进行索引。可以通过相似的逻辑，扩展对原文档的多种索引方式。

In [1]:
# 加载网页数据
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

loader = WebBaseLoader("https://lilianweng.github.io/posts/2024-02-05-human-data-quality/")
docs.extend(loader.load())

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [6]:
# 通过llm对文档进行摘要
import uuid
import os

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain.chat_models import init_chat_model, ChatOpenAI
from dotenv import load_dotenv

load_dotenv()


api_url = os.getenv('API_URL')
api_key = os.getenv('API_KEY')
model_name = os.getenv('MODEL')
llm = init_chat_model(
    model_provider="openai",  # 避免langchain根据模型名自动选择供应商
    model=model_name,
    # temperature=0.0,
    api_key=api_key,
    base_url=api_url,
)

In [7]:
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | llm
    | StrOutputParser()
)

summaries = chain.batch(docs, {"max_concurrency": 5})

In [9]:
# 使用摘要进行索引
from langchain.storage import InMemoryByteStore
from langchain_community.vectorstores import Chroma
from langchain.retrievers.multi_vector import MultiVectorRetriever
from ark_embedding import ArkEmbeddings


embd = ArkEmbeddings(
    model=os.getenv("ALIYUN_EMBEDDING_MODEL"),
    api_key=os.getenv("ALIYUN_API_KEY"),
    api_url=os.getenv("ALIYUN_API_URL"),
    batch_size=10
)
# 向量化并存储
vectorstore = Chroma(collection_name="summaries",
                     embedding_function=embd)
store = InMemoryByteStore()
id_key = "doc_id"

# 构建retriever, 通过id_key关联向量和doc
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

# 与摘要关联的doc
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

# 分别添加向量和文档
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

  vectorstore = Chroma(collection_name="summaries",
  """Field proxy for building Where conditions with operator overloading.


In [10]:
# 匹配相似摘要
query = "Memory in agents"
sub_docs = vectorstore.similarity_search(query, k=1)
sub_docs[0]

Document(metadata={'doc_id': '625d221f-ab8e-48eb-866f-7dd58760a6d0'}, page_content='Of course. Here is a summary of the document "LLM Powered Autonomous Agents" by Lilian Weng.\n\n### Document Summary\n\nThis comprehensive blog post explores the architecture, components, and real-world applications of autonomous agents powered by Large Language Models (LLMs). It frames the LLM as the core "brain" of an agent system, which is augmented by three key components to overcome its inherent limitations.\n\n#### Core Components of an LLM Agent:\n\n1.  **Planning:** The agent breaks down complex tasks into smaller, manageable subgoals and can self-reflect to learn from mistakes.\n    *   **Task Decomposition:** Techniques like Chain-of-Thought (CoT) and Tree of Thoughts are used to break problems into steps.\n    *   **Self-Reflection:** Frameworks like **ReAct** (Reason + Act) and **Reflexion** allow the agent to critique its past actions, learn from failures, and refine its future strategy.\n\

In [11]:
# 通过匹配摘要，检索相似文档
retrieved_docs = retriever.get_relevant_documents(query, n_results=1)
retrieved_docs[0].page_content[0:500]

  retrieved_docs = retriever.get_relevant_documents(query, n_results=1)


"\n\n\n\n\n\nLLM Powered Autonomous Agents | Lil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n|\n\n\n\n\n\n\nPosts\n\n\n\n\nArchive\n\n\n\n\nSearch\n\n\n\n\nTags\n\n\n\n\nFAQ\n\n\n\n\n\n\n\n\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\n \n\n\nTable of Contents\n\n\n\nAgent System Overview\n\nComponent One: Planning\n\nTask Decomposition\n\nSelf-Reflection\n\n\nComponent Two: Memory\n\nTypes of Memory\n\nMaximum Inner Product Search (MIPS)\n\n\nComponent Three:"

# Part 13: RAPTOR
Recursive Abstractive Processing for Tree-Organized Retrieval  
参考代码：https://github.com/parthsarthi03/raptor#  
主要的思路：对聚类后的文本块进行摘要并嵌入，递归这个过程，自底向上构建具有树状结构的不同层级摘要和嵌入。在推理时，从该树中进行检索，整合长篇文档中不同抽象层级的信息。 
整体的思想，有点类似GraphRAG的分层聚类，获取不同层级的信息，只是GraphRAG是对知识图谱进行操作，而RAPTOR是直接对分块chunk或文档进行操作。

In [13]:
# 加载测试用文本
with open('data/sample.txt', 'r') as file:
    text = file.read()

print(text[:100])

The wife of a rich man fell sick, and as she felt that her end
was drawing near, she called her only


In [5]:
from pathlib import Path
import sys
import os

project_root = (Path(os.getcwd()).parent / "raptor").resolve().as_posix()
sys.path.append(project_root)
print(project_root)

F:/project/rag-from-scratch/raptor


In [6]:
# 构建树形摘要
from raptor import RetrievalAugmentation 


RA = RetrievalAugmentation()

# construct the tree
RA.add_documents(text)

ModuleNotFoundError: No module named 'umap'

# Part 14: ColBERT