# Part 12: 多重表征索引
主要的思路：对文档进行摘要，通过摘要进行索引。可以通过相似的逻辑，扩展对原文档的多种索引方式。

In [1]:
# 加载网页数据
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

loader = WebBaseLoader("https://lilianweng.github.io/posts/2024-02-05-human-data-quality/")
docs.extend(loader.load())

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [6]:
# 通过llm对文档进行摘要
import uuid
import os

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain.chat_models import init_chat_model, ChatOpenAI
from dotenv import load_dotenv

load_dotenv()


api_url = os.getenv('API_URL')
api_key = os.getenv('API_KEY')
model_name = os.getenv('MODEL')
llm = init_chat_model(
    model_provider="openai",  # 避免langchain根据模型名自动选择供应商
    model=model_name,
    # temperature=0.0,
    api_key=api_key,
    base_url=api_url,
)

In [7]:
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | llm
    | StrOutputParser()
)

summaries = chain.batch(docs, {"max_concurrency": 5})

In [9]:
# 使用摘要进行索引
from langchain.storage import InMemoryByteStore
from langchain_community.vectorstores import Chroma
from langchain.retrievers.multi_vector import MultiVectorRetriever
from ark_embedding import ArkEmbeddings


embd = ArkEmbeddings(
    model=os.getenv("ALIYUN_EMBEDDING_MODEL"),
    api_key=os.getenv("ALIYUN_API_KEY"),
    api_url=os.getenv("ALIYUN_API_URL"),
    batch_size=10
)
# 向量化并存储
vectorstore = Chroma(collection_name="summaries",
                     embedding_function=embd)
store = InMemoryByteStore()
id_key = "doc_id"

# 构建retriever, 通过id_key关联向量和doc
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

# 与摘要关联的doc
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

# 分别添加向量和文档
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

  vectorstore = Chroma(collection_name="summaries",
  """Field proxy for building Where conditions with operator overloading.


In [10]:
# 匹配相似摘要
query = "Memory in agents"
sub_docs = vectorstore.similarity_search(query, k=1)
sub_docs[0]

Document(metadata={'doc_id': '625d221f-ab8e-48eb-866f-7dd58760a6d0'}, page_content='Of course. Here is a summary of the document "LLM Powered Autonomous Agents" by Lilian Weng.\n\n### Document Summary\n\nThis comprehensive blog post explores the architecture, components, and real-world applications of autonomous agents powered by Large Language Models (LLMs). It frames the LLM as the core "brain" of an agent system, which is augmented by three key components to overcome its inherent limitations.\n\n#### Core Components of an LLM Agent:\n\n1.  **Planning:** The agent breaks down complex tasks into smaller, manageable subgoals and can self-reflect to learn from mistakes.\n    *   **Task Decomposition:** Techniques like Chain-of-Thought (CoT) and Tree of Thoughts are used to break problems into steps.\n    *   **Self-Reflection:** Frameworks like **ReAct** (Reason + Act) and **Reflexion** allow the agent to critique its past actions, learn from failures, and refine its future strategy.\n\

In [11]:
# 通过匹配摘要，检索相似文档
retrieved_docs = retriever.get_relevant_documents(query, n_results=1)
retrieved_docs[0].page_content[0:500]

  retrieved_docs = retriever.get_relevant_documents(query, n_results=1)


"\n\n\n\n\n\nLLM Powered Autonomous Agents | Lil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n|\n\n\n\n\n\n\nPosts\n\n\n\n\nArchive\n\n\n\n\nSearch\n\n\n\n\nTags\n\n\n\n\nFAQ\n\n\n\n\n\n\n\n\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\n \n\n\nTable of Contents\n\n\n\nAgent System Overview\n\nComponent One: Planning\n\nTask Decomposition\n\nSelf-Reflection\n\n\nComponent Two: Memory\n\nTypes of Memory\n\nMaximum Inner Product Search (MIPS)\n\n\nComponent Three:"

# Part 13: RAPTOR
Recursive Abstractive Processing for Tree-Organized Retrieval  
参考代码：https://github.com/parthsarthi03/raptor#  
主要的思路：对聚类后的文本块进行摘要并嵌入，递归这个过程，自底向上构建具有树状结构的不同层级摘要和嵌入。在推理时，从该树中进行检索，整合长篇文档中不同抽象层级的信息。 
整体的思想，有点类似GraphRAG的分层聚类，获取不同层级的信息，只是GraphRAG是对知识图谱进行操作，而RAPTOR是直接对分块chunk或文档进行操作。

In [1]:
from pathlib import Path
import sys
import os

project_root = (Path(os.getcwd()).parent / "raptor").resolve().as_posix()
sys.path.append(project_root)
print(project_root)

F:/project/rag-from-scratch/raptor


In [2]:
from raptor import (
    BaseSummarizationModel, 
    BaseQAModel, 
    BaseEmbeddingModel, 
    RetrievalAugmentation,
    RetrievalAugmentationConfig
)

  from .autonotebook import tqdm as notebook_tqdm
2025-11-21 21:09:28,759 - Loading faiss with AVX2 support.
2025-11-21 21:09:28,778 - Successfully loaded faiss with AVX2 support.


In [3]:
from tenacity import retry, stop_after_attempt, wait_random_exponential
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

True

In [4]:
class MySummarizationModel(BaseSummarizationModel):
    def __init__(self, model=os.environ["MODEL"]):

        self.model = model

    # @retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
    def summarize(self, context, max_tokens=500, stop_sequence=None):

        try:
            client = OpenAI(
                base_url=os.environ["API_URL"],
                api_key=os.environ["API_KEY"],
            )

            response = client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {
                        "role": "user",
                        "content": f"Write a summary of the following, including as many key details as possible: {context}:",
                    },
                ],
                max_tokens=max_tokens,
            )

            return response.choices[0].message.content

        except Exception as e:
            print(e)
            return e

In [5]:
class MyQAModel(BaseQAModel):
    def __init__(self, model=os.environ["MODEL"]):
        """
        Initializes the GPT-3 model with the specified model version.

        Args:
            model (str, optional): The GPT-3 model version to use for generating summaries. Defaults to "text-davinci-003".
        """
        self.model = model
        self.client = OpenAI(
            api_key=os.environ["API_KEY"],
            base_url=os.environ["API_URL"],
        )

    # @retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
    def answer_question(self, context, question, max_tokens=150, stop_sequence=None):
        """
        Generates a summary of the given context using the GPT-3 model.

        Args:
            context (str): The text to summarize.
            max_tokens (int, optional): The maximum number of tokens in the generated summary. Defaults to 150.
            stop_sequence (str, optional): The sequence at which to stop summarization. Defaults to None.

        Returns:
            str: The generated summary.
        """
        try:
            response = self.client.completions.create(
                prompt=f"using the folloing information {context}. Answer the following question in less than 5-7 words, if possible: {question}",
                temperature=0,
                max_tokens=max_tokens,
                top_p=1,
                frequency_penalty=0,
                presence_penalty=0,
                stop=stop_sequence,
                model=self.model,
            )
            return response.choices[0].text.strip()

        except Exception as e:
            print(e)
            return ""

In [6]:
from ark_embedding import ArkEmbeddings


class MyEmbeddingModel(BaseEmbeddingModel):
    def __init__(self, model="text-embedding-ada-002"):
        self.client = ArkEmbeddings(
            model=os.getenv("ALIYUN_EMBEDDING_MODEL"),
            api_key=os.getenv("ALIYUN_API_KEY"),
            api_url=os.getenv("ALIYUN_API_URL"),
            batch_size=10
        )

    # @retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
    def create_embedding(self, text):
        text = text.replace("\n", " ")
        embd = self.client.embed_documents([text])
        return embd[0]

In [7]:
RAC = RetrievalAugmentationConfig(
    summarization_model=MySummarizationModel(), 
    qa_model=MyQAModel(), 
    embedding_model=MyEmbeddingModel()
)

In [8]:
# 加载测试用文本
with open('data/sample.txt', 'r') as file:
    text = file.read()

print(text[:100])

The wife of a rich man fell sick, and as she felt that her end
was drawing near, she called her only


In [9]:
# 构建树状检索
RA = RetrievalAugmentation(config=RAC)

RA.add_documents(text)

2025-11-21 21:09:29,834 - Successfully initialized TreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization Length: 100
            Summarization Model: <__main__.MySummarizationModel object at 0x0000023FF6AFD820>
            Embedding Models: {'EMB': <__main__.MyEmbeddingModel object at 0x0000023FF6AFE000>}
            Cluster Embedding Model: EMB
        
        Reduction Dimension: 10
        Clustering Algorithm: RAPTOR_Clustering
        Clustering Parameters: {}
        
2025-11-21 21:09:29,835 - Successfully initialized ClusterTreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summar

In [12]:
# 从树状检索中查询
question = "How did Cinderella reach her happy ending?"

answer = RA.answer_question(question=question)

print("Answer: ", answer)

2025-11-21 21:10:14,259 - Using collapsed_tree
2025-11-21 21:10:14,358 - HTTP Request: POST https://dashscope.aliyuncs.com/compatible-mode/v1/embeddings "HTTP/1.1 200 OK"
2025-11-21 21:10:14,473 - HTTP Request: POST https://ark.cn-beijing.volces.com/api/v3/completions "HTTP/1.1 404 Not Found"


Error code: 404
Answer:  


In [13]:
# 保存结果
SAVE_PATH = "data/cinderella"
RA.save(SAVE_PATH)

2025-11-21 21:13:22,049 - Tree successfully saved to data/cinderella


In [15]:
# 从保存的结果中恢复检索结果
RA = RetrievalAugmentation(config=RAC, tree=SAVE_PATH)

answer = RA.answer_question(question=question)
print("Answer: ", answer)

2025-11-21 21:13:51,952 - Successfully initialized TreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization Length: 100
            Summarization Model: <__main__.MySummarizationModel object at 0x0000023FF6AFD820>
            Embedding Models: {'EMB': <__main__.MyEmbeddingModel object at 0x0000023FF6AFE000>}
            Cluster Embedding Model: EMB
        
        Reduction Dimension: 10
        Clustering Algorithm: RAPTOR_Clustering
        Clustering Parameters: {}
        
2025-11-21 21:13:51,953 - Successfully initialized ClusterTreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summar

Error code: 404
Answer:  


# Part 14: ColBERT
介绍资料：https://hackernoon.com/how-colbert-helps-developers-overcome-the-limits-of-rag  
论文：https://arxiv.org/abs/2004.12832?ref=hackernoon.com  
核心原理：
- 通过分词+bert进行向量化（通过双向的transformer编码，得到考虑了上下文的向量）。
- 文档和查询都会做相同的处理。
- 每个文档的总得分 = 逐个“分词后的查询向量”分别计算与“分词后的文档向量”的最大相似度，并求和。

In [None]:
# 使用专用于colbert的模型
from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

In [None]:
# 下载wiki的数据
import requests

def get_wikipedia_page(title: str):
    """
    Retrieve the full text content of a Wikipedia page.

    :param title: str - Title of the Wikipedia page.
    :return: str - Full text content of the page as raw string.
    """
    # Wikipedia API endpoint
    URL = "https://en.wikipedia.org/w/api.php"

    # Parameters for the API request
    params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "explaintext": True,
    }

    # Custom User-Agent header to comply with Wikipedia's best practices
    headers = {"User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"}

    response = requests.get(URL, params=params, headers=headers)
    data = response.json()

    # Extracting page content
    page = next(iter(data["query"]["pages"].values()))
    return page["extract"] if "extract" in page else None

full_document = get_wikipedia_page("Hayao_Miyazaki")

In [None]:
# 建立索引
RAG.index(
    collection=[full_document],
    index_name="Miyazaki-123",
    max_document_length=180,
    split_documents=True,
)

In [None]:
# 检索
results = RAG.search(query="What animation studio did Miyazaki found?", k=3)
results

In [None]:
# 转换为langchain retriever
retriever = RAG.as_langchain_retriever(k=3)
retriever.invoke("What animation studio did Miyazaki found?")