# Multi-Representation Indexing Demo

Multi-representation indexing improves retrieval by storing multiple representations of the same document. Rather than indexing the full document directly, this technique uses an LLM to generate summaries, embeds those summaries for efficient semantic retrieval, and links them back to the full original documents. This approach enables faster, more semantic search while preserving access to complete context.

## What this notebook contains
- Loading and processing multiple documents from web sources.
- Generating LLM-based summaries of documents using batch processing.
- Setting up a `MultiVectorRetriever` that links summaries to full documents.
- Storing summaries in a vector database (Chroma) for efficient semantic search.
- Storing original full documents in a byte store for retrieval.
- Querying the system to retrieve semantically similar summaries and return the corresponding full documents.

In [1]:
from langchain.text_splitter import RecursiveCharacterTextSplitter  
from langchain_community.document_loaders import WebBaseLoader  
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma  
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda
from langchain_core.output_parsers import StrOutputParser  
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI, OpenAIEmbeddings 
from langchain.prompts import ChatPromptTemplate
from langchain.storage import InMemoryByteStore
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.load import dumps, loads
from langchain.schema import Document
from typing import Literal
import yaml
import uuid
import os

USER_AGENT environment variable not set, consider setting it to identify your requests.

For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  exec(code_obj, self.user_global_ns, self.user_ns)


In [2]:
# Get the current working directory
cwd = os.getcwd()

# Build the path to config.yaml
config_path = os.path.join(cwd, '..', 'configs', 'config.yaml')

# Normalize the path
config_path = os.path.abspath(config_path)

# Load credential from config file
with open(config_path, 'r') as file:
    config = yaml.safe_load(file)

# Set environment variables
os.environ['LANGCHAIN_API_KEY'] = config['API']['LANGCHAIN']
os.environ['OPENAI_API_KEY'] = config['API']['OPENAI']

# Configure chat LLM (deterministic)
llm = ChatOpenAI(temperature=0) 

In [3]:
# Create a loader that fetches and parses the target web page
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

loader = WebBaseLoader("https://lilianweng.github.io/posts/2024-02-05-human-data-quality/")
docs.extend(loader.load())

In [4]:
#  Define a chain to summarize documents
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | llm
    | StrOutputParser()
)

# Take a list of documents and summarize them simultaneously
summaries = chain.batch(docs, {"max_concurrency": 5})
summaries

['The document discusses the concept of building autonomous agents powered by Large Language Models (LLMs). It covers key components of such agents, including planning, memory, and tool use. Various proof-of-concept examples, such as AutoGPT and GPT-Engineer, are provided to demonstrate the potential of LLM-powered agents. Challenges related to finite context length, planning, and reliability of natural language interfaces are also highlighted. The document includes references to relevant research papers and provides a comprehensive overview of the topic.',
 'The document discusses the importance of high-quality human data for training deep learning models. It covers various methods and techniques to ensure data quality, including human raters, the wisdom of the crowd, rater agreement, and disagreement, as well as two paradigms for data annotation. It also explores how data quality impacts model training, with a focus on influence functions, prediction changes during training, and nois

In [5]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())

# Create a storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# Create the multi-vector retriever to link summaries to full documents
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

# Generate unique IDs for each document
doc_ids = [str(uuid.uuid4()) for _ in docs]

# Link summaries to their corresponding document IDs
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

# Add the summary to the vector store
retriever.vectorstore.add_documents(summary_docs)

# Store the full documents in the byte store
retriever.docstore.mset(list(zip(doc_ids, docs)))

  vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())


In [6]:
# Example query to retrieve relevant document
query = "Memory in agents"

# Retrieve summary using vectorstore
sub_docs = vectorstore.similarity_search(query,k=1)
sub_docs[0]

Document(metadata={'doc_id': 'e03f70f3-7938-4a98-9d33-912dadaf5f07'}, page_content='The document discusses the concept of building autonomous agents powered by Large Language Models (LLMs). It covers key components of such agents, including planning, memory, and tool use. Various proof-of-concept examples, such as AutoGPT and GPT-Engineer, are provided to demonstrate the potential of LLM-powered agents. Challenges related to finite context length, planning, and reliability of natural language interfaces are also highlighted. The document includes references to relevant research papers and provides a comprehensive overview of the topic.')

In [7]:
# Retrieve the full document using the multi-vector retriever
retrieved_docs = retriever.get_relevant_documents(query,n_results=1)
retrieved_docs[0].page_content[0:500]

  retrieved_docs = retriever.get_relevant_documents(query,n_results=1)


"\n\n\n\n\n\nLLM Powered Autonomous Agents | Lil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n|\n\n\n\n\n\n\nPosts\n\n\n\n\nArchive\n\n\n\n\nSearch\n\n\n\n\nTags\n\n\n\n\nFAQ\n\n\n\n\n\n\n\n\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\n \n\n\nTable of Contents\n\n\n\nAgent System Overview\n\nComponent One: Planning\n\nTask Decomposition\n\nSelf-Reflection\n\n\nComponent Two: Memory\n\nTypes of Memory\n\nMaximum Inner Product Search (MIPS)\n\n\nComponent Three:"