## LangChain - Indexing

### Multi-representation

In [1]:
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

loader = WebBaseLoader("https://lilianweng.github.io/posts/2024-02-05-human-data-quality/")
docs.extend(loader.load())

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [2]:
import uuid

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import AzureChatOpenAI

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | AzureChatOpenAI(model="gpt-4o-mini",api_version="2024-12-01-preview", max_retries=0)
    | StrOutputParser()
)

summaries = chain.batch(docs, {"max_concurrency": 5})

In [3]:
summaries

['The document titled "LLM Powered Autonomous Agents" by Lilian Weng discusses the design and capabilities of autonomous agents powered by large language models (LLMs). It outlines the key components necessary for these agents, including:\n\n1. **Planning**: \n   - Task decomposition, breaking complex tasks into manageable subgoals, and using techniques like Chain of Thoughts (CoT) for better reasoning.\n   - Self-reflection to iteratively improve decision-making based on previous actions.\n\n2. **Memory**:\n   - Incorporating types of memory analogous to human cognition.\n   - Use of short-term memory (in-context learning) and long-term memory via external vector storage, with Maximum Inner Product Search (MIPS) techniques to aid in efficient information retrieval.\n\n3. **Tool Use**:\n   - Enabling LLMs to utilize external tools for tasks beyond their pretrained capabilities, enhancing their functionality and enabling complex operations.\n   - Examples of projects like MRKL and Huggi

In [None]:
from langchain.storage import InMemoryByteStore
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.retrievers.multi_vector import MultiVectorRetriever

# The vector store to use to index the child chinks
vector_store = Chroma(
    collection_name="summaries",
    embedding_function=AzureOpenAIEmbeddings(model="text-embedding-3-large")
)

# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever
retriever = MultiVectorRetriever(
    vectorstore=vector_store,
    byte_store=store,
    id_key=id_key # key between the chunks and full docs
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

# Docs linked to summaries
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

# Add
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

  vector_store = Chroma(


In [5]:
query = "Memory in agents"

sub_docs = vector_store.similarity_search(query=query, k=1)
sub_docs[0]

Document(metadata={'doc_id': '782b5b8d-3dc1-41c1-8ee1-bbec92dbc70b'}, page_content='The document titled "LLM Powered Autonomous Agents" by Lilian Weng discusses the design and capabilities of autonomous agents powered by large language models (LLMs). It outlines the key components necessary for these agents, including:\n\n1. **Planning**: \n   - Task decomposition, breaking complex tasks into manageable subgoals, and using techniques like Chain of Thoughts (CoT) for better reasoning.\n   - Self-reflection to iteratively improve decision-making based on previous actions.\n\n2. **Memory**:\n   - Incorporating types of memory analogous to human cognition.\n   - Use of short-term memory (in-context learning) and long-term memory via external vector storage, with Maximum Inner Product Search (MIPS) techniques to aid in efficient information retrieval.\n\n3. **Tool Use**:\n   - Enabling LLMs to utilize external tools for tasks beyond their pretrained capabilities, enhancing their functionali

In [7]:
retrieved_docs = retriever.get_relevant_documents(query, n_results=1)
retrieved_docs[0].page_content[:500]

  retrieved_docs = retriever.get_relevant_documents(query, n_results=1)


"\n\n\n\n\n\nLLM Powered Autonomous Agents | Lil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n|\n\n\n\n\n\n\nPosts\n\n\n\n\nArchive\n\n\n\n\nSearch\n\n\n\n\nTags\n\n\n\n\nFAQ\n\n\n\n\n\n\n\n\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\n \n\n\nTable of Contents\n\n\n\nAgent System Overview\n\nComponent One: Planning\n\nTask Decomposition\n\nSelf-Reflection\n\n\nComponent Two: Memory\n\nTypes of Memory\n\nMaximum Inner Product Search (MIPS)\n\n\nComponent Three:"