# RAG(retrieval augmented generation) Agent with langchain

## Preview
In this guide we’ll build an app that answers questions about the website’s content. The specific website we will use is the LLM Powered Autonomous Agents blog post by Lilian Weng, which allows us to ask questions about the contents of the post.

In [42]:
# %pip install -U langchain langchain-text-splitters langchain-community bs4

## Components

### Chat model

In [43]:
from langchain_openai import ChatOpenAI

model = ChatOpenAI(
    # model="functiongemma:270m", 
    # model="qwen3:1.7b", 
    # model="qwen3:0.6b-fp16", 
    # model="qwen3:1.7b-q8_0", 
    # model="qwen3:1.7b-fp16", 
    model="qwen3:4b", 

    # model="phi4-mini", 
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)


### Embedding model

In [44]:
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(
    model="qwen3-embedding:0.6b"
)

### Vector store

In [45]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

## 01. indexing

> load -> split -> store

### Loading documents

In [46]:
import bs4  # BeautifulSoup: parse website in to text
from langchain_community.document_loaders import WebBaseLoader

# Only keep post title, headers, and content from the full HTML
bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))

loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs={"parse_only": bs4_strainer},
)

docs = loader.load()

assert len(docs) == 1
print(f"Total characters: {len(docs[0].page_content)}")

Total characters: 43047


In [47]:
print(docs[0].page_content[:500])



      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In


### Splitting documents

In [48]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200,
    add_start_index = True, # track index in original document
)

all_splits = text_splitter.split_documents(docs)

print(f"Split blog post into {len(all_splits)} sub-documents.")

Split blog post into 63 sub-documents.


### Storing documents

In [49]:
document_ids = vector_store.add_documents(documents=all_splits)
print(document_ids[:3])

['4fc7d8ae-8bf7-41ad-a439-279e52d3ed79', '18676a05-8766-4dcc-a1db-da59b5bed4c3', '152f2120-486c-4cb6-b9eb-f52315720bdb']


## 2. Retrieval and generation

### RAG agent

In [50]:
# retrieved_docs = vector_store.similarity_search("llm" , k=2)
# serialized = "\n\n".join(
#         (f"Source: {doc.metadata}\nContent: {doc.page_content}")
#         for doc in retrieved_docs
#     )

# retrieved_docs , serialized

In [51]:
from langchain.tools import tool

@tool(response_format="content_and_artifact")
def retrieve_context(query: str):
    """Retrieve information to help answer a query."""
    retrieved_docs = vector_store.similarity_search(query , k=2)
    serialized = "\n\n".join(
        (f"Source: {doc.metadata}\nContent: {doc.page_content}")
        for doc in retrieved_docs
    )
    return serialized, retrieved_docs

In [52]:
from langchain.agents import create_agent


tools = [retrieve_context]
# If desired, specify custom instructions
prompt = (
    "You have access to a tool that retrieves context from a blog post. "
    "Use the tool to help answer user queries."
)
agent = create_agent(model, tools, system_prompt=prompt)

In [53]:
query = (
    "What is the standard method for Task Decomposition?\n\n"
    "Once you get the answer, look up common extensions of that method."
)

for event in agent.stream(
    {"messages": [{"role": "user", "content": query}]},
    stream_mode="values",
):
    event["messages"][-1].pretty_print()


What is the standard method for Task Decomposition?

Once you get the answer, look up common extensions of that method.
Tool Calls:
  retrieve_context (call_riiza4us)
 Call ID: call_riiza4us
  Args:
    query: What is the standard method for Task Decomposition?
Name: retrieve_context

Source: {'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'start_index': 1638}
Content: Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.
Tree of Thoughts (Yao et al. 2023) extends CoT by exploring 

### RAG Chain

In [55]:
from langchain.agents.middleware import dynamic_prompt, ModelRequest

@dynamic_prompt
def prompt_with_context(request: ModelRequest) -> str:
    """Inject context into state message."""
    last_query = request.state["messages"][-1].text
    retrieved_docs = vector_store.similarity_search(last_query)

    docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)

    system_message = (
        "you are a helpful assistant. Use the following context in your response:"
        f"\n\n{docs_content}"
    )

    return system_message

agent = create_agent(model, tools=[], middleware=[prompt_with_context])

In [56]:
query = "What is task decomposition?"

for step in agent.stream(
    {"messages": [{"role": "user", "content": query}]}, stream_mode="values"
):
    step["messages"][-1].pretty_print()


What is task decomposition?

Based on the context provided, task decomposition refers to the process of breaking down complex tasks into smaller, more manageable steps. This concept is particularly important in AI workflows where complicated tasks require structured planning.

The context specifically describes two approaches within this framework:
1. **Chain of Thought (CoT)** - A technique where models are instructed to "think step by step" to decompose hard tasks into smaller, simpler steps. This transforms big tasks into multiple manageable components and provides visibility into the model's reasoning process.

2. **Tree of Thoughts (ToT)** - An extension of CoT that explores multiple reasoning paths at each step. It first decomposes problems into multiple thought steps, generating multiple possibilities per step to create a tree structure. The search can proceed via BFS (breadth-first search) or DFS (depth-first search), with states evaluated through classification or majority vo

---

### Returning source documents
The above RAG chain incorporates retrieved context into a single system message for that run.As in the agentic RAG formulation, we sometimes want to include raw source documents in the application state to have access to document metadata. We can do this for the two-step chain case by:

01. Adding a key to the state to store the retrieved documents
02. Adding a new node via a pre-model hook to populate that key (as well as inject the context).


In [57]:
from typing import Any
from langchain_core.documents import Document
from langchain.agents.middleware import AgentMiddleware, AgentState


class State(AgentState):
    context: list[Document]


class RetrieveDocumentsMiddleware(AgentMiddleware[State]):
    state_schema = State

    def before_model(self, state: AgentState) -> dict[str, Any] | None:
        last_message = state["messages"][-1]
        retrieved_docs = vector_store.similarity_search(last_message.text)

        docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)

        augmented_message_content = (
            f"{last_message.text}\n\n"
            "Use the following context to answer the query:\n"
            f"{docs_content}"
        )
        return {
            "messages": [last_message.model_copy(update={"content": augmented_message_content})],
            "context": retrieved_docs,
        }


agent = create_agent(
    model,
    tools=[],
    middleware=[RetrieveDocumentsMiddleware()],
)

In [58]:
query = "What is task decomposition?"

for step in agent.stream(
    {"messages": [{"role": "user", "content": query}]}, stream_mode="values"
):
    step["messages"][-1].pretty_print()


What is task decomposition?

What is task decomposition?

Use the following context to answer the query:
The AI assistant can parse user input to several tasks: [{"task": task, "id", task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video": URL}}]. The "dep" field denotes the id of the previous task which generates a new resource that the current task relies on. A special tag "-task_id" refers to the generated text image, audio and video in the dependency task with id as task_id. The task MUST be selected from the following options: {{ Available Task List }}. There is a logical relationship between tasks, please note their order. If the user input can't be parsed, you need to reply empty JSON. Here are several cases for your reference: {{ Demonstrations }}. The chat history is recorded as {{ Chat History }}. From this chat history, you can find the path of the user-mentioned resources for your task planning.

Component One: Planning#
A complicate