# What is LangChain?

**LangChain** is a framework that helps you build AI applications using Large Language Models (LLMs) in a fast, stable and structured way.

*In short*: LangChain = A toolkit that helps you connect LLMs with data + logic + external tools to create a complete AI App.

# When to use LangChain

Large Language Models (LLMs), despite their impressive capabilities, fundamentally behave as **‚Äútext ‚Üí text‚Äù functions**. By themselves, they cannot:
- access external data,
- store long-term memory,
- execute multi-step workflows,
- call tools or APIs,
- maintain state, or
- carry out reliable, complex logic.

Engineering teams quickly discovered that pure LLM usage is enough for demos, but **insufficient for real products** such as enterprise chatbots, document-grounded Q&A systems, data-analysis assistants, or automated workflows.

This gap is precisely where **LangChain** emerged: a framework designed to extend LLMs into practical, production-grade systems.

### Connecting LLMs to External Data

LangChain acts as a middleware layer between the model and the real world. It allows LLMs to read:
- documents
- databases
- APIs
- vector stores
- computational tools

This capability is essential for *RAG (Retrieval-Augmented Generation)*, where applications must provide accurate, up-to-date, contextual information instead of relying on the model‚Äôs internal guesses.

### Multi-Step Workflows

LLMs do not inherently understand step-by-step procedures or stateful tasks. LangChain provides:
- Chains for simple multi-step logic
- LangGraph for complex workflows such as:
    - branching
    - looping
    - retrying
    - validation
    - deterministic state machines

These are the building blocks of AI pipelines, document processing systems, and multi-stage reasoning assistants.

### Agents and Tool-Use Capabilities

LLMs cannot decide on their own: when to use a tool, which tool to select, or how to integrate tool outputs.

LangChain‚Äôs Agents introduce this capability. It can help the model interact with multiple type of tool or even creating an working environment that can simulate user computer.

Whenever you want an AI assistant that can act, not just respond, agents become essential.

# Getting Started with LangChain

In this example, we will introduce LangChain, building a simple LLM-powered assistant. We'll provide examples for both OpenAI's `gpt-4o-mini` *and* Meta's `llama3.2` via Ollama!

### Initializing OpenAI's gpt-4o-mini

We start by initializing our LLM. We will use OpenAI's `gpt-4o-mini` model, if you need an API key you can get one from [OpenAI's website](https://platform.openai.com/settings/organization/api-keys).

In [1]:
import os
from getpass import getpass
from os import getenv


openai_model_auth = "openai/gpt-4o-mini"

# Free one from OpenRouter
openai_model = "openai/gpt-oss-120b:free"
os.environ["OPENROUTER_API_KEY"] = os.getenv("OPENROUTER_API_KEY") or getpass(
    "Enter OpenRouter API Key: "
)

In [2]:
from langchain_openai import ChatOpenAI

# For normal accurate responses
llm = ChatOpenAI(temperature = 0.0, model = openai_model, api_key = getenv("OPENROUTER_API_KEY"), base_url = "https://openrouter.ai/api/v1")

# For unique creative responses
creative_llm = ChatOpenAI(temperature = 0.9, model = openai_model, api_key = getenv("OPENROUTER_API_KEY"), base_url = "https://openrouter.ai/api/v1")

  from .autonotebook import tqdm as notebook_tqdm


We will be taking an `article` _draft_ and using LangChain to generate various useful items around this article. We'll be creating:

1. An article title
2. An article description
3. Editor advice where we will insert an additional paragraph in the article
4. A thumbnail / hero image for our article.

Here we input our article to start with. Currently this is using an article from the Aurelio AI learning page.

In [3]:
article = """
\
We believe AI's short‚Äîto mid-term future belongs to agents and that the long-term future of *AGI* may evolve from agentic systems. Our definition of agents covers any neuro-symbolic system in which we merge neural AI (such as an LLM) with semi-traditional software.

With agents, we allow LLMs to integrate with code ‚Äî allowing AI to search the web, perform math, and essentially integrate into anything we can build with code. It should be clear the scope of use cases is phenomenal where AI can integrate with the broader world of software.

In this introduction to AI agents, we will cover the essential concepts that make them what they are and why that will make them the core of real-world AI in the years to come.

---

## Neuro-Symbolic Systems

Neuro-symbolic systems consist of both neural and symbolic computation, where:

- Neural refers to LLMs, embedding models, or other neural network-based models.
- Symbolic refers to logic containing symbolic logic, such as code.

Both neural and symbolic AI originate from the early philosophical approaches to AI: connectionism (now neural) and symbolism. Symbolic AI is the more traditional AI. Diehard symbolists believed they could achieve true AGI via written rules, ontologies, and other logical functions.

The other camp were the connectionists. Connectionism emerged in 1943 with a theoretical neural circuit but truly kicked off with Rosenblatt's perceptron paper in 1958 [1][2]. Both of these approaches to AI are fascinating but deserve more time than we can give them here, so we will leave further exploration of these concepts for a future chapter.

Most important to us is understanding where symbolic logic outperforms neural-based compute and vice-versa.

| Neural | Symbolic |
| --- | --- |
| Flexible, learned logic that can cover a huge range of potential scenarios. | Mostly hand-written rules which can be very granular and fine-tuned but hard to scale. |
| Hard to interpret why a neural system does what it does. Very difficult or even impossible to predict behavior. | Rules are written and can be understood. When unsure why a particular ouput was produced we can look at the rules / logic to understand. |
| Requires huge amount of data and compute to train state-of-the-art neural models, making it hard to add new abilities or update with new information. | Code is relatively cheap to write, it can be updated with new features easily, and latest information can often be added often instantaneously. |
| When trained on broad datasets can often lack performance when exposed to unique scenarios that are not well represented in the training data. | Easily customized to unique scenarios. |
| Struggles with complex computations such as mathematical operations. | Perform complex computations very quickly and accurately. |

Pure neural architectures struggle with many seemingly simple tasks. For example, an LLM *cannot* provide an accurate answer if we ask it for today's date.

Retrieval Augmented Generation (RAG) is commonly used to provide LLMs with up-to-date knowledge on a particular subject or access to proprietary knowledge.

### Giving LLMs Superpowers

By 2020, it was becoming clear that neural AI systems could not perform tasks symbolic systems typically excelled in, such as arithmetic, accessing structured DB data, or making API calls. These tasks require discrete input parameters that allow us to process them reliably according to strict written logic.

In 2022, researchers at AI21 developed Jurassic-X, an LLM-based "neuro-symbolic architecture." Neuro-symbolic refers to merging the "neural computation" of large language models (LLMs) with more traditional (i.e. symbolic) computation of code.

Jurassic-X used the Modular Reasoning, Knowledge, and Language (MRKL) system [3]. The researchers developed MRKL to solve the limitations of LLMs, namely:

- Lack of up-to-date knowledge, whether that is the latest in AI or something as simple as today's date.
- Lack of proprietary knowledge, such as internal company docs or your calendar bookings.
- Lack of reasoning, i.e. the inability to perform operations that traditional software is good at, like running complex mathematical operations.
- Lack of ability to generalize. Back in 2022, most LLMs had to be fine-tuned to perform well in a specific domain. This problem is still present today but far less prominent as the SotA models generalize much better and, in the case of MRKL, are able to use tools relatively well (although we could certainly take the MRKL solution to improve tool use performance even today).

MRKL represents one of the earliest forms of what we would now call an agent; it is an LLM (neural computation) paired with executable code (symbolic computation).

## ReAct and Tools

There is a misconception in the broader industry that an AI agent is an LLM contained within some looping logic that can generate inputs for and execute code functions. This definition of agents originates from the huge popularity of the ReAct agent framework and the adoption of a similar structure with function/tool calling by LLM providers such as OpenAI, Anthropic, and Ollama.

![ReAct agent flow with the Reasoning-Action loop [4]. When the action chosen specifies to use a normal tool, the tool is used and the observation returned for another iteration through the Reasoning-Action loop. To return a final answer to the user the LLM must choose action "answer" and provide the natural language response, finishing the loop.](/images/posts/ai-agents/ai-agents-00.png)

<small>ReAct agent flow with the Reasoning-Action loop [4]. When the action chosen specifies to use a normal tool, the tool is used and the observation returned for another iteration through the Reasoning-Action loop. To return a final answer to the user the LLM must choose action "answer" and provide the natural language response, finishing the loop.</small>

Our "neuro-symbolic" definition is much broader but certainly does include ReAct agents and LLMs paired with tools. This agent type is the most common for now, so it's worth understanding the basic concept behind it.

The **Re**ason **Act**ion (ReAct) method encourages LLMs to generate iterative *reasoning* and *action* steps. During *reasoning,* the LLM describes what steps are to be taken to answer the user's query. Then, the LLM generates an *action,* which we parse into an input to some executable code, which we typically describe as a tool/function call.

![ReAct method. Each iteration includes a Reasoning step followed by an Action (tool call) step. The Observation is the output from the previous tool call. During the final iteration the agent calls the answer tool, meaning we generate the final answer for the user.](/images/posts/ai-agents/ai-agents-01.png)

<small>ReAct method. Each iteration includes a Reasoning step followed by an Action (tool call) step. The Observation is the output from the previous tool call. During the final iteration the agent calls the answer tool, meaning we generate the final answer for the user.</small>

Following the reason and action steps, our action tool call returns an observation. The logic returns the observation to the LLM, which is then used to generate subsequent reasoning and action steps.

The ReAct loop continues until the LLM has enough information to answer the original input. Once the LLM reaches this state, it calls a special *answer* action with the generated answer for the user.

## Not only LLMs and Tool Calls

LLMs paired with tool calling are powerful but far from the only approach to building agents. Using the definition of neuro-symbolic, we cover architectures such as:

- Multi-agent workflows that involve multiple LLM-tool (or other agent structure) combinations.
- More deterministic workflows where we may have set neural model-tool paths that may fork or merge as the use case requires.
- Embedding models that can detect user intents and decide tool-use or LLM selection-based selection in vector space.

These are just a few high-level examples of alternative agent structures. Far from being designed for niche use cases, we find these alternative options to frequently perform better than the more common ReAct or Tool agents. We will cover all of these examples and more in future chapters.

---

Agents are fundamental to the future of AI, but that doesn't mean we should expect that future to come from agents in their most popular form today. ReAct and Tool agents are great and handle many simple use cases well, but the scope of agents is much broader, and we believe thinking beyond ReAct and Tools is key to building future AI.

---

You can sign up for the [Aurelio AI newsletter](https://b0fcw9ec53w.typeform.com/to/w2BDHVK7) to stay updated on future releases in our comprehensive course on agents.

---

## References

[1] The curious case of Connectionism (2019) [https://www.degruyter.com/document/doi/10.1515/opphil-2019-0018/html](https://www.degruyter.com/document/doi/10.1515/opphil-2019-0018/html)

[2] F. Rosenblatt, [The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain](https://www.ling.upenn.edu/courses/cogs501/Rosenblatt1958.pdf) (1958), Psychological Review

[3] E. Karpas et al. [MRKL Systems: A Modular, Neuro-Symbolic Architecture That Combines Large Language Models, External Knowledge Sources and Discrete Reasoning](https://arxiv.org/abs/2205.00445) (2022), AI21 Labs
"""

### Preparing our Prompts

LangChain comes with several prompt classes and methods for organizing or constructing our prompts. We will cover these in more detail in later examples, but for now we'll cover the essentials that we need here.

Prompts for chat agents are at a minimum broken up into three components, those are:

* System prompt: this provides the instructions to our LLM on how it must behave, what it's objective is, etc.

* User prompt: this is a user written input.

* AI prompt: this is the AI generated output. When representing a conversation, previous generations will be inserted back into the next prompt and become part of the broader _chat history_.

```
You are a helpful AI assistant, you will do XYZ.    | SYSTEM PROMPT

User: Hi, what is the capital of Australia?         | USER PROMPT
AI: It is Canberra                                  | AI PROMPT
User: When is the best time to visit?               | USER PROMPT
```

LangChain provides us with _templates_ for each of these prompt types. By using templates we can insert different inputs to the template, modifying the prompt based on the provided inputs.

Let's initialize our system and user prompt first:

In [4]:
from langchain.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate

# Defining the system prompt (how the AI should act)
system_prompt = SystemMessagePromptTemplate.from_template(
    "You are an AI assistant that helps generate article titles."
)

# the user prompt is provided by the user, in this case however the only dynamic
#¬†input is the article
user_prompt = HumanMessagePromptTemplate.from_template(
    """You are tasked with creating a name for a article.
The article is here for you to examine: 

---

{article}

---

The name should be based of the context of the article.
Be creative, but make sure the names are clear, catchy,
and relevant to the theme of the article.

Only output the article name, no other explanation or
text can be provided.""",
    input_variables = ["article"]
)

We can display what our formatted human prompt would look like after inserting a value into the `article` parameter:

In [5]:
user_prompt.format(article = "TEST STRING")

HumanMessage(content='You are tasked with creating a name for a article.\nThe article is here for you to examine: \n\n---\n\nTEST STRING\n\n---\n\nThe name should be based of the context of the article.\nBe creative, but make sure the names are clear, catchy,\nand relevant to the theme of the article.\n\nOnly output the article name, no other explanation or\ntext can be provided.', additional_kwargs={}, response_metadata={})

We have our system and user prompts, we can merge both into our full chat prompt using the `ChatPromptTemplate`:

In [6]:
first_prompt = ChatPromptTemplate.from_messages([system_prompt, user_prompt])

By default, the `ChatPromptTemplate` will read the `input_variables` from each of the prompt templates inserted and allow us to use those input variables when formatting the full chat prompt template:

In [7]:
print(first_prompt.format(article = "TEST STRING"))

System: You are an AI assistant that helps generate article titles.
Human: You are tasked with creating a name for a article.
The article is here for you to examine: 

---

TEST STRING

---

The name should be based of the context of the article.
Be creative, but make sure the names are clear, catchy,
and relevant to the theme of the article.

Only output the article name, no other explanation or
text can be provided.


`ChatPromptTemplate` also prefixes each individual message with it's role, ie `System:`, `Human:`, or `AI:`.

We can chain together our `first_prompt` template and the `llm` object we defined earlier to create a simple LLM chain. This chain will perform the steps **prompt formatting > llm generation > get output**.

We'll be using **L**ang**C**hain **E**xpression **L**anguage (LCEL) to construct our chain. This syntax can look a little strange but we will cover it in detail later in the course. For now, all we need to know is that we define our inputs with the first dictionary segment (ie `{"article": lambda x: x["article"]}`) and then we use the pipe operator (`|`) to say that the output from the left of the pipe will be fed into the input to the right of the pipe.

In [8]:
chain_one = (
    {"article": lambda x: x["article"]}
    | first_prompt
    | creative_llm
    | {"article_title": lambda x: x.content}
)

Our first chain creates the article title, note: we can run all of these individually...

In [9]:
article_title_msg = chain_one.invoke({"article": article})
article_title_msg

{'article_title': 'Neuro‚ÄëSymbolic Agents: Building the Next Generation of AI Intelligence'}

But we will actually chain this step with multiple other `LLMChain` steps. So, to continue, our next step is to summarize the article using both the `article` and newly generated `article_title` values, from which we will output a new `summary` variable:

In [10]:
second_user_prompt = HumanMessagePromptTemplate.from_template(
    """You are tasked with creating a description for
the article. The article is here for you to examine:

---

{article}

---

Here is the article title '{article_title}'.

Output the SEO friendly article description. Do not output
anything other than the description.""",
    input_variables = ["article", "article_title"]
)

second_prompt = ChatPromptTemplate.from_messages([
    system_prompt,
    second_user_prompt
])

In [11]:
chain_two = (
    {
        "article": lambda x: x["article"],
        "article_title": lambda x: x["article_title"]
    }
    | second_prompt
    | llm
    | {"summary": lambda x: x.content}
)

In [12]:
article_description_msg = chain_two.invoke({
    "article": article,
    "article_title": article_title_msg["article_title"]
})
article_description_msg

{'summary': 'Discover how neuro‚Äësymbolic agents combine powerful LLMs with traditional code to create versatile, up‚Äëto‚Äëdate AI systems. Learn the fundamentals of neuro‚Äësymbolic architecture, the ReAct reasoning‚Äëaction loop, tool‚Äëcalling techniques, and emerging multi‚Äëagent workflows that are shaping the future of real‚Äëworld AI. Perfect for developers, researchers, and AI enthusiasts eager to build the next generation of intelligent agents.'}

The third step will consume our first `article` variable and provide several output fields, focusing on helping the user improve a part of their writing. As we are outputting multiple fields we can specify for the LLM to use structured outputs, keeping the generated fields aligned with our requirements.

In [24]:
third_user_prompt = HumanMessagePromptTemplate.from_template(
    """You are tasked with creating a new paragraph for the
article. The article is here for you to examine:

---

{article}

---

Choose one paragraph to review and edit. During your edit
ensure you provide constructive feedback to the user so they
can learn where to improve their own writing.

Return your response as a JSON object with the following fields:
- original_paragraph: The original paragraph you selected
- edited_paragraph: Your improved version of the paragraph
- feedback: Constructive feedback explaining the changes""",
    input_variables = ["article"]
)

# prompt template 3: creating a new paragraph for the article
third_prompt = ChatPromptTemplate.from_messages([
    system_prompt,
    third_user_prompt
])

We create a pydantic object describing the output format we need. This format description is then passed to our model using the `with_structured_output` method:

In [25]:
from pydantic import BaseModel, Field

class Paragraph(BaseModel):
    original_paragraph: str = Field(description = "The original paragraph")
    edited_paragraph: str = Field(description = "The improved edited paragraph")
    feedback: str = Field(description = (
        "Constructive feedback on the original paragraph"
    ))

structured_llm = creative_llm.with_structured_output(Paragraph, method = "json_schema")

Now we put all of this together in another chain:

In [26]:
# chain 3: inputs: article / output: article_para
chain_three = (
    {"article": lambda x: x["article"]}
    | third_prompt
    | structured_llm
    | {
        "original_paragraph": lambda x: x.original_paragraph,
        "edited_paragraph": lambda x: x.edited_paragraph,
        "feedback": lambda x: x.feedback
    }
)

In [27]:
out = chain_three.invoke({"article": article})
out

{'original_paragraph': "Pure neural architectures struggle with many seemingly simple tasks. For example, an LLM *cannot* provide an accurate answer if we ask it for today's date.",
 'edited_paragraph': 'Pure neural architectures often falter on tasks that appear trivial to humans. For instance, a vanilla LLM will typically answer "I don\'t know" or generate an outdated date when asked for the current day, because it lacks direct access to real‚Äëtime information and relies solely on patterns learned during training.',
 'feedback': '### What was improved\n1. **Specificity** ‚Äì The revised sentence clarifies *why* the model fails (no real‚Äëtime access, reliance on training data) rather than just stating it "cannot". This gives readers a concrete understanding of the limitation.\n2. **Tone and Formality** ‚Äì Replacing the informal asterisk emphasis with a more academic phrasing keeps the article‚Äôs voice consistent.\n3. **Clarity** ‚Äì Adding "vanilla" and "real‚Äëtime information" d

### Generate Image

In [31]:
from langchain_community.utilities.dalle_image_generator import DallEAPIWrapper
from langchain_core.prompts import PromptTemplate

image_prompt = PromptTemplate(
    input_variables = ["article"],
    template = (
        "Generate a prompt with less then 500 characters to generate an image "
        "based on the following article: {article}"
    )
)

The `generate_and_display` function will generate the article image once we have the prompt from our image prompt.

In [32]:
from skimage import io
import matplotlib.pyplot as plt
from langchain_core.runnables import RunnableLambda

def generate_and_display_image(image_prompt):
    print(image_prompt)
    image_url = DallEAPIWrapper(model = "dall-e-3").run(image_prompt)
    image_data = io.imread(image_url)

    # And update the display code to:
    plt.imshow(image_data)
    plt.axis('off')
    plt.show()

# we wrap this in a RunnableLambda for use with LCEL
image_gen_runnable = RunnableLambda(generate_and_display_image)

We have all of our image generation components ready, we chain them together again with LCEL:

In [33]:
# chain 4: inputs: article, article_para / outputs: new_suggestion_article
chain_four = (
    {"article": lambda x: x["article"]}
    | image_prompt
    | llm
    | (lambda x: x.content)
    | image_gen_runnable
)

And now, we `invoke` our final chain:

In [None]:
chain_four.invoke({"article": article})

# Chat Memory

Memory Buffer is a mechanism that stores the conversation history between the user and the AI assistant. It acts as a short-term memory, allowing the language model to access previous messages in the conversation.

There is 4 main type of Memory:
- Buffer Memory: Storing all conversation in raw text
- Summary Memory: Storing an overall context of a conversation
- Window Memory: Storing N closest chat
- Entity Memory: Storing key memory

### Buffer Memory

`ConversationBufferMemory` is the simplest form of conversational memory, it is literally just a place that we store messages, and then use to feed messages into our LLM.

Let's start with LangChain's original `ConversationBufferMemory` object, we are setting `return_messages = True` to return the messages as a list of `ChatMessage` objects ‚Äî unless using a non-chat model we would always set this to `True` as without it the messages are passed as a direct string which can lead to unexpected behavior from chat LLMs.

In [34]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(return_messages = True)

  memory = ConversationBufferMemory(return_messages = True)


There are several ways that we can add messages to our memory, using the `save_context` method we can add a user query (via the `input` key) and the AI's response (via the `output` key).

For example, we will create a conversation and save it directly

In [35]:
memory.save_context(
    {"input": "Hi, my name is Dat"},  # user message
    {"output": "Hey Dat, what's up? I'm an AI model."}  # AI response
)
memory.save_context(
    {"input": "I'm researching the different types of conversational memory."},  # user message
    {"output": "That's interesting, what type do you want to see?"}  # AI response
)
memory.save_context(
    {"input": "I've been looking at ConversationBufferMemory and ConversationBufferWindowMemory."},  # user message
    {"output": "That's interesting, ConversationBufferMemory is the simplest form of conversational memory in LangChain. It stores the entire conversation history as a buffer, allowing the LLM to access all previous messages for context. It is useful for chatbots and agents that need to remember the full conversation. Beside The Window version will limited to what model need to remember instead"}  # AI response
)
memory.save_context(
    {"input": "Buffer memory just stores the entire conversation, right?"},  # user message
    {"output": "Yes"}  # AI response
)
memory.save_context(
    {"input": "Buffer window memory stores the last k messages, dropping the rest."},  # user message
    {"output": "Also right!"}  # AI response
)

Before using the memory, we need to load in any variables for that memory type ‚Äî in this case, there are none, so we just pass an empty dictionary:

In [36]:
memory.load_memory_variables({})

{'history': [HumanMessage(content='Hi, my name is Dat', additional_kwargs={}, response_metadata={}),
  AIMessage(content="Hey Dat, what's up? I'm an AI model.", additional_kwargs={}, response_metadata={}),
  HumanMessage(content="I'm researching the different types of conversational memory.", additional_kwargs={}, response_metadata={}),
  AIMessage(content="That's interesting, what type do you want to see?", additional_kwargs={}, response_metadata={}),
  HumanMessage(content="I've been looking at ConversationBufferMemory and ConversationBufferWindowMemory.", additional_kwargs={}, response_metadata={}),
  AIMessage(content="That's interesting, ConversationBufferMemory is the simplest form of conversational memory in LangChain. It stores the entire conversation history as a buffer, allowing the LLM to access all previous messages for context. It is useful for chatbots and agents that need to remember the full conversation. Beside The Window version will limited to what model need to reme

With that, we've created our buffer memory. Before feeding it into our LLM let's quickly view the alternative method for adding messages to our memory. With this other method, we pass individual user and AI messages via the add_user_message and add_ai_message methods. To reproduce what we did above, we do:

In [37]:
memory = ConversationBufferMemory(return_messages=True)

memory.chat_memory.add_user_message("Hello, my name is Dat.")
memory.chat_memory.add_ai_message("Hey Dat, what's up? I'm an AI model.")
memory.chat_memory.add_user_message("I'm researching the different types of conversational memory.")
memory.chat_memory.add_ai_message("That's interesting, what type do you want to see?")
memory.chat_memory.add_user_message("I've been looking at ConversationBufferMemory and ConversationBufferWindowMemory.")
memory.chat_memory.add_ai_message("That's interesting, ConversationBufferMemory is the simplest form of conversational memory in LangChain. It stores the entire conversation history as a buffer, allowing the LLM to access all previous messages for context. It is useful for chatbots and agents that need to remember the full conversation. Beside The Window version will limited to what model need to remember instead")
memory.chat_memory.add_user_message("Buffer memory just stores the entire conversation, right?")
memory.chat_memory.add_ai_message("Yes")
memory.chat_memory.add_user_message("Buffer window memory stores the last k messages, dropping the rest.")
memory.chat_memory.add_ai_message("Also right!")

memory.load_memory_variables({})

{'history': [HumanMessage(content='Hello, my name is Dat.', additional_kwargs={}, response_metadata={}),
  AIMessage(content="Hey Dat, what's up? I'm an AI model.", additional_kwargs={}, response_metadata={}),
  HumanMessage(content="I'm researching the different types of conversational memory.", additional_kwargs={}, response_metadata={}),
  AIMessage(content="That's interesting, what type do you want to see?", additional_kwargs={}, response_metadata={}),
  HumanMessage(content="I've been looking at ConversationBufferMemory and ConversationBufferWindowMemory.", additional_kwargs={}, response_metadata={}),
  AIMessage(content="That's interesting, ConversationBufferMemory is the simplest form of conversational memory in LangChain. It stores the entire conversation history as a buffer, allowing the LLM to access all previous messages for context. It is useful for chatbots and agents that need to remember the full conversation. Beside The Window version will limited to what model need to 

The outcome is exactly the same in either case. To pass this onto our LLM, we need to create a `ConversationChain` object ‚Äî which is already deprecated in favor of the `RunnableWithMessageHistory` class, which we will cover in a moment.

In [38]:
from langchain.chains import ConversationChain

chain = ConversationChain(
    llm = llm,
    memory = memory,
    verbose = True
)

  chain = ConversationChain(


In [39]:
chain.invoke({"input": "what is my name again?"})



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
[HumanMessage(content='Hello, my name is Dat.', additional_kwargs={}, response_metadata={}), AIMessage(content="Hey Dat, what's up? I'm an AI model.", additional_kwargs={}, response_metadata={}), HumanMessage(content="I'm researching the different types of conversational memory.", additional_kwargs={}, response_metadata={}), AIMessage(content="That's interesting, what type do you want to see?", additional_kwargs={}, response_metadata={}), HumanMessage(content="I've been looking at ConversationBufferMemory and ConversationBufferWindowMemory.", additional_kwargs={}, response_metadata={}), AIMessage(content="That's interesting, ConversationBufferMemor

{'input': 'what is my name again?',
 'history': [HumanMessage(content='Hello, my name is Dat.', additional_kwargs={}, response_metadata={}),
  AIMessage(content="Hey Dat, what's up? I'm an AI model.", additional_kwargs={}, response_metadata={}),
  HumanMessage(content="I'm researching the different types of conversational memory.", additional_kwargs={}, response_metadata={}),
  AIMessage(content="That's interesting, what type do you want to see?", additional_kwargs={}, response_metadata={}),
  HumanMessage(content="I've been looking at ConversationBufferMemory and ConversationBufferWindowMemory.", additional_kwargs={}, response_metadata={}),
  AIMessage(content="That's interesting, ConversationBufferMemory is the simplest form of conversational memory in LangChain. It stores the entire conversation history as a buffer, allowing the LLM to access all previous messages for context. It is useful for chatbots and agents that need to remember the full conversation. Beside The Window version

##### `RunnableWithMessageHistory` in Buffer

The `ConversationBufferMemory` type is due for deprecation. Instead, we can use the `RunnableWithMessageHistory` class to implement the same functionality.

When implementing `RunnableWithMessageHistory` we will use LangChain Expression Language (LCEL) and for this we need to define our prompt template and LLM components. Our llm has already been defined, so now we just define a `ChatPromptTemplate` object.


In [40]:
from langchain.prompts import MessagesPlaceholder

system_prompt = "You are a helpful AI assistant."

prompt_template = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template(system_prompt),
    MessagesPlaceholder(variable_name = "history"),
    HumanMessagePromptTemplate.from_template("{query}"),
])

We can link our `prompt_template` and our `llm` together to create a pipeline via LCEL.

In [41]:
pipeline = prompt_template | llm

Our `RunnableWithMessageHistory` requires our pipeline to be wrapped in a `RunnableWithMessageHistory` object. This object requires a few input parameters. One of those is `get_session_history`, which requires a function that returns a `ChatMessageHistory` object based on a session ID. We define this function ourselves:

In [42]:
from langchain_core.chat_history import InMemoryChatMessageHistory

chat_map = {}
def get_chat_history(session_id: str) -> InMemoryChatMessageHistory:
    if session_id not in chat_map:
        # if session ID doesn't exist, create a new chat history
        chat_map[session_id] = InMemoryChatMessageHistory()
    return chat_map[session_id]

We also need to tell our runnable which variable name to use for the chat history (ie `history`) and which to use for the user's query (ie `query`).

In [43]:
from langchain_core.runnables.history import RunnableWithMessageHistory

pipeline_with_history = RunnableWithMessageHistory(
    pipeline,
    get_session_history = get_chat_history,
    input_messages_key = "query",
    history_messages_key = "history"
)

In [44]:
pipeline_with_history.invoke(
    {"query": "Hi, my name is Dat"},
    config = {"session_id": "id_123"}
)

AIMessage(content='Hello, Dat! Nice to meet you. How can I help you today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 48, 'prompt_tokens': 86, 'total_tokens': 134, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': 22, 'rejected_prediction_tokens': None, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0, 'upstream_inference_completions_cost': 0}}, 'model_name': 'openai/gpt-oss-120b:free', 'system_fingerprint': None, 'id': 'gen-1766227892-6yMu9P33iz3csodVbd2k', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--25b7c0b5-0988-40b5-9f48-d245f30811b4-0', usage_metadata={'input_tokens': 86, 'output_tokens': 48, 'total_tokens': 134, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_

In [45]:
pipeline_with_history.invoke(
    {"query": "What is my name again?"},
    config = {"session_id": "id_123"}
)

AIMessage(content='Your name is Dat. üòä', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 56, 'prompt_tokens': 116, 'total_tokens': 172, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': 41, 'rejected_prediction_tokens': None, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0, 'upstream_inference_completions_cost': 0}}, 'model_name': 'openai/gpt-oss-120b:free', 'system_fingerprint': None, 'id': 'gen-1766227902-RtIbQD70eV10OUzXrueu', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--41a5fd2d-977c-49d2-a07f-3c286514d85a-0', usage_metadata={'input_tokens': 116, 'output_tokens': 56, 'total_tokens': 172, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'reasoning': 4

### Window Memory

The `ConversationBufferWindowMemory` type is similar to `ConversationBufferMemory`, but only keeps track of the last `k` messages. There are a few reasons why we would want to keep only the last `k` messages:
- More messages mean more tokens are sent with each request, more tokens increases latency and cost.
- LLMs tend to perform worse when given more tokens, making them more likely to deviate from instructions, hallucinate, or "forget" information provided to them. Conciseness is key to high performing LLMs.
- If we keep all messages we will eventually hit the LLM's context window limit, by adding a window size `k` we can ensure we never hit this limit.

The buffer window solves many problems that we encounter with the standard buffer memory, while still being a very simple and intuitive form of conversational memory.


In [46]:
from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(k = 4, return_messages = True)

  memory = ConversationBufferWindowMemory(k = 4, return_messages = True)


In [47]:
memory.chat_memory.add_user_message("Hello, my name is Dat.")
memory.chat_memory.add_ai_message("Hey Dat, what's up? I'm an AI model.")
memory.chat_memory.add_user_message("I'm researching the different types of conversational memory.")
memory.chat_memory.add_ai_message("That's interesting, what type do you want to see?")
memory.chat_memory.add_user_message("I've been looking at ConversationBufferMemory and ConversationBufferWindowMemory.")
memory.chat_memory.add_ai_message("That's interesting, ConversationBufferMemory is the simplest form of conversational memory in LangChain. It stores the entire conversation history as a buffer, allowing the LLM to access all previous messages for context. It is useful for chatbots and agents that need to remember the full conversation. Beside The Window version will limited to what model need to remember instead")
memory.chat_memory.add_user_message("Buffer memory just stores the entire conversation, right?")
memory.chat_memory.add_ai_message("Yes")
memory.chat_memory.add_user_message("Buffer window memory stores the last k messages, dropping the rest.")
memory.chat_memory.add_ai_message("Also right!")

memory.load_memory_variables({})

{'history': [HumanMessage(content="I'm researching the different types of conversational memory.", additional_kwargs={}, response_metadata={}),
  AIMessage(content="That's interesting, what type do you want to see?", additional_kwargs={}, response_metadata={}),
  HumanMessage(content="I've been looking at ConversationBufferMemory and ConversationBufferWindowMemory.", additional_kwargs={}, response_metadata={}),
  AIMessage(content="That's interesting, ConversationBufferMemory is the simplest form of conversational memory in LangChain. It stores the entire conversation history as a buffer, allowing the LLM to access all previous messages for context. It is useful for chatbots and agents that need to remember the full conversation. Beside The Window version will limited to what model need to remember instead", additional_kwargs={}, response_metadata={}),
  HumanMessage(content='Buffer memory just stores the entire conversation, right?', additional_kwargs={}, response_metadata={}),
  AIMe

As before, we use the `ConversationChain` object (again, this is deprecated and we will rewrite it with `RunnableWithMessageHistory` in a moment).

In [48]:
chain = ConversationChain(
    llm = llm,
    memory = memory,
    verbose = True
)

In [49]:
chain.invoke({"input": "what is my name again?"})



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
[HumanMessage(content="I'm researching the different types of conversational memory.", additional_kwargs={}, response_metadata={}), AIMessage(content="That's interesting, what type do you want to see?", additional_kwargs={}, response_metadata={}), HumanMessage(content="I've been looking at ConversationBufferMemory and ConversationBufferWindowMemory.", additional_kwargs={}, response_metadata={}), AIMessage(content="That's interesting, ConversationBufferMemory is the simplest form of conversational memory in LangChain. It stores the entire conversation history as a buffer, allowing the LLM to access all previous messages for context. It is useful for

{'input': 'what is my name again?',
 'history': [HumanMessage(content="I'm researching the different types of conversational memory.", additional_kwargs={}, response_metadata={}),
  AIMessage(content="That's interesting, what type do you want to see?", additional_kwargs={}, response_metadata={}),
  HumanMessage(content="I've been looking at ConversationBufferMemory and ConversationBufferWindowMemory.", additional_kwargs={}, response_metadata={}),
  AIMessage(content="That's interesting, ConversationBufferMemory is the simplest form of conversational memory in LangChain. It stores the entire conversation history as a buffer, allowing the LLM to access all previous messages for context. It is useful for chatbots and agents that need to remember the full conversation. Beside The Window version will limited to what model need to remember instead", additional_kwargs={}, response_metadata={}),
  HumanMessage(content='Buffer memory just stores the entire conversation, right?', additional_kwar

The reason our LLM can no longer remember our name is because we have set the `k` parameter to 4, meaning that only the last messages are stored in memory, as we can see above this does not include the first message where we introduced ourselves.

Based on the agent forgetting our name, we might wonder why we would ever use this memory type compared to the standard buffer memory. Well, as with most things in AI, it is always a trade-off. Here we are able to support much longer conversations, use less tokens, and improve latency ‚Äî but these come at the cost of forgetting non-recent messages.

##### `RunnableWithMessageHistory` in Window Buffer

To implement this memory type using the `RunnableWithMessageHistory` class, we can use the same approach as before.

In [50]:
from pydantic import BaseModel, Field
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.messages import BaseMessage

class BufferWindowMessageHistory(BaseChatMessageHistory, BaseModel):
    messages: list[BaseMessage] = Field(default_factory=list)
    k: int = Field(default_factory = int)

    def __init__(self, k: int):
        super().__init__(k = k)
        print(f"Initializing BufferWindowMessageHistory with k={k}")

    def add_messages(self, messages: list[BaseMessage]) -> None:
        """Add messages to the history, removing any messages beyond
        the last `k` messages.
        """
        self.messages.extend(messages)
        self.messages = self.messages[-self.k:]

    def clear(self) -> None:
        """Clear the history."""
        self.messages = []

In [51]:
chat_map = {}
def get_chat_history(session_id: str, k: int = 4) -> BufferWindowMessageHistory:
    print(f"get_chat_history called with session_id={session_id} and k={k}")
    if session_id not in chat_map:
        # if session ID doesn't exist, create a new chat history
        chat_map[session_id] = BufferWindowMessageHistory(k = k)
    # remove anything beyond the last
    return chat_map[session_id]

In [52]:
from langchain_core.runnables import ConfigurableFieldSpec

pipeline_with_history = RunnableWithMessageHistory(
    pipeline,
    get_session_history = get_chat_history,
    input_messages_key = "query",
    history_messages_key = "history",
    history_factory_config = [
        ConfigurableFieldSpec(
            id = "session_id",
            annotation = str,
            name = "Session ID",
            description = "The session ID to use for the chat history",
            default = "id_default",
        ),
        ConfigurableFieldSpec(
            id = "k",
            annotation = int,
            name = "k",
            description = "The number of messages to keep in the history",
            default = 4,
        )
    ]
)



Now we invoke our runnable, this time passing a `k` parameter via the config parameter.

In [53]:
pipeline_with_history.invoke(
    {"query": "Hi, my name is Dat"},
    config = {"configurable": {"session_id": "id_k4", "k": 4}}
)

get_chat_history called with session_id=id_k4 and k=4
Initializing BufferWindowMessageHistory with k=4


AIMessage(content='Hello, Dat! Nice to meet you. How can I assist you today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 55, 'prompt_tokens': 86, 'total_tokens': 141, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': 29, 'rejected_prediction_tokens': None, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0, 'upstream_inference_completions_cost': 0}}, 'model_name': 'openai/gpt-oss-120b:free', 'system_fingerprint': None, 'id': 'gen-1766227951-BPAkEBlsJE6goBkbYaVw', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--dcfc2097-5329-41f4-b3a1-8fe4b076073c-0', usage_metadata={'input_tokens': 86, 'output_tokens': 55, 'total_tokens': 141, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'outpu

We can also modify the messages that are stored in memory by modifying the records inside the chat_map dictionary directly.

In [54]:
chat_map["id_k4"].clear()  # clear the history

chat_map["id_k4"].add_user_message("Hello, my name is Dat.")
chat_map["id_k4"].add_ai_message("Hey Dat, what's up? I'm an AI model.")
chat_map["id_k4"].add_user_message("I'm researching the different types of conversational memory.")
chat_map["id_k4"].add_ai_message("That's interesting, what type do you want to see?")
chat_map["id_k4"].add_user_message("I've been looking at ConversationBufferMemory and ConversationBufferWindowMemory.")
chat_map["id_k4"].add_ai_message("That's interesting, ConversationBufferMemory is the simplest form of conversational memory in LangChain. It stores the entire conversation history as a buffer, allowing the LLM to access all previous messages for context. It is useful for chatbots and agents that need to remember the full conversation. Beside The Window version will limited to what model need to remember instead")
chat_map["id_k4"].add_user_message("Buffer memory just stores the entire conversation, right?")
chat_map["id_k4"].add_ai_message("Yes")
chat_map["id_k4"].add_user_message("Buffer window memory stores the last k messages, dropping the rest.")
chat_map["id_k4"].add_ai_message("Also right!")

chat_map["id_k4"].messages  # should contain only the last 4 messages

[HumanMessage(content='Buffer memory just stores the entire conversation, right?', additional_kwargs={}, response_metadata={}),
 AIMessage(content='Yes', additional_kwargs={}, response_metadata={}),
 HumanMessage(content='Buffer window memory stores the last k messages, dropping the rest.', additional_kwargs={}, response_metadata={}),
 AIMessage(content='Also right!', additional_kwargs={}, response_metadata={})]

In [55]:
# Test again
pipeline_with_history.invoke(
    {"query": "what is my name again?"},
    config={"configurable": {"session_id": "id_k4", "k": 4}}
)

get_chat_history called with session_id=id_k4 and k=4


AIMessage(content='I‚Äôm not sure‚Äî I don‚Äôt have a record of your name from our earlier messages. Could you let me know what you‚Äôd like me to call you?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 95, 'prompt_tokens': 129, 'total_tokens': 224, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': 61, 'rejected_prediction_tokens': None, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0, 'upstream_inference_completions_cost': 0}}, 'model_name': 'openai/gpt-oss-120b:free', 'system_fingerprint': None, 'id': 'gen-1766227958-cW7Aq67I8RU59hksJ6NK', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--23bb24ba-7191-410d-b486-7da1ecae4bb5-0', usage_metadata={'input_tokens': 129, 'output_tokens': 95,

Now let's initialize a new session with further `k`

In [56]:
pipeline_with_history.invoke(
    {"query": "Hi, my name is James"},
    config = {"session_id": "id_k14", "k": 14}
)

get_chat_history called with session_id=id_k14 and k=14
Initializing BufferWindowMessageHistory with k=14


AIMessage(content='Hello James! Nice to meet you. How can I help you today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 47, 'prompt_tokens': 86, 'total_tokens': 133, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': 22, 'rejected_prediction_tokens': None, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0, 'upstream_inference_completions_cost': 0}}, 'model_name': 'openai/gpt-oss-120b:free', 'system_fingerprint': None, 'id': 'gen-1766227962-Eb3exVQn5H8hZL77AeXL', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--6172d667-0b7d-4e88-a49e-b0a3dc0e5aad-0', usage_metadata={'input_tokens': 86, 'output_tokens': 47, 'total_tokens': 133, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output

In [57]:
chat_map["id_k4"].add_user_message("Hello, my name is Dat.")
chat_map["id_k4"].add_ai_message("Hey Dat, what's up? I'm an AI model.")
chat_map["id_k4"].add_user_message("I'm researching the different types of conversational memory.")
chat_map["id_k4"].add_ai_message("That's interesting, what type do you want to see?")
chat_map["id_k4"].add_user_message("I've been looking at ConversationBufferMemory and ConversationBufferWindowMemory.")
chat_map["id_k4"].add_ai_message("That's interesting, ConversationBufferMemory is the simplest form of conversational memory in LangChain. It stores the entire conversation history as a buffer, allowing the LLM to access all previous messages for context. It is useful for chatbots and agents that need to remember the full conversation. Beside The Window version will limited to what model need to remember instead")
chat_map["id_k4"].add_user_message("Buffer memory just stores the entire conversation, right?")
chat_map["id_k4"].add_ai_message("Yes")
chat_map["id_k4"].add_user_message("Buffer window memory stores the last k messages, dropping the rest.")
chat_map["id_k4"].add_ai_message("Also right!")

In [58]:
chat_map["id_k14"].messages  # should contain all messages since k = 14

[HumanMessage(content='Hi, my name is James', additional_kwargs={}, response_metadata={}),
 AIMessage(content='Hello James! Nice to meet you. How can I help you today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 47, 'prompt_tokens': 86, 'total_tokens': 133, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': 22, 'rejected_prediction_tokens': None, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0, 'upstream_inference_completions_cost': 0}}, 'model_name': 'openai/gpt-oss-120b:free', 'system_fingerprint': None, 'id': 'gen-1766227962-Eb3exVQn5H8hZL77AeXL', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--6172d667-0b7d-4e88-a49e-b0a3dc0e5aad-0', usage_metadata={'input_tokens': 86, 'output_tok

### Summary Memory

This memory type keeps track of a summary of the conversation rather than the entire conversation. This is useful for long conversations where we don't need to keep track of the entire conversation, but we do want to keep some thread of the full conversation.

In [59]:
from langchain.memory import ConversationSummaryMemory

memory = ConversationSummaryMemory(llm = llm)

  memory = ConversationSummaryMemory(llm = llm)


Unlike with the previous memory types, we need to provide an llm to initialize `ConversationSummaryMemory`. The reason for this is that we need an LLM to generate the conversation summaries.

Beyond this small tweak, using `ConversationSummaryMemory` is the same as with our previous memory types when using the deprecated `ConversationChain` object.


In [60]:
chain = ConversationChain(
    llm = llm,
    memory = memory,
    verbose = True
)

In [61]:
chain.invoke({"input": "hello there my name is Dat"})
chain.invoke({"input": "i am researching different types of AI model"})
chain.invoke({"input": "what have we talked about so far?"})



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:

Human: hello there my name is Dat
AI:[0m

[1m> Finished chain.[0m


[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
The human greets the AI and introduces themselves as ‚ÄúDat.‚Äù The AI replies with a friendly hello, introduces itself as ChatGPT (GPT‚Äë4, knowledge up to June‚ÄØ2024), outlines its abilities‚Äîmultilingual text handling, creative writing a

{'input': 'what have we talked about so far?',
 'history': 'The human greets the AI and introduces themselves as ‚ÄúDat.‚Äù The AI replies with a friendly hello, introduces itself as ChatGPT (GPT‚Äë4, knowledge up to June\u202f2024), outlines its abilities‚Äîmultilingual text handling, creative writing assistance, technical explanations, and everyday help like recipes or coding‚Äîand asks Dat about their hobbies or interests.  \n\nDat then says they are researching different types of AI models. The AI provides a comprehensive tour of AI model families, covering:\n\n* **Rule‚Äëbased / symbolic AI** ‚Äì explicit if‚Äëthen systems, interpretable but brittle.  \n* **Statistical / classical machine‚Äëlearning** ‚Äì linear models, tree‚Äëbased ensembles (XGBoost, LightGBM, CatBoost), kernel methods (SVM, Kernel PCA), probabilistic models (Na√Øve Bayes, HMM, GMM); good for tabular or modest‚Äësize data, often more interpretable than deep nets.  \n* **Shallow neural networks** ‚Äì one‚Äë or tw

In [69]:
chain.invoke({"input": "What is my name again?"})



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
The human greeted the AI and introduced themselves as ‚ÄúDat,‚Äù and the AI replied with a friendly hello, introduced itself as ChatGPT (GPT‚Äë4, knowledge up to June‚ÄØ2024), listed its multilingual text, creative‚Äëwriting, technical‚Äëexplanation, recipe‚Äëand‚Äëcoding assistance abilities, and asked about Dat‚Äôs hobbies.‚ÄØDat said they were researching different types of AI models, so the AI gave a comprehensive tour of AI model families‚Äîincluding rule‚Äëbased/symbolic systems; statistical/classical machine‚Äëlearning (linear models, tree ensembles, kernel methods, probabilistic models); shallow neural networks; deep learning (CNNs for visi

{'input': 'What is my name again?',
 'history': 'The human greeted the AI and introduced themselves as ‚ÄúDat,‚Äù and the AI replied with a friendly hello, introduced itself as ChatGPT (GPT‚Äë4, knowledge up to June\u202f2024), listed its multilingual text, creative‚Äëwriting, technical‚Äëexplanation, recipe‚Äëand‚Äëcoding assistance abilities, and asked about Dat‚Äôs hobbies.\u202fDat said they were researching different types of AI models, so the AI gave a comprehensive tour of AI model families‚Äîincluding rule‚Äëbased/symbolic systems; statistical/classical machine‚Äëlearning (linear models, tree ensembles, kernel methods, probabilistic models); shallow neural networks; deep learning (CNNs for vision, RNN/LSTM/GRU for sequences, transformers broken into encoder‚Äëonly, decoder‚Äëonly, encoder‚Äëdecoder, and multimodal variants; diffusion models for high‚Äëfidelity image/audio generation; large language models with scaling, instruction‚Äëtuning, RLHF; reinforcement‚Äëlearning agents

As this information was stored in the summary the LLM successfully recalled our name. This may not always be the case, by summarizing the conversation we inevitably compress the full amount of information and so we may lose key details occasionally. Nonetheless, this is a great memory type for long conversations while retaining some key information.

##### `RunnableWithMessageHistory` in Summary

As with the window buffer memory, we need to define a custom implementation of the `InMemoryChatMessageHistory` class. We'll call this one `ConversationSummaryMessageHistory`.

In [70]:
from langchain_core.messages import SystemMessage


class ConversationSummaryMessageHistory(BaseChatMessageHistory, BaseModel):
    messages: list[BaseMessage] = Field(default_factory=list)
    llm: ChatOpenAI = Field(default_factory=ChatOpenAI)

    def __init__(self, llm: ChatOpenAI):
        super().__init__(llm=llm)

    def add_messages(self, messages: list[BaseMessage]) -> None:
        """Add messages to the history, removing any messages beyond
        the last `k` messages.
        """
        self.messages.extend(messages)
        # construct the summary chat messages
        summary_prompt = ChatPromptTemplate.from_messages([
            SystemMessagePromptTemplate.from_template(
                "Given the existing conversation summary and the new messages, "
                "generate a new summary of the conversation. Ensuring to maintain "
                "as much relevant information as possible."
            ),
            HumanMessagePromptTemplate.from_template(
                "Existing conversation summary:\n{existing_summary}\n\n"
                "New messages:\n{messages}"
            )
        ])
        # format the messages and invoke the LLM
        new_summary = self.llm.invoke(
            summary_prompt.format_messages(
                existing_summary=self.messages.content,
                messages=[x.content for x in messages]
            )
        )
        # replace the existing history with a single system summary message
        self.messages = [SystemMessage(content=new_summary.content)]

    def clear(self) -> None:
        """Clear the history."""
        self.messages = []

In [71]:
chat_map = {}
def get_chat_history(session_id: str, llm: ChatOpenAI) -> ConversationSummaryMessageHistory:
    if session_id not in chat_map:
        # if session ID doesn't exist, create a new chat history
        chat_map[session_id] = ConversationSummaryMessageHistory(llm=llm)
    # return the chat history
    return chat_map[session_id]

In [72]:
pipeline_with_history = RunnableWithMessageHistory(
    pipeline,
    get_session_history=get_chat_history,
    input_messages_key="query",
    history_messages_key="history",
    history_factory_config=[
        ConfigurableFieldSpec(
            id="session_id",
            annotation=str,
            name="Session ID",
            description="The session ID to use for the chat history",
            default="id_default",
        ),
        ConfigurableFieldSpec(
            id="llm",
            annotation=ChatOpenAI,
            name="LLM",
            description="The LLM to use for the conversation summary",
            default=llm,
        )
    ]
)

Now we invoke our runnable, this time passing a `llm` parameter via the `config` parameter.

In [73]:
pipeline_with_history.invoke(
    {"query": "Hi, my name is Dat"},
    config={"session_id": "id_123", "llm": llm}
)

Error in RootListenersTracer.on_chain_end callback: AttributeError("'list' object has no attribute 'content'")


AIMessage(content='Hello, Dat! Nice to meet you. How can I assist you today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 55, 'prompt_tokens': 86, 'total_tokens': 141, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': 29, 'rejected_prediction_tokens': None, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0, 'upstream_inference_completions_cost': 0}}, 'model_name': 'openai/gpt-oss-120b:free', 'system_fingerprint': None, 'id': 'gen-1766228326-N1OwE2oIHjwTux0gpoCk', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--cded2bb4-af02-4efc-804a-a31ad80c8540-0', usage_metadata={'input_tokens': 86, 'output_tokens': 55, 'total_tokens': 141, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'outpu

In [74]:
chat_map["id_123"].messages

[HumanMessage(content='Hi, my name is Dat', additional_kwargs={}, response_metadata={}),
 AIMessage(content='Hello, Dat! Nice to meet you. How can I assist you today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 55, 'prompt_tokens': 86, 'total_tokens': 141, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': 29, 'rejected_prediction_tokens': None, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0, 'upstream_inference_completions_cost': 0}}, 'model_name': 'openai/gpt-oss-120b:free', 'system_fingerprint': None, 'id': 'gen-1766228326-N1OwE2oIHjwTux0gpoCk', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--cded2bb4-af02-4efc-804a-a31ad80c8540-0', usage_metadata={'input_tokens': 86, 'output_toke

In [75]:
pipeline_with_history.invoke(
    {"query": "I'm researching the different types of conversational memory."},
    config={"session_id": "id_123", "llm": llm}
)

chat_map["id_123"].messages

Error in RootListenersTracer.on_chain_end callback: AttributeError("'list' object has no attribute 'content'")


[HumanMessage(content='Hi, my name is Dat', additional_kwargs={}, response_metadata={}),
 AIMessage(content='Hello, Dat! Nice to meet you. How can I assist you today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 55, 'prompt_tokens': 86, 'total_tokens': 141, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': 29, 'rejected_prediction_tokens': None, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0, 'upstream_inference_completions_cost': 0}}, 'model_name': 'openai/gpt-oss-120b:free', 'system_fingerprint': None, 'id': 'gen-1766228326-N1OwE2oIHjwTux0gpoCk', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--cded2bb4-af02-4efc-804a-a31ad80c8540-0', usage_metadata={'input_tokens': 86, 'output_toke

Let's continue the conversation and see if the summary is updated:

In [76]:
for msg in [
    "I have been looking at ConversationBufferMemory and ConversationBufferWindowMemory.",
    "Buffer memory just stores the entire conversation",
    "Buffer window memory stores the last k messages, dropping the rest."
]:
    pipeline_with_history.invoke(
        {"query": msg},
        config={"session_id": "id_123", "llm": llm}
    )

Error in RootListenersTracer.on_chain_end callback: AttributeError("'list' object has no attribute 'content'")
Error in RootListenersTracer.on_chain_end callback: AttributeError("'list' object has no attribute 'content'")
Error in RootListenersTracer.on_chain_end callback: AttributeError("'list' object has no attribute 'content'")


In [77]:
chat_map["id_123"].messages

[HumanMessage(content='Hi, my name is Dat', additional_kwargs={}, response_metadata={}),
 AIMessage(content='Hello, Dat! Nice to meet you. How can I assist you today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 55, 'prompt_tokens': 86, 'total_tokens': 141, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': 29, 'rejected_prediction_tokens': None, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0, 'upstream_inference_completions_cost': 0}}, 'model_name': 'openai/gpt-oss-120b:free', 'system_fingerprint': None, 'id': 'gen-1766228326-N1OwE2oIHjwTux0gpoCk', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--cded2bb4-af02-4efc-804a-a31ad80c8540-0', usage_metadata={'input_tokens': 86, 'output_toke

The information still maintained, let's check again

In [78]:
pipeline_with_history.invoke(
    {"query": "What is my name again?"},
    config={"session_id": "id_123", "llm": llm}
)

Error in RootListenersTracer.on_chain_end callback: AttributeError("'list' object has no attribute 'content'")


AIMessage(content='Your name is **Dat**.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 95, 'prompt_tokens': 10551, 'total_tokens': 10646, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': 90, 'rejected_prediction_tokens': None, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0, 'upstream_inference_completions_cost': 0}}, 'model_name': 'openai/gpt-oss-120b:free', 'system_fingerprint': None, 'id': 'gen-1766228436-I20tKgsn4kNiCDD7gLwD', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--0fa06d46-fab0-40cb-b941-bec09caf2bb6-0', usage_metadata={'input_tokens': 10551, 'output_tokens': 95, 'total_tokens': 10646, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'reason

# Agent and Tools

Tools are a way augment our LLMs with code execution. A tool is simply a function formatted so that our agent can undertstand how to use it, and then execute it.

We can use the @tool decorator to create an LLM-compatible tool from a standard python function ‚Äî this function should include a few things for optimal performance:
- A docstring describing what the tool does and when it should be used, this will be read by our LLM/agent and used to decide when to use the tool, and also how to use the tool.
- Clear parameter names that ideally tell the LLM what each parameter is, if it isn't clear we make sure the docstring explains what the parameter is for and how to use it.
- Both parameter and return type annotations.


In [84]:
from langchain_core.tools import tool

@tool
def add(x: float, y: float) -> float:
    """Add 'x' and 'y'."""
    return x + y

@tool
def multiply(x: float, y: float) -> float:
    """Multiply 'x' and 'y'."""
    return x * y

@tool
def exponentiate(x: float, y: float) -> float:
    """Raise 'x' to the power of 'y'."""
    return x ** y

@tool
def subtract(x: float, y: float) -> float:
    """Subtract 'x' from 'y'."""
    return y - x

With the `@tool` decorator our function is turned into a `StructuredTool` object, which we can see below:

In [85]:
add

StructuredTool(name='add', description="Add 'x' and 'y'.", args_schema=<class 'langchain_core.utils.pydantic.add'>, func=<function add at 0x7e66a241a3a0>)

We can see the tool name, description, and arg schema:

In [86]:
print(f"{add.name=}\n{add.description=}")

add.name='add'
add.description="Add 'x' and 'y'."


In [87]:
add.args_schema.model_json_schema()

{'description': "Add 'x' and 'y'.",
 'properties': {'x': {'title': 'X', 'type': 'number'},
  'y': {'title': 'Y', 'type': 'number'}},
 'required': ['x', 'y'],
 'title': 'add',
 'type': 'object'}

In [88]:
exponentiate.args_schema.model_json_schema()

{'description': "Raise 'x' to the power of 'y'.",
 'properties': {'x': {'title': 'X', 'type': 'number'},
  'y': {'title': 'Y', 'type': 'number'}},
 'required': ['x', 'y'],
 'title': 'exponentiate',
 'type': 'object'}

When invoking the tool, a JSON string output by the LLM will be parsed into JSON and then consumed as kwargs, similar to the below:

In [89]:
import json

llm_output_string = "{\"x\": 5, \"y\": 2}"  # this is the output from the LLM
llm_output_dict = json.loads(llm_output_string)  # load as dictionary
llm_output_dict

{'x': 5, 'y': 2}

This is then passed into the tool function as `kwargs` (keyword arguments) as indicated by the `**` operator - the `**` operator is used to unpack the dictionary into keyword arguments.

In [90]:
exponentiate.func(**llm_output_dict)  # call the function with unpacked args

25

This covers the basics of tools and how they work, let's move on to creating the agent itself.

### Creating an Agent

We need this agent to remember previous interactions within the conversation. To do that, we will use the `ChatPromptTemplate` with a system message, a placeholder for our chat history, a placeholder for the user query, and finally a placeholder for the agent scratchpad.

The agent scratchpad is where the agent will write it's "notes" as it is working through multiple internal thought and tool-use steps to produce a final output to the user.

In [91]:
prompt = ChatPromptTemplate.from_messages([
    ("system", "you're a helpful assistant"),
    MessagesPlaceholder(variable_name = "chat_history"),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

When creating an agent we need to add conversational memory to make the agent remember previous interactions. We'll be using the older `ConversationBufferMemory` class rather than the newer `RunnableWithMessageHistory` ‚Äî the reason being that we will also be using the older `create_tool_calling_agent` and `AgentExecutor` method and class.

In [92]:
memory = ConversationBufferMemory(
    memory_key = "chat_history",  # must align with MessagesPlaceholder variable_name
    return_messages = True  # to return Message objects
)

In [93]:
from langchain.agents import create_tool_calling_agent

tools = [add, subtract, multiply, exponentiate]

agent = create_tool_calling_agent(
    llm = llm, tools = tools, prompt = prompt
)

Our agent by itself is like one-step of our agent execution loop. So, if we call the `agent.invoke` method it will get the LLM to generate a single response and go no further, so no tools will be executed, and no next iterations will be performed.

We can see this by asking a query that should trigger a tool call:

In [94]:
agent.invoke({
    "input": "what is 10.7 multiplied by 7.68?",
    "chat_history": memory.chat_memory.messages,
    "intermediate_steps": []  # agent will append it's internal steps here
})

AgentFinish(return_values={'output': '\\(10.7 \\times 7.68 = 82.176\\)'}, log='\\(10.7 \\times 7.68 = 82.176\\)')

Here, we can see the LLM has generated that we should use the multiply tool and the tool input should be `{"x": 10.7, "y": 7.68}`. However, the tool is not executed. For that to happen we need an agent execution loop, which will handle the multiple iterations of generation to tool calling to generation, etc.

We use the `AgentExecutor` class to handle the execution loop:

In [95]:
from langchain.agents import AgentExecutor

agent_executor = AgentExecutor(
    agent = agent,
    tools = tools,
    memory = memory,
    verbose = True
)

Now let's try the same query with the executor, note that the `intermediate_steps` parameter that we added before is no longer needed as the executor handles it internally.

In [96]:
agent_executor.invoke({
    "input": "what is 10.7 multiplied by 7.68?",
    "chat_history": memory.chat_memory.messages,
})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m\(10.7 \times 7.68 = 82.176\)[0m

[1m> Finished chain.[0m


{'input': 'what is 10.7 multiplied by 7.68?',
 'chat_history': [HumanMessage(content='what is 10.7 multiplied by 7.68?', additional_kwargs={}, response_metadata={}),
  AIMessage(content='\\(10.7 \\times 7.68 = 82.176\\)', additional_kwargs={}, response_metadata={})],
 'output': '\\(10.7 \\times 7.68 = 82.176\\)'}

We can see that the multiply tool was invoked, producing the observation of 82.175999.... After the observation was provided, we can see that the LLM then generated a final response of:

```
10.7 multiplied by 7.68 is approximately 82.18.
```

This final response was generated based on the original query and the tool output (ie the observation). We can also confirm that this answer is accurate:

In [97]:
10.7*7.68

82.17599999999999

Let's test our agent with some memory and tool use. First, we tell it our name, then we will perform a few tool calls, then see if the agent can still recall our name.

First, give the agent our name:

In [98]:
agent_executor.invoke({
    "input": "My name is Dat",
    "chat_history": memory
})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mNice to meet you, Dat! How can I assist you today?[0m

[1m> Finished chain.[0m


{'input': 'My name is Dat',
 'chat_history': [HumanMessage(content='what is 10.7 multiplied by 7.68?', additional_kwargs={}, response_metadata={}),
  AIMessage(content='\\(10.7 \\times 7.68 = 82.176\\)', additional_kwargs={}, response_metadata={}),
  HumanMessage(content='My name is Dat', additional_kwargs={}, response_metadata={}),
  AIMessage(content='Nice to meet you, Dat! How can I assist you today?', additional_kwargs={}, response_metadata={})],
 'output': 'Nice to meet you, Dat! How can I assist you today?'}

Now let's try and get the agent to perform multiple tool calls within a single execution loop:

In [99]:
agent_executor.invoke({
    "input": "What is nine plus 10, minus 4 * 2, to the power of 3",
    "chat_history": memory
})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThe expression can be interpreted as  

\[
(9 + 10 - 4 \times 2)^{3}
\]

1. \(4 \times 2 = 8\)  
2. \(9 + 10 = 19\)  
3. \(19 - 8 = 11\)  
4. \(11^{3} = 11 \times 11 \times 11 = 1331\)

**Result:** **1331**.[0m

[1m> Finished chain.[0m


{'input': 'What is nine plus 10, minus 4 * 2, to the power of 3',
 'chat_history': [HumanMessage(content='what is 10.7 multiplied by 7.68?', additional_kwargs={}, response_metadata={}),
  AIMessage(content='\\(10.7 \\times 7.68 = 82.176\\)', additional_kwargs={}, response_metadata={}),
  HumanMessage(content='My name is Dat', additional_kwargs={}, response_metadata={}),
  AIMessage(content='Nice to meet you, Dat! How can I assist you today?', additional_kwargs={}, response_metadata={}),
  HumanMessage(content='What is nine plus 10, minus 4 * 2, to the power of 3', additional_kwargs={}, response_metadata={}),
  AIMessage(content='The expression can be interpreted as  \n\n\\[\n(9 + 10 - 4 \\times 2)^{3}\n\\]\n\n1. \\(4 \\times 2 = 8\\)  \n2. \\(9 + 10 = 19\\)  \n3. \\(19 - 8 = 11\\)  \n4. \\(11^{3} = 11 \\times 11 \\times 11 = 1331\\)\n\n**Result:** **1331**.', additional_kwargs={}, response_metadata={})],
 'output': 'The expression can be interpreted as  \n\n\\[\n(9 + 10 - 4 \\times 2)^

In [100]:
9+10-(4*2)**3

-493

Perfect, now let's see if the agent can still recall our name:

In [101]:
agent_executor.invoke({
    "input": "What is my name",
    "chat_history": memory
})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mYour name is Dat.[0m

[1m> Finished chain.[0m


{'input': 'What is my name',
 'chat_history': [HumanMessage(content='what is 10.7 multiplied by 7.68?', additional_kwargs={}, response_metadata={}),
  AIMessage(content='\\(10.7 \\times 7.68 = 82.176\\)', additional_kwargs={}, response_metadata={}),
  HumanMessage(content='My name is Dat', additional_kwargs={}, response_metadata={}),
  AIMessage(content='Nice to meet you, Dat! How can I assist you today?', additional_kwargs={}, response_metadata={}),
  HumanMessage(content='What is nine plus 10, minus 4 * 2, to the power of 3', additional_kwargs={}, response_metadata={}),
  AIMessage(content='The expression can be interpreted as  \n\n\\[\n(9 + 10 - 4 \\times 2)^{3}\n\\]\n\n1. \\(4 \\times 2 = 8\\)  \n2. \\(9 + 10 = 19\\)  \n3. \\(19 - 8 = 11\\)  \n4. \\(11^{3} = 11 \\times 11 \\times 11 = 1331\\)\n\n**Result:** **1331**.', additional_kwargs={}, response_metadata={}),
  HumanMessage(content='What is my name', additional_kwargs={}, response_metadata={}),
  AIMessage(content='Your name is

For other tool provided by 3rd-party, we can use `load_tools` library which Langchain support

In [102]:
from langchain.agents import load_tools
from IPython.display import display, Markdown

# Define toolfunctions with correct @tool decorator as above
# Invoke the call with tools list passing in
# For better display in Jupyter notebooks, we use Markdown to format

# Agent Executor

When we talk about agents, a significant part of an "agent" is simple code logic,
iteratively rerunning LLM calls and processing their output. The exact logic varies
significantly, but one well-known example is the **ReAct** agent.

![ReAct process](https://www.aurelio.ai/_next/image?url=%2Fimages%2Fposts%2Fai-agents%2Fai-agents-00.png&w=640&q=75)

**Re**ason + **Act**ion (ReAct) agents use iterative _reasoning_ and _action_ steps to
incorporate chain-of-thought and tool-use into their execution. During the _reasoning_
step, the LLM generates the steps to take to answer the query. Next, the LLM generates
the _action_ input, which our code logic parses into a tool call.

![Agentic graph of ReAct](https://www.aurelio.ai/_next/image?url=%2Fimages%2Fposts%2Fai-agents%2Fai-agents-01.png&w=640&q=75)

Following our action step, we get an observation from the tool call. Then, we feed the
observation back into the agent executor logic for a final answer or further reasoning
and action steps.

The agent and agent executor we will be building will follow this pattern.

In [103]:
prompt = ChatPromptTemplate.from_messages([
    ("system", (
        "You're a helpful assistant. When answering a user's question "
        "you should first use one of the tools provided. After using a "
        "tool the tool output will be provided in the "
        "'scratchpad' below. If you have an answer in the "
        "scratchpad you should not use any more tools and "
        "instead answer directly to the user."
    )),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad"),
])

In [104]:
from langchain_core.runnables.base import RunnableSerializable

# define the agent runnable
agent: RunnableSerializable = (
    {
        "input": lambda x: x["input"],
        "chat_history": lambda x: x["chat_history"],
        "agent_scratchpad": lambda x: x.get("agent_scratchpad", [])
    }
    | prompt
    | llm.bind_tools(tools, tool_choice = "any")
)

We invoke the agent with the `invoke` method, passing in the input and chat history.

In [105]:
tool_call = agent.invoke({"input": "What is 10 + 10", "chat_history": []})
tool_call

AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'chatcmpl-tool-89cc7030bfac4d8d', 'function': {'arguments': '{\n  "x": 10,\n  "y": 10\n}', 'name': 'add'}, 'type': 'function', 'index': 0}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 41, 'prompt_tokens': 280, 'total_tokens': 321, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': 6, 'rejected_prediction_tokens': None, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0, 'upstream_inference_completions_cost': 0}}, 'model_name': 'openai/gpt-oss-120b:free', 'system_fingerprint': None, 'id': 'gen-1766228607-nD1BRZwtOmqWb82T8mqS', 'service_tier': None, 'finish_reason': 'tool_calls', 'logprobs': None}, id='run--7db5ae84-80f1-4c4d-b644-8f246d41aef7-0', tool_calls=[{'name': 'add'

Because we set `tool_choice = "any"` to force the tool output, the usual `content` field will be empty as that field is used for natural language output, ie the final answer of the LLM. To find our tool output, we need to look at the `tool_calls` field:

In [106]:
tool_call.tool_calls

[{'name': 'add',
  'args': {'x': 10, 'y': 10},
  'id': 'chatcmpl-tool-89cc7030bfac4d8d',
  'type': 'tool_call'}]

From here, we have the tool name that our LLM wants to use and the `args` that it wants to pass to that tool. We can see that the tool `add` is being used with the arguments `x = 10` and `y = 10`. The agent.invoke method has not executed the tool function; we need to write that part of the agent code ourselves.

Executing the tool code requires two steps:
- Map the tool name to the tool function.
- Execute the tool function with the generated args.

In [107]:
# create tool name to function mapping
name2tool = {tool.name: tool.func for tool in tools}

Now execute to get our answer:

In [108]:
tool_exec_content = name2tool[tool_call.tool_calls[0]["name"]](
    **tool_call.tool_calls[0]["args"]
)
tool_exec_content

20

That is our answer and tool execution logic. We feed this back into our LLM via the `agent_scratchpad` placeholder.

In [109]:
from langchain_core.messages import ToolMessage

tool_exec = ToolMessage(
    content=f"The {tool_call.tool_calls[0]['name']} tool returned {tool_exec_content}",
    tool_call_id=tool_call.tool_calls[0]["id"]
)

out = agent.invoke({
    "input": "What is 10 + 10",
    "chat_history": [],
    "agent_scratchpad": [tool_call, tool_exec]
})
out

AIMessage(content='The sum of 10\u202f+\u202f10 is **20**.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 33, 'prompt_tokens': 318, 'total_tokens': 351, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': 8, 'rejected_prediction_tokens': None, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0, 'upstream_inference_completions_cost': 0}}, 'model_name': 'openai/gpt-oss-120b:free', 'system_fingerprint': None, 'id': 'gen-1766228610-bnhPdWmnh76B8ZeKau6I', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--10a2e034-1a90-48d6-9710-1cd4804fff03-0', usage_metadata={'input_tokens': 318, 'output_tokens': 33, 'total_tokens': 351, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details':

Despite having the answer in our `agent_scratchpad`, the LLM still tries to use the tool again. This behaviour happens because we bonded the tools to the LLM with `tool_choice = "any"`. When we set `tool_choice` to `"any"` or `"required"`, we tell the LLM that it MUST use a tool, i.e., it cannot provide a final answer.

There's two options to fix this:
- Set `tool_choice = "auto"`to tell the LLM that it can choose to use a tool or provide a final answer.
- Create a `final_answer` tool - we'll explain this shortly.


In [110]:
# Option 1
agent: RunnableSerializable = (
    {
        "input": lambda x: x["input"],
        "chat_history": lambda x: x["chat_history"],
        "agent_scratchpad": lambda x: x.get("agent_scratchpad", [])
    }
    | prompt
    | llm.bind_tools(tools, tool_choice="auto")
)

We'll start from the start again, so `agent_scratchpad` is empty:

In [111]:
tool_call = agent.invoke({"input": "What is 10 + 10", "chat_history": []})
tool_call

AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'chatcmpl-tool-935c299ff56bde01', 'function': {'arguments': '{\n  "x": 10,\n  "y": 10\n}', 'name': 'add'}, 'type': 'function', 'index': 0}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 41, 'prompt_tokens': 280, 'total_tokens': 321, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': 6, 'rejected_prediction_tokens': None, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0, 'upstream_inference_completions_cost': 0}}, 'model_name': 'openai/gpt-oss-120b:free', 'system_fingerprint': None, 'id': 'gen-1766228616-wTgkfnShniLO3jPqGERF', 'service_tier': None, 'finish_reason': 'tool_calls', 'logprobs': None}, id='run--3ed6c08b-0766-4de8-9eb0-cc51c1c9d8e1-0', tool_calls=[{'name': 'add'

In [112]:
tool_output = name2tool[tool_call.tool_calls[0]["name"]](
    **tool_call.tool_calls[0]["args"]
)

tool_exec = ToolMessage(
    content=f"The {tool_call.tool_calls[0]['name']} tool returned {tool_output}",
    tool_call_id=tool_call.tool_calls[0]["id"]
)

out = agent.invoke({
    "input": "What is 10 + 10",
    "chat_history": [],
    "agent_scratchpad": [tool_call, tool_exec]
})
out

AIMessage(content='10\u202f+\u202f10\u202f=\u202f20.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 28, 'prompt_tokens': 318, 'total_tokens': 346, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': 7, 'rejected_prediction_tokens': None, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0, 'upstream_inference_completions_cost': 0}}, 'model_name': 'openai/gpt-oss-120b:free', 'system_fingerprint': None, 'id': 'gen-1766228619-ztovFTteiTFWZCSVbWv2', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--02135d4f-be47-41ba-9b67-562747f2a20f-0', usage_metadata={'input_tokens': 318, 'output_tokens': 28, 'total_tokens': 346, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'rea

We now have the final answer in the content field! This method is perfectly functional; however, we recommend option 2 as it provides more control over the agent's output.

There are several reasons that option 2 can provide more control, those are:
- It removes the possibility of an agent using the direct content field when it is not appropriate; for example, some LLMs (particularly smaller ones) may try to use the content field when using a tool.
- We can enforce a specific structured output in our answers. Structured outputs are handy when we require particular fields for downstream code or multi-part answers. For example, a RAG agent may return a natural language answer and a list of sources used to generate that answer.

To implement option 2, we must create a `final_answer` tool. We will add a `tools_used` field to give our output some structure‚Äîin a real-world use case, we probably wouldn't want to generate this field, but it's useful for our example here.


In [113]:
@tool
def final_answer(answer: str, tools_used: list[str]) -> str:
    """Use this tool to provide a final answer to the user.
    The answer should be in natural language as this will be provided
    to the user directly. The tools_used must include a list of tool
    names that were used within the `scratchpad`.
    """
    return {"answer": answer, "tools_used": tools_used}

Our `final_answer` tool doesn't necessarily need to do anything; in this example, we're using it purely to structure our final response. We can now add this tool to our agent:

In [114]:
tools = [final_answer, add, subtract, multiply, exponentiate]

# we need to update our name2tool mapping too
name2tool = {tool.name: tool.func for tool in tools}

agent: RunnableSerializable = (
    {
        "input": lambda x: x["input"],
        "chat_history": lambda x: x["chat_history"],
        "agent_scratchpad": lambda x: x.get("agent_scratchpad", [])
    }
    | prompt
    | llm.bind_tools(tools, tool_choice = "any")  # we're forcing tool use again
)

In [115]:
tool_call = agent.invoke({"input": "What is 10 + 10", "chat_history": []})
tool_call.tool_calls

[{'name': 'add',
  'args': {'x': 10, 'y': 10},
  'id': 'chatcmpl-tool-976b52be56fde777',
  'type': 'tool_call'}]

We execute the tool and provide it's output to the agent again:

In [116]:
tool_out = name2tool[tool_call.tool_calls[0]["name"]](
    **tool_call.tool_calls[0]["args"]
)

tool_exec = ToolMessage(
    content=f"The {tool_call.tool_calls[0]['name']} tool returned {tool_out}",
    tool_call_id=tool_call.tool_calls[0]["id"]
)

out = agent.invoke({
    "input": "What is 10 + 10",
    "chat_history": [],
    "agent_scratchpad": [tool_call, tool_exec]
})
out

AIMessage(content='{\n  "answer": "10 + 10 = 20.",\n  "tools_used": ["add"]\n}', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 48, 'prompt_tokens': 393, 'total_tokens': 441, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': 12, 'rejected_prediction_tokens': None, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0, 'upstream_inference_completions_cost': 0}}, 'model_name': 'openai/gpt-oss-120b:free', 'system_fingerprint': None, 'id': 'gen-1766228633-kgbuBpQNvdbmri1Ex1Ga', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--eb4fc8ee-3aea-4d2e-ba9e-45e93140a410-0', usage_metadata={'input_tokens': 393, 'output_tokens': 48, 'total_tokens': 441, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'o

### Full Custom Agent Execution Loop

In [122]:
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage


class CustomAgentExecutor:
    chat_history: list[BaseMessage]

    def __init__(self, max_iterations: int = 3):
        self.chat_history = []
        self.max_iterations = max_iterations
        self.agent: RunnableSerializable = (
            {
                "input": lambda x: x["input"],
                "chat_history": lambda x: x["chat_history"],
                "agent_scratchpad": lambda x: x.get("agent_scratchpad", [])
            }
            | prompt
            | llm.bind_tools(tools, tool_choice="any")  # we're forcing tool use again
        )

    def invoke(self, input: str) -> dict:
        # invoke the agent but we do this iteratively in a loop until
        # reaching a final answer
        count = 0
        agent_scratchpad = []
        while count < self.max_iterations:
            # invoke a step for the agent to generate a tool call
            tool_call = self.agent.invoke({
                "input": input,
                "chat_history": self.chat_history,
                "agent_scratchpad": agent_scratchpad
            })
            # add initial tool call to scratchpad
            agent_scratchpad.append(tool_call)
            # otherwise we execute the tool and add it's output to the agent scratchpad
            tool_name = tool_call.tool_calls[0]["name"]
            tool_args = tool_call.tool_calls[0]["args"]
            tool_call_id = tool_call.tool_calls[0]["id"]
            tool_out = name2tool[tool_name](**tool_args)
            # add the tool output to the agent scratchpad
            tool_exec = ToolMessage(
                content=f"{tool_out}",
                tool_call_id=tool_call_id
            )
            agent_scratchpad.append(tool_exec)
            # add a print so we can see intermediate steps
            print(f"{count}: {tool_name}({tool_args})")
            count += 1
            # if the tool call is the final answer tool, we stop
            if tool_name == "final_answer":
                break
        # add the final output to the chat history
        final_answer = tool_out["answer"]
        self.chat_history.extend([
            HumanMessage(content=input),
            AIMessage(content=final_answer)
        ])
        # return the final answer in dict form
        return json.dumps(tool_out)

In [123]:
agent_executor = CustomAgentExecutor()

In [124]:
agent_executor.invoke(input = "What is 10 + 10")

0: add({'x': 10, 'y': 10})
1: final_answer({'answer': '10 + 10 = 20.', 'tools_used': ['add']})


'{"answer": "10 + 10 = 20.", "tools_used": ["add"]}'

# LangChains Expression Language

### Traditional Chains vs LCEL

In this section we're going to dive into a basic example using the traditional method for building chains before jumping into LCEL. We will build a pipeline where the user must input a specific topic, and then the LLM will look and return a report on the specified topic. Generating a _research report_ for the user.

##### Traditional LLMChain to LCEL

The `LLMChain` is the simplest chain originally introduced in LangChain. This chain takes a prompt, feeds it into an LLM, and _optionally_ adds an output parsing step before returning the result.

Let's see how we construct this using the traditional method, for this we need:

* `prompt` ‚Äî a `PromptTemplate` that will be used to generate the prompt for the LLM.
* `llm` ‚Äî the LLM we will be using to generate the output.
* `output_parser` ‚Äî an optional output parser that will be used to parse the structured output of the LLM.

In [125]:
from langchain import PromptTemplate

prompt_template = "Give me a small report on {topic}"

prompt = PromptTemplate(
    input_variables = ["topic"],
    template = prompt_template
)

In [126]:
llm_out = llm.invoke("Hello there")
llm_out

AIMessage(content='Hello! How can I help you today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 40, 'prompt_tokens': 71, 'total_tokens': 111, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': 26, 'rejected_prediction_tokens': None, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0, 'upstream_inference_completions_cost': 0}}, 'model_name': 'openai/gpt-oss-120b:free', 'system_fingerprint': None, 'id': 'gen-1766228752-dFIAaCgX0PG646hHWjnu', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--088ee9e4-e8f7-4f1d-8439-6b5830610ff4-0', usage_metadata={'input_tokens': 71, 'output_tokens': 40, 'total_tokens': 111, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'reaso

Then we define our output parser, this will be used to parse the output of the LLM. In this case, we will use the `StrOutputParser` which will parse the `AIMessage` output from our LLM into a single string.

In [127]:
from langchain.schema.output_parser import StrOutputParser

output_parser = StrOutputParser()

In [128]:
out = output_parser.invoke(llm_out)
out

'Hello! How can I help you today?'

Through the `LLMChain` class we can place each of our components into a linear `chain`.

In [129]:
from langchain.chains import LLMChain

chain = LLMChain(prompt = prompt,
                llm = llm,
                output_parser = output_parser
                )

  chain = LLMChain(prompt=prompt, llm=llm, output_parser=output_parser)


Note that the `LLMChain` _was_ deprecated in LangChain `0.1.17`, the expected way of constructing these chains today is through LCEL, which we'll cover in a moment.

We can `invoke` our `chain`, providing a `topic` that we'd like to be researched.

In [130]:
result = chain.invoke("retrieval augmented generation")
result

{'topic': 'retrieval augmented generation',
 'text': '**Retrieval‚ÄëAugmented Generation (RAG): A Brief Report**\n\n---\n\n### 1. Overview  \nRetrieval‚ÄëAugmented Generation (RAG) is a hybrid AI architecture that couples a **retrieval module** (searching an external knowledge source) with a **generative language model**. The retriever supplies relevant documents or passages, and the generator conditions its output on this retrieved context, enabling the system to produce up‚Äëto‚Äëdate, factual, and domain‚Äëspecific responses that go beyond the static knowledge baked into the model‚Äôs parameters.\n\n---\n\n### 2. Core Components  \n\n| Component | Function | Typical Implementations |\n|-----------|----------|--------------------------|\n| **Retriever** | Finds the most relevant pieces of information from a large corpus (e.g., Wikipedia, internal documents, web index). | ‚Ä¢ Dense vector search (e.g., FAISS, ScaNN) using bi‚Äëencoders like DPR, Contriever, or Sentence‚ÄëTransformers.

We can view a formatted version of this output using the `Markdown` display:

In [131]:
display(Markdown(result["text"]))

**Retrieval‚ÄëAugmented Generation (RAG): A Brief Report**

---

### 1. Overview  
Retrieval‚ÄëAugmented Generation (RAG) is a hybrid AI architecture that couples a **retrieval module** (searching an external knowledge source) with a **generative language model**. The retriever supplies relevant documents or passages, and the generator conditions its output on this retrieved context, enabling the system to produce up‚Äëto‚Äëdate, factual, and domain‚Äëspecific responses that go beyond the static knowledge baked into the model‚Äôs parameters.

---

### 2. Core Components  

| Component | Function | Typical Implementations |
|-----------|----------|--------------------------|
| **Retriever** | Finds the most relevant pieces of information from a large corpus (e.g., Wikipedia, internal documents, web index). | ‚Ä¢ Dense vector search (e.g., FAISS, ScaNN) using bi‚Äëencoders like DPR, Contriever, or Sentence‚ÄëTransformers.<br>‚Ä¢ Sparse lexical search (BM25, Elasticsearch) for exact term matching.<br>‚Ä¢ Hybrid (dense‚ÄØ+‚ÄØsparse) retrieval for robustness. |
| **Reader/Generator** | Consumes the retrieved texts and produces a natural‚Äëlanguage answer or continuation. | ‚Ä¢ Large language models (LLMs) such as GPT‚Äë3/4, LLaMA, PaLM, or instruction‚Äëtuned variants.<br>‚Ä¢ Encoder‚Äëdecoder models (T5, BART) fine‚Äëtuned on QA or summarisation tasks. |
| **Fusion / Integration Layer** | Merges retrieved evidence with the model‚Äôs internal knowledge. | ‚Ä¢ Concatenation of passages to the prompt (prompt‚Äëengineering).<br>‚Ä¢ Cross‚Äëattention over retrieved documents (e.g., Fusion‚Äëin‚ÄëDecoder, RAG‚ÄëSequence/Token).<br>‚Ä¢ Knowledge‚Äëaware adapters or LoRA modules that bias the generator toward the evidence. |
| **Index & Update Mechanism** | Stores the external corpus and allows incremental updates. | ‚Ä¢ Vector databases (FAISS, Milvus, Weaviate).<br>‚Ä¢ Periodic re‚Äëembedding pipelines to keep the index fresh. |

---

### 3. Typical Workflow  

1. **Query Encoding** ‚Äì The user‚Äôs input is encoded into a dense vector (or a set of lexical terms).  
2. **Document Retrieval** ‚Äì The vector is used to retrieve *k* top‚Äëranked passages from the external index.  
3. **Context Construction** ‚Äì Retrieved passages are formatted (e.g., ‚ÄúDocument 1: ‚Ä¶‚Äù) and concatenated with the original query.  
4. **Conditional Generation** ‚Äì The LLM receives the combined prompt and generates a response, often with a **grounding loss** that penalizes hallucinations not supported by the retrieved text.  
5. **Post‚Äëprocessing** ‚Äì Optional steps include citation extraction, answer verification, or re‚Äëranking of multiple generated candidates.

---

### 4. Advantages  

| Benefit | Why It Matters |
|---------|----------------|
| **Improved factuality** | The model can cite up‚Äëto‚Äëdate sources, reducing hallucinations. |
| **Scalability of knowledge** | Adding new facts only requires updating the external index, not retraining the LLM. |
| **Domain adaptability** | Specialized corpora (legal, medical, corporate) can be plugged in without massive model fine‚Äëtuning. |
| **Interpretability** | Retrieved passages act as evidence that can be shown to users. |
| **Parameter efficiency** | Smaller LLMs can achieve performance comparable to larger models when paired with a strong retriever. |

---

### 5. Challenges & Open Issues  

| Challenge | Current Mitigations |
|-----------|----------------------|
| **Retriever quality** ‚Äì Poor recall leads to bad generations. | ‚Ä¢ Joint training of retriever and generator (e.g., RAG‚ÄëFine‚ÄëTuning).<br>‚Ä¢ Hybrid retrieval and query expansion. |
| **Prompt length limits** ‚Äì LLM context windows may truncate evidence. | ‚Ä¢ Passage selection/ranking, summarisation of retrieved docs, or use of long‚Äëcontext models (e.g., LLaMA‚Äë2‚Äë70B‚ÄëChat with 4‚Äë8‚ÄØk tokens). |
| **Hallucination despite evidence** ‚Äì Model may ignore retrieved text. | ‚Ä¢ Grounding losses, reinforcement learning from human feedback (RLHF) with evidence‚Äëaware rewards. |
| **Latency** ‚Äì Two‚Äëstage pipeline can be slower than pure generation. | ‚Ä¢ Approximate nearest‚Äëneighbor search, caching frequent queries, or distilling the retriever into a lightweight model. |
| **Evaluation** ‚Äì Measuring factual correctness and citation quality is non‚Äëtrivial. | ‚Ä¢ Benchmarks such as **Natural Questions**, **TriviaQA**, **KILT**, **MMLU‚ÄëRAG**, and metrics like **Exact Match**, **F1**, **Citation Recall**. |

---

### 6. Representative Applications  

| Domain | Example Use‚ÄëCase |
|--------|------------------|
| **Search Engines** | Bing Chat, Google Gemini‚Äôs ‚Äúsearch‚Äëaugmented‚Äù mode. |
| **Customer Support** | Knowledge‚Äëbase‚Äëdriven chatbots that pull policy documents in real time. |
| **Healthcare** | Clinical decision support that retrieves up‚Äëto‚Äëdate guidelines or research papers. |
| **Legal** | Contract analysis tools that cite statutes and case law. |
| **Education** | Tutoring assistants that reference textbooks or scholarly articles. |
| **Enterprise** | Internal Q&A over proprietary documents (e.g., Confluence, SharePoint). |

---

### 7. Emerging Trends (2023‚Äë2025)

1. **Multimodal RAG** ‚Äì Retrieval of images, tables, or code snippets alongside text, with multimodal generators (e.g., Flamingo‚ÄëRAG, LLaVA‚ÄëRAG).  
2. **Self‚ÄëRAG** ‚Äì Models that generate their own retrieval queries or synthesize intermediate ‚Äúknowledge‚Äù representations (e.g., ‚ÄúChain‚Äëof‚ÄëThought Retrieval‚Äù).  
3. **Dynamic Indexing** ‚Äì Real‚Äëtime ingestion pipelines that allow the system to reflect news or streaming data within seconds.  
4. **Instruction‚Äëtuned RAG** ‚Äì Fine‚Äëtuning LLMs on datasets where the answer must be explicitly grounded in retrieved citations (e.g., **RAG‚ÄëQA**, **OpenRAG**).  
5. **Privacy‚Äëpreserving Retrieval** ‚Äì Use of encrypted indexes or federated search to keep proprietary data confidential while still benefiting from RAG.  

---

### 8. Sample Implementation Blueprint (Python‚Äëstyle)

```python
# 1Ô∏è‚É£ Load a dense retriever (e.g., Sentence‚ÄëTransformer)
from sentence_transformers import CrossEncoder, SentenceTransformer
retriever = SentenceTransformer('facebook/dpr-ctx_encoder-single-nq-base')
index = faiss.read_index('my_corpus.index')   # pre‚Äëbuilt vector index

# 2Ô∏è‚É£ Encode the query
query = "What are the main causes of urban heat islands?"
q_vec = retriever.encode([query], normalize_embeddings=True)

# 3Ô∏è‚É£ Retrieve top‚Äëk passages
k = 5
D, I = index.search(q_vec, k)                # distances, ids
passages = [corpus[i] for i in I[0]]

# 4Ô∏è‚É£ Build the prompt
prompt = f"""Answer the question using only the information below.
Question: {query}
Context:
{chr(10).join([f"[{i+1}] {p}" for i, p in enumerate(passages)])}
Answer:"""

# 5Ô∏è‚É£ Generate with an LLM (e.g., OpenAI API)
import openai
response = openai.ChatCompletion.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.2,
    max_tokens=300,
)

print(response.choices[0].message.content)
```

*The code illustrates the classic ‚Äúretrieve‚Äëthen‚Äëgenerate‚Äù pattern; production systems typically add reranking, citation extraction, and safety filters.*

---

### 9. Key References (selected)

| Year | Citation |
|------|----------|
| 2020 | **Lewis et al.** ‚ÄúRetrieval‚ÄëAugmented Generation for Knowledge‚ÄëIntensive NLP Tasks‚Äù (RAG). |
| 2021 | **Karpukhin et al.** ‚ÄúDense Passage Retrieval for Open‚ÄëDomain Question Answering‚Äù. |
| 2022 | **Izacard & Grave** ‚ÄúLeveraging Passage Retrieval with Generative Models‚Äù. |
| 2023 | **Gao et al.** ‚ÄúSelf‚ÄëRAG: Generating Retrieval Queries from the Model Itself‚Äù. |
| 2024 | **Wang et al.** ‚ÄúMultimodal Retrieval‚ÄëAugmented Generation‚Äù. |
| 2025 | **OpenRAG** (open‚Äësource framework) ‚Äì https://github.com/openrag/openrag |

---

### 10. Take‚Äëaway Message  

Retrieval‚ÄëAugmented Generation bridges the gap between **static, parametric knowledge** and **dynamic, external information**. By grounding LLM outputs in retrieved evidence, RAG delivers more factual, up‚Äëto‚Äëdate, and explainable results while keeping model sizes manageable. Ongoing research focuses on tighter integration, multimodal evidence, and real‚Äëtime indexing‚Äîmaking RAG a cornerstone of next‚Äëgeneration AI assistants and enterprise knowledge tools.

That is a simple `LLMChain` using the traditional LangChain method. Now let's move onto LCEL.

**L**ang**C**hain **E**xpression **L**anguage (LCEL) is the recommended approach to building chains in LangChain. Having superceeded the traditional methods with `LLMChain`, etc. LCEL gives us a more flexible system for building chains. The pipe operator `|` is used by LCEL to _chain_ together components. Let's see how we'd construct an `LLMChain` using LCEL.

In [132]:
lcel_chain = prompt | llm | output_parser

We can `invoke` this chain in the same way as we did before:

In [133]:
result = lcel_chain.invoke("retrieval augmented generation")

display(Markdown(result))

RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit exceeded: free-models-per-day. Add 5 credits to unlock 1000 free model requests per day', 'code': 429, 'metadata': {'headers': {'X-RateLimit-Limit': '50', 'X-RateLimit-Remaining': '0', 'X-RateLimit-Reset': '1766275200000'}, 'provider_name': None}}, 'user_id': 'user_2iO7ni5EqyszYlcKGCHsy3SBaE3'}

##### How Does the Pipe Operator Work?

Before moving onto other LCEL features, let's take a moment to understand what the pipe operator `|` is doing and _how_ it works.

Functionality wise, the pipe tells you that whatever the _left_ side outputs will be fed as input into the _right_ side. In the example of `prompt | llm | output_parser`, we see that `prompt` feeds into `llm` feeds into `output_parser`.

The pipe operator is a way of chaining together components, and is a way of saying that whatever the _left_ side outputs will be fed as input into the _right_ side.

Let's make a basic class named `Runnable` that will transform our a provided function into a _runnable_ class that we will then use with the pipe `|` operator.

In [134]:
class Runnable:
    def __init__(self, func):
        self.func = func
    def __or__(self, other):
        def chained_func(*args, **kwargs):
            return other.invoke(self.func(*args, **kwargs))
        return Runnable(chained_func)
    def invoke(self, *args, **kwargs):
        return self.func(*args, **kwargs)

With the `Runnable` class, we will be able wrap a function into the class, allowing us to then chain together multiple of these _runnable_ functions using the `__or__` method.

First, let's create a few functions that we'll chain together:

In [135]:
def add_five(x):
    return x + 5

def sub_five(x):
    return x - 5

def mul_five(x):
    return x * 5

Now we wrap our functions with the `Runnable`:

In [136]:
add_five_runnable = Runnable(add_five)
sub_five_runnable = Runnable(sub_five)
mul_five_runnable = Runnable(mul_five)

Finally, we can chain these together using the `__or__` method from the `Runnable` class:

In [137]:
chain = (add_five_runnable).__or__(sub_five_runnable).__or__(mul_five_runnable)

chain.invoke(3)

15

So we can see that we're able to chain together our functions using `__or__`. The pipe `|` operator is simply a shortcut for the `__or__` method, so we can create the exact same chain like so:

### LCEL `RunnableLambda`

The `RunnableLambda` class is LangChain's built-in method for constructing a _runnable_ object from a function. That is, it does the same thing as the custom `Runnable` class we created earlier. Let's try it out with the same functions as before.

In [138]:
from langchain_core.runnables import RunnableLambda

add_five_runnable = RunnableLambda(add_five)
sub_five_runnable = RunnableLambda(sub_five)
mul_five_runnable = RunnableLambda(mul_five)

We chain these together again with the pipe `|` operator:

In [139]:
chain = add_five_runnable | sub_five_runnable | mul_five_runnable

And call them using the `invoke` method:

In [140]:
chain.invoke(3)

15

Now we want to try something a little more testing, so this time we will generate a report, and we will try and edit that report using this functionallity.

In [141]:
prompt_str = "give me a small report about {topic}"
prompt = PromptTemplate(
    input_variables = ["topic"],
    template = prompt_str
)

In [142]:
chain = prompt | llm | output_parser

In [None]:
result = chain.invoke("AI")

display(Markdown(result))

Here we are making two functions, `extract_fact` to pull out the main content of our text and `replace_word` that will replace AI with Skynet!

In [143]:
def extract_fact(x):
    if "\n\n" in x:
        return "\n".join(x.split("\n\n")[1:])
    else:
        return x

old_word = "AI"
new_word = "skynet"

def replace_word(x):
    return x.replace(old_word, new_word)

In [144]:
extract_fact_runnable = RunnableLambda(extract_fact)
replace_word_runnable = RunnableLambda(replace_word)

In [145]:
chain = prompt | llm | output_parser | extract_fact_runnable | replace_word_runnable

In [None]:
result = chain.invoke("retrieval augmented generation")

display(Markdown(result))

Those are our `RunnableLambda` functions. It's worth noting that all inputs to these functions are expected to be a SINGLE arguments. If you have a function that accepts multiple arguments, you can input a dictionary with keys, then unpack them inside the function.

### LCEL `RunnableParallel` and `RunnablePassthrough`

LCEL provides us with various `Runnable` classes that allow us to control the flow of data and execution order through our chains. Two of these are `RunnableParallel` and `RunnablePassthrough`.

* `RunnableParallel` ‚Äî allows us to run multiple `Runnable` instances in parallel. Acting almost as a Y-fork in the chain.

* `RunnablePassthrough` ‚Äî allows us to pass through a variable to the next `Runnable` without modification.

To see these runnables in action, we will create two data sources, each source provides specific information but to answer the question we will need both to fed to the LLM.

In [146]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import DocArrayInMemorySearch

# If you're broke, use huggingface embeddings or find free alternatives
embedding = OpenAIEmbeddings()

vecstore_a = DocArrayInMemorySearch.from_texts(
    [
        "half the info is here",
        "DeepSeek-V3 was released in December 2024"
    ],
    embedding = embedding
)
vecstore_b = DocArrayInMemorySearch.from_texts(
    [
        "the other half of the info is here",
        "the DeepSeek-V3 LLM is a mixture of experts model with 671B parameters"
    ],
    embedding = embedding
)

  embedding = OpenAIEmbeddings()


ValidationError: 1 validation error for OpenAIEmbeddings
  Value error, Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass `openai_api_key` as a named parameter. [type=value_error, input_value={'model_kwargs': {}, 'cli...20, 'http_client': None}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.12/v/value_error

In [None]:
prompt_str = """Using the context provided, answer the user's question.
Context:
{context_a}
{context_b}
"""

In [None]:
prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template(prompt_str),
    HumanMessagePromptTemplate.from_template("{question}")
])

Here we are wrapping our vector stores as retrievers so they can be fitted into one big retrieval variable to be used by the prompt.

In [None]:
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

retriever_a = vecstore_a.as_retriever()
retriever_b = vecstore_b.as_retriever()

retrieval = RunnableParallel(
    {
        "context_a": retriever_a, "context_b": retriever_b, "question": RunnablePassthrough()
    }
)

The chain we'll be constructing will look something like this:

![](https://github.com/aurelio-labs/langchain-course/blob/main/assets/lcel-flow.png?raw=1)

In [None]:
chain = retrieval | prompt | llm | output_parser

In [None]:
result = chain.invoke(
    "what architecture does the model DeepSeek released in december use?"
)

result

With that we've seen how we can use `RunnableParallel` and `RunnablePassthrough` to control the flow of data and execution order through our chains.

# Streaming

### Streaming with `astream`

We will start by creating a aysnc stream from our LLM. We do this within an `async for` loop, allowing us to iterate through the chunks of data and use them as soon as the async `astream` method returns the tokens to us. By adding a pipe character `|` we can see the individual tokens that are generated. We set `flush` equal to `True` as this forces immediate output to the console, resulting in smoother streaming.

In [None]:
from langchain_core.runnables import ConfigurableField

llm_streaming = ChatOpenAI(temperature = 0.0, 
                           model = openai_model, 
                           api_key = getenv("OPENROUTER_API_KEY"), 
                           base_url = "https://openrouter.ai/api/v1",
                           streaming = True).configurable_fields(
                               callbacks = ConfigurableField(
                                    id = "callbacks",
                                    name = "Callbacks",
                                    description = "The callbacks to use for streaming output",
                           ))

In [None]:
tokens = []
async for token in llm_streaming.astream("What is NLP?"):
    tokens.append(token)
    print(token.content, end="|", flush=True)

Since we appended each token to the `tokens` list, we can also see what is inside each and every token.

In [None]:
tokens[0]

We can also merge multiple `AIMessageChunk` objects together with the `+` operator, creating a larger set of tokens / chunk:

In [None]:
tokens[0] + tokens[1] + tokens[2] + tokens[3] + tokens[4]

A word of caution, there is nothing preventing you from merging tokens in the incorrect order

### Streaming with Agents

In [None]:
agent: RunnableSerializable = (
    {
        "input": lambda x: x["input"],
        "chat_history": lambda x: x["chat_history"],
        "agent_scratchpad": lambda x: x.get("agent_scratchpad", [])
    }
    | prompt
    | llm_streaming.bind_tools(tools, tool_choice="any")
)

Now, we will define our _custom_ callback handler. This will be a queue callback handler that will allow us to stream the output of the agent through an `asyncio.Queue` object and yield the tokens as they are generated elsewhere.

In [None]:
import asyncio
from langchain.callbacks.base import AsyncCallbackHandler


class QueueCallbackHandler(AsyncCallbackHandler):
    """Callback handler that puts tokens into a queue."""

    def __init__(self, queue: asyncio.Queue):
        self.queue = queue
        self.final_answer_seen = False

    async def __aiter__(self):
        while True:
            if self.queue.empty():
                await asyncio.sleep(0.1)
                continue
            token_or_done = await self.queue.get()

            if token_or_done == "<<DONE>>":
                # this means we're done
                return
            if token_or_done:
                yield token_or_done

    async def on_llm_new_token(self, *args, **kwargs) -> None:
        """Put new token in the queue."""
        #print(f"on_llm_new_token: {args}, {kwargs}")
        chunk = kwargs.get("chunk")
        if chunk:
            # check for final_answer tool call
            if tool_calls := chunk.message.additional_kwargs.get("tool_calls"):
                if tool_calls[0]["function"]["name"] == "final_answer":
                    # this will allow the stream to end on the next `on_llm_end` call
                    self.final_answer_seen = True
        await self.queue.put(chunk)
        return

    async def on_llm_end(self, *args, **kwargs) -> None:
        """Put None in the queue to signal completion."""
        #print(f"on_llm_end: {args}, {kwargs}")
        # this should only be used at the end of our agent execution, however LangChain
        # will call this at the end of every tool call, not just the final tool call
        # so we must only send the "done" signal if we have already seen the final_answer
        # tool call
        if self.final_answer_seen:
            await self.queue.put("<<DONE>>")
        else:
            await self.queue.put("<<STEP_END>>")
        return

We can see how this works together in our `agent` invocation:

In [None]:
queue = asyncio.Queue()
streamer = QueueCallbackHandler(queue)

tokens = []

async def stream(query: str):
    response = agent.with_config(
        callbacks=[streamer]
    )
    async for token in response.astream({
        "input": query,
        "chat_history": [],
        "agent_scratchpad": []
    }):
        tokens.append(token)
        print(token, flush=True)

await stream("What is 10 + 10")

In [None]:
tk = tokens[0]

for token in tokens[1:]:
    tk += token

tk

Now we're seeing that the output is being streamed token-by-token. Because we're being streamed a tool call the `content` field is empty. Instead, we can see the tokens being added inside the `tool_calls` fields, within `id`, `function.name`, and `function.arguments`.

In [None]:
from langchain_core.messages import ToolMessage

class CustomAgentExecutor:
    chat_history: list[BaseMessage]

    def __init__(self, max_iterations: int = 3):
        self.chat_history = []
        self.max_iterations = max_iterations
        self.agent: RunnableSerializable = (
            {
                "input": lambda x: x["input"],
                "chat_history": lambda x: x["chat_history"],
                "agent_scratchpad": lambda x: x.get("agent_scratchpad", [])
            }
            | prompt
            | llm.bind_tools(tools, tool_choice="any")  # we're forcing tool use again
        )

    async def invoke(self, input: str, streamer: QueueCallbackHandler, verbose: bool = False) -> dict:
        # invoke the agent but we do this iteratively in a loop until
        # reaching a final answer
        count = 0
        agent_scratchpad = []
        while count < self.max_iterations:
            # invoke a step for the agent to generate a tool call
            async def stream(query: str):
                response = self.agent.with_config(
                    callbacks = [streamer]
                )
                # we initialize the output dictionary that we will be populating with
                # our streamed output
                output = None
                # now we begin streaming
                async for token in response.astream({
                    "input": query,
                    "chat_history": self.chat_history,
                    "agent_scratchpad": agent_scratchpad
                }):
                    if output is None:
                        output = token
                    else:
                        # we can just add the tokens together as they are streamed and
                        # we'll have the full response object at the end
                        output += token
                    if token.content != "":
                        # we can capture various parts of the response object
                        if verbose: print(f"content: {token.content}", flush=True)
                    tool_calls = token.additional_kwargs.get("tool_calls")
                    if tool_calls:
                        if verbose: print(f"tool_calls: {tool_calls}", flush=True)
                        tool_name = tool_calls[0]["function"]["name"]
                        if tool_name:
                            if verbose: print(f"tool_name: {tool_name}", flush=True)
                        arg = tool_calls[0]["function"]["arguments"]
                        if arg != "":
                            if verbose: print(f"arg: {arg}", flush=True)
                return AIMessage(
                    content = output.content,
                    tool_calls = output.tool_calls,
                    tool_call_id = output.tool_calls[0]["id"]
                )

            tool_call = await stream(query = input)
            # add initial tool call to scratchpad
            agent_scratchpad.append(tool_call)
            # otherwise we execute the tool and add it's output to the agent scratchpad
            tool_name = tool_call.tool_calls[0]["name"]
            tool_args = tool_call.tool_calls[0]["args"]
            tool_call_id = tool_call.tool_call_id
            tool_out = name2tool[tool_name](**tool_args)
            # add the tool output to the agent scratchpad
            tool_exec = ToolMessage(
                content = f"{tool_out}",
                tool_call_id = tool_call_id
            )
            agent_scratchpad.append(tool_exec)
            count += 1
            # if the tool call is the final answer tool, we stop
            if tool_name == "final_answer":
                break
        # add the final output to the chat history, we only add the "answer" field
        final_answer = tool_out["answer"]
        self.chat_history.extend([
            HumanMessage(content = input),
            AIMessage(content = final_answer)
        ])
        # return the final answer in dict form
        return tool_args

agent_executor = CustomAgentExecutor()

We've added a few `print` statements to help us see what is being output, we activate those by setting `verbose=True`. Let's see what is returned:

In [None]:
queue = asyncio.Queue()
streamer = QueueCallbackHandler(queue)

out = await agent_executor.invoke("What is 10 + 10", streamer, verbose = True)

We can see what is being output through the `verbose=True` flag. However, if we do _not_ `print` the output, we will see nothing:

In [None]:
queue = asyncio.Queue()
streamer = QueueCallbackHandler(queue)

out = await agent_executor.invoke("What is 10 + 10", streamer)

Although we see nothing, it does not mean that nothing is being returned to us - we're just not using our callback handler and `asyncio.Queue`. To use these we create an `asyncio` task, iterate over the `__aiter__` method of our `streamer` object, and await the task, like so:

In [None]:
queue = asyncio.Queue()
streamer = QueueCallbackHandler(queue)

task = asyncio.create_task(agent_executor.invoke("What is 10 + 10", streamer))

async for token in streamer:
    print(token, flush=True)

await task

Although this seems like a lot of work, we're now streaming tokens in a way that allows us to pass these tokens on to other parts of our code - such as through a websocket, streamed API response, or some downstream processing.

Let's try this out, we'll put together some simple post-processing to allow us to more nicely format the streamed output from out agent.

In [None]:
queue = asyncio.Queue()
streamer = QueueCallbackHandler(queue)

task = asyncio.create_task(agent_executor.invoke("What is 10 + 10", streamer))

async for token in streamer:
    # first identify if we have a <<STEP_END>> token
    if token == "<<STEP_END>>":
        print("\n", flush=True)
    # we'll first identify if the token is a tool call
    elif tool_calls := token.message.additional_kwargs.get("tool_calls"):
        # if we have a tool call with a tool name, we'll print it
        if tool_name := tool_calls[0]["function"]["name"]:
            print(f"Calling {tool_name}...", flush=True)
        # if we have a tool call with arguments, we ad them to our args string
        if tool_args := tool_calls[0]["function"]["arguments"]:
            print(f"{tool_args}", end="", flush=True)

_ = await task

With that we've produced a nice streaming output within our notebook - which ofcourse can be applied with very similar logic elsewhere, such as within a more polished web app.