# LangChain Memory

Most common use-case for building an AI application with LangChain is a chatbot trained on a knowledgebase. But there is a problem with just using a RetrievalQA method i.e the chatbot developing using this method has no memory and thus cannot answer in a conversational manner

For example if you check the below flow

**Human** : How to add a video funnel

**AI**    : If you want to set up a video funnel, these are the steps you need to follow:

Step 1- First of all, go to the Marketing tab at the video level.

Step 2- Click on the “Add new funnel” button.

Step 3- Set the title to grab the users’ attention.

Step 4- Set the text for the action that you want would be taken by the users.

Step 5- Select the video for the Funnel.

Step 6- Set the start time at which the pop-up would appear to the user.

Step 7- Once you are satisfied with all the things, click on the “Save Changes” button.

**Human** : Now tell only 5 steps from above

**AI**.   : Sorry I am unable to answer, can you provide more context

As you can see, the chatbot has no memory and thus unable to answer any follow up questions. To solve this we will be using Memory


The memory allows a Large Language Model (LLM) to remember previous interactions with the user. By default, LLMs are stateless — meaning each incoming query is processed independently of other interactions. The only thing that exists for a stateless agent is the current input, nothing else.

There are many applications where remembering previous interactions is very important, such as chatbots. Conversational memory allows us to do that.

There are several ways that we can implement conversational memory. In the context of LangChain, they are all built on top of the `ConversationChain`.

### Types of Memory

Let's discuss some of the most popular memory methods available in Langchain.

1. ConversationBufferMemory
2. ConversationBufferWindowMemory
3. ConversationTokenBufferMemory
4. ConversationSummaryMemory

We shall discuss each of these with an example

### ConversationBufferMemory

This is a simple method that involves storing every chat interaction directly in the buffer. Although it provides good results, it has few drawbacks

1. Because every message is stored, the amount of data being sent to api is high and thus results in higher costs and slower speed of response
2. ChatGPT has input context limit which can get crossed with few messages and thus can result in error

Let's understand it's working with the help of an example. We will try adding memory to a chain

In [143]:
from langchain.llms import OpenAI
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory, ConversationSummaryMemory, ConversationBufferWindowMemory, ConversationSummaryBufferMemory

from langchain.callbacks import get_openai_callback

import os

In [10]:
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
HUGGINGFACEHUB_API_TOKEN = os.getenv('HUGGINGFACEHUB_API_TOKEN')

In [16]:
llm = ChatOpenAI(api_key=OPENAI_API_KEY,temperature=0)

conversation = ConversationChain(
    llm=llm,
    verbose=True,
    memory=ConversationBufferMemory()
)

We can see the prompt template used by the `ConversationChain` like so:

In [20]:
print(conversation.prompt.template)

The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
{history}
Human: {input}
AI:


Here, the prompt primes the model by telling it that the following is a conversation between a human (us) and an AI (`gpt3-turbo`). The prompt attempts to reduce hallucinations (where a model makes things up) by stating:

**"If the AI does not know the answer to a question, it truthfully says it does not know."**

This can help but does not solve the problem of hallucinations — but we will save this for the topic of a future chapter.

Following the initial prompt, we see two parameters; `{history}` and `{input}`. The `{input}` is where we’d place the latest human query; 

The `{history}` is where conversational memory is used. Here, we feed in information about the conversation history between the human and AI.

These two parameters — `{history}` and `{input}` — are passed to the LLM within the prompt template we just saw, and the output that we (hopefully) return is simply the predicted continuation of the conversation.



In [29]:
conversation_buf = ConversationChain(
    llm=llm,
    memory=ConversationBufferMemory()
)

In [33]:
conversation_buf.invoke("Good morning AI!")

{'input': 'Good morning AI!',
 'history': 'Human: Good morning AI!\nAI: Good morning! How are you today?',
 'response': 'Good morning! How can I assist you today?'}

We return the first response from the conversational agent. Let’s continue the conversation, writing prompts that the LLM can only answer if it considers the conversation history. We also add a count_tokens function so we can see how many tokens are being used by each interaction.

In [38]:
def count_tokens(chain, query):
    with get_openai_callback() as cb:
        result = chain.run(query)
        print(f'Spent a total of {cb.total_tokens} tokens')

    return result

In [44]:
count_tokens(
    conversation_buf, 
    "My interest here is to explore the potential of integrating Large Language Models with external knowledge"
)

Spent a total of 328 tokens


"As I mentioned earlier, integrating Large Language Models with external knowledge sources can greatly enhance their capabilities. One approach could be to use knowledge graphs or structured data to provide context and background information for the text generated by the model. This could help improve the accuracy and relevance of the responses. Another approach could be to use APIs to access real-time data from sources like news websites or social media platforms, allowing the model to generate up-to-date and relevant content. Overall, there are many possibilities for integrating Large Language Models with external knowledge, and it's an exciting area to explore!"

In [46]:
count_tokens(
    conversation_buf,
    "I just want to analyze the different possibilities. What can you think of?"
)

Spent a total of 470 tokens


"There are several possibilities for integrating Large Language Models with external knowledge sources. One option is to use pre-existing knowledge graphs or databases to provide additional context for the text generated by the model. This could help improve the accuracy and relevance of the responses. Another option is to use APIs to access real-time data from sources like news websites or social media platforms, allowing the model to generate up-to-date and relevant content. Additionally, you could explore using domain-specific knowledge bases or ontologies to enhance the model's understanding of specific topics. These are just a few ideas, and there are many other possibilities to consider as well."

In [48]:
count_tokens(
    conversation_buf, 
    "What is my aim again?"
)

Spent a total of 510 tokens


'Your aim is to explore the potential of integrating Large Language Models with external knowledge sources to enhance their capabilities and generate more accurate and contextually relevant responses.'

The LLM can clearly remember the history of the conversation. Let’s take a look at how this conversation history is stored by the ConversationBufferMemory:

In [51]:
print(conversation_buf.memory.buffer)

Human: Good morning AI!
AI: Good morning! How are you today?
Human: Good morning AI!
AI: Good morning! How can I assist you today?
Human: My interest here is to explore the potential of integrating Large Language Models with external knowledge
AI: That's a fascinating topic! Large Language Models like GPT-3 have shown great potential in generating human-like text, but integrating them with external knowledge sources could enhance their capabilities even further. By connecting these models to databases, websites, or other sources of information, they could provide more accurate and contextually relevant responses. Do you have any specific ideas or questions about how to approach this integration?
Human: My interest here is to explore the potential of integrating Large Language Models with external knowledge
AI: As I mentioned earlier, integrating Large Language Models with external knowledge sources can greatly enhance their capabilities. One approach could be to use knowledge graphs or

We can see that the buffer saves every interaction in the chat history directly. There are a few pros and cons to this approach. In short, they are:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <style>
        table {
            width: 100%;
            border-collapse: collapse;
        }
        th, td {
            border: 1px solid #000;
            padding: 10px;
            text-align: left;
        }
        th {
            background-color: #f2f2f2;
        }
    </style>
</head>
<body>
    <table>
        <thead>
            <tr>
                <th>Pros</th>
                <th>Cons</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Storing everything gives the LLM the maximum amount of information</td>
                <td>More tokens mean slowing response times and higher costs</td>
            </tr>
            <tr>
                <td>Storing everything is simple and intuitive</td>
                <td>Long conversations cannot be remembered as we hit the LLM token limit (4096 tokens for text-davinci-003 and gpt-3.5-turbo)</td>
            </tr>
        </tbody>
    </table>
</body>
</html>


The `ConversationBufferMemory` is an excellent option to get started with but is limited by the storage of every interaction. Let’s take a look at other options that help remedy this.

### ConversationSummaryMemory
Using ConversationBufferMemory, we very quickly use a lot of tokens and even exceed the context window limit of even the most advanced LLMs available today.

To avoid excessive token usage, we can use ConversationSummaryMemory. As the name would suggest, this form of memory summarizes the conversation history before it is passed to the {history} parameter.

We initialize the ConversationChain with the summary memory like so:

In [75]:
conversation_sum = ConversationChain(
	llm=llm,
	memory=ConversationSummaryMemory(llm=llm)
)

When using `ConversationSummaryMemory`, we need to pass an LLM to the object because the summarization is powered by an LLM. We can see the prompt used to do this here:

In [78]:
print(conversation_sum.memory.prompt.template)

Progressively summarize the lines of conversation provided, adding onto the previous summary returning a new summary.

EXAMPLE
Current summary:
The human asks what the AI thinks of artificial intelligence. The AI thinks artificial intelligence is a force for good.

New lines of conversation:
Human: Why do you think artificial intelligence is a force for good?
AI: Because artificial intelligence will help humans reach their full potential.

New summary:
The human asks what the AI thinks of artificial intelligence. The AI thinks artificial intelligence is a force for good because it will help humans reach their full potential.
END OF EXAMPLE

Current summary:
{summary}

New lines of conversation:
{new_lines}

New summary:


Using this, we can summarize every new interaction and append it to a “running summary” of all past interactions. Let’s have another conversation utilizing this approach.

In [81]:
# without count_tokens we'd call `conversation_sum("Good morning AI!")`
# but let's keep track of our tokens:
count_tokens(
    conversation_sum, 
    "Good morning AI!"
)

Spent a total of 251 tokens


'Good morning! How are you today?'

In [83]:
count_tokens(
    conversation_sum, 
    "My interest here is to explore the potential of integrating Large Language Models with external knowledge"
)

Spent a total of 565 tokens


"That's a fascinating topic! Large Language Models like GPT-3 have shown great potential in generating human-like text, but integrating external knowledge could take their capabilities to the next level. By combining the vast amount of information stored in external knowledge bases with the language model's ability to generate text, we could create AI systems that are even more intelligent and capable of understanding complex concepts. Have you come across any specific approaches or research in this area that you find particularly interesting?"

In [85]:
count_tokens(
    conversation_sum, 
    "I just want to analyze the different possibilities. What can you think of?"
)

Spent a total of 754 tokens


'There are several approaches to integrating external knowledge with language models like GPT-3. One common method is to use knowledge graphs, which organize information in a structured way that can be easily integrated into the model. Another approach is to use pre-trained embeddings that capture the relationships between words and concepts, allowing the model to access external knowledge more effectively. Additionally, some researchers are exploring the use of reinforcement learning to help language models learn how to incorporate external knowledge during the training process. These are just a few examples of the exciting possibilities in this area!'

In [87]:
count_tokens(
    conversation_sum, 
    "Which data source types could be used to give context to the model?"
)

Spent a total of 822 tokens


"There are several data source types that could be used to give context to the model. Some common ones include structured data from databases, unstructured data from text documents or websites, knowledge graphs that represent relationships between entities, and even real-time data streams for up-to-date information. Each type of data source can provide valuable context to help enhance the language model's understanding and generate more accurate responses."

In [89]:
count_tokens(
    conversation_sum, 
    "What is my aim again?"
)

Spent a total of 847 tokens


'Your aim is to explore the potential of integrating Large Language Models with external knowledge to enhance their capabilities.'

In this case the summary contains enough information for the LLM to “remember” our original aim. We can see this summary in it’s raw form like so:

In [92]:
print(conversation_sum.memory.buffer)

The human greets the AI with a "Good morning." The AI responds in kind and asks how the human is feeling today. The human expresses interest in exploring the potential of integrating Large Language Models with external knowledge. The AI finds this topic fascinating and discusses how combining external knowledge with language models like GPT-3 could enhance their capabilities. The AI asks the human if they have encountered any specific approaches or research in this area that they find intriguing. The human wants to analyze the different possibilities and the AI mentions several approaches, including using knowledge graphs, pre-trained embeddings, and reinforcement learning to enhance language models with external knowledge. The human asks which data source types could be used to give context to the model, and the AI explains that structured data from databases, unstructured data from text documents or websites, knowledge graphs, and real-time data streams can all provide valuable conte

The number of tokens being used for this conversation is greater than when using the `ConversationBufferMemory`, so is there any advantage to using `ConversationSummaryMemory` over the buffer memory?

<div>
<img src="https://education-team-2020.s3.eu-west-1.amazonaws.com/ai-eng/images-langchain-memory-rag/token_interaction.webp" alt='auto' width="1000"/>
</div>

Token count (y-axis) for the buffer memory vs. summary memory as the number of interactions (x-axis) increases.

For longer conversations. As shown above, the summary memory initially uses far more tokens. However, as the conversation progresses, the summarization approach grows more slowly. In contrast, the buffer memory continues to grow linearly with the number of tokens in the chat.

We can summarize the pros and cons of ConversationSummaryMemory as follows:

We can summarize the pros and cons of `ConversationSummaryMemory` as follows:

<!DOCTYPE html>
<html>
<head>
    <style>
        table {
            width: 100%;
            border-collapse: collapse;
        }
        th, td {
            border: 1px solid black;
            padding: 8px;
            text-align: left;
        }
        th {
            background-color: #f2f2f2;
        }
    </style>
</head>
<body>

<table>
    <tr>
        <th>Pros</th>
        <th>Cons</th>
    </tr>
    <tr>
        <td>Shortens the number of tokens for long conversations.</td>
        <td>Can result in higher token usage for smaller conversations.</td>
    </tr>
    <tr>
        <td>Enables much longer conversations.</td>
        <td>Memorization of the conversation history is wholly reliant on the summarization ability of the intermediate summarization LLM.</td>
    </tr>
    <tr>
        <td>Relatively straightforward implementation, intuitively simple to understand.</td>
        <td>Also requires token usage for the summarization LLM; this increases costs (but does not limit conversation length).</td>
    </tr>
</table>

</body>
</html>


Conversation summarization is a good approach for cases where long conversations are expected. Yet, it is still fundamentally limited by token limits. After a certain amount of time, we still exceed context window limits - Maybe NOT, depending on the LLM being used.

### ConversationBufferWindowMemory
The `ConversationBufferWindowMemor`y acts in the same way as our earlier “buffer memory” but adds a window to the memory. Meaning that we only keep a given number of past interactions before “forgetting” them. We use it like so:

In [114]:
conversation_bufw = ConversationChain(
	llm=llm,
	memory=ConversationBufferWindowMemory(k=1)
)

In this instance, we set `k=1` — this means the window will remember the single latest interaction between the human and AI. That is the latest human response and the latest AI response. We can see the effect of this below:

In [117]:
count_tokens(
    conversation_bufw, 
    "Good morning AI!"
)

Spent a total of 75 tokens


'Good morning! How are you today?'

In [119]:
count_tokens(
    conversation_bufw, 
    "My interest here is to explore the potential of integrating Large Language Models with external knowledge"
)

Spent a total of 185 tokens


"That's a fascinating topic! Large Language Models like GPT-3 have shown great potential in generating human-like text, but integrating them with external knowledge sources could enhance their capabilities even further. By combining the vast amount of information stored in external knowledge bases with the language model's natural language processing abilities, we could potentially create AI systems that are more knowledgeable and contextually aware. Do you have any specific ideas or goals in mind for this integration?"

In [121]:
count_tokens(
    conversation_bufw, 
    "I just want to analyze the different possibilities. What can you think of?"
)

Spent a total of 296 tokens


"There are several ways in which Large Language Models can be integrated with external knowledge sources. One approach is to use knowledge graphs, which are structured representations of knowledge that can be used to enhance the model's understanding of concepts and relationships. Another approach is to use pre-trained embeddings from external knowledge bases to improve the model's ability to generate relevant and accurate text. Additionally, integrating the model with domain-specific knowledge sources can help tailor its responses to specific topics or industries. Overall, the possibilities are vast and exciting for exploring the potential of this integration."

In [123]:
count_tokens(
    conversation_bufw, 
    "Which data source types could be used to give context to the model?"
)

Spent a total of 283 tokens


'Some common data source types that could be used to give context to the model include text corpora, knowledge graphs, pre-trained embeddings from knowledge bases, domain-specific databases, structured data sources like tables or spreadsheets, and even multimedia sources like images or videos. By incorporating these diverse data sources, the model can gain a richer understanding of the world and generate more accurate and contextually relevant responses.'

In [125]:
count_tokens(
    conversation_bufw, 
    "What is my aim again?"
)

Spent a total of 221 tokens


"Your aim is to understand how different data source types can be used to provide context to a model in order to improve its performance and generate more accurate responses. By leveraging various data sources, you can enhance the model's understanding of the world and make it more contextually aware."

By the end of the conversation, when we ask **"What is my aim again?"**, the answer to this was contained in the human response three interactions ago. As we only kept the most recent interaction (`k=1`), the model had forgotten and could not give the correct answer.

We can see the effective “memory” of the model like so:

In [128]:
bufw_history = conversation_bufw.memory.load_memory_variables(
    inputs=[]
)['history']

In [130]:
print(bufw_history)

Human: What is my aim again?
AI: Your aim is to understand how different data source types can be used to provide context to a model in order to improve its performance and generate more accurate responses. By leveraging various data sources, you can enhance the model's understanding of the world and make it more contextually aware.


Although this method isn’t suitable for remembering distant interactions, it is good at limiting the number of tokens being used — a number that we can increase/decrease depending on our needs. For the longer conversation used in our earlier comparison, we can set `k=6` and reach ~1.5K tokens per interaction after 27 total interactions:

<div>
<img src="https://education-team-2020.s3.eu-west-1.amazonaws.com/ai-eng/images-langchain-memory-rag/conversation_bw.webp" alt='auto' width="1000"/>
</div>

Token count including the ConversationBufferWindowMemory at k=6 and k=12.

If we only need memory of recent interactions, this is a great option. However, for a mix of both distant and recent interactions, there are other options.

### ConversationSummaryBufferMemory
The `ConversationSummaryBufferMemory` is a mix of the `ConversationSummaryMemory` and the `ConversationBufferWindowMemory`. It summarizes the earliest interactions in a conversation while maintaining the max_token_limit most recent tokens in their conversation. It is initialized like so:

In [145]:
conversation_sum_bufw = ConversationChain(
    llm=llm, memory=ConversationSummaryBufferMemory(
        llm=llm,
        max_token_limit=650
))

When applying this to our earlier conversation, we can set `max_token_limit` to a small number and yet the LLM can remember our earlier “aim”.

This is because that information is captured by the “summarization” component of the memory, despite being missed by the “buffer window” component.

Naturally, the pros and cons of this component are a mix of the earlier components on which this is based.

<!DOCTYPE html>
<html>
<head>
    <style>
        table {
            width: 100%;
            border-collapse: collapse;
        }
        th, td {
            border: 1px solid black;
            padding: 10px;
            text-align: left;
        }
        th {
            background-color: #f2f2f2;
        }
    </style>
</head>
<body>

<table>
    <tr>
        <th>Pros</th>
        <th>Cons</th>
    </tr>
    <tr>
        <td>Summarizer means we can remember distant interactions</td>
        <td>Summarizer increases token count for shorter conversations</td>
    </tr>
    <tr>
        <td>Buffer prevents us from missing information from the most recent interactions</td>
        <td>Storing the raw interactions — even if just the most recent interactions — increases token count</td>
    </tr>
</table>

</body>
</html>


Although requiring more tweaking on what to summarize and what to maintain within the buffer window, the `ConversationSummaryBufferMemory` does give us plenty of flexibility and is the only one of our memory types (so far) that allows us to remember distant interactions and store the most recent interactions in their raw — and most information-rich — form.

<div>
<img src="https://education-team-2020.s3.eu-west-1.amazonaws.com/ai-eng/images-langchain-memory-rag/memory_bws.webp" alt='auto' width="1000"/>
</div>

Token count comparisons including the ConversationSummaryBufferMemory type with max_token_limit values of 650 and 1300.

We can also see that despite including a summary of past interactions and the raw form of recent interactions — the increase in token count of `ConversationSummaryBufferMemory` is competitive with other methods.

### Other Memory Types
The memory types we have covered here are great for getting started and give a good balance between remembering as much as possible and minimizing tokens.

However, we have other options — particularly the `ConversationKnowledgeGraphMemory` and `ConversationEntityMemory`. We’ll give these different forms of memory the attention they deserve in upcoming chapters.

That’s it for this introduction to conversational memory for LLMs using LangChain. As we’ve seen, there are plenty of options for helping stateless LLMs interact as if they were in a stateful environment — able to consider and refer back to past interactions.

As mentioned, there are other forms of memory we can cover. We can also implement our own memory modules, use multiple types of memory within the same chain, combine them with agents, and much more. All of which we will cover in the future.