# Build a Retrieval Augmented Generation (RAG) App

One of the most powerful applications enabled by LLMs is sophisticated question-answering (Q&A) chatbots. These are applications that can answer questions about specific source information. These applications use a technique known as Retrieval Augmented Generation, or RAG.

This notebook will show how to build a simple Q&A application
over a text data source. Along the way we’ll go over a typical Q&A architecture.

> The code in this notebook is adapted from the LangChain tutorial [Build a RAG App](https://python.langchain.com/v0.2/docs/tutorials/rag/)

## What is RAG?

RAG is a technique for augmenting LLM knowledge with additional data.

LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on. If you want to build AI applications that can reason about private data or data introduced after a model's cutoff date, you need to augment the knowledge of the model with the specific information it needs. The process of bringing the appropriate information and inserting it into the model prompt is known as Retrieval Augmented Generation (RAG).

LangChain has a number of components designed to help build Q&A applications, and RAG applications more generally. 


## Concepts

A typical RAG application has two main components:

**Indexing**: a pipeline for ingesting data from a source and indexing it. *This usually happens offline.*

**Retrieval and generation**: the actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model.

The most common full sequence from raw data to answer looks like:

### Indexing
1. **Load**: First we need to load our data. This is done with [DocumentLoaders](https://python.langchain.com/v0.2/docs/concepts/#document-loaders).
2. **Split**: [Text splitters](https://python.langchain.com/v0.2/docs/concepts/#text-splitters) break large `Documents` into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won't fit in a model's finite context window.
3. **Store**: We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a [VectorStore](https://python.langchain.com/v0.2/docs/concepts/#vectorstores) and [Embeddings](https://python.langchain.com/v0.2/docs/concepts/#embedding-models) model.

![index_diagram](img/rag_indexing.png)

> Image from [LangChain Docs: Build a Retrieval Augmented Generation (RAG) App](https://python.langchain.com/v0.2/docs/tutorials/rag/#indexing)

### Retrieval and generation

4. **Retrieve**: Given a user input, relevant chunks are retrieved from storage using a [Retriever](https://python.langchain.com/v0.2/docs/concepts/#retrievers).
5. **Generate**: A [ChatModel](https://python.langchain.com/v0.2/docs/concepts/#chat-models) (or LLM) produces an answer using a prompt that includes the question and the retrieved data

![retrieval_diagram](img/rag_retrieval_generation.png)

> Image from [LangChain Docs: Build a Retrieval Augmented Generation (RAG) App](https://python.langchain.com/v0.2/docs/tutorials/rag/#retrieval-and-generation)


## Preview

In this guide we’ll build a QA app over as website. The specific website we will use is the [LLM Powered Autonomous Agents](https://lilianweng.github.io/posts/2023-06-23-agent/) blog post
by Lilian Weng, which allows us to ask questions about the contents of
the post.

First, we need to choose a LLM to use. We will be using Google's **Gemini 1.5 Flash model** in this notebook because it is fast and it offers a free tier for us to play around with.


In [1]:
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash")

## 1. Indexing: Load

We need to first load the blog post contents. We can use [DocumentLoaders](https://python.langchain.com/v0.2/docs/concepts/#document-loaders) for this, which are objects that load in data from a source and return a list of [Documents](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html).
A `Document` is an object with some `page_content` (str) and `metadata` (dict).

In this case we’ll use the
[WebBaseLoader](https://python.langchain.com/v0.2/docs/integrations/document_loaders/web_base/), which uses `urllib` to load HTML from web URLs and `BeautifulSoup` to parse it to text. We can customize the HTML $\longrightarrow$ text parsing by passing
in parameters to the `BeautifulSoup` parser via `bs_kwargs` (see
[BeautifulSoup
docs](https://beautiful-soup-4.readthedocs.io/en/latest/#beautifulsoup)).
In this case only HTML tags with class `post-content`, `post-title`, or `post-header` are relevant, so we’ll remove all others.

In [2]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

# Only keep post title, headers, and content from the full HTML.
bs4_strainer = bs4.SoupStrainer(
    class_=("post-title", 
            "post-header", 
            "post-content")
)
loader = WebBaseLoader(
    web_paths=(
        "https://lilianweng.github.io/posts/2023-06-23-agent/",
    ),
    bs_kwargs={"parse_only": bs4_strainer},
)
docs = loader.load()

len(docs[0].page_content)

USER_AGENT environment variable not set, consider setting it to identify your requests.


43131

In [3]:
print(docs[0].page_content[:500])



      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In


### Go deeper

`DocumentLoader`: Object that loads data from a source as list of `Documents`.

- [Docs](https://python.langchain.com/v0.2/docs/how_to/#document-loaders):
  Detailed documentation on how to use `DocumentLoaders`.

## 2. Indexing: Split

Our loaded document is over 42k characters long. 
This is too long to fit in the context window of many models. Even for those models that could fit the full post in their context window, models can struggle to find information in very long inputs.

To handle this we’ll split the `Document` into chunks for embedding and vector storage. This should help us retrieve only the most relevant bits of the blog post at run time.

In this case we’ll split our documents into chunks of 1000 characters with 200 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the
[RecursiveCharacterTextSplitter](https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/), which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.

We set `add_start_index=True` so that the character index at which each split Document starts within the initial Document is preserved as metadata attribute “start_index”.

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=200, 
    add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

66

In [5]:
len(all_splits[0].page_content)

969

In [6]:
all_splits[10].metadata

{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
 'start_index': 7056}

### Go deeper

`TextSplitter`: Object that splits a list of `Document`s into smaller chunks. Subclass of `DocumentTransformer`s.

- Learn more about splitting text using different methods by reading the [how-to docs](https://python.langchain.com/v0.2/docs/how_to/#text-splitters)


## 3. Indexing: Store

Now we need to index our text chunks so that we can search over them
later. The most common way to do this is to embed the contents of
each document split and insert these embeddings into a vector database
such as Chroma. When we want to search over our splits, we take a
text search query, embed it, and perform some sort of “similarity”
search to identify the stored splits with the most similar embeddings to
our query embedding. The simplest similarity measure is cosine
similarity — we measure the cosine of the angle between each pair of
embeddings (which are very high dimensional vectors).

We can embed and store all of our document splits in a single command
using the 
[Chroma](https://python.langchain.com/v0.2/docs/integrations/vectorstores/chroma/)
vector store and 
[GoogleGenerativeAIEmbeddings](https://python.langchain.com/v0.2/docs/integrations/text_embedding/google_generative_ai/) 
model.

In [7]:
from langchain_chroma import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(
    model="models/text-embedding-004"
)
vectorstore = Chroma.from_documents(documents=all_splits, embedding=embeddings)

### Go deeper

`Embeddings`: Wrapper around a text embedding model, used for converting
text to embeddings.

- [Docs](https://python.langchain.com/v0.2/docs/how_to/embed_text/): Detailed documentation on how to use embeddings.

`VectorStore`: Wrapper around a vector database, used for storing and
querying embeddings.

- [Docs](https://python.langchain.com/v0.2/docs/how_to/vectorstores/): Detailed documentation on how to use vector stores.


This completes the **Indexing** portion of the pipeline. At this point
we have a query-able vector store containing the chunked contents of our
blog post. Given a user question, we should ideally be able to return
the snippets of the blog post that answer the question.


## 4. Retrieval and Generation: Retrieve

Now let’s write the actual application logic. We want to create a simple
application that takes a user question, searches for documents relevant
to that question, passes the retrieved documents and initial question to
the LLM, and returns an answer.

First we need to define our logic for searching over documents.
LangChain defines a
[Retriever](https://python.langchain.com/v0.2/docs/concepts/#retrievers/) 
interface, which wraps an index that can return relevant `Documents` 
given a string query.

The most common type of `Retriever` is the
[VectorStoreRetriever](https://python.langchain.com/v0.2/docs/how_to/vectorstore_retriever/),
which uses the similarity search capabilities of a vector store to
facilitate retrieval. Any `VectorStore` can easily be turned into a
`Retriever` with `VectorStore.as_retriever()`:

In [8]:
retriever = vectorstore.as_retriever(
    search_type="similarity", 
    search_kwargs={"k": 6} # return top 6 most relevant documents
)

retrieved_docs = retriever.invoke(
    "What are the approaches to Task Decomposition?"
)

len(retrieved_docs)

6

In [9]:
print(retrieved_docs[0].page_content)

Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.
Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.


### Go deeper

Vector stores are commonly used for retrieval, but there are other ways
to do retrieval, too.

`Retriever`: An object that returns `Document`s given a text query

- [Docs](https://python.langchain.com/v0.2/docs/how_to/#retrievers): Further
  documentation on the interface and built-in retrieval techniques.


## 5. Retrieval and Generation: Generate

Let’s put it all together into a chain that takes a question, retrieves
relevant documents, constructs a prompt, passes that to a LLM, and
parses the output.

We’ll use a prompt for RAG that is checked into the LangChain prompt hub
([here](https://smith.langchain.com/hub/rlm/rag-prompt)).

In [10]:
from langchain import hub

prompt = hub.pull("rlm/rag-prompt")

example_messages = prompt.invoke(
    {"context": "filler context", "question": "filler question"}
).to_messages()

example_messages

[HumanMessage(content="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: filler question \nContext: filler context \nAnswer:")]

In [11]:
print(example_messages[0].content)

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: filler question 
Context: filler context 
Answer:


We’ll use the 
[LCEL Runnable](https://python.langchain.com/v0.2/docs/concepts/#langchain-expression-language-lcel)
protocol to define the chain, allowing us to 

- pipe together components and functions in a transparent way 
- automatically trace our chain in LangSmith 
- get streaming, async, and batched calling out of the box.

Here is the implementation:

In [12]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {
        "context": retriever | format_docs, 
        "question": RunnablePassthrough()
    }
    | prompt
    | llm
    | StrOutputParser()
)

for chunk in rag_chain.stream("What is Task Decomposition?"):
    print(chunk, end="", flush=True)

Task decomposition is the process of breaking down a complex task into smaller, more manageable steps. This can be done through various methods, including prompting an LLM with instructions, using task-specific instructions, or with human input. By breaking down tasks, it becomes easier for models to understand and solve them. 


Let's dissect the LCEL to understand what's going on.

First: each of these components (`retriever`, `prompt`, `llm`, etc.) are instances of 
[Runnable](https://python.langchain.com/v0.2/docs/concepts/#langchain-expression-language-lcel)
. This means that they implement the same methods -- such as sync and async `.invoke`, `.stream`, or `.batch` -- which makes them easier to connect together. They can be connected into a [RunnableSequence](https://api.python.langchain.com/en/latest/runnables/langchain_core.runnables.base.RunnableSequence.html) -- another Runnable -- via the `|` (pipe) operator.

LangChain will automatically cast certain objects to runnables when met with the `|` operator. Here, `format_docs` is cast to a 
[RunnableLambda](https://api.python.langchain.com/en/latest/runnables/langchain_core.runnables.base.RunnableLambda.html)
, and the dict with `"context"` and `"question"` is cast to a 
[RunnableParallel](https://api.python.langchain.com/en/latest/runnables/langchain_core.runnables.base.RunnableParallel.html)
. The details are less important than the bigger point, which is that each object is a Runnable.

Let's trace how the input question flows through the above runnables.

As we've seen above, the input to `prompt` is expected to be a dict with keys `"context"` and `"question"`. So the first element of this chain builds runnables that will calculate both of these from the input question:
- `retriever | format_docs` passes the question through the retriever, generating
[Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html)
 objects, and then to `format_docs` to generate strings;
- `RunnablePassthrough()` passes through the input question unchanged.

That is, if you constructed:

In [13]:
chain = (
    {
        "context": retriever | format_docs, 
        "question": RunnablePassthrough()
    }
    | prompt
)

formatted_prompt = chain.invoke("What is Task Decomposition?")

In [None]:
print(formatted_prompt.messages[0].content)

Then `chain.invoke(question)` would build a formatted prompt, ready for inference. (Note: when developing with LCEL, it can be practical to test with sub-chains like this.)

The last steps of the chain are `llm`, which runs the inference, and `StrOutputParser()`, which just plucks the string content out of the LLM's output message.


### Built-in chains

If preferred, LangChain includes convenience functions that implement the above LCEL. We compose two functions:

- [create_stuff_documents_chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.combine_documents.stuff.create_stuff_documents_chain.html)
specifies how retrieved context is fed into a prompt and LLM. In this case, we will "stuff" the contents into the prompt -- i.e., we will include all retrieved context without any summarization or other processing. It largely implements our above `rag_chain`, with input keys `context` and `input`-- it generates an answer using retrieved context and query.

- [create_retrieval_chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.retrieval.create_retrieval_chain.html)
adds the retrieval step and propagates the retrieved context through the chain, providing it alongside the final answer. It has input key `input`, and includes `input`, `context`, and `answer` in its output.

In [15]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)


question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

response = rag_chain.invoke({"input": "What is Task Decomposition?"})
print(response["answer"])

Task Decomposition is a technique used to break down complex tasks into smaller, simpler steps. This is often done by using chain of thought (CoT) prompting, where the model is instructed to "think step by step." This helps the model utilize more computation at test time and allows for better interpretation of its thinking process. 



#### Returning sources
Often in Q&A applications it's important to show users the sources that were used to generate the answer. LangChain's built-in `create_retrieval_chain` will propagate retrieved source documents through to the output in the `"context"` key:

In [16]:
for document in response["context"]:
    print(document)
    print()

page_content='Fig. 1. Overview of a LLM-powered autonomous agent system.\nComponent One: Planning#\nA complicated task usually involves many steps. An agent needs to know what they are and plan ahead.\nTask Decomposition#\nChain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.' metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'start_index': 1585}

page_content='Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be 

#### Customizing the prompt

As shown above, we can load prompts (e.g., 
[this RAG prompt](https://smith.langchain.com/hub/rlm/rag-prompt)
) from the prompt hub. The prompt can also be easily customized:

In [17]:
from langchain_core.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.

{context}

Question: {question}

Helpful Answer:"""

custom_rag_prompt = PromptTemplate.from_template(template)

custom_rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | custom_rag_prompt
    | llm
    | StrOutputParser()
)

custom_rag_chain.invoke("What is Task Decomposition?")

'Task decomposition is the process of breaking down a complex task into smaller, more manageable steps. This can be done by using prompting techniques like Chain of Thought (CoT) or Tree of Thoughts (ToT), which guide the model to think step by step and explore multiple reasoning possibilities. Task decomposition can also be achieved through task-specific instructions or human input. \nThanks for asking! \n'

# Conversational RAG

In many Q&A applications we want to allow the user to have a back-and-forth conversation, meaning the application needs some sort of "memory" of past questions and answers, and some logic for incorporating those into its current thinking.

In this guide we focus on **adding logic for incorporating historical messages.** Further details on chat history management is 
[covered here](https://python.langchain.com/v0.2/docs/how_to/message_history/).

> The following section of this notebook is adapted from the LangChain tutorial 
[Conversational RAG](https://python.langchain.com/v0.2/docs/tutorials/qa_chat_history/).


### Adding chat history

The chain we have built uses the input query directly to retrieve relevant context. But in a conversational setting, the user query might require conversational context to be understood. For example, consider this exchange:

> Human: "What is Task Decomposition?"
>
> AI: "Task decomposition involves breaking down complex tasks into smaller and simpler steps to make them more manageable for an agent or model."
>
> Human: "What are common ways of doing it?"

In order to answer the second question, our system needs to understand that "it" refers to "Task Decomposition."

We'll need to update two things about our existing app:

1. **Prompt**: Update our prompt to support historical messages as an input.
2. **Contextualizing questions**: Add a sub-chain that takes the latest user question and reformulates it in the context of the chat history. This can be thought of simply as building a new *"history aware"* retriever. Whereas before we had:
   - `query` $\longrightarrow$ `retriever`  
     Now we will have:
   - `(query, conversation history)` $\longrightarrow$ `LLM` $\longrightarrow$ `rephrased query` $\longrightarrow$ `retriever`

#### Contextualizing the question

First we'll need to define a sub-chain that takes historical messages and the latest user question, and reformulates the question if it makes reference to any information in the historical information.

We'll use a prompt that includes a `MessagesPlaceholder` variable under the name "chat_history". This allows us to pass in a list of Messages to the prompt using the "chat_history" input key, and these messages will be inserted after the system message and before the human message containing the latest question.

Note that we leverage a helper function [create_history_aware_retriever](https://api.python.langchain.com/en/latest/chains/langchain.chains.history_aware_retriever.create_history_aware_retriever.html) for this step, which manages the case where `chat_history` is empty, and otherwise applies `prompt | llm | StrOutputParser() | retriever` in sequence.

`create_history_aware_retriever` constructs a chain that accepts keys `input` and `chat_history` as input, and has the same output schema as a retriever.

In [18]:
from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import MessagesPlaceholder

contextualize_q_system_prompt = (
    "Given a chat history and the latest user question "
    "which might reference context in the chat history, "
    "formulate a standalone question which can be understood "
    "without the chat history. Do NOT answer the question, "
    "just reformulate it if needed and otherwise return it as is."
)

contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)
history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt
)

This chain prepends a rephrasing of the input query to our retriever, so that the retrieval incorporates the context of the conversation.

Now we can build our full QA chain. This is as simple as updating the retriever to be our new `history_aware_retriever`.

Again, we will use [create_stuff_documents_chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.combine_documents.stuff.create_stuff_documents_chain.html) to generate a `question_answer_chain`, with input keys `context`, `chat_history`, and `input`-- it accepts the retrieved context alongside the conversation history and query to generate an answer.

We build our final `rag_chain` with [create_retrieval_chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.retrieval.create_retrieval_chain.html). This chain applies the `history_aware_retriever` and `question_answer_chain` in sequence, retaining intermediate outputs such as the retrieved context for convenience. It has input keys `input` and `chat_history`, and includes `input`, `chat_history`, `context`, and `answer` in its output.

In [19]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)


question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)

rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

Let's try this. Below we ask a question and a follow-up question that requires contextualization to return a sensible response. Because our chain includes a `"chat_history"` input, the caller needs to manage the chat history. We can achieve this by appending input and output messages to a list:

In [20]:
from langchain_core.messages import AIMessage, HumanMessage

chat_history = []

question = "What is Task Decomposition?"
ai_msg_1 = rag_chain.invoke({"input": question, "chat_history": chat_history})
chat_history.extend(
    [
        HumanMessage(content=question),
        AIMessage(content=ai_msg_1["answer"]),
    ]
)

second_question = "What are common ways of doing it?"
ai_msg_2 = rag_chain.invoke({"input": second_question, "chat_history": chat_history})

print(ai_msg_2["answer"])

Task decomposition can be achieved in several ways:

* **LLM Prompting:**  Simple prompts like "Steps for XYZ" or "What are the subgoals for achieving XYZ?" can guide an LLM to break down a task.
* **Task-Specific Instructions:** Using instructions tailored to the task, such as "Write a story outline" for writing a novel, can help decompose the task into manageable steps.
* **Human Input:**  Humans can directly provide a breakdown of the task into sub-tasks, especially for tasks that require domain expertise. 



#### Stateful management of chat history

Here we've gone over how to add application logic for incorporating historical outputs, but we're still manually updating the chat history and inserting it into each input. In a real Q&A application we'll want some way of persisting chat history and some way of automatically inserting and updating it.

For this we can use:

- [BaseChatMessageHistory](https://api.python.langchain.com/en/latest/langchain_api_reference.html#module-langchain.memory): Store chat history.
- [RunnableWithMessageHistory](https://python.langchain.com/v0.2/docs/how_to/message_history/): Wrapper for an LCEL chain and a `BaseChatMessageHistory` that handles injecting chat history into inputs and updating it after each invocation.

For a detailed walkthrough of how to use these classes together to create a stateful conversational chain, head to the [How to add message history (memory)](https://python.langchain.com/v0.2/docs/how_to/message_history) LCEL page.

Below, we implement a simple example of the second option, in which chat histories are stored in a simple dict. LangChain manages memory integrations with [Redis](https://python.langchain.com/v0.2/docs/integrations/memory/redis_chat_message_history/) and other technologies to provide for more robust persistence.

Instances of `RunnableWithMessageHistory` manage the chat history for you. They accept a config with a key (`"session_id"` by default) that specifies what conversation history to fetch and prepend to the input, and append the output to the same conversation history. Below is an example:

In [21]:
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory


store = {}


def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]


conversational_rag_chain = RunnableWithMessageHistory(
    rag_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="chat_history",
    output_messages_key="answer",
)

In [22]:
conversational_rag_chain.invoke(
    {"input": "What is Task Decomposition?"},
    config={
        "configurable": {"session_id": "abc123"}
    },  # constructs a key "abc123" in `store`.
)["answer"]

Parent run d7c382a3-6a64-46c9-9c37-f29ad4eed85f not found for run 330f852a-a611-48d2-8744-2f2b665df735. Treating as a root run.


'Task decomposition is a technique used to break down complex tasks into smaller, more manageable steps. This process is often used in conjunction with chain-of-thought (CoT) prompting, which encourages the model to "think step by step" and utilize more computation to solve complex problems. By decomposing tasks, it becomes easier for a model to understand and solve them. \n'

In [23]:
conversational_rag_chain.invoke(
    {"input": "What are common ways of doing it?"},
    config={"configurable": {"session_id": "abc123"}},
)["answer"]

Parent run 7183abb1-0fc9-4d58-a9fb-3f87c3d5f295 not found for run 7bf4769d-a64c-4efe-9c6f-d8fef4cf085c. Treating as a root run.


'Task decomposition can be achieved in a few common ways:\n\n1. **LLM prompting:** Simple prompts like "Steps for XYZ.\\n1." or "What are the subgoals for achieving XYZ?" can guide the LLM to break down the task.\n2. **Task-specific instructions:** Providing instructions tailored to the task, such as "Write a story outline" for writing a novel, can help the LLM understand the necessary steps.\n3. **Human input:** Humans can directly provide the task decomposition, either by outlining the steps themselves or by providing examples of how the task can be broken down. \n'

The conversation history can be inspected in the `store` dict:

In [24]:
for message in store["abc123"].messages:
    if isinstance(message, AIMessage):
        prefix = "AI"
    else:
        prefix = "User"

    print(f"{prefix}: {message.content}\n")

User: What is Task Decomposition?

AI: Task decomposition is a technique used to break down complex tasks into smaller, more manageable steps. This process is often used in conjunction with chain-of-thought (CoT) prompting, which encourages the model to "think step by step" and utilize more computation to solve complex problems. By decomposing tasks, it becomes easier for a model to understand and solve them. 


User: What are common ways of doing it?

AI: Task decomposition can be achieved in a few common ways:

1. **LLM prompting:** Simple prompts like "Steps for XYZ.\n1." or "What are the subgoals for achieving XYZ?" can guide the LLM to break down the task.
2. **Task-specific instructions:** Providing instructions tailored to the task, such as "Write a story outline" for writing a novel, can help the LLM understand the necessary steps.
3. **Human input:** Humans can directly provide the task decomposition, either by outlining the steps themselves or by providing examples of how the

### Tying it together

![retrieval chain with chat history](img/conversational_retrieval_chain.png)

> Image from [LangChain Docs: Conversational RAG](https://python.langchain.com/v0.2/docs/tutorials/qa_chat_history/#tying-it-together)

## Summary

We've covered the steps to build a basic Q&A app over data:

- Loading data with a
[Document Loader](https://python.langchain.com/v0.2/docs/concepts/#document-loaders)
- Chunking the indexed data with a
[Text Splitter](https://python.langchain.com/v0.2/docs/concepts/#text-splitters)
to make it more easily usable by a model
- [Embedding the data](https://python.langchain.com/v0.2/docs/concepts/#embedding-models)
and storing the data in a
[vectorstore](https://python.langchain.com/v0.2/docs/how_to/vectorstores/)
- [Retrieving](https://python.langchain.com/v0.2/docs/concepts/#retrievers)
the previously stored chunks in response to incoming questions
- Generating an answer using the retrieved chunks as context
- Added chat history to support asking follow up questions.
