# Build a LangChain agentic RAG system using the OpenAI model (gpt-4o-mini)


**Author**: Mahrukh Ali Khan

In this tutorial, you will create a LangChain agentic RAG system using the OpenAI (gpt-40-mini) that can answer complex queries about the 2024 US Open using external information.


# Overview of agentic RAG

## What is RAG?

RAG is a technique in natural language processing (NLP) that combines information retrieval and generative models to produce more accurate, relevant and contextually aware responses. In traditional language generation tasks, large language models (LLMs) such as Meta's [Llama Models](https://llama.meta.com/) or IBM’s [Granite Models](https://www.ibm.com/granite) are used to construct responses based on an input prompt. Common real-world use cases of these LLMs are chatbots. When models are missing relevant information that is up to date in their knowledge base, RAG is a powerful tool.

## What are AI agents?

At the core of agentic RAG systems are artificial intelligence (AI) agents. An AI agent refers to a system or program that is capable of autonomously performing tasks on behalf of a user or another system by designing its workflow and using available tools. Agentic technology implements tool use on the backend to obtain up-to-date information from various data sources, optimize workflow and create subtasks autonomously to solve complex tasks. These external tools can include external data sets, search engines, APIs and even other agents. Step-by-step, the agent reassesses its plan of action in real time and self-corrects.  

## Agentic RAG vs Traditional RAG

Agentic RAG frameworks are powerful as they can encompass more than just one tool. In traditional RAG applications, the LLM is provided with a vector database to reference when forming its responses. In contrast, agentic RAG implementations are not restricted to document agents that only perform data retrieval. RAG agents can also have tools for tasks such as solving mathematical calculations, writing emails, performing data analysis and more. These tools can be supplemental to the agent's decision-making process. AI agents are context-aware in their multistep reasoning and can determine when to use appropriate tools.

AI agents, or intelligent agents, can also work collaboratively in multiagent systems, which tend to outperform singular agents. This scalability and adaptability is what sets apart agentic RAG agents from traditional RAG pipelines. 


# Prerequisites

You need an OpenAI account to create a OpenAI API-Key for this project.

# Steps

## Step 1. Set up your environment

1. Log in to OpenAI using your OpenAI account.

2. Generate an API-Key and save it


## Step 2. Install and import relevant libraries and set up your credentials

We'll need a few libraries and modules for this tutorial. Make sure to import the following ones; if they're not installed, you can resolve this with a quick pip installation. 

Common Python frameworks for building agentic RAG systems include LangChain and LlamaIndex. In this tutorial, we will be using LangChain.  


In [1]:
# NECESSARY INSTALLATION
%pip install langchain | tail -n 1
%pip install langchain-community | tail -n 1
%pip install openai | tail -n 1
%pip install chromadb | tail -n 1
%pip install tiktoken | tail -n 1
%pip install beautifulsoup4 | tail -n 1


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
# IMPORTS
import getpass #securely prompt the user for sensitive information like API keys

#OpenAI and Langchain imports
from langchain_openai import OpenAIEmbeddings, ChatOpenAI  #generates vector embeddings and to access OpenAI's chat models
from langchain.vectorstores import Chroma #vector store
from langchain_community.document_loaders import WebBaseLoader #Loads web content for processing and indexing in the RAG pipeline.
from langchain.text_splitter import RecursiveCharacterTextSplitter #large text into smaller text
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder #Helps structure prompts for LLM interactions.
from langchain.prompts import PromptTemplate #Prompt Customation 
from langchain.tools import tool #Tools for Agentic Actions
from langchain.tools.render import render_text_description_and_args #for better interpretability
from langchain.agents.output_parsers import JSONAgentOutputParser #output parser
from langchain.agents.format_scratchpad import format_log_to_str #for debugging and monitoring 
from langchain.agents import AgentExecutor #Executes agent actions based on provided tools and LLMs.
from langchain.memory import ConversationBufferMemory  #memory management
from langchain_core.runnables import RunnablePassthrough  #Passes data through the pipeline without modifications


USER_AGENT environment variable not set, consider setting it to identify your requests.


In [4]:
import os

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = getpass.getpass("Please enter your OpenAI API key (hit enter): ")

print("OpenAI API-key has been set successfully!!")


OpenAI API-key has been set successfully!!


## Step 4. Initialization a basic agent with no tools

This step is important as it will produce a clear example of an agent's behavior with and without external data sources. Let's start by setting our parameters. You can change parameters accordingly



In [12]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4o-mini",  # or another OpenAI model like "gpt-3.5-turbo", "gpt-4o"
    temperature=0,
    max_tokens=250,
    stop=["Human:", "Observation"],
)


We'll set up a prompt template in case you want to ask multiple questions. 

In [13]:
#Prompt Template
template = "Answer the {query} accurately. If you do not know the answer, simply say you do not know."
prompt = PromptTemplate.from_template(template)

And now we can set up a chain with our prompt and our LLM. This allows the generative model to produce a response.

In [14]:
agent = prompt | llm

Let's test to see how our agent responds to a basic query. 

In [15]:

agent.invoke({"query": "What sport is played at the US Open?"})

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

The agent successfully responded to the basic query with the correct answer. In the next step of this tutorial, we will be creating a RAG tool for the agent to access relevant information about IBM's involvement in the 2024 US Open. 
As we have covered, traditional LLMs cannot obtain current information on their own. Let's verify this.

In [16]:
agent.invoke({"query": "Where was the 2024 US Open Tennis Championship?"})

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

Evidently, the LLM is unable to provide us with the relevant information. The training data used for this model contained information prior to the 2024 US Open and without the appropriate tools, the agent does not have access to this information. 

## Step 5. Establish the knowledge base and retriever

The first step in creating the knowledge base is listing the URLs we will be extracting content from. In this case, our data source will be collected from our online content summarizing IBM’s involvement in the 2024 US Open. The relevant URLs are established in the `urls` list.

In [17]:
#Creating Knowledge Base
urls = [
    "https://www.ibm.com/case-studies/us-open",
    "https://www.ibm.com/sports/usopen",
    "https://newsroom.ibm.com/US-Open-AI-Tennis-Fan-Engagement",
    "https://newsroom.ibm.com/2024-08-15-ibm-and-the-usta-serve-up-new-and-enhanced-generative-ai-features-for-2024-us-open-digital-platforms",
]

Next, load the documents using LangChain `WebBaseLoader` for the URLs we listed. We'll also print a sample document to see how it loaded.

In [18]:
docs = [WebBaseLoader(url).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]
docs_list[0]

Document(metadata={'source': 'https://www.ibm.com/case-studies/us-open', 'title': 'U.S. Open | IBM', 'description': 'To help the US Open stay on the cutting edge of customer experience, IBM Consulting built powerful generative AI models with watsonx.', 'language': 'en'}, page_content='\n\n\n\n\n\n\n\n\nU.S. Open | IBM\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHome\n\n\n\n\nCase Studies\n\n\n\nUS Open \n\n\n\n\n                \n\n\n\n  \n    Acing the US Open digital experience\n\n\n\n\n\n\n    \n\n\n            \n\n                    \n\n\n  \n  \n      AI models built with watsonx transform data into insight\n  \n\n\n\n\n    \n\n\n                \n\n\nGet the latest AI and tech insights\n\n\n\nLearn More\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFor two weeks at the end of summer, nearly one million people make the journey to Flushing, New York, 

In order to split the data in these documents to chunks that can be processed by the LLM, we can use a text splitter such as `RecursiveCharacterTextSplitter`. This text splitter splits the content on the following characters: ["\n\n", "\n", " ", ""]. This is done with the intention of keeping text in the same chunks, such as paragraphs, sentences and words together. 

Once the text splitter is initiated, we can apply it to our `docs_list`.

In [19]:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=250, chunk_overlap=0
)
doc_splits = text_splitter.split_documents(docs_list)

The embedding model that we are using is OpenAIEmbeddings. Let's initialize it.

In [20]:
embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002",  # OpenAI's text embedding model
    openai_api_key=os.getenv("OPENAI_API_KEY")  # Retrieve API key from environment variable
)

print("OpenAI embeddings have been initialized successfully!")

OpenAI embeddings have been initialized successfully!


In order to store our embedded documents, we will use Chroma DB, an open source vector store. 

In [22]:
# Initialize the Chroma vector store with the documents and OpenAI embeddings
vectorstore = Chroma.from_documents(
    documents=doc_splits,  # The document splits from the previous step
    collection_name="agentic-rag-chroma",  # Name of the vector store collection
    embedding=embeddings  # The OpenAI embeddings initialized earlier
)
print("Chroma vector store has been created successfully!")


RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

To access information in the vector store, we must set up a retriever. 

In [23]:
retriever = vectorstore.as_retriever()

NameError: name 'vectorstore' is not defined

## Step 6. Define the agent's RAG tool

Let's define the `get_IBM_US_Open_context()` tool our agent will be using. This tool's only parameter is the user query. The tool description is also noted to inform the agent of the use of the tool. This way, the agent knows when to call this tool. This tool can be used by the agentic RAG system for routing the user query to the vector store if it pertains to IBM’s involvement in the 2024 US Open. 

In [24]:
# Define the tool for retrieving IBM's US Open context
@tool
def get_IBM_US_Open_context(question: str):
    """Get context about IBM's involvement in the 2024 US Open Tennis Championship."""
    
    # Query the OpenAI vector store for relevant context
    context = vectorstore.similarity_search(question, k=1)  # Adjust `k` as needed
    return context

# Register the tool
tools = [get_IBM_US_Open_context]


## Step 7. Establish the prompt template

Next, we will set up a new prompt template to ask multiple questions. This template is more complex. It is referred to as a [structured chat prompt](https://api.python.langchain.com/en/latest/agents/langchain.agents.structured_chat.base.create_structured_chat_agent.html#langchain-agents-structured-chat-base-create-structured-chat-agent) and can be used for creating agents that have multiple tools available. In our case, the tool we are using was defined in Step 6. The structured chat prompt will be made up of a `system_prompt`, a `human_prompt` and our RAG tool. 

First, we will set up the `system_prompt`. This prompt instructs the agent to print its "thought process," which involves the agent's subtasks, the tools that were used and the final output. This gives us insight into the agent's function calling. The prompt also instructs the agent to return its responses in JSON Blob format.

In [25]:
### Breakdown:
#- **Tools & Actions:** The system prompt includes the tools available (in your case, the `get_IBM_US_Open_context` tool). The valid actions are either "Final Answer" or the specific tool name.
#- **Structured Chat Format:** The agent will think through the question, decide which tool to use, and format its response using JSON in the defined structure. 
#- **Multiple Thoughts & Actions:** The agent will iterate over several thoughts and actions to reach the final answer. 
#- **Final Answer:** Once the agent is confident, it outputs the final answer to the user.
#This system prompt provides clear instructions for the agent's reasoning and interaction with tools, ensuring it produces the most relevant response in a structured format.

system_prompt = """Respond to the human as helpfully and accurately as possible. You have access to the following tools: {tools}
Use a json blob to specify a tool by providing an action key (tool name) and an action_input key (tool input).
Valid "action" values: "Final Answer" or {tool_names}
Provide only ONE action per $JSON_BLOB, as shown:"
```
{{
  "action": $TOOL_NAME,
  "action_input": $INPUT
}}
```
Follow this format:
Question: input question to answer
Thought: consider previous and subsequent steps
Action:
```
$JSON_BLOB
```
Observation: action result
... (repeat Thought/Action/Observation N times)
Thought: I know what to respond
Action:
```
{{
  "action": "Final Answer",
  "action_input": "Final response to human"
}}
Begin! Reminder to ALWAYS respond with a valid json blob of a single action.
Respond directly if appropriate. Format is Action:```$JSON_BLOB```then Observation"""

In the following code, we are establishing the `human_prompt`. This prompt tells the agent to display the user input followed by the intermediate steps taken by the agent as part of the `agent_scratchpad`.

In [26]:
human_prompt = """{input}
{agent_scratchpad}
(reminder to always respond in a JSON blob)"""

Next, we establish the order of our newly defined prompts in the prompt template. We create this new template to feature the `system_prompt` followed by an optional list of messages collected in the agent's memory, if any, and finally, the `human_prompt` which includes both the human input and `agent_scratchpad`.

In [27]:
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        MessagesPlaceholder("chat_history", optional=True),
        ("human", human_prompt),
    ]
)

Now, let's finalize our prompt template by adding the tool names, descriptions and arguments using a [partial prompt template](https://python.langchain.com/v0.1/docs/modules/model_io/prompts/partial/). This allows the agent to access the information pertaining to each tool including the intended use cases and also means we can add and remove tools without altering our entire prompt template.

In [28]:
prompt = prompt.partial(
    tools=render_text_description_and_args(list(tools)),
    tool_names=", ".join([t.name for t in tools]),
)

## Step 8. Set up the agent's memory and chain

An important feature of AI agents is their memory. Agents are able to store past conversations and past findings in their memory to improve the accuracy and relevance of their responses going forward. In our case, we will use LangChain's `ConversationBufferMemory()` as a means of memory storage. 

In [29]:
memory = ConversationBufferMemory()

  memory = ConversationBufferMemory()


And now we can set up a chain with our agent's scratchpad, memory, prompt and the LLM. The AgentExecutor class is used to execute the agent. It takes the agent, its tools, error handling approach, verbose parameter and memory as parameters.

In [30]:
chain = (
    RunnablePassthrough.assign(
        agent_scratchpad=lambda x: format_log_to_str(x["intermediate_steps"]),
        chat_history=lambda x: memory.chat_memory.messages,
    )
    | prompt
    | llm
    | JSONAgentOutputParser()
)

agent_executor = AgentExecutor(
    agent=chain, tools=tools, handle_parsing_errors=True, verbose=True, memory=memory
)

## Step 9. Generate responses with the agentic RAG system

We are now able to ask the agent questions. Recall the agent's previous inability to provide us with information pertaining to the 2024 US Open. Now that the agent has its RAG tool available to use, let's try asking the same questions again. 

In [31]:
agent_executor.invoke({"input": "Where was the 2024 US Open Tennis Championship?"})



[1m> Entering new AgentExecutor chain...[0m


RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

Great! The agent used its available RAG tool to return the location of the 2024 US Open, per the user's query. We even get to see the exact document that the agent is retrieving its information from. Now, let's try a slightly more complex question query. This time, the query will be about IBM's involvement in the 2024 US Open. 

In [32]:
agent_executor.invoke(
    {"input": "How did IBM use watsonx at the 2024 US Open Tennis Championship?"}
)



[1m> Entering new AgentExecutor chain...[0m


RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

Again, the agent was able to successfully retrieve the relevant information pertaining to the user query. Additionally, the agent is successfully updating its knowledge base as it learns new information and experiences new interactions as seen by the history output. 

Now, let's test if the agent can decipher when tool calling is not necessary to answer the user query. We can test this by asking the RAG agent a question that is not about the US Open. 

In [33]:
agent_executor.invoke({"input": "What is the capital of France?"})



[1m> Entering new AgentExecutor chain...[0m


RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

As seen in the AgentExecutor chain, the agent recognized that it had the information in its knowledge base to answer this question without using its tools. 

## Summary

In this tutorial, you created a RAG agent using LangChain in python with OpenAI. The LLM you worked with was the GPT-40-mini model. The sample output is important as it shows the significance of this generative AI advancement. The AI agent was successfully able to retrieve relevant information via the `get_IBM_US_Open_context` tool, update its memory with each interaction and output appropriate responses. It is also important to note the agent's ability to determine whether tool calling is appropriate for each specific task. When the agent had the information necessary to answer the input query, it did not use any tools for question answering. 
