# Building RAG Chatbots with LangChain and Gemini

In this example, we'll build an AI chatbot from start to finish. We will use LangChain, Google's Gemini models, and the Chroma vector database to build a chatbot capable of learning from external documents using **R**etrieval **A**ugmented **G**eneration (RAG).

We will start by creating a simple conversational agent, see where it fails, and then enhance it with a knowledge base built from a PDF document. By the end of this notebook, you'll have a functioning RAG pipeline that can hold a conversation and provide informative, source-backed responses.

### Prerequisites

Before we start, we need to install the necessary Python libraries. Here's a quick overview of their roles:

-   **dotenv**: Helps manage environment variables, like API keys.
-   **sentence-transformers**: Provides state-of-the-art models for creating text embeddings locally.
-   **langchain**: The core framework we'll use to "chain" components together.
-   **langchain-community** & **langchain-google-genai**: Provide integrations for various LLMs and tools.
-   **langchain-chroma**: The integration for the Chroma vector database.
-   **langchain-huggingface**: Provides integrations for Hugging Face models, including our embedding model.
-   **pypdf**: A library to load and parse PDF files.

#### You can install these libraries using pip:

In [None]:
!pip install -qU \
    dotenv \
    sentence-transformers \
    langchain \
    langchain-community \
    langchain-chroma \
    langchain-google-genai \
    langchain-huggingface \
    pypdf

### Building a Chatbot (without RAG)

We'll begin by creating a simple chatbot without any retrieval augmentation. To do this, we'll initialize a `ChatGoogleGenerativeAI` object from LangChain. This requires a [Google API key](https://aistudio.google.com/apikey). The `getpass` function provides a secure way to enter your key if it's not already set as an environment variable.

We can test that the model is working correctly by invoking it with a simple prompt.

In [1]:
import os
from langchain_google_genai import ChatGoogleGenerativeAI
from getpass import getpass

os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY") or getpass(
    "Enter your Google API key: "
)

chat = ChatGoogleGenerativeAI(
    google_api_key=os.environ["GOOGLE_API_KEY"],
    model='gemini-2.5-flash'
)

chat.invoke("Write a short poem about pydata")

AIMessage(content="Where Python's serpent coils around,\nAnd data's mysteries are found,\nPyData gathers, bright and keen,\nA vibrant, knowledge-sharing scene.\n\nFrom Pandas frames to plots so grand,\nWith NumPy's might in every hand,\nWe parse, predict, and visualize,\nNew patterns rising to our eyes.\n\nA community, strong and true,\nWhere insights bloom and friendships too.\nFor open source, a common goal,\nMaking data make us whole.", additional_kwargs={}, response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'model_name': 'gemini-2.5-flash', 'safety_ratings': []}, id='run--a66ea015-4e00-472c-a85c-81e9fafd075b-0', usage_metadata={'input_tokens': 8, 'output_tokens': 1180, 'total_tokens': 1188, 'input_token_details': {'cache_read': 0}, 'output_token_details': {'reasoning': 1077}})

### Structuring a Conversation

Chats with generative models are structured as a sequence of messages, each with a specific role. This allows the model to understand the context of the conversation.

In LangChain, we use message objects to represent this structure:
*   `SystemMessage`: Sets the overall behavior and personality of the assistant (e.g., "You are a helpful assistant.").
*   `HumanMessage`: Represents a prompt from the user.
*   `AIMessage`: Represents a response from the model.

Let's create a short conversation history to see how this works.

In [2]:
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="Hi AI, how are you today?"),
    AIMessage(content="I'm great thank you. How can I help you?"),
    HumanMessage(content="I'd like to understand what an LLM is.")
]

In [3]:
res = chat.invoke(messages)
res.content

'An LLM, or **Large Language Model**, is a type of artificial intelligence (AI) program designed to understand, generate, and process human language.\n\nLet\'s break down each part of the name:\n\n1.  **Large:** This refers to two main things:\n    *   **Massive Datasets:** LLMs are trained on enormous amounts of text and code data – billions, even trillions, of words from books, articles, websites, conversations, and more. This vast exposure helps them learn the nuances of language.\n    *   **Billions of Parameters:** They have an incredibly large number of "parameters," which are like the internal knobs and dials that the model adjusts during training to learn patterns and relationships in the data. More parameters generally mean a more complex and capable model.\n\n2.  **Language:** This indicates their primary focus: human language. They are designed to work with, understand, and produce text, whether it\'s written or spoken (after being converted to text).\n\n3.  **Model:** This 

### Maintaining Conversational Context

To have a real conversation, the chatbot needs to remember what has already been said. We can achieve this by appending the latest AI response and the new user prompt to our list of messages before making the next call.

This simple loop of "prompt -> get response -> append history" forms the basis of any stateful chatbot.

In [None]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="What do you think pydata members can use LLMs for?"
)
# add to messages
messages.append(prompt)

# send to gemini
res = chat.invoke(messages)

print(res.content)

PyData members, with their strong foundation in Python, data analysis, machine learning, and scientific computing, are uniquely positioned to leverage LLMs in highly effective ways. LLMs can act as powerful assistants, knowledge bases, and even collaborators, significantly boosting productivity and enabling new kinds of analyses.

Here are numerous ways PyData members can use LLMs:

### 1. Code Assistance & Development

*   **Code Generation:**
    *   **Boilerplate Code:** Quickly generate functions, classes, or scripts for common tasks (e.g., a function to load data from a CSV, a simple Flask API endpoint, a basic `matplotlib` plot).
    *   **Specific Library Usage:** Ask for examples of how to use a particular function in Pandas, NumPy, Scikit-learn, or PyTorch for a specific task. "Show me how to perform a group-by aggregation with multiple columns in Pandas."
    *   **SQL/Query Generation:** Generate complex SQL queries based on natural language descriptions, which is invaluable

### Dealing with Hallucinations and Knowledge Cutoffs

We have a functioning chatbot, but its knowledge is limited to what it learned during its training. We call this the model's *parametric knowledge*. This means it has no access to real-time information or very recent developments.

This limitation becomes clear when we ask about something new, like the "Kimi K2" model. The model may either admit it doesn't know or, more problematically, "hallucinate" an answer that sounds plausible but is incorrect.

In [None]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="What is so special about Kimi K2?"
)
# add to messages
messages.append(prompt)

# send to gemini
res = chat.invoke(messages)

In [6]:
print(res.content)

Kimi K2, developed by Chinese AI startup **Moonshot AI**, is primarily special for its **exceptionally long context window**.

Let's break down why this is a significant differentiator:

1.  **Massive Context Window:**
    *   Kimi K2 boasts a context window of **200,000 characters (or tokens, roughly equivalent to 200,000 Chinese characters or 200,000 English tokens)**.
    *   **Why is this special?** At the time of its release and even now, this was significantly larger than many leading models. For comparison, early versions of GPT-4 had context windows of 8k or 32k tokens, and while Claude 2.1 expanded to 200k tokens, Kimi was among the first to truly push this boundary for a widely accessible model.
    *   **What does "context window" mean?** It's the amount of information (text, code, data) that the LLM can "see" and process at any given moment to understand your query and generate a response. It's like its short-term memory and reading comprehension limit.

2.  **Implications 

### Augmenting with Source Knowledge

The chatbot's answer is likely generic or incorrect because "Kimi K2" is too new for its training data. We can solve this by providing the necessary information directly in the prompt. This is called *source knowledge*.

Let's define a block of text containing key facts about Kimi K2.

In [7]:
source_knowledge = (
"Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters.",
"Trained with the Muon optimizer, Kimi K2 achieves exceptional performance across frontier knowledge, reasoning, and coding tasks while being meticulously optimized for agentic capabilities.",
"It was pre-trained on 15.5T tokens with zero training instability.",
"The model is specifically designed for tool use, reasoning, and autonomous problem-solving.",
"Kimi K2 comes in two variants: Kimi-K2-Base, the foundation model for custom solutions, and Kimi-K2-Instruct, a post-trained model for general-purpose chat and agentic experiences.",
"Its architecture is a Mixture-of-Experts (MoE) with 1 trillion total parameters and 32 billion activated parameters.",
"The model has 61 layers, including 1 dense layer, an attention hidden dimension of 7168, and a MoE hidden dimension of 2048 per expert.",
"It features 384 experts with 8 selected per token, 1 shared expert, and 64 attention heads.",
"The model has a vocabulary size of 160K, a context length of 128K, and uses MLA for its attention mechanism and SwiGLU as its activation function."
)

Now, we can create an *augmented prompt* that includes instructions for the model, our source knowledge, and the original query. This tells the LLM to base its answer on the provided context.

In [8]:
query = "What is so special about Kimi K2?"

augmented_prompt = f"""Using the contexts below, answer the query.

Contexts:
{source_knowledge}

Query: {query}"""

Now, let's pass this new, augmented prompt to the model. Notice that we are creating a new `HumanMessage` with this detailed content.

In [None]:
# create a new user prompt
prompt = HumanMessage(
    content=augmented_prompt
)
# add to messages
messages.append(prompt)

# send to gemini
res = chat.invoke(messages)

In [10]:
print(res.content)

Kimi K2 is special for several reasons, primarily due to its advanced architecture and optimization for specific capabilities:

1.  **State-of-the-Art Mixture-of-Experts (MoE) Architecture:** It's an MoE model with a massive scale, featuring 1 trillion total parameters and 32 billion activated parameters. This allows for efficiency while maintaining high capability.
2.  **Exceptional Performance:** It achieves outstanding results across frontier knowledge, reasoning, and coding tasks.
3.  **Optimized for Agentic Capabilities:** Kimi K2 is meticulously optimized and specifically designed for tool use, reasoning, and autonomous problem-solving, making it highly suitable for agent-based AI applications.
4.  **Massive and Stable Training:** It was pre-trained on an enormous 15.5 trillion tokens with zero training instability, indicating a robust and high-quality training process.
5.  **Large Context Length:** It boasts a significant context length of 128K, enabling it to process and unders

The quality of this answer is significantly better because it's grounded in the facts we provided. This demonstrates the core idea of **Retrieval-Augmented Generation**. The main challenge is: how do we find and provide this context automatically? The answer is to build a knowledge base with a vector store.

---

### Building the Knowledge Base

To automate the retrieval of source knowledge, we need to create a searchable knowledge base. This involves two key components:

1.  **An Embedding Model**: This model converts text into numerical vectors (embeddings), capturing its semantic meaning.
2.  **A Vector Store**: This is a specialized database designed to store and efficiently search these vectors.

First, we'll set up our embedding model. We're using a popular, high-performance model from Hugging Face called `all-mpnet-base-v2`. The `HuggingFaceEmbeddings` class from LangChain makes it easy to use locally.

In [11]:
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

  from .autonotebook import tqdm as notebook_tqdm


Next, we initialize our vector store. We'll use **Chroma**, a lightweight, open-source and fast vector database that can run entirely in-memory or be persisted to disk. We'll give our collection a name and tell it which embedding function to use.

In [13]:
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

Now it's time to populate our knowledge base. Instead of a small text snippet, we'll use an entire [PDF document](https://arxiv.org/pdf/2501.17805). The process involves three steps:

1.  **Load**: Use `PyPDFLoader` to load the content of the PDF.
2.  **Split**: Large documents are too big to fit into the context window of many models. We use `RecursiveCharacterTextSplitter` to break the document into smaller, overlapping chunks. This ensures semantic context is preserved at the boundaries of each chunk.
3.  **Add**: Add the processed chunks to our vector store. Chroma will automatically use our embedding model to convert each chunk into a vector and index it.

In [15]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 1. Load the PDF
loader = PyPDFLoader("pdfs/2501.17805v1.pdf")
documents = loader.load()

# 2. Split the PDF into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

# 3. Add the chunks to the vector store
vector_store.add_documents(docs)
 
print("Successfully loaded, split, and added the PDF to the vector store.")

Successfully loaded, split, and added the PDF to the vector store.


### Retrieval Augmented Generation

Our knowledge base is now ready. We can perform a similarity search to find the document chunks most relevant to a given query. This search works by embedding the query and finding the vectors in our database that are closest to it in the high-dimensional space.

Let's test it out.

In [14]:
query = "What are 5 points from the report on what makes risk management for general-purpose AI particularly difficult?"

results = vector_store.similarity_search(query, k=3)
for res in results:
    print(f"*** {res.page_content} \n")

*** and ensure various risk activities 
(i.e. all of the above) are 
cohesively structured and 
aligned, risk roles and 
responsibilities are clearly 
defined, and checks and 
balances are in place to avoid 
silos and manage conflicts of 
interest. 
In other safety critical 
industries, the Three Lines of 
Defence framework – 
separating risk ownership, 
oversight and audit – is 
widely used and can be 
usefully applied to advanced 
AI companies (954, 955) 
 
Table 3.1: Several practices and mechanisms, organised by five stages of risk management, can help manage the broad 
range of risks posed by general-purpose AI. 
 
Documentation and institutional transparency mechanisms, together with information sharing 
practices, play an important role in managing the risks of general-purpose AI and facilitating 
external scrutiny. It has become common practice to test models before release, including via 

*** performance. 
 
This section covers six general technical challenges that can make r

The search successfully retrieved relevant passages from the PDF. Now we can automate the process we performed manually earlier. We'll create a function `augment_prompt` that:
1.  Takes a user's query.
2.  Performs a similarity search on our `vector_store`.
3.  Joins the content of the retrieved chunks into a single context block.
4.  Constructs a detailed prompt for the LLM, including instructions, the retrieved context, and the original query.

In [None]:
def augment_prompt(query: str):
    # get top 5 results from knowledge base
    results = vector_store.similarity_search(query, k=5)
    # get the text from the results
    source_knowledge = "\n".join([x.page_content for x in results])
    # feed into an augmented prompt
    augmented_prompt = f"""Using the contexts below, answer the query.
    ---
    Contexts:
    {source_knowledge}
    ---
    Query: {query}"""

    return augmented_prompt

Let's see what the final, augmented prompt looks like before sending it to the model.

In [16]:
print(augment_prompt(query))

Using the contexts below, answer the query.
    ---
    Contexts:
    and ensure various risk activities 
(i.e. all of the above) are 
cohesively structured and 
aligned, risk roles and 
responsibilities are clearly 
defined, and checks and 
balances are in place to avoid 
silos and manage conflicts of 
interest. 
In other safety critical 
industries, the Three Lines of 
Defence framework – 
separating risk ownership, 
oversight and audit – is 
widely used and can be 
usefully applied to advanced 
AI companies (954, 955) 
 
Table 3.1: Several practices and mechanisms, organised by five stages of risk management, can help manage the broad 
range of risks posed by general-purpose AI. 
 
Documentation and institutional transparency mechanisms, together with information sharing 
practices, play an important role in managing the risks of general-purpose AI and facilitating 
external scrutiny. It has become common practice to test models before release, including via
performance. 
 
This sec

This prompt now contains everything the model needs to generate a high-quality, fact-based answer. We'll create a new, clean message list for this RAG-powered query to avoid confusing the model with our previous, non-RAG conversation.

In [17]:
# create a new user prompt
prompt = HumanMessage(
    content=augment_prompt(query)
)

rag_messages = [SystemMessage(content="You are a helpful assistant."),]
# add to messages
rag_messages.append(prompt)

res = chat.invoke(rag_messages)

print(res.content)

Based on the provided contexts, here are 5 points that make risk management for general-purpose AI particularly difficult:

1.  **Significant Gaps in Validation, Standardization, and Implementation:** There are considerable gaps globally in validating, standardizing, and implementing risk management frameworks and practices, especially for identifying and mitigating unprecedented risks.
2.  **Rapid Evolution of Technology:** The technology's rapid evolution makes the context of general-purpose AI risk management uniquely complex.
3.  **Broad Applicability of Technology:** The broad applicability of general-purpose AI contributes to the unique complexity of its risk management.
4.  **Complex Interaction Effects:** General-purpose AI systems exhibit complex interaction effects, which traditional risk management practices must be adapted to address.
5.  **Emergence of Autonomous Agents:** Autonomous general-purpose AI agents, which can plan and act with little to no human involvement, ele