### 📖 Where We Are

**In the previous sections**, we built all the foundational components for a RAG system:
1.  **Data Ingestion & Parsing (Notebooks 1-6)**: We learned how to load and clean data from any source into `Document` objects.
2.  **Vector Embeddings (Notebooks 7-8)**: We mastered the art of converting those `Document` objects into numerical vectors (embeddings) that capture their semantic meaning.

**In this new section**, we'll learn how to store and search through those embeddings efficiently using **Vector Stores**. This notebook will walk you through building your **first complete, end-to-end RAG system** using ChromaDB, a popular open-source vector database.

### 1. Building a RAG System with LangChain and ChromaDB

In [1]:
import os
from dotenv import load_dotenv
# Load environment variables from a .env file for secure key management.
load_dotenv()

True

In [2]:
# Set the GROQ_API_KEY environment variable for authentication with the Groq service.
os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY")

In [3]:
# --- Langchain Imports ---
# Document Loaders for loading text and directory data.
from langchain_community.document_loaders import TextLoader, DirectoryLoader
# Text Splitter for chunking documents.
from langchain_text_splitters import RecursiveCharacterTextSplitter
# HuggingFace Embeddings for creating vector representations of text.
from langchain_huggingface import HuggingFaceEmbeddings
# Chroma is our vector store for saving and retrieving embeddings.
from langchain_community.vectorstores import Chroma

### RAG (Retrieval-Augmented Generation) Architecture Overview

The process of building a RAG system involves a series of sequential steps to transform raw data into a searchable knowledge base that an LLM can use.

**Indexing Pipeline:**
1.  **Document Loading**: Load documents from various sources (e.g., text files, PDFs, databases).
2.  **Document Splitting**: Break large documents into smaller, semantically meaningful chunks.
3.  **Embedding Generation**: Convert each chunk into a numerical vector using an embedding model.
4.  **Vector Storage**: Store these embeddings in a specialized vector database like ChromaDB for efficient retrieval.

**Retrieval and Generation Pipeline:**

5.  **User Query**: A user asks a question.
6.  **Query Embedding**: The user's query is also converted into a vector.
7.  **Similarity Search**: The vector store searches for the document chunks with embeddings most similar to the query's embedding.
8.  **Context Augmentation**: The retrieved chunks (the context) are combined with the original user query into a single prompt.
9.  **Response Generation**: This augmented prompt is sent to an LLM, which generates an answer based on the provided context.

In [4]:
# --- 1. Document Loading ---
# Create sample documents and save them to a directory to simulate a real-world scenario.
sample_docs = [
    "Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed.",
    "Deep learning is a subset of machine learning based on artificial neural networks with many layers.",
    "NLP is a field of AI that focuses on the interaction between computers and human language."
]
os.makedirs("text_files", exist_ok=True)
for i, doc in enumerate(sample_docs):
    with open(f"text_files/doc_{i}.txt", "w") as f:
        f.write(doc)

# Use DirectoryLoader to load all the text files from our directory.
loader = DirectoryLoader("text_files", glob="*.txt", loader_cls=TextLoader)
documents = loader.load()
print(f"Loaded {len(documents)} documents.")

Loaded 3 documents.


In [5]:
# --- 2. Document Splitting ---
# Initialize a text splitter to break documents into smaller, manageable chunks.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # The maximum size of each chunk.
    chunk_overlap=50 # The number of characters to overlap between chunks.
)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} documents.")

Created 3 chunks from 3 documents.


In [6]:
# --- 3. Embedding Generation ---
# Initialize the embedding model from HuggingFace. This runs locally.
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

  from .autonotebook import tqdm as notebook_tqdm


### 4. Vector Storage with ChromaDB

Now we arrive at the core of our retrieval system: the **Vector Store**. After creating embeddings for all our document chunks, we need a place to store them so we can search through them later.

**Analogy: The GPS Navigation System 🛰️**
-   **Embeddings** are the unique GPS coordinates for each document chunk.
-   A **Vector Store** is the powerful navigation system (like Google Maps). When you provide your current location (the embedding of your query), the system doesn't need to check every coordinate on Earth. It uses highly efficient algorithms to instantly find the closest and most relevant points of interest (the most similar document chunks).

**ChromaDB** is a popular open-source vector store that's easy to use and can run entirely on your local machine. The `Chroma.from_documents` function is a convenient helper that automates the process: it takes our chunks, uses the provided embedding model to create vectors for them, and stores them in the database.

In [7]:
# --- 4. Vector Storage ---
# Create a Chroma vector store from the document chunks.
# This single command handles embedding each chunk and storing it in the database.
vector_store = Chroma.from_documents(
    documents=chunks,              # The document chunks to be stored.
    embedding=embedding_model,     # The model to use for creating embeddings.
    persist_directory="chroma_db"  # The directory to save the database on disk.
)

print(f"Vector store created with {vector_store._collection.count()} vectors.")

Vector store created with 30 vectors.


### 5. Testing the Retrieval (Similarity Search)

In [8]:
# The core function of a vector store is performing a similarity search.
query = "What is deep learning?"
# The `similarity_search` method embeds the query and finds the most similar document chunks.
similar_docs = vector_store.similarity_search(query)

print(f"Query: {query}")
print(f"\nTop similar chunk:")
print(similar_docs[0].page_content)

Query: What is deep learning?

Top similar chunk:
Deep learning is a subset of machine learning based on artificial neural networks with many layers.


In [10]:
# For more control, `similarity_search_with_score` returns the documents and their distance scores.
# `k` determines the number of results to return.
results_with_scores = vector_store.similarity_search_with_score(query, k=3)

for doc, score in results_with_scores:
    print(f"\nContent: {doc.page_content}")
    # For ChromaDB's default L2 distance, a LOWER score means MORE similar.
    print(f"Score: {score:.4f}")


Content: Deep learning is a subset of machine learning based on artificial neural networks with many layers.
Score: 0.5072

Content: Deep learning is a subset of machine learning based on artificial neural networks. 
    These networks are inspired by the human brain and consist of layers of interconnected 
    nodes. Deep learning has revolutionized fields like computer vision, natural language 
    processing, and speech recognition. Convolutional Neural Networks (CNNs) are particularly 
    effective for image processing, while Recurrent Neural Networks (RNNs) and Transformers
Score: 0.5434

Content: Deep learning is a subset of machine learning based on artificial neural networks. 
    These networks are inspired by the human brain and consist of layers of interconnected 
    nodes. Deep learning has revolutionized fields like computer vision, natural language 
    processing, and speech recognition. Convolutional Neural Networks (CNNs) are particularly 
    effective for image pr

### 6. Building the RAG Chain

In [11]:
# --- Initialize the LLM ---
# We'll use Groq for its fast inference speed.
from langchain_groq import ChatGroq
llm = ChatGroq(
    model="openai/gpt-oss-20b", 
    temperature=0.2      # A low temperature for more factual, less creative responses.
)

#### The Retriever
A **Retriever** is a generic LangChain interface that fetches documents. A vector store can be easily converted into a retriever. This abstraction makes it easy to swap different retrieval methods or vector stores in your RAG chain.

In [12]:
# Convert the vector store into a retriever.
# `search_kwargs={"k": 2}` specifies that the retriever should fetch the top 2 most similar documents.
retriever = vector_store.as_retriever(search_kwargs={"k": 2})

In [13]:
# --- Define the Prompt Template ---
from langchain_core.prompts import ChatPromptTemplate

system_prompt = """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 
Keep the answer concise.

Context: {context}"""

# The prompt template includes placeholders for the context (from the retriever) and the user's input.
prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}")
])

#### Creating the RAG Chain
Now we'll wire all the components together into a complete chain. We use two main helper functions:
1.  **`create_stuff_documents_chain`**: This takes the retrieved documents and "stuffs" them all into the `{context}` placeholder in our prompt.
2.  **`create_retrieval_chain`**: This is the final step. It combines the `retriever` (to get documents) and the `document_chain` (to process them with the LLM) into a single, runnable pipeline.

In [14]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# Create the chain that will process the retrieved documents.
document_chain = create_stuff_documents_chain(llm, prompt)

# Create the final retrieval chain.
rag_chain = create_retrieval_chain(retriever, document_chain)

In [15]:
# --- 7. Query the RAG system ---
# The .invoke() method runs the entire pipeline.
response = rag_chain.invoke({"input": "What is Deep Learning?"})
print(response['answer'])

Deep learning is a branch of machine learning that uses artificial neural networks with many layers (deep networks) to learn representations and patterns from data. Inspired by the human brain, these networks consist of interconnected nodes organized in layers, enabling breakthroughs in areas such as computer vision, natural language processing, and speech recognition.


### Alternative: Building a RAG Chain with LCEL

**LCEL (LangChain Expression Language)** is the modern, more powerful way to build chains. It uses the pipe operator `|` to connect components, offering greater flexibility and transparency.

**Analogy: LEGOs vs. a Pre-built Toy 🧱**
-   **`create_retrieval_chain`** is like a pre-built toy car. It's easy to use and works great for its intended purpose.
-   **LCEL** is like a box of LEGOs. You can build the same car, but you can also easily modify it, add new parts, or see exactly how each piece connects to the others.

In [17]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# A helper function to format the retrieved documents into a single string.
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Create a custom prompt
custom_prompt = ChatPromptTemplate.from_template("""Use the following context to answer the question. 
If you don't know the answer based on the context, say you don't know.
Provide specific details from the context to support your answer.

Context:
{context}

Question: {question}

Answer:""")

# --- Build the chain using LCEL ---
rag_chain_lcel = (
    # This dictionary defines the inputs to the prompt.
    # The retriever is called first, its output is piped to format_docs, and the result fills `context`.
    # `question` is passed through directly from the input.
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | custom_prompt          # The formatted inputs are piped into the prompt.
    | llm             # The prompt is piped into the LLM.
    | StrOutputParser() # The LLM's output is parsed into a simple string.
)

# Invoke the LCEL chain (note the simpler input format).
response_lcel = rag_chain_lcel.invoke("What is Deep Learning?")
print(response_lcel)

Deep learning is a subset of machine learning that uses artificial neural networks with many layers. These networks are inspired by the human brain and consist of layers of interconnected nodes. Deep learning has transformed fields such as computer vision, natural language processing, and speech recognition, with specialized architectures like Convolutional Neural Networks (CNNs) for image processing and Recurrent Neural Networks (RNNs) and Transformers for sequential data.


### 8. Adding New Documents to the Vector Store

In [18]:
from langchain.schema import Document

In [19]:
# Define a new document to add.

new_document = """
Reinforcement Learning in Detail

Reinforcement learning (RL) is a type of machine learning where an agent learns to make 
decisions by interacting with an environment. The agent receives rewards or penalties 
based on its actions and learns to maximize cumulative reward over time. Key concepts 
in RL include: states, actions, rewards, policies, and value functions. Popular RL 
algorithms include Q-learning, Deep Q-Networks (DQN), Policy Gradient methods, and 
Actor-Critic methods. RL has been successfully applied to game playing (like AlphaGo), 
robotics, and autonomous systems.
"""
# Add new documents to the existing vector store
new_doc = Document(page_content=new_document, metadata={"source": "Manual Entry"})

# Split the new document into chunks.
new_chunks = text_splitter.split_documents([new_doc])

# Add the new chunks to the existing vector store.
vector_store.add_documents(new_chunks)

print(f"Added {len(new_chunks)} new chunks. Total vectors now: {vector_store._collection.count()}")

Added 3 new chunks. Total vectors now: 33


In [20]:
# Test the RAG system with a question about the new content.
new_question = "What is reinforcement learning?"
new_response = rag_chain_lcel.invoke(new_question)
print(new_response)

Reinforcement learning (RL) is a type of machine learning in which an **agent** learns to make decisions by **interacting with an environment**. The agent receives **rewards or penalties** based on the actions it takes and learns to **maximize cumulative reward over time**. Key concepts in RL include:

- **States** – the current situation of the environment  
- **Actions** – the choices the agent can make  
- **Rewards** – feedback signals that indicate the desirability of an action  
- **Policies** – strategies that map states to actions  
- **Value functions** – estimates of future rewards for states or state‑action pairs  

Popular RL algorithms mentioned are **Q‑learning**, **Deep Q‑Networks (DQN)**, and **Policy Gradient methods**.


### 9. Advanced RAG: Conversational Memory

A standard RAG chain is stateless; it has no memory of past interactions. This is a problem for follow-up questions (e.g., User: "What is ML?", Bot: ..., User: "What are **its** main types?"). The retriever won't know that "its" refers to machine learning.

The solution is to create a **history-aware retriever**. This special retriever first looks at the chat history and the new question, and **reformulates** the new question into a standalone query. In the example above, it would transform "What are its main types?" into "What are the main types of machine learning?" before searching for documents.

In [21]:
from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage

In [22]:
# --- 1. Create a prompt for query reformulation ---
contextualize_q_system_prompt = """Given a chat history and the latest user question 
which might reference context in the chat history, formulate a standalone question 
which can be understood without the chat history. Do NOT answer the question, 
just reformulate it if needed and otherwise return it as is."""

# `MessagesPlaceholder` is a special variable that will hold the list of chat history messages.
contextualize_q_prompt = ChatPromptTemplate.from_messages([
    ("system", contextualize_q_system_prompt),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

In [23]:
# --- 2. Create the history-aware retriever ---
# This chain will take the user input and chat history, and use the LLM to create a new, standalone query.
history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt
)

In [24]:
# --- 3. Create the final conversational RAG chain ---
# First, we need a new QA prompt that also includes the chat history.
qa_system_prompt = """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 
Keep the answer concise.

Context: {context}"""

qa_prompt = ChatPromptTemplate.from_messages([
    ("system", qa_system_prompt),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

# This document chain will be used to generate the final answer.
question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)

# This is the final chain that connects the history-aware retriever and the QA chain.
conversational_rag_chain = create_retrieval_chain(
    history_aware_retriever, 
    question_answer_chain
)

In [25]:
# Initialize the chat history as an empty list.
chat_history = []

# First question
first_question = "What is machine learning?"
result1 = conversational_rag_chain.invoke({
    "chat_history": chat_history,
    "input": first_question
})
print(f"Q: {first_question}")
print(f"A: {result1['answer']}")

Q: What is machine learning?
A: Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed.


In [26]:
# --- Update the chat history ---
# We must manually update the history with the user's question and the model's answer.
chat_history.extend([
    HumanMessage(content=first_question),
    AIMessage(content=result1['answer'])
])

In [27]:
# Follow-up question
follow_up_question = "What is a subset of it?"  # This question relies on the previous context.
result2 = conversational_rag_chain.invoke({
    "chat_history": chat_history,
    "input": follow_up_question
})
print(f"\nQ: {follow_up_question}")
print(f"A: {result2['answer']}")


Q: What is a subset of it?
A: A subset of machine learning is one of its three main types: supervised learning, unsupervised learning, or reinforcement learning.


### 🔑 Key Takeaways

* **Vector Store is the Core**: A Vector Store (like ChromaDB) is a specialized database that stores embeddings and performs ultra-fast similarity searches, forming the heart of the retrieval system.
* **End-to-End Pipeline**: A full RAG pipeline involves loading, splitting, embedding, and storing data (indexing), followed by retrieving relevant documents and generating an answer (retrieval & generation).
* **LCEL for Flexibility**: LangChain Expression Language (LCEL) with the pipe `|` operator is the modern, powerful way to build custom, transparent RAG chains.
* **Vector Stores are Dynamic**: You can easily add new documents to an existing vector store to keep your RAG system's knowledge base up-to-date.
* **Memory is a Must for Conversation**: For building chatbots, a stateless RAG chain is not enough. A **history-aware retriever** is essential to reformulate follow-up questions, allowing the system to understand conversational context.