# Build a Retrieval Augmented Generation (RAG) Application

### What You'll Learn

This notebook takes a progressive approach, building your understanding step-by-step:

1. **Basic LLM Interaction** - Start with simple prompting
2. **Prompt Engineering** - Learn to structure prompts effectively
3. **Chaining Components** - Connect multiple steps together
4. **Document Loading & Splitting** - Prepare data for retrieval
5. **Vector Embeddings & Storage** - Create searchable knowledge bases
6. **Complete RAG Implementation** - Put it all together with LangGraph

### What is RAG?

**Retrieval Augmented Generation (RAG)** is one of the most powerful applications enabled by LLMs. It allows AI to answer questions using **specific source information** rather than relying solely on its training data.

Think of it like giving an AI assistant access to a specialized library - it can look up relevant information before answering your questions!

### The RAG Architecture

A typical RAG application has **two main components**:

1. **Indexing** (Offline) - Preparing your knowledge base
   - Load documents from various sources
   - Split large documents into chunks
   - Create embeddings and store in a vector database

2. **Retrieval & Generation** (Runtime) - Answering questions
   - Retrieve relevant documents based on the query
   - Generate answers using retrieved context

---


Let's get started! üöÄ

## Step 0: Environment Setup

Before we begin, let's install all the required packages. We'll need:

- **langchain** - The core framework for building LLM applications
- **langchain-openai** - Azure OpenAI integration
- **langchain-community** - Community-contributed components (document loaders, vector stores)
- **langgraph** - Framework for building stateful, multi-step applications
- **faiss-cpu** - Facebook AI Similarity Search for vector storage
- **beautifulsoup4** - HTML parsing for web content loading

Run the cell below to install all dependencies:

In [None]:
%pip install langchain-text-splitters langchain-community langgraph
%pip install langchain langchain-chroma langchain-openai
%pip install -qU "langchain[openai]"
%pip install faiss-cpu
%pip install beautifulsoup4

## Step 1: Basic LLM Interaction

Let's start with the fundamentals - connecting to an LLM and making a simple request.

### What's Happening Here?

1. **Load environment variables** - Read Azure OpenAI credentials from `.env` file
2. **Create an LLM client** - Initialize connection to Azure OpenAI
3. **Make a simple query** - Ask a question directly

This is the **simplest** way to interact with an LLM, but notice:
- ‚ùå No specialized knowledge
- ‚ùå No structured prompting
- ‚ùå Limited control over format
- ‚ùå Answers only from training data

Let's see it in action:

In [None]:
import os
import dotenv
from langchain_openai import AzureChatOpenAI

dotenv.load_dotenv()

# Use API key authentication for connection
llm = AzureChatOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    azure_deployment=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
    openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
)

### Test Basic LLM Call

Let's ask the LLM a simple question. Notice how it answers from its general training knowledge:

In [None]:
answer = llm.invoke("how can langsmith help with testing?")

In [None]:
print(f"answer: {answer}")
print(f"Type: {type(answer)}")

### Understanding the Response

The LLM returns a structured response object containing:
- **content** - The actual text answer
- **response_metadata** - Technical details (tokens used, model info, etc.)
- **type** - The object type (AIMessage)

The LLM gives a general answer based on its training, but it doesn't have access to:
- ‚úó Up-to-date information
- ‚úó Your company's internal documentation  
- ‚úó Specific domain knowledge from your documents

**This is where RAG becomes powerful!** But first, let's learn about better prompting...

## Step 2: Prompt Engineering with Templates

Now let's improve our LLM interactions using **prompt templates**!

### Why Use Prompt Templates?

Direct LLM calls are limited. LangChain's power comes from **chaining components** together:

- ‚úÖ **Prompt Templates** - Structure conversations effectively
- ‚úÖ **Output Parsers** - Format responses consistently
- ‚úÖ **Retrieval Components** - Add external knowledge
- ‚úÖ **Multiple Processing Steps** - Build complex workflows
- ‚úÖ **Data Transformations** - Clean and prepare data

### What is a Prompt Template?

Think of it as a **Mad Libs for AI**:
- You create a template with placeholders (e.g., `{name}`, `{user_input}`)
- You fill in the blanks with actual values
- The completed prompt goes to the LLM

This gives you:
- **Consistency** - Same structure every time
- **Reusability** - One template, many uses
- **Maintainability** - Update once, affects all uses

### Example 1: Multi-Turn Conversation Template

Let's create a template that simulates a multi-turn conversation:

**What's happening:**

1. **Import ChatPromptTemplate** - LangChain's tool for structured conversations
2. **Create conversation with roles** - Each message has a role:
   - `"system"` - Sets the AI's identity and behavior (with placeholder `{name}`)
   - `"human"` - User messages
   - `"ai"` - Assistant responses
3. **Use placeholders** - `{name}` and `{user_input}` will be filled in later
4. **Invoke the template** - Fill the placeholders with actual values

In [None]:
from langchain_core.prompts import ChatPromptTemplate

template = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful AI bot. Your name is {name}."),
    ("human", "Hello, how are you doing?"),
    ("ai", "I'm doing well, thanks!"),
    ("human", "{user_input}"),
])

prompt_value = template.invoke(
    {
        "name": "Bob",
        "user_input": "What is your name?"
    }
)


In [None]:
for msg in prompt_value.messages:
  print(type(msg).__name__, ":", msg.content)

See how the template filled in the placeholders? The conversation now has:
- System message with "Bob" as the name
- Human and AI exchange establishing context
- Final human question asking for the name

In [None]:
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a world class technical documentation writer."), # system instructions
    ("user", "{input}")
])

### Example 2: Simple Two-Message Template

Here's a simpler pattern - just system instructions and user input:

This is the most common RAG pattern:
- Set the role/behavior in the system message
- Take user input dynamically

In [None]:
for msg in prompt.messages:
  print(type(msg).__name__, ":", msg)

Notice: The template has placeholders but isn't invoked yet, so `{input}` is still a variable.

## Step 3: Chaining Components Together

Now for the magic! Let's **chain** our components using the pipe (`|`) operator.

![chaining.png](../Assets/images/chaining.png)

### What is Chaining?

Think of it as a **pipeline** or **assembly line** where data flows through multiple steps:

```
Input ‚Üí Step 1 ‚Üí Step 2 ‚Üí Step 3 ‚Üí Output
```

The `|` symbol means "**then**" or "**pipe to**":

```python
chain = prompt | llm | output_parser
```

Translates to:
1. Take the `prompt` template 
2. **THEN** send it to the `llm`
3. **THEN** parse the output

### Why is This Powerful?

- ‚úÖ **Modular** - Swap components easily
- ‚úÖ **Readable** - Clear flow of data
- ‚úÖ **Reusable** - Components work anywhere
- ‚úÖ **Testable** - Test each step independently

### Create a Simple Chain (2 Steps)

Let's start with a basic 2-step chain:

In [None]:
print(prompt)

This chain has 2 components:
1. **prompt** - Formats the input
2. **llm** - Generates the response

Think of it like a pipeline or assembly line:

The | symbol = "then" or "pipe to"
prompt = Your formatted question/template
llm = The AI model that generates answers
So chain = prompt | llm means:

Take the prompt ‚Üí THEN ‚Üí send it to the LLM


In [None]:
chain = prompt | llm
## pass the prompt to the LLM

### Inspecting Chain Components

You can access different parts of the chain:

In [None]:
#chain.first shows you the first component in your chain.

print(chain.middle)

In [None]:
chain_result = chain.invoke({"input": "how can langsmith help with testing?"})

### Run the Chain

Now let's execute our 2-step chain:

The chain processes:
1. **prompt** fills `{input}` with our question
2. **llm** generates a response based on the formatted prompt

"system": "You are a world class technical documentation writer."

"user", "how can langsmith help with testing?"

What actually got sent to the LLM was:

In [None]:
print(chain_result.content) #This line displays the actual text response from the AI.

This extracts just the text content. But the response object contains more:

### View Response Metadata

This displays technical information about the AI's response - all the "behind the scenes" details like token counts and model info:

In [None]:
print(chain_result.response_metadata)

In [None]:
from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()

### Adding an Output Parser (3-Step Chain)

The LLM returns a complex `AIMessage` object. Often we just want simple text!

**StrOutputParser** extracts just the string content, making responses easier to work with.

### Create a 3-Step Chain

Now we'll add the output parser to create a complete 3-step pipeline:

**The 3 Steps:**
1. **prompt** ‚Üí Format the question
2. **llm** ‚Üí Get AI response (returns complex AIMessage object)
3. **output_parser** ‚Üí Extract clean text string

**Think of it like a car wash:**
- Step 1: Prep the car (format prompt)
- Step 2: Wash the car (get AI response)
- Step 3: Dry and polish (clean up the output)

In [None]:
chain = prompt | llm | output_parser

In [None]:
chain_result = chain.invoke({"input": "how can langsmith help with testing?"})

Execute the 3-step chain:

In [None]:
print(chain_result)

Notice: Now we get a clean string directly, not an AIMessage object! Much easier to work with.

**Key Insight:** We're still missing the "R" in RAG - **Retrieval**! The LLM is answering from its training data, not from specific documents we provide.

## Step 4: Adding Retrieval - The "R" in RAG

Now we get to the heart of RAG! Let's add the ability to **retrieve information from documents**.

### The RAG Indexing Pipeline

Before we can retrieve, we need to **index our documents**:

```
Document ‚Üí Load ‚Üí Split ‚Üí Embed ‚Üí Store in Vector Database
```

This happens **offline** (once) to prepare your knowledge base.

### What We'll Build

1. **Load** a document (web page)
2. **Split** it into chunks
3. **Embed** each chunk (convert to vectors)
4. **Store** in a vector database (FAISS)
5. **Retrieve** relevant chunks when asked

Let's start!

In [None]:
#Imports a tool that can read websites
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://docs.smith.langchain.com/")

docs = loader.load() # The docs variable now contains all that website content, ready to be processed further!

### Step 4.1: Load Documents

First, we need source material. We'll use **WebBaseLoader** to load content from a website.

**What's happening:**
- `WebBaseLoader` fetches HTML from the URL
- Parses it into plain text
- Creates `Document` objects we can process

The `docs` variable now contains all that website content, ready for the next step!

### Step 4.2: Split Documents & Create Vector Store

Web pages are often too long for LLM context windows. We need to:
1. **Split** the document into smaller chunks
2. **Embed** each chunk (convert to vectors)
3. **Store** in a searchable vector database

**Think of it like creating a smart library:**

- **Text Splitter** - Takes a huge book and breaks it into individual pages/chapters
- **Embeddings** - Creates a "topic fingerprint" for each page describing its meaning
- **FAISS** - Organizes all those fingerprints so you can quickly find relevant pages

### Why Split?
- LLMs have limited context windows
- Smaller chunks = more precise retrieval
- Better matching between queries and relevant content

In [None]:
# First, set up the embeddings model
from langchain_openai import AzureOpenAIEmbeddings

embeddings = AzureOpenAIEmbeddings(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    azure_deployment=os.environ["AZURE_OPENAI_ADA_DEPLOYMENT"],
    openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
)

# Now split the documents and create vector store
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(docs)
vector = FAISS.from_documents(documents, embeddings)

print(f"‚úì Created vector store with {len(documents)} document chunks")

In [None]:
print(documents)

You can see the document has been split into manageable chunks, each ready to be embedded and stored.

## Step 5: Building the RAG Chain

Now let's connect retrieval with generation! We'll create a chain that:
1. Takes a question
2. Retrieves relevant documents
3. Uses those documents to answer

### The Document Chain

This creates a specialized chain for answering questions using provided documents as context.

In [None]:
from langchain.chains.combine_documents import create_stuff_documents_chain
# create_stuff_documents_chain :
#  Create a chain for passing a list of Documents to a model.

prompt = ChatPromptTemplate.from_template("""
Answer the following question based only on the provided context:

<context>
{context}
</context>

Question: {input}""", output_parser = output_parser)

document_chain = create_stuff_documents_chain(llm, prompt)
# document_chain = prompt | llm /

**Key components:**
- **Prompt with context** - Includes `{context}` placeholder for retrieved documents
- **Instruction** - "Answer based ONLY on the provided context"
- **document_chain** - Combines prompt and LLM to process documents

This is the "generation" part of RAG!

### Complete the RAG Chain

Now let's add the retrieval component:

**What's happening:**
- `retriever` - Searches the vector store for relevant documents
- `retrieval_chain` - Combines retrieval + generation

**Think of it like a research assistant:**
- **Before RAG:** AI answers from general knowledge (may hallucinate)
- **With RAG:** AI gets specific research papers and answers ONLY using those papers (grounded in facts)

In [None]:
from langchain.chains import create_retrieval_chain

retriever = vector.as_retriever()
retrieval_chain = create_retrieval_chain(retriever, document_chain)

### Test the Complete RAG System!

This is the moment we've been building toward! Let's run the complete RAG pipeline:

**The full flow:**
1. User asks: "how can langsmith help with testing?"
2. **Retrieval:** Search vector store for relevant document chunks
3. **Generation:** LLM answers using ONLY the retrieved context

In [None]:
response = retrieval_chain.invoke({"input": "how can langsmith help with testing?"})

In [None]:
print(response["answer"])

# LangSmith offers several features that can help with testing:...

üéâ **Success!** The answer comes from the specific documentation we loaded, not generic training data!

Notice how the answer is:
- ‚úÖ Specific to LangSmith
- ‚úÖ Based on retrieved documentation
- ‚úÖ Grounded in factual content

In [None]:
response = retrieval_chain.invoke({"input": "how can use it?"})
print(response["answer"])

### Test with Another Question

Let's try a different question to see RAG in action:

---

## Step 6: Advanced RAG with LangGraph

Now let's rebuild our RAG application using **LangGraph** - a framework for building **stateful, multi-step applications**.

### Why LangGraph?

The simple chain we built works, but LangGraph adds:
- ‚úÖ **State management** - Track conversation context
- ‚úÖ **Multiple invocation modes** - Sync, async, streaming
- ‚úÖ **Easier debugging** - Visualize application flow
- ‚úÖ **Streamlined deployment** - Production-ready patterns
- ‚úÖ **Better observability** - Built-in tracing

### LangGraph Components

To build a LangGraph application, we define:
1. **State** - What data flows through the application
2. **Nodes** - Individual processing steps
3. **Control Flow** - How steps connect together

Let's rebuild our RAG app with LangGraph!

### Load and Index a New Document

For this advanced example, let's load a different document - a blog post about LLM agents:

In [None]:
import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

# Load and chunk contents of the blog
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # chunk size (characters)
    chunk_overlap=200,  # chunk overlap (characters)
    add_start_index=True,  # track index in original document
)
all_splits = text_splitter.split_documents(docs)

print(f"Loaded {len(docs)} document(s)")
print(f"Split into {len(all_splits)} chunks")
print(f"Total characters: {len(docs[0].page_content)}")

# Create a new vector store for this example
embedding_dim = len(embeddings.embed_query("hello world"))
index = faiss.IndexFlatL2(embedding_dim)

vector_store_graph = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

# Index chunks
_ = vector_store_graph.add_documents(documents=all_splits)
print("‚úì Documents indexed in vector store")

### Define Application State

State controls what data flows through our application. For RAG, we need:
- **question** - User's query
- **context** - Retrieved documents
- **answer** - Generated response

In [None]:
# Define state for application
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str

print("‚úì State defined")

### Define Nodes (Application Steps)

Nodes are individual functions that process the state. We need two:
1. **retrieve** - Find relevant documents
2. **generate** - Create answer from documents

In [None]:
# Get the RAG prompt from LangChain Hub
prompt = hub.pull("rlm/rag-prompt")

# Define application steps
def retrieve(state: State):
    """Retrieve relevant documents based on the question"""
    retrieved_docs = vector_store_graph.similarity_search(state["question"])
    return {"context": retrieved_docs}


def generate(state: State):
    """Generate answer using retrieved context"""
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}

print("‚úì Nodes defined: retrieve, generate")

### Build and Compile the Graph

Now let's connect our nodes into a graph:

**What's happening:**
- `StateGraph(State)` - Create graph with our state type
- `.add_sequence([retrieve, generate])` - Connect nodes in order
- `.add_edge(START, "retrieve")` - Define entry point
- `.compile()` - Finalize the graph

In [None]:
# Compile application
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

print("‚úì Graph compiled successfully")

### Visualize the Graph (Optional)

LangGraph provides visualization to understand the flow:

In [None]:
from IPython.display import Image, display

try:
    display(Image(graph.get_graph().draw_mermaid_png()))
except Exception as e:
    print("Note: Graph visualization requires additional dependencies")
    print("The graph still works without visualization!")

### Test the LangGraph RAG Application

Let's test our new application with a question about the blog post:

In [None]:
response = graph.invoke({"question": "What is Task Decomposition?"})
print(f"Answer: {response['answer']}")

### Stream Results (See Steps in Real-Time)

One of LangGraph's powerful features is streaming - watch each step execute:

In [None]:
for step in graph.stream(
    {"question": "What is Task Decomposition?"}, 
    stream_mode="updates"
):
    print(f"{step}\n\n{'='*50}\n")

See how you can watch each step execute? First retrieval, then generation!

### Stream Tokens (Real-Time Generation)

You can even stream individual tokens as they're generated:

In [None]:
print("Streaming answer: ", end="")
for message, metadata in graph.stream(
    {"question": "What is Task Decomposition?"}, 
    stream_mode="messages"
):
    print(message.content, end="", flush=True)
print("\n\n‚úì Complete!")

---

## Congratulations! üéâ

You've successfully built a complete RAG application from scratch!

### What You've Learned

Let's recap the journey:

1. **Basic LLM Interaction** ‚úì
   - Connected to Azure OpenAI
   - Made simple queries
   - Understood response objects

2. **Prompt Engineering** ‚úì
   - Created prompt templates
   - Structured conversations with roles
   - Used placeholders for dynamic content

3. **Chaining Components** ‚úì
   - Connected prompt ‚Üí LLM ‚Üí parser
   - Understood the pipe (`|`) operator
   - Built modular, reusable components

4. **Document Loading & Splitting** ‚úì
   - Loaded web content
   - Split into manageable chunks
   - Prepared data for retrieval

5. **Vector Embeddings & Storage** ‚úì
   - Created embeddings with Azure OpenAI
   - Stored in FAISS vector database
   - Built searchable knowledge bases

6. **Complete RAG Implementation** ‚úì
   - Combined retrieval + generation
   - Grounded answers in specific documents
   - Reduced hallucinations

7. **Advanced LangGraph** ‚úì
   - Built stateful applications
   - Defined nodes and control flow
   - Enabled streaming and observability

### Key Concepts

**RAG Benefits:**
- ‚úÖ **Up-to-date information** - Use current documents, not just training data
- ‚úÖ **Domain-specific knowledge** - Incorporate your specialized content
- ‚úÖ **Reduced hallucinations** - Answers grounded in real documents
- ‚úÖ **Source attribution** - Track where answers come from
- ‚úÖ **Easy updates** - Change knowledge base without retraining

**RAG Architecture:**
```
Indexing (Offline):
  Document ‚Üí Load ‚Üí Split ‚Üí Embed ‚Üí Vector Store

Retrieval & Generation (Runtime):
  Query ‚Üí Retrieve Relevant Docs ‚Üí Generate Answer with Context
```

### Next Steps

To extend this RAG application further:

1. **Add Conversation Memory** - Track chat history for multi-turn conversations
2. **Implement Query Analysis** - Optimize search queries before retrieval
3. **Add Metadata Filtering** - Filter documents by date, section, or category
4. **Use Multiple Retrievers** - Combine different search strategies
5. **Add Re-ranking** - Improve retrieval quality with re-ranking models
6. **Deploy to Production** - Use LangGraph Platform for deployment

### Resources

- [LangChain Documentation](https://python.langchain.com/)
- [LangGraph Documentation](https://langchain-ai.github.io/langgraph/)
- [RAG Tutorial Part 2](https://python.langchain.com/docs/tutorials/rag/#next-steps) - Multi-turn conversations
- [Azure OpenAI Service](https://learn.microsoft.com/azure/ai-services/openai/)
- [LangSmith](https://smith.langchain.com/) - Tracing and debugging

### Practice Exercise

Try building your own RAG application:
1. Choose a different document source (PDF, database, API)
2. Experiment with different chunk sizes
3. Try different embedding models
4. Add custom prompts for your use case
5. Implement conversation memory

**Happy building!** üöÄ