# RAG (Retrieval-Augmented Generation) with LangChain

This notebook demonstrates how to build a complete RAG pipeline using LangChain and Azure OpenAI. RAG enhances language model responses by retrieving relevant context from a knowledge base before generating answers.

## What is RAG?

**Retrieval-Augmented Generation (RAG)** is a technique that combines:
- **Retrieval**: Finding relevant information from a knowledge base
- **Generation**: Using an LLM to create answers based on retrieved context

This approach helps LLMs provide accurate, grounded responses without hallucinating information.

## Workflow Overview

We'll cover the following steps:
1. **Environment Setup**: Load API credentials securely
2. **Initialize Embeddings**: Create an embedding model for converting text to vectors
3. **Initialize LLM**: Set up the language model for generating responses
4. **Create Vector Store**: Initialize an in-memory vector database
5. **Load Documents**: Fetch and parse web content
6. **Split Documents**: Break large documents into smaller chunks
7. **Store Embeddings**: Add document chunks to the vector store
8. **Create RAG Chain**: Build a retrieval-augmented generation workflow
9. **Test the System**: Query the RAG system and stream results

Let's build an intelligent question-answering system!

## Step 1: Environment Setup and API Key Management

This section handles the secure loading of API credentials.

### Your Task:
Set up environment variables for Azure OpenAI API access.

**Steps:**
1. Import required modules:
   - `load_dotenv` from `dotenv` (loads environment variables from .env file)
   - `getpass` (for secure password input)
   - `os` (for environment variable access)
2. Call `load_dotenv()` to load environment variables from a `.env` file if present
3. Check if `AZURE_OPENAI_API_KEY` exists in environment variables
4. If not found, prompt the user securely: `os.environ["AZURE_OPENAI_API_KEY"] = getpass.getpass("Enter your Azure OpenAI API key: ")`

**Security Best Practice:** 
- Never hardcode API keys in code
- Use environment variables or secure prompts
- Works both locally (with `.env` file) and in production (with system environment variables)

**Expected Output:** No visible output, but the API key will be securely stored

In [None]:
# TODO: Import load_dotenv from dotenv, getpass, and os


# TODO: Load environment variables from .env file


# TODO: Check if AZURE_OPENAI_API_KEY exists, if not, prompt for it


## Step 2: Initialize the Embedding Model

**Embeddings** convert text into numerical vectors that capture semantic meaning.

### Your Task:
Configure Azure OpenAI embeddings for converting text to vectors.

**Steps:**
1. Import `AzureOpenAIEmbeddings` from `langchain_openai`
2. Create an embeddings instance with these parameters:
   - `azure_endpoint="https://aoi-ext-eus-aiml-profx-01.openai.azure.com/"`
   - `api_key=os.environ["AZURE_OPENAI_API_KEY"]`
   - `model="text-embedding-ada-002"`
   - `api_version="2024-12-01-preview"`

### Why embeddings matter in RAG:
Similar concepts will have similar vector representations, allowing us to find relevant documents by comparing vector similarity rather than exact keyword matches.

**Model Details:**
- **text-embedding-ada-002**: Azure OpenAI's powerful embedding model
- **Output**: 1536-dimensional vectors
- **Use Cases**: Semantic search, clustering, similarity comparison

**Expected Output:** No output, but the `embeddings` object will be ready to use

In [None]:
# TODO: Import AzureOpenAIEmbeddings from langchain_openai


# TODO: Create an AzureOpenAIEmbeddings instance with the required parameters


## Step 3: Initialize the Language Model (LLM)

**Large Language Models (LLMs)** generate human-like text responses based on input prompts.

### Your Task:
Set up Azure Chat OpenAI for generating answers.

**Steps:**
1. Import `AzureChatOpenAI` from `langchain_openai`
2. Create an LLM instance with these parameters:
   - `azure_endpoint="https://aoi-ext-eus-aiml-profx-01.openai.azure.com/"`
   - `api_key=os.environ["AZURE_OPENAI_API_KEY"]`
   - `model="gpt-4o"`
   - `api_version="2024-12-01-preview"`

### Role in RAG:
The LLM will generate the final answer by:
1. Receiving the user's question
2. Receiving relevant context retrieved from the vector store
3. Synthesizing a response that combines both

**Model Details:**
- **gpt-4o**: GPT-4 Optimized version
- **Capabilities**: Advanced reasoning, following instructions, grounded responses

**Expected Output:** No output, but the `llm` object will be ready to generate responses

In [None]:
# TODO: Import AzureChatOpenAI from langchain_openai


# TODO: Create an AzureChatOpenAI instance with the required parameters


## Step 4: Create the Vector Store

**Vector Stores** store document embeddings and enable fast similarity searches.

### Your Task:
Initialize an in-memory vector store for storing document embeddings.

**Steps:**
1. Import `InMemoryVectorStore` from `langchain_core.vectorstores`
2. Create a vector store instance: `vector_store = InMemoryVectorStore(embeddings)`

### What is InMemoryVectorStore?
- Stores vectors in RAM (not persisted to disk)
- Fast for development and small datasets
- Automatically uses the embeddings model we configured earlier
- Data exists only during the session

### How it works:
1. Documents are converted to embeddings using our embedding model
2. Embeddings are stored along with the original text
3. When queried, it finds the most similar vectors using cosine similarity

**Production Alternatives:**
For production use, consider:
- **Chroma**: Local, persistent vector database
- **Pinecone**: Managed cloud vector database
- **Azure AI Search**: Azure's vector search service
- **Weaviate**: Open-source vector database

**Expected Output:** No output, but the vector store is ready to store embeddings

In [None]:
# TODO: Import InMemoryVectorStore from langchain_core.vectorstores


# TODO: Create an InMemoryVectorStore instance with the embeddings object


## Step 5: Load Documents from the Web

**Document Loading** is the first step in building a knowledge base for RAG.

### Your Task:
Load content from a blog post about AI agents.

**Steps:**
1. Import required modules:
   - `bs4` (BeautifulSoup for HTML parsing)
   - `WebBaseLoader` from `langchain_community.document_loaders`
2. Create a BeautifulSoup strainer to filter HTML:
   - `bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))`
   - This keeps only the main content, removing navigation, ads, etc.
3. Create a WebBaseLoader with:
   - `web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",)`
   - `bs_kwargs={"parse_only": bs4_strainer}`
4. Load the documents: `docs = loader.load()`
5. Verify one document was loaded: `assert len(docs) == 1`
6. Print the total characters: `print(f"Total characters: {len(docs[0].page_content)}")`

### Why filter HTML?
- Reduces noise (ads, navigation, comments)
- Focuses on the main content
- Improves embedding quality and retrieval accuracy

**Expected Output:** Message showing the total number of characters loaded (should be several thousand)

In [None]:
# TODO: Import bs4 and WebBaseLoader


# TODO: Create a SoupStrainer to filter HTML (keep only post-title, post-header, post-content classes)


# TODO: Create a WebBaseLoader with the blog URL and bs_kwargs


# TODO: Load the documents


# TODO: Assert that exactly one document was loaded


# TODO: Print the total number of characters in the loaded document


### Preview the Document Content

Let's inspect the loaded content to verify it was extracted correctly.

### Your Task:
Print the first 500 characters of the loaded document.

**Steps:**
1. Print the first 500 characters: `print(docs[0].page_content[:500])`

**Expected Output:** A preview of the blog post content showing the title and beginning of the article

In [None]:
# TODO: Print the first 500 characters of the document content


## Step 6: Split Documents into Chunks

**Why split documents?**
- LLMs have context window limits (maximum input size)
- Smaller chunks create more precise embeddings
- Retrieval can target specific relevant sections instead of entire documents

### Your Task:
Split the large blog post into manageable chunks.

**Steps:**
1. Import `RecursiveCharacterTextSplitter` from `langchain_text_splitters`
2. Create a text splitter with these parameters:
   - `chunk_size=1000` (approximately 1000 characters per chunk)
   - `chunk_overlap=200` (consecutive chunks share 200 characters)
   - `add_start_index=True` (track position in original document)
3. Split the documents: `all_splits = text_splitter.split_documents(docs)`
4. Print the number of chunks: `print(f"Split blog post into {len(all_splits)} sub-documents.")`

### The RecursiveCharacterTextSplitter:
- Tries to split on natural boundaries (paragraphs, sentences, words)
- Falls back to character-level splitting if needed
- Preserves semantic coherence within chunks

**Expected Output:** Message showing the blog post was split into multiple chunks (typically 40-80)

In [None]:
# TODO: Import RecursiveCharacterTextSplitter from langchain_text_splitters


# TODO: Create a text splitter with chunk_size=1000, chunk_overlap=200, add_start_index=True


# TODO: Split the documents using split_documents()


# TODO: Print the number of chunks created


## Step 7: Add Documents to Vector Store

**Indexing** the documents by converting them to embeddings and storing them.

### Your Task:
Store all document chunks in the vector store with their embeddings.

**Steps:**
1. Call `vector_store.add_documents()` with the split documents: `document_ids = vector_store.add_documents(documents=all_splits)`
2. Print the first 3 IDs: `print(document_ids[:3])`

### What happens during indexing:
1. Each document chunk is sent to the embedding model
2. The embedding model returns a vector for each chunk
3. The vector store saves both the embedding and the original text
4. Returns unique IDs for each stored document

### The Magic:
- Documents with similar semantic meaning will have similar embedding vectors
- When we later search with a question, the vector store can find chunks with similar embeddings
- This enables **semantic search** (meaning-based) rather than keyword matching

**Expected Output:** A list showing the first 3 document IDs (UUIDs)

**⚠️ Note:** This may take 30-60 seconds as it generates embeddings for all chunks

In [None]:
# TODO: Add all document chunks to the vector store


# TODO: Print the first 3 document IDs


## Step 8: Load the RAG Prompt Template

**Prompt Engineering** is crucial for effective RAG systems.

### Your Task:
Load a pre-built RAG prompt template from LangChain Hub.

**Steps:**
1. Import `hub` from `langchain`
2. Load the RAG prompt: `prompt = hub.pull("rlm/rag-prompt")`
3. Preview the prompt by invoking it with example data:
   - `example_messages = prompt.invoke({"context": "(context goes here)", "question": "(question goes here)"}).to_messages()`
4. Assert one message exists: `assert len(example_messages) == 1`
5. Print the prompt content: `print(example_messages[0].content)`

### What is `hub.pull()`?
- Loads a pre-built prompt template from LangChain Hub
- **`"rlm/rag-prompt"`**: A well-tested RAG prompt template

### The RAG Prompt Structure:
The template instructs the LLM to:
1. Use the provided context to answer the question
2. Say "I don't know" if the context doesn't contain the answer
3. Keep answers concise and grounded in the context

### Why use a template?
- Consistent response quality
- Reduces hallucinations (making up information)
- Ensures the LLM leverages the retrieved context effectively

**Expected Output:** The formatted prompt template showing how context and questions will be combined

In [None]:
# TODO: Import hub from langchain


# TODO: Pull the RAG prompt template from LangChain Hub


# TODO: Create example messages by invoking the prompt with placeholder context and question


# TODO: Assert that one message was created


# TODO: Print the prompt content


## Step 9: Define the State Schema

**State Management** for our RAG workflow using TypedDict.

### Your Task:
Define a state schema that will flow through the RAG pipeline.

**Steps:**
1. Import required types:
   - `Document` from `langchain_core.documents`
   - `List` and `TypedDict` from `typing_extensions`
2. Create a `State` class that inherits from `TypedDict` with three fields:
   - `question`: str (the user's input query)
   - `context`: List[Document] (retrieved documents)
   - `answer`: str (the LLM's generated response)

### What is State?
A structured data container that flows through our RAG pipeline, containing all necessary information at each step.

### Why define State?
- **Type Safety**: Ensures each step receives and returns the correct data types
- **Clarity**: Makes the data flow explicit and easier to debug
- **LangGraph Integration**: LangGraph uses this schema to manage state between nodes

**Expected Output:** No output, but the `State` type is now defined for use in the workflow

In [None]:
# TODO: Import Document from langchain_core.documents


# TODO: Import List and TypedDict from typing_extensions


# TODO: Define a State class (TypedDict) with question (str), context (List[Document]), and answer (str)


## Step 10: Define RAG Workflow Functions

**The two core functions of the RAG pipeline:**

### Your Task:
Create the retrieval and generation functions for the RAG workflow.

**Function 1: `retrieve(state: State)`**
- Takes the current state containing the question
- Uses `vector_store.similarity_search()` to find relevant documents
- Returns a dictionary with `context` key containing the retrieved documents

**Function 2: `generate(state: State)`**
- Takes the state containing question and context
- Combines all document content into a single string using `"\n\n".join()`
- Formats the prompt with question and context
- Invokes the LLM to generate an answer
- Returns a dictionary with `answer` key containing the LLM's response content

**Steps:**
1. Define `retrieve(state: State)`:
   - Get documents: `retrieved_docs = vector_store.similarity_search(state["question"])`
   - Return: `{"context": retrieved_docs}`
   
2. Define `generate(state: State)`:
   - Join documents: `docs_content = "\n\n".join(doc.page_content for doc in state["context"])`
   - Format prompt: `messages = prompt.invoke({"question": state["question"], "context": docs_content})`
   - Get response: `response = llm.invoke(messages)`
   - Return: `{"answer": response.content}`

### The RAG Flow:
**Question → Retrieve Context → Generate Answer**

**Expected Output:** No output, but the two functions are defined

In [None]:
# TODO: Define the retrieve function that takes state and returns context


# TODO: Define the generate function that takes state and returns answer


## Step 11: Build the RAG Graph with LangGraph

**LangGraph** creates a stateful, directed workflow for our RAG pipeline.

### Your Task:
Build and compile the RAG workflow graph.

**Steps:**
1. Import required classes:
   - `START` and `StateGraph` from `langgraph.graph`
2. Create a graph builder: `graph_builder = StateGraph(State)`
3. Add the workflow sequence: `graph_builder.add_sequence([retrieve, generate])`
   - This chains the functions: retrieve → generate
4. Add the entry edge: `graph_builder.add_edge(START, "retrieve")`
   - Defines where the workflow starts
5. Compile the graph: `graph = graph_builder.compile()`

### Graph Construction:
- **StateGraph(State)**: Initialize a graph that manages our State schema
- **add_sequence([retrieve, generate])**: Chain the functions sequentially
- **add_edge(START, "retrieve")**: Define the entry point
- **compile()**: Build the executable graph

### Benefits of LangGraph:
- **State Persistence**: Automatically passes state between nodes
- **Debugging**: Can visualize and trace the workflow
- **Flexibility**: Easy to add steps (e.g., query rewriting, re-ranking)
- **Streaming**: Can stream intermediate results

**Expected Output:** No output, but the compiled graph is ready to process questions

In [None]:
# TODO: Import START and StateGraph from langgraph.graph


# TODO: Create a StateGraph instance with the State schema


# TODO: Add the retrieve and generate functions as a sequence


# TODO: Add an edge from START to retrieve


# TODO: Compile the graph


## Step 12: Visualize the Workflow Graph

**Visual Representation** of our RAG pipeline using Mermaid diagrams.

### Your Task:
Generate and display a visual diagram of the workflow.

**Steps:**
1. Import display tools:
   - `Image` and `display` from `IPython.display`
2. Get the graph visualization: `graph.get_graph().draw_mermaid_png()`
3. Display it: `display(Image(graph.get_graph().draw_mermaid_png()))`

### What to Expect:
The diagram will show:
- **Nodes**: Each step in the workflow (`retrieve`, `generate`)
- **Edges**: The flow of data between steps
- **Entry Point**: Where the workflow starts (START → retrieve)
- **End Point**: Where it completes

This visualization helps understand the execution flow and is useful for:
- Debugging complex workflows
- Documentation and team communication
- Identifying optimization opportunities

**Expected Output:** A visual diagram showing the RAG pipeline flow

In [None]:
# TODO: Import Image and display from IPython.display


# TODO: Display the graph visualization using graph.get_graph().draw_mermaid_png()


## Step 13: Test the RAG System

**First Query** - Testing the complete pipeline with a sample question.

### Your Task:
Invoke the RAG graph with a question and examine the results.

**Steps:**
1. Invoke the graph with a question: `result = graph.invoke({"question": "What is Task Decomposition?"})`
2. Print the context: `print(f"Context: {result['context']}\\n\\n")`
3. Print the answer: `print(f"Answer: {result['answer']}")`

### Execution Flow:
1. **Input**: `{"question": "What is Task Decomposition?"}`
2. **Retrieve Step**: 
   - Converts question to embedding
   - Finds most similar document chunks from the blog post
   - Returns relevant context about task decomposition
3. **Generate Step**:
   - Formats prompt with question + retrieved context
   - LLM generates answer based on the context
4. **Output**: Complete result with context and answer

### Expected Result:
- **Context**: The actual document chunks retrieved from the vector store
- **Answer**: A concise explanation of Task Decomposition based on the blog content

**This is RAG in action! The LLM is answering based on retrieved context, not just its training data.**

**Expected Output:** Retrieved context documents and a generated answer about task decomposition

In [None]:
# TODO: Invoke the graph with the question "What is Task Decomposition?"


# TODO: Print the context from the result


# TODO: Print the answer from the result


## Step 14: Stream the Workflow Updates

**Streaming Mode: Updates** - Observe each step of the workflow as it executes.

### Your Task:
Stream the workflow execution to see state changes at each step.

**Steps:**
1. Use `graph.stream()` with `stream_mode="updates"`:
   ```python
   for step in graph.stream({"question": "What is Task Decomposition?"}, stream_mode="updates"):
       print(f"{step}\n\n----------------\n")
   ```

### What is `stream_mode="updates"`?
- Shows the state changes after each node completes
- Provides visibility into the pipeline execution
- Useful for debugging and understanding the workflow

### Output Format:
You'll see two updates:
1. **After `retrieve`**: Shows the retrieved context documents
2. **After `generate`**: Shows the generated answer

This is helpful for:
- **Debugging**: Identify which step is slow or failing
- **Validation**: Verify the right documents are being retrieved
- **Monitoring**: Track progress in real-time applications

**Expected Output:** Two dictionaries showing state updates after retrieve and generate steps

In [None]:
# TODO: Stream the workflow with stream_mode="updates" and print each step


## Step 15: Stream LLM Messages (Token-by-Token)

**Streaming Mode: Messages** - Watch the LLM generate the response in real-time.

### Your Task:
Stream the LLM's response as it's generated, token by token.

**Steps:**
1. Use `graph.stream()` with `stream_mode="messages"`:
   ```python
   for message, metadata in graph.stream({"question": "What is Task Decomposition?"}, stream_mode="messages"):
       print(message.content, end="|")
   ```

### What is `stream_mode="messages"`?
- Streams individual messages/tokens as they're generated by the LLM
- Shows the answer being constructed incrementally
- Similar to how ChatGPT displays responses word-by-word

### Use Cases:
- **User Experience**: Display progressive responses in chat interfaces
- **Real-time Feedback**: Users see that processing is happening
- **Early Termination**: Can stop generation if the answer is sufficient

### Output:
- Each token/chunk of the LLM response is printed as it's generated
- The `|` separator shows where each chunk ends
- This demonstrates the streaming capability for production applications

**Expected Output:** The answer text appearing gradually, separated by `|` characters

**This completes your RAG implementation! 🎉**

In [None]:
# TODO: Stream the workflow with stream_mode="messages" and print each message token


## Congratulations! 🎉

You've successfully built a complete RAG (Retrieval-Augmented Generation) system using LangChain, Azure OpenAI, and LangGraph!

### What You've Accomplished:
- ✅ Set up secure API key management
- ✅ Configured Azure OpenAI embeddings for semantic search
- ✅ Initialized a language model for text generation
- ✅ Created a vector store for efficient similarity search
- ✅ Loaded and parsed web content
- ✅ Split documents into optimal chunks
- ✅ Indexed documents with embeddings
- ✅ Loaded and understood RAG prompt templates
- ✅ Defined state management for workflow
- ✅ Built a complete RAG pipeline with LangGraph
- ✅ Visualized the workflow
- ✅ Tested the system with queries
- ✅ Implemented streaming for real-time responses

### Key Concepts Mastered:
1. **RAG Architecture**: Combining retrieval and generation for grounded responses
2. **Vector Embeddings**: Converting text to numerical representations
3. **Semantic Search**: Finding relevant information by meaning, not keywords
4. **State Management**: Managing data flow through complex workflows
5. **LangGraph**: Building stateful, multi-step AI applications
6. **Streaming**: Providing real-time user feedback

### Next Steps:
- Try different questions about AI agents
- Experiment with different chunk sizes and retrieval parameters
- Add more documents to the knowledge base
- Implement query rewriting or re-ranking for better retrieval
- Deploy this as a web API or chatbot

### Challenge Exercises:
1. Modify the system to load multiple web pages
2. Add a step to the workflow that validates if retrieved context is relevant
3. Implement a fallback mechanism when no good context is found
4. Create a custom prompt template for a specific domain (e.g., technical support)
5. Add conversation memory to make the system handle follow-up questions

**You now have a production-ready foundation for building intelligent question-answering systems!**