# Modern RAG with LangChain 1.0 & LangGraph

This notebook demonstrates Retrieval-Augmented Generation (RAG) using the latest LangChain patterns. We'll cover:

1. **Basic RAG Pipeline** - Document loading, splitting, embedding, retrieval, generation
2. **Conversational RAG** - Multi-turn conversations with chat history
3. **Agentic RAG with LangGraph** - Dynamic, stateful RAG workflows
4. **Structured Output RAG** - Extracting typed data from RAG responses

---

## What is RAG?

RAG (Retrieval-Augmented Generation) enhances LLM responses by:
1. **Retrieving** relevant documents from a knowledge base
2. **Augmenting** the LLM's context with those documents
3. **Generating** a response grounded in the retrieved information

This reduces hallucinations and enables LLMs to answer questions about your own data.

---

## Setup

In [None]:
%pip install -qU langchain langchain-openai langchain-community langchain-chroma langgraph pypdf chromadb

In [None]:
import os
import getpass

def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")

_set_env("OPENAI_API_KEY")

---

# Part 1: Basic RAG Pipeline

The RAG pipeline consists of:

1. **Load** - Read documents from various sources
2. **Split** - Break documents into smaller chunks
3. **Embed** - Convert text to numerical vectors
4. **Store** - Save vectors in a vector database
5. **Retrieve** - Find relevant chunks for a query
6. **Generate** - Use retrieved context to answer

![RAG Pipeline](https://python.langchain.com/v0.1/assets/images/rag_pipeline-8f9d36e0a21e6aaa36d42b48a21f5bd2.png)

## 1.1 Document Loading

LangChain provides loaders for many file types. Let's load a PDF:

In [None]:
from langchain_community.document_loaders import PyPDFLoader

# Load the "Attention Is All You Need" paper (Transformer architecture)
pdf_path = "./assets-resources/attention-paper.pdf"
loader = PyPDFLoader(pdf_path)

# load_and_split() loads and splits by page
documents = loader.load_and_split()

print(f"Loaded {len(documents)} pages")
print(f"\nFirst page preview (first 500 chars):")
print(documents[0].page_content[:500])

In [None]:
# Check the metadata
print("Document metadata:")
print(documents[0].metadata)

## 1.2 Text Splitting

Large documents need to be split into smaller chunks for effective retrieval:

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# RecursiveCharacterTextSplitter tries to keep related text together
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # Target chunk size in characters
    chunk_overlap=200,    # Overlap between chunks for context continuity
    separators=["\n\n", "\n", ".", " ", ""]  # Try these in order
)

splits = text_splitter.split_documents(documents)

print(f"Split {len(documents)} pages into {len(splits)} chunks")
print(f"\nExample chunk (first 300 chars):")
print(splits[5].page_content[:300])

## 1.3 Embeddings & Vector Store

Embeddings convert text to numerical vectors that capture semantic meaning. Similar texts have similar vectors.

In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# Create embeddings model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create vector store from documents
vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=embeddings,
    # persist_directory="./chroma_db"  # Uncomment to persist to disk
)

print(f"Created vector store with {vectorstore._collection.count()} vectors")

## 1.4 Retrieval

A retriever finds the most relevant chunks for a query:

In [None]:
# Create a retriever from the vector store
retriever = vectorstore.as_retriever(
    search_kwargs={"k": 4}  # Return top 4 most relevant chunks
)

# Test retrieval
query = "What is the self-attention mechanism?"
relevant_docs = retriever.invoke(query)

print(f"Found {len(relevant_docs)} relevant chunks for: '{query}'")
print("\nTop result:")
print(relevant_docs[0].page_content[:400])

## 1.5 Generation with RAG Chain

Now we combine retrieval with generation using LangChain's LCEL (LangChain Expression Language):

In [None]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

# Initialize LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Create the RAG prompt
system_prompt = """You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, say that you don't know.
Use three sentences maximum and keep the answer concise.

Context:
{context}
"""

prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}"),
])

# Create the RAG chain
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [None]:
# Ask a question!
question = "What is the key innovation of the Transformer architecture?"

result = rag_chain.invoke({"input": question})

print(f"Question: {question}")
print(f"\nAnswer: {result['answer']}")

In [None]:
# View the retrieved context
print("\nRetrieved context came from these sources:")
for i, doc in enumerate(result['context'], 1):
    print(f"  {i}. Page {doc.metadata.get('page', 'N/A')}")

In [None]:
# More questions
questions = [
    "What are positional encodings and why are they needed?",
    "How does multi-head attention work?",
    "What BLEU scores did the Transformer achieve?"
]

for q in questions:
    result = rag_chain.invoke({"input": q})
    print(f"Q: {q}")
    print(f"A: {result['answer']}")
    print()

---

# Part 2: Conversational RAG with Chat History

For multi-turn conversations, we need to:
1. Reformulate queries based on chat history
2. Maintain context across turns

The `create_history_aware_retriever` handles query reformulation:

In [None]:
from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage

# Prompt for reformulating queries based on chat history
contextualize_q_prompt = ChatPromptTemplate.from_messages([
    ("system", 
     """Given a chat history and the latest user question 
     which might reference context in the chat history, 
     formulate a standalone question which can be understood 
     without the chat history. Do NOT answer the question, 
     just reformulate it if needed and otherwise return it as is."""),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

# Create history-aware retriever
history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt
)

In [None]:
# QA prompt that includes chat history
qa_prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

# Create the conversational RAG chain
question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)
conversational_rag = create_retrieval_chain(
    history_aware_retriever, 
    question_answer_chain
)

In [None]:
# Multi-turn conversation
chat_history = []

# Turn 1
question1 = "What is self-attention?"
response1 = conversational_rag.invoke({
    "input": question1,
    "chat_history": chat_history
})
print(f"User: {question1}")
print(f"Assistant: {response1['answer']}")

# Update history
chat_history.extend([
    HumanMessage(content=question1),
    AIMessage(content=response1['answer']),
])

In [None]:
# Turn 2 - references previous context ("it")
question2 = "How does it differ from traditional attention?"
response2 = conversational_rag.invoke({
    "input": question2,
    "chat_history": chat_history
})
print(f"\nUser: {question2}")
print(f"Assistant: {response2['answer']}")

chat_history.extend([
    HumanMessage(content=question2),
    AIMessage(content=response2['answer']),
])

In [None]:
# Turn 3 - more follow-up
question3 = "What are the computational advantages?"
response3 = conversational_rag.invoke({
    "input": question3,
    "chat_history": chat_history
})
print(f"\nUser: {question3}")
print(f"Assistant: {response3['answer']}")

---

# Part 3: Agentic RAG with LangGraph

Agentic RAG gives the LLM control over when to retrieve. The agent can:
- Decide whether retrieval is needed
- Reformulate queries if results are poor
- Handle complex multi-step reasoning

This is the most flexible RAG pattern.

In [None]:
from langgraph.graph import StateGraph, START, END, MessagesState
from langgraph.prebuilt import ToolNode, tools_condition
from langchain_core.tools import tool

# Create a retrieval tool
@tool
def retrieve_transformer_docs(query: str) -> str:
    """Retrieve documents about the Transformer architecture.
    Use this when you need information from the 'Attention Is All You Need' paper.
    """
    docs = retriever.invoke(query)
    return "\n\n".join([doc.page_content for doc in docs])

tools = [retrieve_transformer_docs]

In [None]:
# Create the LLM with tools bound
llm_with_tools = ChatOpenAI(model="gpt-4o-mini", temperature=0).bind_tools(tools)

# Define the agent node
def call_model(state: MessagesState):
    """Call the model with current messages."""
    system_message = {
        "role": "system",
        "content": """You are a helpful assistant that answers questions about the Transformer architecture.
        Use the retrieve_transformer_docs tool when you need specific information from the paper.
        If you can answer from general knowledge, you may do so, but cite the paper when relevant."""
    }
    messages = [system_message] + state["messages"]
    response = llm_with_tools.invoke(messages)
    return {"messages": [response]}

In [None]:
# Build the graph
workflow = StateGraph(MessagesState)

# Add nodes
workflow.add_node("agent", call_model)
workflow.add_node("tools", ToolNode(tools))

# Add edges
workflow.add_edge(START, "agent")
workflow.add_conditional_edges(
    "agent",
    tools_condition,  # Routes to 'tools' if tool calls exist, else END
)
workflow.add_edge("tools", "agent")

# Compile
agentic_rag = workflow.compile()

In [None]:
# Visualize the graph
from IPython.display import Image, display

display(Image(agentic_rag.get_graph().draw_mermaid_png()))

In [None]:
# Test the agentic RAG
response = agentic_rag.invoke({
    "messages": [("user", "What is the scaled dot-product attention formula?")]
})

# Show the conversation
for msg in response["messages"]:
    msg.pretty_print()

In [None]:
# The agent might NOT use retrieval for general questions
response = agentic_rag.invoke({
    "messages": [("user", "What does BLEU stand for?")]
})

for msg in response["messages"]:
    msg.pretty_print()

In [None]:
# Complex question requiring retrieval
response = agentic_rag.invoke({
    "messages": [("user", 
        "Compare the Transformer's training time to previous models mentioned in the paper."
    )]
})

for msg in response["messages"]:
    msg.pretty_print()

---

# Part 4: Structured Output RAG

Combine RAG with structured output to extract specific data types from documents:

In [None]:
from pydantic import BaseModel, Field
from typing import List, Optional
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

class PaperSummary(BaseModel):
    """Structured summary of a research paper."""
    title: str = Field(description="Title of the paper")
    main_contribution: str = Field(description="The key innovation or contribution")
    key_results: List[str] = Field(description="List of important results/findings")
    limitations: Optional[List[str]] = Field(description="Any limitations mentioned")
    future_work: Optional[str] = Field(description="Suggested future directions")

# Create structured output LLM
structured_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output(PaperSummary)

In [None]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Structured RAG chain
structured_rag_prompt = ChatPromptTemplate.from_messages([
    ("system", """Extract a structured summary from the following research paper excerpts.
    Focus on the main contributions, key results, and any limitations.
    
    Context:
    {context}"""),
    ("human", "{question}")
])

structured_rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | structured_rag_prompt
    | structured_llm
)

In [None]:
# Get structured summary
summary = structured_rag_chain.invoke("Summarize this paper's contributions and results")

print(f"Title: {summary.title}")
print(f"\nMain Contribution: {summary.main_contribution}")
print(f"\nKey Results:")
for result in summary.key_results:
    print(f"  - {result}")
if summary.limitations:
    print(f"\nLimitations:")
    for limit in summary.limitations:
        print(f"  - {limit}")
if summary.future_work:
    print(f"\nFuture Work: {summary.future_work}")

## 4.1 Quiz Generation from Documents

A practical use case: generate quiz questions from the paper:

In [None]:
class QuizQuestion(BaseModel):
    question: str = Field(description="The quiz question")
    options: List[str] = Field(description="4 multiple choice options")
    correct_answer: str = Field(description="The correct option")
    explanation: str = Field(description="Brief explanation of the answer")

class Quiz(BaseModel):
    topic: str = Field(description="Topic of the quiz")
    questions: List[QuizQuestion] = Field(description="List of quiz questions")

quiz_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3).with_structured_output(Quiz)

In [None]:
quiz_prompt = ChatPromptTemplate.from_messages([
    ("system", """Generate a quiz based on the following research paper content.
    Create challenging but fair questions that test understanding of key concepts.
    
    Context:
    {context}"""),
    ("human", "Create {num_questions} quiz questions about {topic}")
])

quiz_chain = (
    {"context": (lambda x: x["topic"]) | retriever | format_docs, 
     "topic": lambda x: x["topic"],
     "num_questions": lambda x: x["num_questions"]}
    | quiz_prompt
    | quiz_llm
)

In [None]:
# Generate a quiz
quiz = quiz_chain.invoke({
    "topic": "self-attention mechanism",
    "num_questions": 3
})

print(f"Quiz: {quiz.topic}")
print("=" * 50)

for i, q in enumerate(quiz.questions, 1):
    print(f"\nQ{i}: {q.question}")
    for j, opt in enumerate(q.options):
        print(f"   {chr(65+j)}) {opt}")
    print(f"\n   Correct: {q.correct_answer}")
    print(f"   Explanation: {q.explanation}")

---

## Summary

In this notebook, we covered modern RAG patterns:

1. **Basic RAG Pipeline**:
   - Document loading with `PyPDFLoader`
   - Text splitting with `RecursiveCharacterTextSplitter`
   - Embeddings with `OpenAIEmbeddings`
   - Vector storage with `Chroma`
   - Retrieval chain with `create_retrieval_chain`

2. **Conversational RAG**:
   - `create_history_aware_retriever` for query reformulation
   - `MessagesPlaceholder` for chat history
   - Multi-turn conversations with context

3. **Agentic RAG with LangGraph**:
   - `StateGraph` for workflow definition
   - Tools for retrieval (`@tool` decorator)
   - Dynamic decision-making (retrieve or not)

4. **Structured Output RAG**:
   - Pydantic models for typed responses
   - Quiz generation from documents
   - Paper summarization with structured output

---

## Next Steps

- Try loading your own documents (CSV, HTML, etc.)
- Experiment with different embedding models
- Add evaluation with RAGAS
- Deploy with LangServe