# Advanced Agentic RAG

**Professional Document Processing with AI Agents**

---

Welcome to the advanced world of **Retrieval-Augmented Generation**! This notebook builds upon the basics to show you how to create production-ready RAG systems that can handle real-world documents like PDFs, Word files, and more. By the end of this 10-minute tutorial, you'll have a sophisticated document processing pipeline.

### 🎯 What You'll Learn

In this advanced tutorial, you will:
- Process multiple document formats (PDF, DOCX, TXT, MD)
- Implement smart text chunking strategies
- Use ChromaDB for persistent storage
- Add metadata filtering and advanced search
- Build a document analysis agent
- Optimize performance for production use

### 📋 Prerequisites

This tutorial assumes you've completed the "Agentic RAG Basics" notebook and understand:
- Basic RAG concepts
- Vector embeddings
- Tool creation for agents

## 📦 Step 1: Installing Advanced Dependencies

### Additional Packages
We'll need extra libraries for document processing and persistent storage.

### 📚 New Dependencies
- **chromadb**: Persistent vector database with metadata filtering
- **PyPDF2**: PDF text extraction
- **python-docx**: Word document processing
- **tiktoken**: Advanced text tokenization for chunking

In [None]:
# Install required packages
%pip install chromadb PyPDF2 python-docx tiktoken -q

# Also ensure we have the basics from the previous notebook
%pip install sentence-transformers strands-agents reportlab -q

print("✅ All advanced packages installed successfully!")
print("   Ready for professional document processing! 🚀")

## 🔧 Step 2: Setting Up AWS Bedrock & ChromaDB

### Hybrid Architecture
We'll combine:
- **AWS Bedrock**: For powerful Claude LLM
- **ChromaDB**: For local, persistent vector storage

This gives us the best of both worlds - powerful AI with local data privacy!

In [None]:
import boto3
from strands import Agent, tool
from strands.models import BedrockModel
import chromadb
from chromadb.utils import embedding_functions
from pathlib import Path

# Setup AWS Bedrock
session = boto3.Session(profile_name='default')
try:
    bedrock_model = BedrockModel(
        model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0",
        boto_session=session
    )
    print("✅ AWS Bedrock configured successfully!")
except Exception as e:
    print(f"❌ Error: {e}")

# Setup ChromaDB
chroma_client = chromadb.PersistentClient(path="./chroma_db")
embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# Create collection
collection_name = "strands_documents"
try:
    collection = chroma_client.create_collection(
        name=collection_name,
        embedding_function=embedding_function
    )
    print(f"✅ Created new collection: {collection_name}")
except:
    collection = chroma_client.get_collection(
        name=collection_name,
        embedding_function=embedding_function
    )
    print(f"✅ Using existing collection with {collection.count()} documents")

## 📄 Step 3: Advanced Document Processing

### Multi-Format Document Handler
Let's create a processor that handles PDFs, Word docs, and text files.

In [None]:
import PyPDF2
from docx import Document as DocxDocument
from typing import Tuple, Dict

class AdvancedDocumentProcessor:
    """Process various document types."""
    
    @staticmethod
    def extract_text_from_pdf(pdf_path: str) -> Tuple[str, Dict]:
        """Extract text from PDF."""
        text = ""
        metadata = {"source_type": "pdf", "page_count": 0}
        
        try:
            with open(pdf_path, 'rb') as file:
                pdf_reader = PyPDF2.PdfReader(file)
                metadata["page_count"] = len(pdf_reader.pages)
                
                for page_num, page in enumerate(pdf_reader.pages):
                    page_text = page.extract_text()
                    if page_text:
                        text += f"\n\n[Page {page_num + 1}]\n{page_text}"
        except Exception as e:
            print(f"Error processing PDF: {e}")
            
        return text.strip(), metadata
    
    @staticmethod
    def extract_text_from_docx(docx_path: str) -> Tuple[str, Dict]:
        """Extract text from Word document."""
        text = ""
        metadata = {"source_type": "docx", "paragraph_count": 0}
        
        try:
            doc = DocxDocument(docx_path)
            paragraphs = [para.text for para in doc.paragraphs if para.text.strip()]
            text = "\n\n".join(paragraphs)
            metadata["paragraph_count"] = len(paragraphs)
        except Exception as e:
            print(f"Error processing DOCX: {e}")
            
        return text.strip(), metadata
    
    @staticmethod
    def extract_text_from_file(file_path: str) -> Tuple[str, Dict]:
        """Extract text from any supported file."""
        path = Path(file_path)
        suffix = path.suffix.lower()
        
        if suffix == '.pdf':
            return AdvancedDocumentProcessor.extract_text_from_pdf(file_path)
        elif suffix in ['.docx', '.doc']:
            return AdvancedDocumentProcessor.extract_text_from_docx(file_path)
        elif suffix in ['.txt', '.md']:
            text = path.read_text(encoding='utf-8')
            return text, {"source_type": "text", "char_count": len(text)}
        else:
            return "", {"error": f"Unsupported file type: {suffix}"}

processor = AdvancedDocumentProcessor()
print("📄 Document processor ready!")

## ✂️ Step 4: Smart Text Chunking

### Intelligent Document Splitting
Large documents need to be chunked intelligently for optimal retrieval.

In [None]:
import tiktoken
import re
from typing import List, Dict

class SmartTextChunker:
    """Intelligent text chunking."""
    
    def __init__(self, chunk_size: int = 400, chunk_overlap: int = 50):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.encoding = tiktoken.get_encoding("cl100k_base")
    
    def split_text_into_chunks(self, text: str) -> List[Dict[str, any]]:
        """Split text into overlapping chunks."""
        paragraphs = text.split('\n\n')
        chunks = []
        current_chunk = ""
        chunk_index = 0
        
        for para in paragraphs:
            para_tokens = len(self.encoding.encode(para))
            current_tokens = len(self.encoding.encode(current_chunk))
            
            if current_tokens + para_tokens > self.chunk_size and current_chunk:
                # Save current chunk
                chunks.append({
                    "text": current_chunk.strip(),
                    "chunk_index": chunk_index,
                    "token_count": current_tokens
                })
                chunk_index += 1
                
                # Start new chunk with overlap
                words = current_chunk.split()[-self.chunk_overlap:]
                current_chunk = " ".join(words) + "\n\n" + para
            else:
                current_chunk += ("\n\n" if current_chunk else "") + para
        
        # Add last chunk
        if current_chunk:
            chunks.append({
                "text": current_chunk.strip(),
                "chunk_index": chunk_index,
                "token_count": len(self.encoding.encode(current_chunk))
            })
        
        return chunks

chunker = SmartTextChunker()
print("✂️ Smart chunker ready!")

## 📝 Step 5: Document Generation

### Creating Test Documents
Let's generate sample documents for our RAG system.

In [None]:
import sys
sys.path.append('../src')

try:
    from rag_document_generator import create_advanced_documents
    print("📝 Creating advanced documents...")
    advanced_files = create_advanced_documents()
    print(f"✅ Created {len(advanced_files)} documents")
except ImportError:
    print("⚠️  Document generator not found. Creating simple test files...")
    
    # Create simple test documents
    import os
    os.makedirs('../rag_docs', exist_ok=True)
    
    with open('../rag_docs/test_document.txt', 'w') as f:
        f.write("This is a test document for the RAG system.\n\n"
                "It contains information about Strands agents and RAG.")
    
    print("✅ Created test document")

## 🔄 Step 6: Document Ingestion Pipeline

### Complete Processing Pipeline
Now let's create a pipeline that processes documents and adds them to ChromaDB.

In [None]:
from datetime import datetime
import hashlib

class DocumentIngestionPipeline:
    """Pipeline for document ingestion."""
    
    def __init__(self, collection, processor, chunker):
        self.collection = collection
        self.processor = processor
        self.chunker = chunker
    
    def ingest_document(self, file_path: str, metadata: Dict = None) -> Dict:
        """Process and ingest a document."""
        print(f"\n📄 Processing: {Path(file_path).name}")
        
        # Extract text
        text, doc_metadata = self.processor.extract_text_from_file(file_path)
        
        if not text or "error" in doc_metadata:
            print(f"   ❌ Error: {doc_metadata.get('error', 'Unknown')}")
            return {"status": "error"}
        
        # Add metadata
        doc_metadata["filename"] = Path(file_path).name
        doc_metadata["ingestion_date"] = datetime.now().isoformat()
        if metadata:
            doc_metadata.update(metadata)
        
        # Chunk text
        chunks = self.chunker.split_text_into_chunks(text)
        print(f"   📊 Created {len(chunks)} chunks")
        
        # Add to ChromaDB
        chunk_ids = []
        documents = []
        metadatas = []
        
        for i, chunk in enumerate(chunks):
            chunk_id = hashlib.md5(f"{file_path}_{i}".encode()).hexdigest()
            chunk_ids.append(chunk_id)
            documents.append(chunk["text"])
            
            chunk_metadata = doc_metadata.copy()
            chunk_metadata.update({
                "chunk_index": chunk["chunk_index"],
                "chunk_total": len(chunks)
            })
            metadatas.append(chunk_metadata)
        
        try:
            self.collection.add(
                ids=chunk_ids,
                documents=documents,
                metadatas=metadatas
            )
            print(f"   ✅ Successfully ingested")
            return {"status": "success", "chunks": len(chunks)}
        except Exception as e:
            print(f"   ❌ Error: {e}")
            return {"status": "error"}

pipeline = DocumentIngestionPipeline(collection, processor, chunker)
print("🔄 Pipeline ready!")

## 📥 Step 7: Ingesting Documents

### Processing Our Document Collection
Let's ingest all documents from the rag_docs directory.

In [None]:
from pathlib import Path

print("🚀 Starting document ingestion...")
print("=" * 60)

rag_docs_path = Path('../rag_docs')
if not rag_docs_path.exists():
    print("❌ rag_docs directory not found!")
else:
    # Process all documents
    document_types = [
        ("*.pdf", {"category": "reference"}),
        ("*.docx", {"category": "patterns"}),
        ("*.txt", {"category": "guides"}),
        ("*.md", {"category": "documentation"})
    ]
    
    total_processed = 0
    
    for pattern, metadata in document_types:
        files = list(rag_docs_path.rglob(pattern))
        if files:
            print(f"\n📂 Processing {len(files)} {pattern} files...")
            for file in files:
                result = pipeline.ingest_document(str(file), metadata)
                if result["status"] == "success":
                    total_processed += 1
    
    print(f"\n✅ Total documents processed: {total_processed}")
    print(f"📊 Total chunks in database: {collection.count()}")

## 🔧 Step 8: Advanced RAG Tools

### Creating Sophisticated Search Tools
Let's build tools with metadata filtering and advanced search capabilities.

In [None]:
@tool
def search_documents(query: str, category: str = None, limit: int = 5) -> str:
    """Search documents with optional category filtering.
    
    Args:
        query: Search query
        category: Optional category filter (reference, patterns, guides, documentation)
        limit: Number of results to return
    """
    # Build where clause for filtering
    where_clause = {"category": category} if category else None
    
    # Search in ChromaDB
    results = collection.query(
        query_texts=[query],
        n_results=limit,
        where=where_clause
    )
    
    if not results['documents'][0]:
        return "No relevant documents found."
    
    # Format results
    formatted_results = []
    for i, (doc, metadata) in enumerate(zip(results['documents'][0], results['metadatas'][0])):
        formatted_results.append(
            f"[Result {i+1}]\n"
            f"Source: {metadata.get('filename', 'Unknown')}\n"
            f"Category: {metadata.get('category', 'Unknown')}\n"
            f"Chunk: {metadata.get('chunk_index', '?')}/{metadata.get('chunk_total', '?')}\n"
            f"Content: {doc[:200]}..."
        )
    
    return "\n\n".join(formatted_results)

@tool
def list_document_categories() -> str:
    """List all document categories in the knowledge base."""
    all_metadata = collection.get()["metadatas"]
    categories = set()
    
    for metadata in all_metadata:
        if "category" in metadata:
            categories.add(metadata["category"])
    
    if categories:
        return "Available categories:\n" + "\n".join(f"- {cat}" for cat in sorted(categories))
    else:
        return "No categories found in knowledge base."

@tool
def get_document_stats() -> str:
    """Get statistics about the document knowledge base."""
    total_chunks = collection.count()
    all_metadata = collection.get()["metadatas"]
    
    # Count unique documents
    unique_docs = set(m.get("filename", "Unknown") for m in all_metadata)
    
    # Count by category
    category_counts = {}
    for metadata in all_metadata:
        cat = metadata.get("category", "uncategorized")
        category_counts[cat] = category_counts.get(cat, 0) + 1
    
    stats = f"📊 Knowledge Base Statistics:\n"
    stats += f"Total chunks: {total_chunks}\n"
    stats += f"Unique documents: {len(unique_docs)}\n"
    stats += f"\nChunks by category:\n"
    for cat, count in sorted(category_counts.items()):
        stats += f"  - {cat}: {count}\n"
    
    return stats

print("🔧 Advanced RAG tools created!")

## 🤖 Step 9: Creating the Document Analysis Agent

### Professional Document Assistant
Let's create an agent that can analyze documents and answer complex questions.

In [None]:
# Create the document analysis agent
document_agent = Agent(
    model=bedrock_model,
    system_prompt="""You are a professional document analysis assistant with access to a comprehensive knowledge base.
    
    YOUR CAPABILITIES:
    1. Search documents by content and category
    2. Analyze document statistics
    3. Provide detailed summaries and insights
    4. Cross-reference information from multiple sources
    
    GUIDELINES:
    - Always search the knowledge base before answering
    - Use category filtering when relevant
    - Cite your sources with document names
    - If information is not found, acknowledge this clearly
    - Provide comprehensive, well-structured responses
    
    Remember: You have access to reference documents, design patterns, guides, and documentation.""",
    tools=[search_documents, list_document_categories, get_document_stats]
)

print("🤖 Document analysis agent ready!")
print("   Model: Claude 3.7 Sonnet")
print("   Knowledge Base: ChromaDB with advanced search")
print("   Tools: search_documents, list_document_categories, get_document_stats")

## 💬 Step 10: Testing the Advanced RAG System

### Professional Document Analysis
Let's test our agent with complex queries that require document analysis.

In [None]:
# Test 1: Get overview of knowledge base
print("📊 Test 1: Knowledge Base Overview")
print("=" * 50)
response = document_agent("What types of documents do you have access to? Give me an overview.")
print(f"🤖 Agent: {response}")
print("\n" + "=" * 50 + "\n")

In [None]:
# Test 2: Category-specific search
print("🔍 Test 2: Category-Specific Search")
print("=" * 50)
response = document_agent("Search for information about agent design patterns in the patterns category.")
print(f"🤖 Agent: {response}")
print("\n" + "=" * 50 + "\n")

In [None]:
# Test 3: Cross-reference information
print("🔗 Test 3: Cross-Reference Analysis")
print("=" * 50)
response = document_agent(
    "Compare the information about Strands agents across different document types. "
    "What are the key themes that appear in multiple documents?"
)
print(f"🤖 Agent: {response}")
print("\n" + "=" * 50 + "\n")

## 🎉 Congratulations!

### 🏆 What You've Accomplished

In this advanced tutorial, you've:
- ✅ Built a multi-format document processor (PDF, DOCX, TXT, MD)
- ✅ Implemented smart text chunking with token awareness
- ✅ Set up ChromaDB for persistent vector storage
- ✅ Created a document ingestion pipeline
- ✅ Built advanced search tools with metadata filtering
- ✅ Developed a professional document analysis agent

### 🚀 What's Next?

Now that you've mastered advanced RAG, you can:
1. **Scale Your System**: Add more documents and document types
2. **Implement Caching**: Speed up frequent queries
3. **Add Re-ranking**: Improve search result quality
4. **Build Specialized Agents**: Create domain-specific assistants
5. **Deploy to Production**: Use cloud storage and scaling

### 💡 Key Takeaways

1. **ChromaDB**: Provides persistence and metadata filtering
2. **Smart Chunking**: Respects document structure and token limits
3. **Metadata**: Enables powerful filtering and organization
4. **Pipeline Architecture**: Makes scaling easy

### 📚 Resources

- [ChromaDB Documentation](https://docs.trychroma.com/)
- [Tiktoken Documentation](https://github.com/openai/tiktoken)
- [PyPDF2 Documentation](https://pypdf2.readthedocs.io/)

### 🌟 Challenge Yourself

Try enhancing your RAG system by:
- Adding support for Excel files and CSVs
- Implementing semantic chunking based on headings
- Creating a web interface for document upload
- Building specialized agents for different document categories

Happy building with advanced RAG! 🚀🤖✨