# Task 14.4: Document AI Pipeline

**Module:** 14 - Multimodal AI  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐⭐

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand the Document AI pipeline architecture
- [ ] Extract text from PDFs using OCR
- [ ] Process document layouts (tables, figures, text blocks)
- [ ] Use VLMs for document understanding
- [ ] Build a complete Document Q&A system

---

## Prerequisites

- Completed: Tasks 14.1-14.3
- Knowledge of: VLMs, RAG fundamentals
- Running in: NGC PyTorch container

---

## Real-World Context

Document AI transforms how businesses handle paperwork:

**Industry Applications:**
- **Legal**: Extract clauses from contracts, compare documents
- **Finance**: Parse invoices, receipts, and financial statements
- **Healthcare**: Process medical records and lab reports
- **Insurance**: Automate claims processing from forms
- **Research**: Extract data from scientific papers

**Why DGX Spark?**
- Process large PDF batches locally
- Keep sensitive documents on-premise
- Run VLMs for complex document understanding
- No per-page API costs!

---

## ELI5: How Does Document AI Work?

> **Imagine you're teaching a robot to read a messy desk full of papers:**
>
> 1. **Take a Photo** (PDF to Image): First, we convert the document to pictures
>
> 2. **Find the Words** (OCR): The robot looks for text - like playing "Where's Waldo" but for letters
>
> 3. **Understand the Layout** (Layout Analysis): Figure out what's a title, what's a table, what's a paragraph
>
> 4. **Read and Remember** (Extraction): Pull out the important information
>
> 5. **Answer Questions** (Q&A): Now you can ask "What's the total amount?" and the robot knows!
>
> **In AI terms:**
> - **OCR (Optical Character Recognition)**: Convert images of text to actual text
> - **Layout Analysis**: Detect document structure (headers, tables, figures)
> - **VLM Integration**: Use vision-language models to understand context

---

## Part 1: Environment Setup

In [None]:
# Install required packages (run once)
# !pip install pymupdf pdf2image pytesseract pillow -q
# !apt-get install -y tesseract-ocr poppler-utils  # System dependencies

In [None]:
import torch
import gc
from PIL import Image, ImageDraw, ImageFont
import numpy as np
import time
from typing import List, Dict, Optional, Tuple
import warnings
warnings.filterwarnings('ignore')

# Check GPU
print("=" * 50)
print("GPU Configuration")
print("=" * 50)

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    total_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU: {gpu_name}")
    print(f"Total Memory: {total_memory:.1f} GB")
else:
    print("No GPU available")

print(f"\nPyTorch: {torch.__version__}")

In [None]:
def clear_gpu_memory():
    """Clear GPU memory."""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
    print("GPU memory cleared!")

def get_memory_usage():
    """Get GPU memory usage."""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated(0) / 1e9
        return f"Allocated: {allocated:.2f}GB"
    return "No GPU"

print("Utility functions loaded!")

---

## Part 2: Creating Sample Documents

Let's create some sample documents to work with. We'll create synthetic invoices and reports.

In [None]:
def create_sample_invoice() -> Image.Image:
    """
    Create a sample invoice image for testing.
    
    Returns:
        PIL Image of a fake invoice
    """
    # Create a white image
    img = Image.new('RGB', (800, 1000), color='white')
    draw = ImageDraw.Draw(img)
    
    # Try to use a nicer font, fall back to default
    try:
        font_large = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 28)
        font_medium = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 18)
        font_small = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 14)
    except:
        font_large = ImageFont.load_default()
        font_medium = ImageFont.load_default()
        font_small = ImageFont.load_default()
    
    # Header
    draw.text((50, 30), "INVOICE", fill='navy', font=font_large)
    draw.text((600, 30), "#INV-2024-0042", fill='black', font=font_medium)
    
    # Company info
    draw.text((50, 80), "TechCorp Solutions Inc.", fill='black', font=font_medium)
    draw.text((50, 105), "123 Innovation Street", fill='gray', font=font_small)
    draw.text((50, 125), "San Francisco, CA 94105", fill='gray', font=font_small)
    
    # Bill To
    draw.text((50, 180), "Bill To:", fill='black', font=font_medium)
    draw.text((50, 205), "Acme Corporation", fill='black', font=font_small)
    draw.text((50, 225), "456 Business Ave", fill='gray', font=font_small)
    draw.text((50, 245), "New York, NY 10001", fill='gray', font=font_small)
    
    # Date info
    draw.text((500, 180), "Date: December 15, 2024", fill='black', font=font_small)
    draw.text((500, 200), "Due: January 15, 2025", fill='black', font=font_small)
    
    # Table header
    y = 320
    draw.rectangle([50, y, 750, y+30], fill='lightgray')
    draw.text((60, y+5), "Description", fill='black', font=font_small)
    draw.text((400, y+5), "Qty", fill='black', font=font_small)
    draw.text((500, y+5), "Unit Price", fill='black', font=font_small)
    draw.text((650, y+5), "Amount", fill='black', font=font_small)
    
    # Table rows
    items = [
        ("AI Development Services", "40", "$150.00", "$6,000.00"),
        ("Model Training (GPU hours)", "100", "$25.00", "$2,500.00"),
        ("Data Preprocessing", "20", "$75.00", "$1,500.00"),
        ("Technical Consultation", "8", "$200.00", "$1,600.00"),
        ("Cloud Infrastructure Setup", "1", "$500.00", "$500.00")
    ]
    
    y = 360
    for desc, qty, unit, amount in items:
        draw.text((60, y), desc, fill='black', font=font_small)
        draw.text((400, y), qty, fill='black', font=font_small)
        draw.text((500, y), unit, fill='black', font=font_small)
        draw.text((650, y), amount, fill='black', font=font_small)
        draw.line([(50, y+25), (750, y+25)], fill='lightgray', width=1)
        y += 35
    
    # Totals
    y = 560
    draw.text((500, y), "Subtotal:", fill='black', font=font_small)
    draw.text((650, y), "$12,100.00", fill='black', font=font_small)
    
    draw.text((500, y+25), "Tax (8.5%):", fill='black', font=font_small)
    draw.text((650, y+25), "$1,028.50", fill='black', font=font_small)
    
    draw.line([(500, y+50), (750, y+50)], fill='black', width=2)
    
    draw.text((500, y+60), "Total Due:", fill='black', font=font_medium)
    draw.text((650, y+60), "$13,128.50", fill='navy', font=font_medium)
    
    # Payment info
    y = 700
    draw.text((50, y), "Payment Information:", fill='black', font=font_medium)
    draw.text((50, y+30), "Bank: First National Bank", fill='gray', font=font_small)
    draw.text((50, y+50), "Account: 1234567890", fill='gray', font=font_small)
    draw.text((50, y+70), "Routing: 987654321", fill='gray', font=font_small)
    
    # Footer
    draw.text((50, 920), "Thank you for your business!", fill='navy', font=font_medium)
    draw.text((50, 950), "Questions? Contact: billing@techcorp.example.com", fill='gray', font=font_small)
    
    return img

# Create and display the sample invoice
invoice_image = create_sample_invoice()

# Display at reduced size
display_image = invoice_image.copy()
display_image.thumbnail((500, 625))
display(display_image)

print(f"\nInvoice image size: {invoice_image.size}")

In [None]:
def create_sample_report() -> Image.Image:
    """
    Create a sample report with a table and chart placeholder.
    
    Returns:
        PIL Image of a fake report
    """
    img = Image.new('RGB', (800, 1000), color='white')
    draw = ImageDraw.Draw(img)
    
    try:
        font_large = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 24)
        font_medium = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 16)
        font_small = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 12)
    except:
        font_large = ImageFont.load_default()
        font_medium = ImageFont.load_default()
        font_small = ImageFont.load_default()
    
    # Title
    draw.text((50, 30), "Q4 2024 Performance Report", fill='darkblue', font=font_large)
    draw.line([(50, 65), (750, 65)], fill='darkblue', width=2)
    
    # Executive Summary
    draw.text((50, 85), "Executive Summary", fill='black', font=font_medium)
    summary = """The fourth quarter of 2024 showed strong performance across all key metrics.
Revenue increased by 23% compared to Q3, with particularly strong growth in the
AI services division. Customer satisfaction scores remained high at 94%."""
    
    y = 115
    for line in summary.split('\n'):
        draw.text((50, y), line.strip(), fill='gray', font=font_small)
        y += 18
    
    # Key Metrics section
    draw.text((50, 200), "Key Metrics", fill='black', font=font_medium)
    
    # Draw a simple bar chart
    metrics = [("Revenue", 85), ("Growth", 72), ("Satisfaction", 94), ("Efficiency", 78)]
    x_start = 80
    y_base = 350
    bar_width = 80
    
    for i, (name, value) in enumerate(metrics):
        x = x_start + i * 150
        bar_height = int(value * 1.2)
        
        # Draw bar
        color = ['steelblue', 'seagreen', 'coral', 'mediumpurple'][i]
        draw.rectangle([x, y_base - bar_height, x + bar_width, y_base], fill=color)
        
        # Draw label and value
        draw.text((x + 10, y_base + 10), name, fill='black', font=font_small)
        draw.text((x + 25, y_base - bar_height - 20), f"{value}%", fill='black', font=font_small)
    
    # Draw axes
    draw.line([(60, y_base), (700, y_base)], fill='black', width=2)
    draw.line([(60, 220), (60, y_base)], fill='black', width=2)
    
    # Data Table
    draw.text((50, 420), "Quarterly Breakdown", fill='black', font=font_medium)
    
    # Table
    y = 450
    draw.rectangle([50, y, 700, y+25], fill='lightgray')
    headers = ["Quarter", "Revenue", "Expenses", "Profit", "Growth"]
    x_positions = [60, 180, 300, 440, 580]
    
    for x, header in zip(x_positions, headers):
        draw.text((x, y+5), header, fill='black', font=font_small)
    
    rows = [
        ("Q1 2024", "$2.1M", "$1.4M", "$700K", "+12%"),
        ("Q2 2024", "$2.4M", "$1.5M", "$900K", "+15%"),
        ("Q3 2024", "$2.8M", "$1.6M", "$1.2M", "+18%"),
        ("Q4 2024", "$3.4M", "$1.8M", "$1.6M", "+23%")
    ]
    
    y = 480
    for row in rows:
        for x, cell in zip(x_positions, row):
            draw.text((x, y), cell, fill='black', font=font_small)
        draw.line([(50, y+20), (700, y+20)], fill='lightgray', width=1)
        y += 30
    
    # Conclusions
    draw.text((50, 630), "Key Findings", fill='black', font=font_medium)
    findings = [
        "1. Revenue growth accelerated each quarter, reaching 23% in Q4",
        "2. Profit margins improved from 33% to 47% over the year",
        "3. Customer satisfaction maintained above 90% threshold",
        "4. Operational efficiency gains from AI automation"
    ]
    
    y = 660
    for finding in findings:
        draw.text((50, y), finding, fill='gray', font=font_small)
        y += 25
    
    # Footer
    draw.line([(50, 850), (750, 850)], fill='lightgray', width=1)
    draw.text((50, 870), "Prepared by: Analytics Team | Date: December 2024", fill='gray', font=font_small)
    draw.text((50, 890), "Classification: Internal Use Only", fill='gray', font=font_small)
    
    return img

# Create and display the report
report_image = create_sample_report()

display_image = report_image.copy()
display_image.thumbnail((500, 625))
display(display_image)

print(f"\nReport image size: {report_image.size}")

---

## Part 3: Document Understanding with VLMs

Modern VLMs like Qwen2-VL have excellent document understanding capabilities. Let's use them to extract information from our documents.

In [None]:
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor

print("Loading Qwen2-VL for document understanding...")
print(f"Memory before: {get_memory_usage()}")
start_time = time.time()

doc_processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

doc_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

load_time = time.time() - start_time
print(f"\nModel loaded in {load_time:.1f} seconds!")
print(f"Memory after: {get_memory_usage()}")

In [None]:
def analyze_document(image: Image.Image, question: str, max_tokens: int = 500) -> str:
    """
    Analyze a document image and answer a question about it.
    
    Args:
        image: Document image
        question: Question about the document
        max_tokens: Maximum response length
        
    Returns:
        Answer from the VLM
    """
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": question}
            ]
        }
    ]
    
    text = doc_processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = doc_processor(text=[text], images=[image], return_tensors="pt", padding=True)
    inputs = inputs.to(doc_model.device)
    
    start_time = time.time()
    with torch.inference_mode():
        output_ids = doc_model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.3  # Lower temperature for factual extraction
        )
    generation_time = time.time() - start_time
    
    generated_ids = output_ids[:, inputs.input_ids.shape[1]:]
    response = doc_processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    print(f"[Generated in {generation_time:.1f}s]")
    return response

print("analyze_document() function ready!")

In [None]:
# Test with the invoice
print("=" * 50)
print("INVOICE ANALYSIS")
print("=" * 50)

question1 = "What is the total amount due on this invoice? Also tell me the invoice number and due date."

print(f"\nQuestion: {question1}")
print("-" * 50)
answer1 = analyze_document(invoice_image, question1)
print(f"\nAnswer: {answer1}")

In [None]:
# Extract structured information
question2 = """Extract all line items from this invoice as a structured list.
For each item, provide: description, quantity, unit price, and total amount.
Format as a numbered list."""

print(f"Question: {question2}")
print("-" * 50)
answer2 = analyze_document(invoice_image, question2)
print(f"\nAnswer: {answer2}")

In [None]:
# Analyze the report
print("\n" + "=" * 50)
print("REPORT ANALYSIS")
print("=" * 50)

question3 = "What was the revenue growth in Q4 2024? What are the key findings mentioned in this report?"

print(f"\nQuestion: {question3}")
print("-" * 50)
answer3 = analyze_document(report_image, question3)
print(f"\nAnswer: {answer3}")

In [None]:
# Extract table data
question4 = """Read the quarterly breakdown table and extract the data.
What was the profit in each quarter? Calculate the total annual profit."""

print(f"Question: {question4}")
print("-" * 50)
answer4 = analyze_document(report_image, question4)
print(f"\nAnswer: {answer4}")

---

## Part 4: Building a Document Q&A Pipeline

Let's create a complete pipeline that can:
1. Accept document images
2. Store them with metadata
3. Answer questions across multiple documents

In [None]:
class DocumentQA:
    """
    Document Question-Answering Pipeline.
    
    Handles multiple documents and answers questions using VLM.
    """
    
    def __init__(self, model, processor):
        """
        Initialize the Document QA system.
        
        Args:
            model: Loaded VLM model
            processor: VLM processor
        """
        self.model = model
        self.processor = processor
        self.documents = {}  # Store documents by ID
        
    def add_document(self, doc_id: str, image: Image.Image, metadata: Dict = None):
        """
        Add a document to the collection.
        
        Args:
            doc_id: Unique identifier for the document
            image: Document image
            metadata: Optional metadata (title, date, type, etc.)
        """
        self.documents[doc_id] = {
            'image': image,
            'metadata': metadata or {},
            'added_at': time.time()
        }
        print(f"Added document: {doc_id}")
        
    def get_document_summary(self, doc_id: str) -> str:
        """
        Get a summary of a document.
        
        Args:
            doc_id: Document identifier
            
        Returns:
            Summary text
        """
        if doc_id not in self.documents:
            return f"Document '{doc_id}' not found"
            
        image = self.documents[doc_id]['image']
        
        prompt = "Provide a brief summary of this document. What type of document is it and what are the key points?"
        
        messages = [{
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": prompt}
            ]
        }]
        
        text = self.processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        inputs = self.processor(text=[text], images=[image], return_tensors="pt", padding=True)
        inputs = inputs.to(self.model.device)
        
        with torch.inference_mode():
            output_ids = self.model.generate(**inputs, max_new_tokens=200, temperature=0.3)
        
        generated_ids = output_ids[:, inputs.input_ids.shape[1]:]
        return self.processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    def ask(self, question: str, doc_id: Optional[str] = None) -> Dict:
        """
        Ask a question about document(s).
        
        Args:
            question: The question to ask
            doc_id: Specific document to query (None = search all)
            
        Returns:
            Dictionary with answer and source documents
        """
        if doc_id:
            # Query specific document
            if doc_id not in self.documents:
                return {'answer': f"Document '{doc_id}' not found", 'sources': []}
            
            docs_to_query = {doc_id: self.documents[doc_id]}
        else:
            # Query all documents
            docs_to_query = self.documents
        
        if not docs_to_query:
            return {'answer': "No documents in collection", 'sources': []}
        
        # For simplicity, we'll create a combined view of all documents
        # In production, you'd use retrieval to find relevant documents first
        
        answers = []
        sources = []
        
        for did, doc in docs_to_query.items():
            image = doc['image']
            
            messages = [{
                "role": "user",
                "content": [
                    {"type": "image", "image": image},
                    {"type": "text", "text": f"{question}\n\nIf this document doesn't contain relevant information, just say 'Not found in this document'."}
                ]
            }]
            
            text = self.processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
            inputs = self.processor(text=[text], images=[image], return_tensors="pt", padding=True)
            inputs = inputs.to(self.model.device)
            
            with torch.inference_mode():
                output_ids = self.model.generate(**inputs, max_new_tokens=300, temperature=0.3)
            
            generated_ids = output_ids[:, inputs.input_ids.shape[1]:]
            answer = self.processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
            
            if "not found" not in answer.lower():
                answers.append(f"From {did}: {answer}")
                sources.append(did)
        
        if not answers:
            final_answer = "I couldn't find relevant information in the documents."
        else:
            final_answer = "\n\n".join(answers)
        
        return {
            'question': question,
            'answer': final_answer,
            'sources': sources
        }
    
    def extract_fields(self, doc_id: str, fields: List[str]) -> Dict:
        """
        Extract specific fields from a document.
        
        Args:
            doc_id: Document identifier
            fields: List of field names to extract
            
        Returns:
            Dictionary of extracted field values
        """
        if doc_id not in self.documents:
            return {'error': f"Document '{doc_id}' not found"}
        
        image = self.documents[doc_id]['image']
        
        fields_str = "\n".join([f"- {field}" for field in fields])
        prompt = f"""Extract the following fields from this document. 
Return the values in a structured format.

Fields to extract:
{fields_str}

Format your response as:
Field Name: Value"""
        
        messages = [{
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": prompt}
            ]
        }]
        
        text = self.processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        inputs = self.processor(text=[text], images=[image], return_tensors="pt", padding=True)
        inputs = inputs.to(self.model.device)
        
        with torch.inference_mode():
            output_ids = self.model.generate(**inputs, max_new_tokens=300, temperature=0.1)
        
        generated_ids = output_ids[:, inputs.input_ids.shape[1]:]
        response = self.processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
        
        # Parse the response into a dictionary
        extracted = {}
        for line in response.split('\n'):
            if ':' in line:
                key, value = line.split(':', 1)
                extracted[key.strip()] = value.strip()
        
        return extracted
    
    def list_documents(self) -> List[str]:
        """List all document IDs in the collection."""
        return list(self.documents.keys())

print("DocumentQA class defined!")

In [None]:
# Initialize the Document QA system
doc_qa = DocumentQA(doc_model, doc_processor)

# Add our documents
doc_qa.add_document("invoice_001", invoice_image, {
    'type': 'invoice',
    'title': 'TechCorp Invoice',
    'date': '2024-12-15'
})

doc_qa.add_document("report_q4_2024", report_image, {
    'type': 'report',
    'title': 'Q4 2024 Performance Report',
    'date': '2024-12'
})

print(f"\nDocuments in collection: {doc_qa.list_documents()}")

In [None]:
# Get summaries
print("\n" + "=" * 50)
print("DOCUMENT SUMMARIES")
print("=" * 50)

for doc_id in doc_qa.list_documents():
    print(f"\n{doc_id}:")
    print("-" * 40)
    summary = doc_qa.get_document_summary(doc_id)
    print(summary)

In [None]:
# Extract specific fields from the invoice
print("\n" + "=" * 50)
print("FIELD EXTRACTION")
print("=" * 50)

fields = [
    "Invoice Number",
    "Company Name",
    "Total Amount",
    "Due Date",
    "Number of Line Items"
]

extracted = doc_qa.extract_fields("invoice_001", fields)

print("\nExtracted Fields:")
for field, value in extracted.items():
    print(f"  {field}: {value}")

In [None]:
# Ask questions across all documents
print("\n" + "=" * 50)
print("CROSS-DOCUMENT Q&A")
print("=" * 50)

question = "What financial information is available in these documents? Summarize the key numbers."

print(f"\nQuestion: {question}")
print("-" * 50)

result = doc_qa.ask(question)
print(f"\nAnswer: {result['answer']}")
print(f"\nSources: {result['sources']}")

---

## Try It Yourself: Create Your Own Document Pipeline

1. Create a new type of document (e.g., receipt, contract, form)
2. Add it to the DocumentQA system
3. Extract specific fields from it

<details>
<summary>Hint: Creating a simple receipt</summary>

```python
def create_receipt():
    img = Image.new('RGB', (400, 600), 'white')
    draw = ImageDraw.Draw(img)
    
    draw.text((50, 30), "RECEIPT", fill='black')
    draw.text((50, 60), "Coffee Shop", fill='gray')
    draw.text((50, 100), "Latte         $4.50", fill='black')
    draw.text((50, 130), "Muffin        $3.00", fill='black')
    draw.text((50, 170), "Total:        $7.50", fill='black')
    draw.text((50, 200), "Date: 2024-12-15", fill='gray')
    
    return img
```
</details>

In [None]:
# YOUR CODE HERE
# Create your own document and add it to the system!



---

## Common Mistakes

### Mistake 1: Image Resolution Too Low

```python
# Wrong - text becomes unreadable
small_image = doc_image.resize((200, 200))
answer = analyze_document(small_image, question)  # Poor results

# Right - maintain readable resolution
answer = analyze_document(doc_image, question)  # Full resolution
# Or resize proportionally while keeping text readable
doc_image.thumbnail((1200, 1600), Image.LANCZOS)  # Better
```

### Mistake 2: Not Being Specific in Questions

```python
# Wrong - vague question
answer = analyze_document(invoice, "What's in here?")  # Generic answer

# Right - specific question
answer = analyze_document(invoice, "What is the total amount due and the due date?")  # Precise answer
```

### Mistake 3: Ignoring Document Structure

```python
# Wrong - asking about multiple things in one question
answer = analyze_document(doc, "Extract all fields, summarize, and compare to other docs")

# Right - break down into specific tasks
summary = analyze_document(doc, "Summarize this document briefly.")
fields = analyze_document(doc, "Extract: Company Name, Amount, Date")
```

### Mistake 4: Not Handling Handwritten or Low-Quality Scans

```python
# Wrong - assuming perfect input
answer = analyze_document(blurry_scan, question)  # May fail

# Right - add robustness hints
answer = analyze_document(blurry_scan, 
    f"{question}\n\nNote: The image quality may be low. Do your best to read it.")
```

---

## Checkpoint

You've learned:
- How Document AI pipelines work
- How to use VLMs for document understanding
- How to extract structured information from documents
- How to build a Document Q&A system
- How to query across multiple documents

### Key Takeaways

1. **VLMs are powerful for documents**: They understand both text and layout
2. **Be specific in questions**: Clear questions get better answers
3. **Image quality matters**: Keep resolution high enough for text to be readable
4. **Structure helps**: Breaking tasks into steps improves accuracy

---

## Challenge (Optional)

### Build a Document Comparison System

Create a function that:
1. Takes two documents as input
2. Identifies similarities and differences
3. Highlights any discrepancies in numbers or key fields

In [None]:
# YOUR CHALLENGE CODE HERE

def compare_documents(doc_id1: str, doc_id2: str) -> Dict:
    """
    Compare two documents and identify similarities and differences.
    
    Args:
        doc_id1: First document ID
        doc_id2: Second document ID
        
    Returns:
        Comparison results
    """
    # TODO: Implement document comparison
    pass

---

## Further Reading

- [Document Understanding with LLMs](https://huggingface.co/blog/document-ai)
- [Qwen2-VL Paper](https://arxiv.org/abs/2409.12191)
- [LayoutLM for Document Understanding](https://arxiv.org/abs/1912.13318)
- [PyMuPDF Documentation](https://pymupdf.readthedocs.io/)

---

## Cleanup

In [None]:
# Clean up
del doc_model
del doc_processor
del doc_qa

clear_gpu_memory()
print(f"Final memory state: {get_memory_usage()}")
print("\nNotebook complete! Ready for the next task.")