# Lab 4.1.4: Document AI Pipeline

**Module:** 4.1 - Multimodal AI  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Extract text from PDFs using PyMuPDF
- [ ] Apply OCR to scanned documents using Tesseract
- [ ] Detect and extract tables from documents
- [ ] Use VLMs to understand complex document layouts
- [ ] Build a complete document Q&A pipeline

---

## üìö Prerequisites

- Completed: Lab 4.1.1 (Vision-Language Models)
- Knowledge of: PDF structure, basic NLP concepts
- Running in: NGC PyTorch container

---

## üåç Real-World Context

Document AI is transforming how organizations handle paperwork:

- **Legal**: Extract key clauses from contracts automatically
- **Finance**: Process invoices and receipts at scale
- **Healthcare**: Digitize patient records with high accuracy
- **Insurance**: Extract information from claim forms
- **Research**: Parse scientific papers for key findings

---

## üßí ELI5: What is Document AI?

> **Imagine you're a librarian who needs to organize thousands of old books and papers.** Some are typed, some are handwritten, some have pictures and tables.
>
> Document AI is like having a super-powered assistant who can:
> 1. **Read** any document, even messy handwriting
> 2. **Understand** the layout - which part is the title, which is a table
> 3. **Extract** specific information you need
> 4. **Answer questions** about what's in the documents
>
> **In AI terms:** Document AI combines OCR (converting images to text), layout analysis (understanding structure), and NLP/VLMs (understanding meaning) to process any document intelligently.

---

## Part 1: Environment Setup

Let's set up the tools we need for document processing.

In [None]:
# Check GPU
import torch

print("=" * 50)
print("DGX Spark Environment Check")
print("=" * 50)

if torch.cuda.is_available():
    device = torch.cuda.get_device_properties(0)
    print(f"GPU: {device.name}")
    print(f"Memory: {device.total_memory / 1024**3:.1f} GB")
else:
    print("WARNING: No GPU detected!")

In [None]:
# Install dependencies (run once)
# !pip install pymupdf>=1.23.0 pytesseract pdf2image pillow>=10.0.0
# !apt-get update && apt-get install -y tesseract-ocr poppler-utils

In [None]:
# Import libraries
import gc
import time
import re
from pathlib import Path
from typing import Optional, Union, List, Dict, Any, Tuple
from dataclasses import dataclass, field
from io import BytesIO

import torch
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt

# Document processing
import fitz  # PyMuPDF

print("‚úÖ Libraries imported!")

In [None]:
# Check for OCR availability
import shutil

print("\nüîç Checking Document AI dependencies:")

# Check Tesseract
tesseract_path = shutil.which("tesseract")
if tesseract_path:
    print(f"  ‚úÖ Tesseract OCR: {tesseract_path}")
else:
    print("  ‚ö†Ô∏è  Tesseract OCR: Not found (install with: apt-get install tesseract-ocr)")

# Check pdftoppm (for pdf2image)
pdftoppm_path = shutil.which("pdftoppm")
if pdftoppm_path:
    print(f"  ‚úÖ pdftoppm: {pdftoppm_path}")
else:
    print("  ‚ö†Ô∏è  pdftoppm: Not found (install with: apt-get install poppler-utils)")

# Check PyMuPDF
print(f"  ‚úÖ PyMuPDF: v{fitz.version[0]}")

---

## Part 2: Creating a Sample PDF

Let's create a sample PDF document to work with.

In [None]:
def create_sample_pdf(output_path: str = "sample_document.pdf") -> str:
    """
    Create a sample PDF with various content types for testing.
    
    Returns:
        Path to created PDF
    """
    doc = fitz.open()
    
    # Page 1: Title and introduction
    page = doc.new_page(width=612, height=792)  # Letter size
    
    # Title
    title_rect = fitz.Rect(72, 72, 540, 120)
    page.insert_textbox(
        title_rect,
        "Annual Performance Report 2024",
        fontsize=24,
        fontname="helv",
        align=fitz.TEXT_ALIGN_CENTER
    )
    
    # Subtitle
    subtitle_rect = fitz.Rect(72, 130, 540, 160)
    page.insert_textbox(
        subtitle_rect,
        "DGX Spark AI Division",
        fontsize=14,
        fontname="helv",
        align=fitz.TEXT_ALIGN_CENTER
    )
    
    # Introduction
    intro_rect = fitz.Rect(72, 200, 540, 400)
    intro_text = """Executive Summary

This report presents the annual performance metrics for the DGX Spark AI Division. 
Key highlights include:

‚Ä¢ Revenue growth of 45% year-over-year
‚Ä¢ Successful launch of 3 new AI products
‚Ä¢ Customer satisfaction score of 4.8/5.0
‚Ä¢ Team expansion from 50 to 85 employees

The following sections provide detailed analysis of each department's performance 
and projections for the upcoming fiscal year."""
    
    page.insert_textbox(
        intro_rect,
        intro_text,
        fontsize=11,
        fontname="helv"
    )
    
    # Page 2: Financial data with table
    page2 = doc.new_page(width=612, height=792)
    
    # Section header
    header_rect = fitz.Rect(72, 72, 540, 100)
    page2.insert_textbox(
        header_rect,
        "1. Financial Performance",
        fontsize=16,
        fontname="helv"
    )
    
    # Table data
    table_text = """Quarterly Revenue (in millions USD)

Quarter    | 2023      | 2024      | Growth
-----------|-----------|-----------|--------
Q1         | $12.5     | $18.2     | +45.6%
Q2         | $14.3     | $21.5     | +50.3%
Q3         | $15.8     | $23.1     | +46.2%
Q4         | $18.2     | $26.8     | +47.3%
-----------|-----------|-----------|--------
Total      | $60.8     | $89.6     | +47.4%

Key Financial Metrics:

‚Ä¢ Operating Margin: 28.5% (up from 23.2%)
‚Ä¢ EBITDA: $25.5M (up 62% YoY)
‚Ä¢ Cash Position: $45.2M
‚Ä¢ R&D Investment: $15.8M (17.6% of revenue)"""
    
    table_rect = fitz.Rect(72, 120, 540, 450)
    page2.insert_textbox(
        table_rect,
        table_text,
        fontsize=10,
        fontname="cour"  # Monospace for table
    )
    
    # Page 3: Product highlights
    page3 = doc.new_page(width=612, height=792)
    
    header_rect = fitz.Rect(72, 72, 540, 100)
    page3.insert_textbox(
        header_rect,
        "2. Product Highlights",
        fontsize=16,
        fontname="helv"
    )
    
    products_text = """New Product Launches:

1. SPARK Vision Pro
   Released: March 2024
   Description: Advanced computer vision system for manufacturing quality control.
   Revenue Impact: $8.2M in first 9 months
   Customer Adoption: 45 enterprise clients

2. SPARK NLP Suite
   Released: June 2024  
   Description: Natural language processing toolkit for document automation.
   Revenue Impact: $5.4M in first 6 months
   Customer Adoption: 78 enterprise clients

3. SPARK Edge Deployment
   Released: September 2024
   Description: Edge computing framework for real-time AI inference.
   Revenue Impact: $3.1M in first 3 months
   Customer Adoption: 32 enterprise clients

Product Roadmap 2025:
‚Ä¢ Q1: SPARK Multimodal (vision + language integration)
‚Ä¢ Q2: SPARK AutoML Platform
‚Ä¢ Q3: SPARK Real-time Analytics
‚Ä¢ Q4: SPARK Enterprise Security Suite"""
    
    products_rect = fitz.Rect(72, 120, 540, 650)
    page3.insert_textbox(
        products_rect,
        products_text,
        fontsize=10,
        fontname="helv"
    )
    
    # Save the PDF
    doc.save(output_path)
    doc.close()
    
    print(f"‚úÖ Created sample PDF: {output_path}")
    return output_path

# Create the sample PDF
sample_pdf_path = create_sample_pdf()

---

## Part 3: Basic PDF Text Extraction

Let's extract text from our PDF using PyMuPDF.

In [None]:
def extract_text_from_pdf(pdf_path: str) -> Dict[int, str]:
    """
    Extract text from each page of a PDF.
    
    Args:
        pdf_path: Path to PDF file
        
    Returns:
        Dictionary mapping page numbers to text content
    """
    doc = fitz.open(pdf_path)
    
    pages = {}
    for page_num, page in enumerate(doc, 1):
        text = page.get_text()
        pages[page_num] = text
    
    doc.close()
    return pages

# Extract text
print("üìÑ Extracting text from PDF...")
print("=" * 60)

pages = extract_text_from_pdf(sample_pdf_path)

for page_num, text in pages.items():
    print(f"\nüìÉ Page {page_num}:")
    print("-" * 40)
    # Show first 500 characters
    preview = text[:500] + "..." if len(text) > 500 else text
    print(preview)

In [None]:
# Get structured text blocks with positions
def extract_structured_blocks(pdf_path: str) -> List[Dict]:
    """
    Extract text blocks with position information.
    
    Returns:
        List of blocks with text, position, and metadata
    """
    doc = fitz.open(pdf_path)
    
    all_blocks = []
    
    for page_num, page in enumerate(doc, 1):
        blocks = page.get_text("dict")["blocks"]
        
        for block in blocks:
            if "lines" in block:  # Text block
                # Combine text from all lines
                text = " ".join(
                    " ".join(span["text"] for span in line["spans"])
                    for line in block["lines"]
                ).strip()
                
                if text:
                    all_blocks.append({
                        "page": page_num,
                        "bbox": block["bbox"],  # (x0, y0, x1, y1)
                        "text": text,
                        "type": "text",
                    })
            
            elif "image" in block:  # Image block
                all_blocks.append({
                    "page": page_num,
                    "bbox": block["bbox"],
                    "type": "image",
                    "size": (block.get("width"), block.get("height")),
                })
    
    doc.close()
    return all_blocks

# Get structured blocks
blocks = extract_structured_blocks(sample_pdf_path)

print(f"\nüìä Found {len(blocks)} content blocks:")
print("=" * 60)

for i, block in enumerate(blocks[:10]):  # Show first 10
    block_type = block["type"]
    page = block["page"]
    
    if block_type == "text":
        preview = block["text"][:60] + "..." if len(block["text"]) > 60 else block["text"]
        print(f"  {i+1}. Page {page}: [TEXT] {preview}")
    else:
        print(f"  {i+1}. Page {page}: [IMAGE] Size: {block['size']}")

if len(blocks) > 10:
    print(f"  ... and {len(blocks) - 10} more blocks")

---

## Part 4: Table Extraction

Let's extract tables from our PDF document.

In [None]:
@dataclass
class TableData:
    """Extracted table data."""
    rows: List[List[str]]
    headers: Optional[List[str]] = None
    page_number: int = 0
    
    def to_markdown(self) -> str:
        """Convert table to markdown format."""
        if not self.rows:
            return ""
        
        lines = []
        
        if self.headers:
            lines.append("| " + " | ".join(self.headers) + " |")
            lines.append("| " + " | ".join(["---"] * len(self.headers)) + " |")
        
        for row in self.rows:
            lines.append("| " + " | ".join(str(cell) for cell in row) + " |")
        
        return "\n".join(lines)
    
    def to_dict(self) -> List[Dict]:
        """Convert table to list of dictionaries."""
        if not self.headers or not self.rows:
            return []
        
        return [
            dict(zip(self.headers, row))
            for row in self.rows
        ]


def extract_tables(pdf_path: str) -> List[TableData]:
    """
    Extract tables from a PDF.
    
    Args:
        pdf_path: Path to PDF file
        
    Returns:
        List of TableData objects
    """
    doc = fitz.open(pdf_path)
    tables = []
    
    for page_num, page in enumerate(doc, 1):
        # Try to find tables
        try:
            page_tables = page.find_tables()
            
            for table in page_tables:
                if table.row_count > 1:
                    rows = []
                    for row in table.extract():
                        cleaned = [str(cell).strip() if cell else "" for cell in row]
                        rows.append(cleaned)
                    
                    if rows:
                        tables.append(TableData(
                            rows=rows[1:] if len(rows) > 1 else [],
                            headers=rows[0] if rows else None,
                            page_number=page_num,
                        ))
        except Exception as e:
            print(f"  Note: Table extraction not available for page {page_num}")
    
    doc.close()
    return tables

# Extract tables
print("üìä Extracting tables from PDF...")
print("=" * 60)

tables = extract_tables(sample_pdf_path)
print(f"\nFound {len(tables)} table(s)")

for i, table in enumerate(tables):
    print(f"\nüìã Table {i+1} (Page {table.page_number}):")
    print(table.to_markdown())

---

## Part 5: OCR for Scanned Documents

For scanned PDFs (images instead of selectable text), we need OCR.

### üßí ELI5: What is OCR?

> **OCR is like teaching a computer to read.** When you take a photo of a page, the computer just sees colored dots (pixels). OCR looks at the shapes of those dots and figures out what letters they represent.
>
> It's like how you learned to read - first you learned that certain shapes mean certain letters, then you could read any text!

In [None]:
def pdf_page_to_image(pdf_path: str, page_num: int = 0, dpi: int = 150) -> Image.Image:
    """
    Convert a PDF page to an image.
    
    Args:
        pdf_path: Path to PDF file
        page_num: Page number (0-indexed)
        dpi: Resolution for rendering
        
    Returns:
        PIL Image of the page
    """
    doc = fitz.open(pdf_path)
    page = doc[page_num]
    
    # Render at specified DPI
    mat = fitz.Matrix(dpi / 72, dpi / 72)
    pix = page.get_pixmap(matrix=mat)
    
    # Convert to PIL Image
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    
    doc.close()
    return img

# Convert first page to image
page_image = pdf_page_to_image(sample_pdf_path, 0)

plt.figure(figsize=(10, 12))
plt.imshow(page_image)
plt.axis('off')
plt.title("PDF Page 1 as Image")
plt.tight_layout()
plt.show()

print(f"Image size: {page_image.size}")

In [None]:
# OCR using Tesseract (if available)
try:
    import pytesseract
    
    def ocr_image(image: Image.Image, lang: str = "eng") -> str:
        """
        Perform OCR on an image.
        
        Args:
            image: PIL Image to OCR
            lang: Language code for Tesseract
            
        Returns:
            Extracted text
        """
        return pytesseract.image_to_string(image, lang=lang)
    
    def ocr_with_boxes(image: Image.Image) -> List[Dict]:
        """
        Perform OCR and get word bounding boxes.
        
        Returns:
            List of words with their positions
        """
        data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
        
        words = []
        for i in range(len(data["text"])):
            if int(data["conf"][i]) > 30 and data["text"][i].strip():
                words.append({
                    "text": data["text"][i],
                    "x": data["left"][i],
                    "y": data["top"][i],
                    "width": data["width"][i],
                    "height": data["height"][i],
                    "confidence": data["conf"][i],
                })
        
        return words
    
    # Test OCR on our page
    print("üîç Performing OCR on page image...")
    print("=" * 60)
    
    ocr_text = ocr_image(page_image)
    print("\nOCR Result (first 500 chars):")
    print(ocr_text[:500])
    
except ImportError:
    print("‚ö†Ô∏è pytesseract not installed. Run: pip install pytesseract")
    print("   Also ensure tesseract-ocr is installed on the system.")

---

## Part 6: Using VLMs for Document Understanding

Vision-Language Models can understand complex document layouts better than traditional OCR!

### üßí ELI5: VLMs for Documents

> **Regular OCR just reads the letters.** A VLM is like a person who can look at the whole page and understand:
> - "This is a title because it's big and at the top"
> - "These numbers are a table because they're in rows and columns"
> - "This part is the footer because it's at the bottom of every page"

In [None]:
# Load LLaVA for document understanding
from transformers import AutoProcessor, LlavaForConditionalGeneration

print("Loading LLaVA for document understanding...")
start_time = time.time()

model_name = "llava-hf/llava-1.5-7b-hf"

processor = AutoProcessor.from_pretrained(model_name)
model = LlavaForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    low_cpu_mem_usage=True,
)

print(f"\n‚úÖ Loaded in {time.time() - start_time:.1f}s")
print(f"GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

In [None]:
def analyze_document_image(image: Image.Image, question: str, max_new_tokens: int = 512) -> str:
    """
    Analyze a document image using VLM.
    
    Args:
        image: Document page image
        question: Question about the document
        max_new_tokens: Maximum response length
        
    Returns:
        VLM's response
    """
    prompt = f"USER: <image>\n{question}\nASSISTANT:"
    
    inputs = processor(text=prompt, images=image, return_tensors="pt")
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
        )
    
    response = processor.decode(output_ids[0], skip_special_tokens=True)
    
    if "ASSISTANT:" in response:
        response = response.split("ASSISTANT:")[-1].strip()
    
    return response

print("‚úÖ Document analysis function ready!")

In [None]:
# Test VLM document understanding
page2_image = pdf_page_to_image(sample_pdf_path, 1)  # Financial page

# Display the page
plt.figure(figsize=(10, 12))
plt.imshow(page2_image)
plt.axis('off')
plt.title("Document Page for Analysis")
plt.tight_layout()
plt.show()

# Ask questions about the document
questions = [
    "What type of document is this?",
    "Summarize the key financial metrics shown in this document.",
    "What was the total revenue growth percentage?",
]

print("\nüìä VLM Document Analysis")
print("=" * 60)

for q in questions:
    print(f"\n‚ùì {q}")
    response = analyze_document_image(page2_image, q)
    print(f"üí¨ {response}")

---

## Part 7: Building a Complete Document Q&A Pipeline

Now let's combine everything into a production-ready pipeline!

In [None]:
@dataclass
class ProcessedDocument:
    """A fully processed document."""
    source_path: str
    num_pages: int
    text_content: Dict[int, str]  # page -> text
    tables: List[TableData]
    metadata: Dict[str, Any]
    
    @property
    def full_text(self) -> str:
        """Get all text concatenated."""
        return "\n\n".join(
            f"--- Page {page} ---\n{text}"
            for page, text in sorted(self.text_content.items())
        )


class DocumentProcessor:
    """
    Complete document processing pipeline.
    """
    
    def __init__(self, vlm_model=None, vlm_processor=None):
        """Initialize with optional VLM for advanced understanding."""
        self.vlm_model = vlm_model
        self.vlm_processor = vlm_processor
    
    def process(self, pdf_path: str, use_vlm: bool = True) -> ProcessedDocument:
        """
        Process a PDF document.
        
        Args:
            pdf_path: Path to PDF file
            use_vlm: Whether to use VLM for enhanced understanding
            
        Returns:
            ProcessedDocument with all extracted content
        """
        print(f"üìÑ Processing: {pdf_path}")
        
        doc = fitz.open(pdf_path)
        
        # Extract metadata
        metadata = {
            "title": doc.metadata.get("title", ""),
            "author": doc.metadata.get("author", ""),
            "pages": doc.page_count,
        }
        
        # Extract text from each page
        text_content = {}
        for page_num, page in enumerate(doc, 1):
            text = page.get_text()
            text_content[page_num] = text
            print(f"  Page {page_num}: {len(text)} characters extracted")
        
        doc.close()
        
        # Extract tables
        tables = extract_tables(pdf_path)
        print(f"  Found {len(tables)} table(s)")
        
        return ProcessedDocument(
            source_path=pdf_path,
            num_pages=metadata["pages"],
            text_content=text_content,
            tables=tables,
            metadata=metadata,
        )
    
    def ask_question(
        self,
        document: ProcessedDocument,
        question: str,
        use_vlm_for_page: Optional[int] = None,
    ) -> str:
        """
        Answer a question about the document.
        
        Args:
            document: Processed document
            question: Question to answer
            use_vlm_for_page: If specified, use VLM on this page
            
        Returns:
            Answer based on document content
        """
        if use_vlm_for_page and self.vlm_model:
            # Use VLM for visual understanding
            page_image = pdf_page_to_image(document.source_path, use_vlm_for_page - 1)
            return analyze_document_image(page_image, question)
        
        # Build context from document
        context = document.full_text
        
        # Add tables if present
        if document.tables:
            context += "\n\n--- TABLES ---\n"
            for i, table in enumerate(document.tables):
                context += f"\nTable {i+1}:\n{table.to_markdown()}\n"
        
        # Truncate if too long
        max_context = 3000
        if len(context) > max_context:
            context = context[:max_context] + "...\n[Truncated]"
        
        # Simple keyword-based answering (in production, use LLM)
        # For demo, we'll use the VLM on page 1
        if self.vlm_model:
            page_image = pdf_page_to_image(document.source_path, 0)
            full_question = f"Based on this document, {question}"
            return analyze_document_image(page_image, full_question)
        
        return f"Document contains {document.num_pages} pages. Use VLM for detailed Q&A."

print("‚úÖ DocumentProcessor class ready!")

In [None]:
# Create and use the document processor
processor = DocumentProcessor(vlm_model=model, vlm_processor=processor)

# Process our sample document
print("\nüîÑ Processing document...")
print("=" * 60)

doc = processor.process(sample_pdf_path)

print(f"\nüìä Document Summary:")
print(f"  Pages: {doc.num_pages}")
print(f"  Tables: {len(doc.tables)}")
print(f"  Total characters: {sum(len(t) for t in doc.text_content.values())}")

In [None]:
# Ask questions about the document
questions = [
    "What is the title of this report?",
    "What were the key highlights mentioned?",
    "What is the revenue growth percentage?",
]

print("\nüìù Document Q&A")
print("=" * 60)

for q in questions:
    print(f"\n‚ùì {q}")
    # Use VLM on specific pages for visual content
    answer = processor.ask_question(doc, q, use_vlm_for_page=1)
    print(f"üí¨ {answer}")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Not Handling Scanned PDFs
```python
# ‚ùå Wrong: Assumes all PDFs have selectable text
text = page.get_text()
if not text:  # Empty! Document is scanned
    # Now what?

# ‚úÖ Right: Fallback to OCR for scanned pages
text = page.get_text()
if not text.strip():
    # Page is scanned - use OCR
    image = pdf_page_to_image(pdf_path, page_num)
    text = ocr_image(image)
```
**Why:** Many documents (especially older ones) are scanned images.

---

### Mistake 2: Ignoring Document Structure
```python
# ‚ùå Wrong: Just dump all text together
all_text = " ".join(page.get_text() for page in doc)

# ‚úÖ Right: Preserve structure
content = {
    "title": extract_title(doc),
    "sections": extract_sections(doc),
    "tables": extract_tables(doc),
    "figures": extract_figures(doc),
}
```
**Why:** Document structure carries important information (headings, tables, etc.).

---

### Mistake 3: Low Resolution for OCR
```python
# ‚ùå Wrong: Low DPI loses detail
image = pdf_page_to_image(pdf_path, page, dpi=72)
text = ocr_image(image)  # Poor results!

# ‚úÖ Right: Use sufficient resolution
image = pdf_page_to_image(pdf_path, page, dpi=150)  # Or 300 for fine print
text = ocr_image(image)
```
**Why:** OCR accuracy depends on image resolution. 150-300 DPI is recommended.

---

## üéâ Checkpoint

You've learned:
- ‚úÖ Extracting text from PDFs using PyMuPDF
- ‚úÖ Getting structured content blocks with positions
- ‚úÖ Extracting tables from documents
- ‚úÖ Using OCR for scanned documents
- ‚úÖ Leveraging VLMs for visual document understanding
- ‚úÖ Building a complete document Q&A pipeline

---

## üöÄ Challenge (Optional)

Build an **Invoice Processing System** that:
1. Takes invoice images/PDFs as input
2. Extracts key fields: invoice number, date, vendor, line items, total
3. Validates extracted data (e.g., line items sum to total)
4. Outputs structured JSON

In [None]:
# Challenge: Your code here!

def process_invoice(invoice_path: str) -> Dict[str, Any]:
    """
    Process an invoice and extract structured data.
    
    Args:
        invoice_path: Path to invoice PDF or image
        
    Returns:
        Dictionary with extracted invoice data
    """
    # Your implementation here!
    pass

---

## üìñ Further Reading

- [PyMuPDF Documentation](https://pymupdf.readthedocs.io/)
- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)
- [Document AI with Vision Transformers](https://arxiv.org/abs/2111.15664)
- [LayoutLM Paper](https://arxiv.org/abs/1912.13318)

---

## üßπ Cleanup

In [None]:
# Clean up
if 'model' in dir():
    del model
if 'processor' in dir():
    del processor

# Remove sample PDF
import os
if os.path.exists(sample_pdf_path):
    os.remove(sample_pdf_path)

torch.cuda.empty_cache()
gc.collect()

print("‚úÖ Cleanup complete!")
print(f"GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

---

## Next Steps

In the next lab, we'll explore **Audio Transcription** using Whisper for speech-to-text conversion!

‚û°Ô∏è Continue to [Lab 05: Audio Transcription](./05-audio-transcription.ipynb)