# Document Parsing with LlamaParse for RAG Systems

## Core Features & Parsing Fundamentals

This comprehensive notebook demonstrates **LlamaParse (v0.6.77)** - a GenAI-native document parser designed for converting complex documents into LLM-ready data for Retrieval-Augmented Generation (RAG) systems.

### What is LlamaParse?

LlamaParse is a document parsing service by LlamaIndex that excels at:
- **Complex Document Handling**: Financial reports, research papers, scanned PDFs
- **Precise Extraction**: Tables, charts, images, and diagrams
- **70+ File Formats**: PDF, DOCX, XLSX, PPTX, HTML, images, audio, and more
- **LLM-Ready Output**: Clean markdown, text, or structured JSON

### Pricing
- **Free Tier**: 1,000 pages daily
- **Paid Tier**: 7,000 pages/week + $0.003/additional page

### Part 1 Contents
1. Introduction & Setup
2. Core Concepts
3. Basic Document Parsing
4. Presets & Built-in Configurations
5. Parse Modes (Fast, Premium, Auto)
6. Supported File Formats
7. Multimodal Parsing Features
8. Layout Extraction
9. Structured Output

**Part 2** covers: Custom Prompts, Advanced Configuration, Async Operations, LlamaIndex Integration, RAG Examples, CLI Usage, and Best Practices.

---
## 1. Introduction & Setup

### 1.1 Installation

Install the required packages for LlamaParse and RAG functionality.

In [None]:
# Install LlamaParse and related packages
# llama-parse: The core document parsing library (v0.6.77)
# llama-index: Framework for building RAG applications
# llama-cloud-services: Cloud service integrations

# !pip install llama-parse==0.6.77 llama-index llama-cloud-services -q

# # Additional dependencies for RAG and embeddings
# !pip install chromadb python-dotenv openai -q

# # Optional: Install llama-index components for vector stores
# !pip install llama-index-vector-stores-chroma llama-index-embeddings-openai -q

# print("Installation complete!")

### 1.2 API Key Setup

LlamaParse requires an API key from [LlamaCloud](https://cloud.llamaindex.ai/api-key).

**Steps to get your API key:**
1. Go to https://cloud.llamaindex.ai/
2. Sign up or log in
3. Navigate to API Keys section
4. Create a new API key (starts with `llx-`)

In [1]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Set up API keys
# Option 1: Set directly (not recommended for production)
# os.environ["LLAMA_CLOUD_API_KEY"] = "llx-your-api-key-here"

# Option 2: Load from environment variable (recommended)
# Make sure to add LLAMA_CLOUD_API_KEY to your .env file

# OpenAI API key for embeddings (optional, for RAG examples)
# os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

# Verify API key is set
llama_api_key = os.getenv("LLAMA_CLOUD_API_KEY")
if llama_api_key:
    print(f"LlamaCloud API key loaded: {llama_api_key[:10]}...")
else:
    print("Warning: LLAMA_CLOUD_API_KEY not found in environment variables")
    print("Please set it before running parsing examples")

LlamaCloud API key loaded: llx-nYSY32...


### 1.3 Import Required Libraries

In [2]:
# Core LlamaParse imports
# from llama_parse import LlamaParse

# Alternative import from llama_cloud_services (same functionality)
from llama_cloud_services import LlamaParse

# Standard library imports
import os
import json
import asyncio
from pathlib import Path
from typing import List, Dict, Any

# For async operations in Jupyter
import nest_asyncio
nest_asyncio.apply()

print("Imports successful!")
from importlib.metadata import version

print(version("llama-parse"))

Imports successful!
0.6.54


### 1.4 Setup Sample Documents Directory

In [3]:
# Define paths for sample documents
SAMPLE_DIR = Path("./sample_documents")
OUTPUT_DIR = Path("./llamaparse_output")

# Create output directory if it doesn't exist
OUTPUT_DIR.mkdir(exist_ok=True)

# List available sample documents
if SAMPLE_DIR.exists():
    print("Available sample documents:")
    for file in SAMPLE_DIR.iterdir():
        if file.is_file():
            print(f"  - {file.name} ({file.stat().st_size / 1024:.1f} KB)")
else:
    print(f"Sample directory not found: {SAMPLE_DIR}")
    print("Creating sample directory...")
    SAMPLE_DIR.mkdir(exist_ok=True)

Available sample documents:
  - attention_paper.pdf (2163.3 KB)


### 1.5 Download a Sample PDF for Testing

In [4]:
import urllib.request

# Download a sample PDF (attention paper from arXiv)
sample_pdf_url = "https://arxiv.org/pdf/1706.03762.pdf"  # "Attention Is All You Need" paper
sample_pdf_path = SAMPLE_DIR / "attention_paper.pdf"

if not sample_pdf_path.exists():
    print(f"Downloading sample PDF from arXiv...")
    try:
        urllib.request.urlretrieve(sample_pdf_url, sample_pdf_path)
        print(f"Downloaded: {sample_pdf_path}")
        print(f"File size: {sample_pdf_path.stat().st_size / 1024:.1f} KB")
    except Exception as e:
        print(f"Download failed: {e}")
        print("You can manually download a PDF for testing")
else:
    print(f"Sample PDF already exists: {sample_pdf_path}")

Sample PDF already exists: sample_documents/attention_paper.pdf


---
## 2. Core Concepts

### 2.1 Understanding the LlamaParse Architecture

LlamaParse uses a **job-based architecture**:

```
Document → Submit Job → Poll Status → Get Results
```

**Key Components:**
- **LlamaParse Client**: Main interface for document parsing
- **Job**: Represents a parsing task (can be async)
- **JobResult**: Contains parsed content, pages, images, charts, layout
- **Document**: LlamaIndex Document object with text and metadata

In [5]:
# Understanding the LlamaParse client structure

# The LlamaParse class is the main entry point
# It can be instantiated with various configuration options

# Basic instantiation
parser_basic = LlamaParse(
    result_type="markdown"  # Output format: "markdown" or "text"
)

# View available configuration parameters
print("LlamaParse Configuration Parameters:")
print("="*50)

# Key parameters categorized
config_categories = {
    "Authentication": ["api_key", "base_url"],
    "Output Control": ["result_type", "language", "split_by_page"],
    "Mode Selection": ["fast_mode", "premium_mode", "auto_mode", "preset"],
    "OCR & Images": ["disable_ocr", "high_res_ocr", "extract_charts", "take_screenshot"],
    "Layout": ["extract_layout", "do_not_unroll_columns"],
    "Structured Output": ["structured_output", "structured_output_json_schema"],
    "Prompts": ["parsing_instruction", "system_prompt", "formatting_instruction"],
    "Performance": ["num_workers", "job_timeout_in_seconds", "verbose"],
}

for category, params in config_categories.items():
    print(f"\n{category}:")
    for param in params:
        print(f"  - {param}")

LlamaParse Configuration Parameters:

Authentication:
  - api_key
  - base_url

Output Control:
  - result_type
  - language
  - split_by_page

Mode Selection:
  - fast_mode
  - premium_mode
  - auto_mode
  - preset

OCR & Images:
  - disable_ocr
  - high_res_ocr
  - extract_charts
  - take_screenshot

Layout:
  - extract_layout
  - do_not_unroll_columns

Structured Output:
  - structured_output
  - structured_output_json_schema

Prompts:
  - parsing_instruction
  - system_prompt
  - formatting_instruction

Performance:
  - num_workers
  - job_timeout_in_seconds
  - verbose


### 2.2 Result Types: Text vs Markdown

LlamaParse supports two primary output formats:

| Format | Best For | Features |
|--------|----------|----------|
| `markdown` | RAG, LLM input | Preserves structure, tables, formatting |
| `text` | Simple extraction | Plain text, no formatting |

In [6]:
# Comparing result types

# Markdown parser - preserves structure and formatting
parser_markdown = LlamaParse(
    result_type="markdown",  # Returns formatted markdown
    verbose=True  # Show progress
)

# Text parser - plain text extraction
parser_text = LlamaParse(
    result_type="text",  # Returns plain text
    verbose=True
)

print("Markdown Parser Configuration:")
print(f"  Result Type: {parser_markdown.result_type}")

print("\nText Parser Configuration:")
print(f"  Result Type: {parser_text.result_type}")

Markdown Parser Configuration:
  Result Type: ResultType.MD

Text Parser Configuration:
  Result Type: ResultType.TXT


### 2.3 JobResult Structure

When parsing a document, LlamaParse returns a `JobResult` object with the following attributes:

```python
JobResult:
├── pages: List[Page]         # Individual page data
│   ├── text: str             # Plain text content
│   ├── md: str               # Markdown content
│   ├── images: List          # Extracted images
│   ├── charts: List          # Extracted charts
│   ├── layout: Dict          # Layout information
│   └── structuredData: Dict  # Structured data (if enabled)
├── text: str                 # Full document text
├── md: str                   # Full document markdown
└── images: List              # All extracted images
```

In [7]:
# Understanding JobResult structure (conceptual)

# The JobResult provides multiple ways to access parsed content:

job_result_methods = {
    "get_markdown_documents(split_by_page)": "Get LlamaIndex Documents with markdown content",
    "get_text_documents(split_by_page)": "Get LlamaIndex Documents with text content",
    "get_image_documents()": "Get extracted images as documents",
    "aget_image_documents()": "Async version with download capabilities",
    "get_json()": "Get structured JSON output",
    ".pages": "Direct access to page-by-page data",
    ".text": "Full document as plain text",
    ".md": "Full document as markdown",
}

print("JobResult Access Methods:")
print("="*60)
for method, description in job_result_methods.items():
    print(f"\n{method}")
    print(f"  → {description}")

JobResult Access Methods:

get_markdown_documents(split_by_page)
  → Get LlamaIndex Documents with markdown content

get_text_documents(split_by_page)
  → Get LlamaIndex Documents with text content

get_image_documents()
  → Get extracted images as documents

aget_image_documents()
  → Async version with download capabilities

get_json()
  → Get structured JSON output

.pages
  → Direct access to page-by-page data

.text
  → Full document as plain text

.md
  → Full document as markdown


### 2.4 Sync vs Async Methods

LlamaParse provides both synchronous and asynchronous methods:

| Sync Method | Async Method | Description |
|------------|--------------|-------------|
| `load_data()` | `aload_data()` | Parse and return Documents |
| `parse()` | `aparse()` | Parse and return JobResult |
| `get_images()` | `aget_images()` | Get extracted images |

In [8]:
# Sync vs Async usage patterns

# Synchronous usage (blocking)
def sync_parse_example(file_path: str):
    """
    Synchronous parsing - blocks until complete.
    Best for: Simple scripts, single document parsing
    """
    parser = LlamaParse(result_type="markdown")
    documents = parser.load_data(file_path)
    return documents

# Asynchronous usage (non-blocking)
async def async_parse_example(file_path: str):
    """
    Asynchronous parsing - allows concurrent operations.
    Best for: Multiple documents, web applications, batch processing
    """
    parser = LlamaParse(result_type="markdown")
    documents = await parser.aload_data(file_path)
    return documents

print("Sync and Async patterns defined.")
print("\nSync: Use load_data(), parse() for simple scripts")
print("Async: Use aload_data(), aparse() for concurrent processing")

Sync and Async patterns defined.

Sync: Use load_data(), parse() for simple scripts
Async: Use aload_data(), aparse() for concurrent processing


---
## 3. Basic Document Parsing

### 3.1 Simple PDF Parsing with load_data()

In [9]:
# Basic PDF parsing example
# This is the simplest way to parse a document

# Initialize parser with markdown output
parser = LlamaParse(
    result_type="markdown",  # Output as markdown (best for RAG)
    verbose=True,            # Show parsing progress
)

# Parse the sample PDF
pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    print(f"Parsing: {pdf_path}")
    print("-" * 50)
    
    # load_data() returns a list of LlamaIndex Document objects
    documents = parser.load_data(pdf_path)
    
    print(f"\nParsing complete!")
    print(f"Number of documents returned: {len(documents)}")
    
    # Preview the first document
    if documents:
        doc = documents[0]
        print(f"\nDocument type: {type(doc).__name__}")
        print(f"Content length: {len(doc.text)} characters")
        print(f"\nFirst 1000 characters:")
        print("=" * 50)
        print(doc.text[:1000])
else:
    print(f"File not found: {pdf_path}")
    print("Please run the download cell first or provide your own PDF.")

Parsing: sample_documents/attention_paper.pdf
--------------------------------------------------
Started parsing the file under job_id 76159519-4b81-4479-8e84-2c618b7bbca5

Parsing complete!
Number of documents returned: 15

Document type: Document
Content length: 2767 characters

First 1000 characters:
arXiv:1706.03762v7 [cs.CL] 2 Aug 2023

# Attention Is All You Need

Ashish Vaswani∗  Noam Shazeer∗        Niki Parmar∗  Jakob Uszkoreit∗

Google Brain         Google Brain     Google Research    Google Research

avaswani@google.com    noam@google.com    nikip@google.com    usz@google.com

Llion Jones∗     Aidan N. Gomez∗ †         Łukasz Kaiser∗

Google Research    University of Toronto          Google Brain

llion@google.com    aidan@cs.toronto.edu    lukaszkaiser@google.com

Illia Polosukhin∗ ‡

illia.polosukhin@gmail.com

# Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The bes

In [10]:
documents

[Document(id_='888d6efd-015e-4923-8c86-731b7506976b', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='arXiv:1706.03762v7 [cs.CL] 2 Aug 2023\n\n# Attention Is All You Need\n\nAshish Vaswani∗  Noam Shazeer∗        Niki Parmar∗  Jakob Uszkoreit∗\n\nGoogle Brain         Google Brain     Google Research    Google Research\n\navaswani@google.com    noam@google.com    nikip@google.com    usz@google.com\n\nLlion Jones∗     Aidan N. Gomez∗ †         Łukasz Kaiser∗\n\nGoogle Research    University of Toronto          Google Brain\n\nllion@google.com    aidan@cs.toronto.edu    lukaszkaiser@google.com\n\nIllia Polosukhin∗ ‡\n\nillia.polosukhin@gmail.com\n\n# Abstract\n\nThe dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. Th

### 3.2 Getting Text Output

In [11]:
# Parsing with plain text output

parser_text = LlamaParse(
    result_type="text",  # Plain text output (no formatting)
    verbose=True,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    print("Parsing with TEXT output...")
    documents_text = parser_text.load_data(pdf_path)
    
    if documents_text:
        print(f"\nText output preview (first 800 chars):")
        print("=" * 50)
        print(documents_text[0].text[:800])
else:
    print(f"File not found: {pdf_path}")

Parsing with TEXT output...
Started parsing the file under job_id 786735a2-9698-4755-a547-bcba831b18ef

Text output preview (first 800 chars):
    arXiv:1706.03762v7 [cs.CL] 2 Aug 2023

  Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
    scholarly works.

    Attention Is All You Need

    Ashish Vaswani∗  Noam Shazeer∗        Niki Parmar∗  Jakob Uszkoreit∗
    Google Brain         Google Brain     Google Research    Google Research
avaswani@google.com    noam@google.com    nikip@google.com    usz@google.com

    Llion Jones∗     Aidan N. Gomez∗ †         Łukasz Kaiser∗
Google Research    University of Toronto          Google Brain
llion@google.com    aidan@cs.toronto.edu    lukaszkaiser@google.com

                     Illia Polosukhin∗ ‡
                     illia.polosukhin@gmail.com

                                 


### 3.3 Using parse() for More Control

The `parse()` method returns a `JobResult` object which provides more granular access to parsed content.

In [12]:
# Using parse() for more detailed results

parser = LlamaParse(
    result_type="markdown",
    verbose=True,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    print("Using parse() method for detailed results...")
    
    # parse() returns JobResult objects (one per file)
    job_results = parser.parse([pdf_path])  # Note: takes a list of files
    
    if job_results:
        result = job_results[0]
        print(f"\nJobResult type: {type(result).__name__}")
        print(f"Number of pages: {len(result.pages)}")
        
        # Access page-by-page content
        print(f"\nPage 1 content (first 500 chars):")
        print("=" * 50)
        if result.pages:
            page1 = result.pages[0]
            print(page1.md[:500] if page1.md else page1.text[:500])
else:
    print(f"File not found: {pdf_path}")

Using parse() method for detailed results...


Getting job results:   0%|          | 0/1 [00:00<?, ?it/s]

Started parsing the file under job_id 332fb20f-2aff-4bcc-b435-542a4251783b


Getting job results: 100%|██████████| 1/1 [00:09<00:00,  9.26s/it]


JobResult type: JobResult
Number of pages: 15

Page 1 content (first 500 chars):
arXiv:1706.03762v7 [cs.CL] 2 Aug 2023

# Attention Is All You Need

Ashish Vaswani∗  Noam Shazeer∗        Niki Parmar∗  Jakob Uszkoreit∗

Google Brain         Google Brain     Google Research    Google Research

avaswani@google.com    noam@google.com    nikip@google.com    usz@google.com

Llion Jones∗     Aidan N. Gomez∗ †         Łukasz Kaiser∗

Google Research    University of Toronto          Google Brain

llion@google.com    aidan@cs.toronto.edu    lukaszkaiser@google.com

Illia Polosukhin∗ 





### 3.4 Accessing Individual Pages

In [13]:
# Accessing page-by-page content from JobResult

parser = LlamaParse(
    result_type="markdown",
    verbose=False,  # Suppress progress output
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    job_results = parser.parse([pdf_path])
    
    if job_results:
        result = job_results[0]
        
        print(f"Document has {len(result.pages)} pages")
        print("=" * 50)
        
        # Iterate through pages
        for i, page in enumerate(result.pages[:3]):  # First 3 pages
            print(f"\n--- Page {i + 1} ---")
            print(f"Text length: {len(page.text) if page.text else 0} chars")
            print(f"Markdown length: {len(page.md) if page.md else 0} chars")
            print(f"Images: {len(page.images) if page.images else 0}")
            print(f"Charts: {len(page.charts) if hasattr(page, 'charts') and page.charts else 0}")
            
            # Preview content
            content = page.md if page.md else page.text
            if content:
                preview = content[:200].replace('\n', ' ')
                print(f"Preview: {preview}...")
else:
    print(f"File not found: {pdf_path}")

Getting job results: 100%|██████████| 1/1 [00:08<00:00,  8.91s/it]

Document has 15 pages

--- Page 1 ---
Text length: 3364 chars
Markdown length: 2766 chars
Images: 0
Charts: 0
Preview: arXiv:1706.03762v7 [cs.CL] 2 Aug 2023  # Attention Is All You Need  Ashish Vaswani∗  Noam Shazeer∗        Niki Parmar∗  Jakob Uszkoreit∗  Google Brain         Google Brain     Google Research    Googl...

--- Page 2 ---
Text length: 4310 chars
Markdown length: 4270 chars
Images: 0
Charts: 0
Preview:   # 1 Introduction  Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequenc...

--- Page 3 ---
Text length: 1956 chars
Markdown length: 1854 chars
Images: 1
Charts: 0
Preview:  # 2020  Figure 1: The Transformer - model architecture.  The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and...





### 3.5 Split by Page Option

In [14]:
# Using split_by_page to get separate documents per page

parser = LlamaParse(
    result_type="markdown",
    split_by_page=True,  # Each page becomes a separate document
    verbose=False,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    # With split_by_page=True, load_data returns one Document per page
    documents = parser.load_data(pdf_path)
    
    print(f"Total documents (pages): {len(documents)}")
    print("=" * 50)
    
    # Preview each page document
    for i, doc in enumerate(documents[:5]):  # First 5 pages
        print(f"\nDocument {i + 1} (Page {i + 1}):")
        print(f"  Length: {len(doc.text)} chars")
        print(f"  Metadata: {doc.metadata}")
        preview = doc.text[:150].replace('\n', ' ')
        print(f"  Preview: {preview}...")
else:
    print(f"File not found: {pdf_path}")

Total documents (pages): 15

Document 1 (Page 1):
  Length: 2767 chars
  Metadata: {}
  Preview: arXiv:1706.03762v7 [cs.CL] 2 Aug 2023  # Attention Is All You Need  Ashish Vaswani∗  Noam Shazeer∗        Niki Parmar∗  Jakob Uszkoreit∗  Google Brain...

Document 2 (Page 2):
  Length: 4272 chars
  Metadata: {}
  Preview:    # 1 Introduction  Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly es...

Document 3 (Page 3):
  Length: 1856 chars
  Metadata: {}
  Preview:   # 2020  Figure 1: The Transformer - model architecture.  The Transformer follows this overall architecture using stacked self-attention and point-wi...

Document 4 (Page 4):
  Length: 2921 chars
  Metadata: {}
  Preview:   # Scaled Dot-Product Attention  # Multi-Head Attention  | Linear      | MatMul             | SoftMax | Concat | | ----------- | ------------------ |...

Document 5 (Page 5):
  Length: 3196 chars
  Metadata: {}
  Preview:   output va

### 3.6 Page Formatting Options

In [15]:
# Customizing page separators and formatting

parser = LlamaParse(
    result_type="markdown",
    split_by_page=False,      # Single document output
    page_separator="\n\n---PAGE BREAK---\n\n",  # Custom separator between pages
    page_prefix="[Page Start]",   # Prefix for each page
    page_suffix="[Page End]",     # Suffix for each page
    verbose=False,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    documents = parser.load_data(pdf_path)
    
    if documents:
        # Find page separators in output
        content = documents[0].text
        
        print("Custom page formatting applied:")
        print("=" * 50)
        
        # Show a section with page break
        if "---PAGE BREAK---" in content:
            # Find first page break
            break_pos = content.find("---PAGE BREAK---")
            start = max(0, break_pos - 200)
            end = min(len(content), break_pos + 250)
            print(content[start:end])
        else:
            print("Page separators may not be visible in output")
            print(f"\nFirst 500 chars:")
            print(content[:500])
else:
    print(f"File not found: {pdf_path}")

Custom page formatting applied:
r research.

†Work performed while at Google Brain.

‡Work performed while at Google Research.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

[Page End]

---PAGE BREAK---

[Page Start]

# 1 Introduction

Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and trans


---
## 4. Presets & Built-in Configurations

LlamaParse provides pre-optimized configurations (presets) for different document types.

### Available Presets

| Preset | Best For | Description |
|--------|----------|-------------|
| `fast` | Quick extraction | No OCR, fastest processing |
| `balanced` | General documents | Balance of speed and accuracy |
| `premium` | Complex documents | Best quality, uses advanced models |
| `structured` | Forms, tables | Optimized for structured data |
| `auto` | Mixed content | Automatic mode selection |
| `scientific` | Research papers | LaTeX, equations, citations |
| `invoice` | Invoices, receipts | Financial document extraction |
| `slides` | Presentations | PowerPoint, slide content |

### 4.1 Fast Preset

In [16]:
# Fast preset - quickest parsing, skips OCR

parser_fast = LlamaParse(
    preset="fast",  # Quick extraction without OCR
    result_type="text",
    verbose=True,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    print("Parsing with FAST preset...")
    print("(Best for: digital PDFs with selectable text)")
    print("-" * 50)
    
    import time
    start = time.time()
    documents = parser_fast.load_data(pdf_path)
    elapsed = time.time() - start
    
    print(f"\nParsing time: {elapsed:.2f} seconds")
    print(f"Content length: {len(documents[0].text)} chars")
    print(f"\nPreview:")
    print(documents[0].text[:500])
else:
    print(f"File not found: {pdf_path}")

Parsing with FAST preset...
(Best for: digital PDFs with selectable text)
--------------------------------------------------
Started parsing the file under job_id 7418cc36-88ce-48e5-a791-b45f7e9a2c4b

Parsing time: 10.79 seconds
Content length: 3364 chars

Preview:
    arXiv:1706.03762v7 [cs.CL] 2 Aug 2023

  Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
    scholarly works.

    Attention Is All You Need

    Ashish Vaswani∗  Noam Shazeer∗        Niki Parmar∗  Jakob Uszkoreit∗
    Google Brain         Google Brain     Google Research    Google Research
avaswani@google.com    noam@google.com    nikip@google.com    usz@google.com

    Llion Jones∗ 


### 4.2 Premium Preset

In [17]:
# Premium preset - best quality parsing

parser_premium = LlamaParse(
    preset="premium",  # Best quality, uses advanced models
    result_type="markdown",
    verbose=True,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    print("Parsing with PREMIUM preset...")
    print("(Best for: complex layouts, scanned documents)")
    print("-" * 50)
    
    import time
    start = time.time()
    documents = parser_premium.load_data(pdf_path)
    elapsed = time.time() - start
    
    print(f"\nParsing time: {elapsed:.2f} seconds")
    print(f"Content length: {len(documents[0].text)} chars")
    print(f"\nPreview:")
    print(documents[0].text[:500])
else:
    print(f"File not found: {pdf_path}")

Parsing with PREMIUM preset...
(Best for: complex layouts, scanned documents)
--------------------------------------------------
Started parsing the file under job_id a9ea7db6-fb80-4246-8ba7-5e876970a2dc

Parsing time: 14.61 seconds
Content length: 3024 chars

Preview:


Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.

# Attention Is All You Need

**Ashish Vaswani**<sup>∗</sup>  
Google Brain  
avaswani@google.com

**Noam Shazeer**<sup>∗</sup>  
Google Brain  
noam@google.com

**Niki Parmar**<sup>∗</sup>  
Google Research  
nikip@google.com

**Jakob Uszkoreit**<sup>∗</sup>  
Google Research  
usz@google.com

**Llion Jones**<sup>∗</sup> 


### 4.3 Scientific Preset

In [18]:
# Scientific preset - optimized for research papers

parser_scientific = LlamaParse(
    preset="scientific",  # Handles LaTeX, equations, citations
    result_type="markdown",
    verbose=True,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    print("Parsing with SCIENTIFIC preset...")
    print("(Best for: academic papers, equations, citations)")
    print("-" * 50)
    
    import time
    start = time.time()
    documents = parser_scientific.load_data(pdf_path)
    elapsed = time.time() - start
    
    print(f"\nParsing time: {elapsed:.2f} seconds")
    print(f"Content length: {len(documents[0].text)} chars")
    print(f"\nPreview (look for equations and formatting):")
    print(documents[0].text[:800])
else:
    print(f"File not found: {pdf_path}")

Parsing with SCIENTIFIC preset...
(Best for: academic papers, equations, citations)
--------------------------------------------------
Started parsing the file under job_id 52729523-917b-4cbe-a985-c2d4f5caa87f

Parsing time: 12.11 seconds
Content length: 3234 chars

Preview (look for equations and formatting):


arXiv:1706.03762v7 [cs.CL] 2 Aug 2023

Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.

# Attention Is All You Need

<table>
<thead>
<tr>
<th>Author Name</th>
<th>Affiliation</th>
</tr>
</thead>
<tbody>
<tr>
<td>Ashish Vaswani*<br>avaswani@google.com</td>
<td>Google Brain</td>
</tr>
<tr>
<td>Noam Shazeer*<br>noam@google.com</td>
<td>Google Brain</td>
</tr>
<tr>
<td>Niki Parmar*<br>nikip@google.com</td>
<td>Google Research</td>
</tr>
<tr>
<td>Jakob Uszkoreit*<br>usz@google.com</td>
<td>Google Research</td>
</tr>
<tr>
<td>Llion Jones*<br>llion@google.com</

### 4.4 Invoice Preset

In [None]:
# Invoice preset - optimized for financial documents

parser_invoice = LlamaParse(
    preset="invoice",  # Handles invoices, receipts
    result_type="markdown",
    verbose=True,
)


# Note: For best results, use this with actual invoice documents
# This example shows the configuration

print("Invoice Preset Configuration:")
print("=" * 50)
print("Best for: Invoices, receipts, financial documents")
print("Features:")
print("  - Optimized for tabular data (line items)")
print("  - Extracts dates, amounts, totals")
print("  - Handles various invoice formats")
print("\nUsage:")
print('  parser = LlamaParse(preset="invoice")')
print('  docs = parser.load_data("invoice.pdf")')

### 4.5 Slides Preset

In [None]:
# Slides preset - optimized for presentations

parser_slides = LlamaParse(
    preset="slides",  # Optimized for PowerPoint/presentations
    result_type="markdown",
    verbose=True,
)

# Note: Use with PPTX files for best results

print("Slides Preset Configuration:")
print("=" * 50)
print("Best for: PowerPoint, Google Slides, Keynote")
print("Features:")
print("  - Preserves slide structure")
print("  - Extracts speaker notes")
print("  - Handles diagrams and charts")
print("  - Maintains bullet point hierarchy")
print("\nUsage:")
print('  parser = LlamaParse(preset="slides")')
print('  docs = parser.load_data("presentation.pptx")')

### 4.6 Comparing Presets

In [19]:
# Compare different presets on the same document

import time

presets_to_compare = ["fast", "balanced", "premium"]
results = {}

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    print("Comparing presets on the same document...")
    print("=" * 60)
    
    for preset in presets_to_compare:
        print(f"\nParsing with '{preset}' preset...")
        
        parser = LlamaParse(
            preset=preset,
            result_type="text",
            verbose=False,
        )
        
        start = time.time()
        docs = parser.load_data(pdf_path)
        elapsed = time.time() - start
        
        results[preset] = {
            "time": elapsed,
            "length": len(docs[0].text),
            "preview": docs[0].text[:200]
        }
        
        print(f"  Time: {elapsed:.2f}s, Content: {len(docs[0].text)} chars")
    
    # Summary table
    print("\n" + "=" * 60)
    print("COMPARISON SUMMARY")
    print("=" * 60)
    print(f"{'Preset':<12} {'Time (s)':<12} {'Content Length':<15}")
    print("-" * 40)
    for preset, data in results.items():
        print(f"{preset:<12} {data['time']:<12.2f} {data['length']:<15}")
else:
    print(f"File not found: {pdf_path}")

Comparing presets on the same document...

Parsing with 'fast' preset...
  Time: 9.05s, Content: 3364 chars

Parsing with 'balanced' preset...
  Time: 9.05s, Content: 3364 chars

Parsing with 'premium' preset...
  Time: 12.01s, Content: 3381 chars

COMPARISON SUMMARY
Preset       Time (s)     Content Length 
----------------------------------------
fast         9.05         3364           
balanced     9.05         3364           
premium      12.01        3381           


---
## 5. Parse Modes

LlamaParse offers granular control over parsing behavior through different modes.

### Mode Categories

**1. High-Level Modes (Boolean flags):**
- `fast_mode`: Skip OCR, fastest processing
- `premium_mode`: Best available parser
- `auto_mode`: Automatic mode selection

**2. Granular Parse Modes (Page-level):**
- `parse_page_without_llm`: Fast extraction without AI
- `parse_page_with_llm`: Uses LLM for each page
- `parse_page_with_lvm`: Uses vision model for pages
- `parse_page_with_agent`: Agentic reasoning per page
- `parse_page_with_layout_agent`: Layout-aware agent

### 5.1 Fast Mode

In [20]:
# Fast mode - skips OCR for faster processing

parser_fast = LlamaParse(
    fast_mode=True,  # Skip OCR, extract text only
    result_type="text",
    verbose=True,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    print("FAST MODE Parsing")
    print("=" * 50)
    print("Characteristics:")
    print("  - No OCR processing")
    print("  - Extracts only embedded/selectable text")
    print("  - Fastest processing time")
    print("  - Best for: Digital PDFs with selectable text")
    print("-" * 50)
    
    import time
    start = time.time()
    documents = parser_fast.load_data(pdf_path)
    elapsed = time.time() - start
    
    print(f"\nTime: {elapsed:.2f}s")
    print(f"Content: {len(documents[0].text)} chars")
else:
    print(f"File not found: {pdf_path}")

FAST MODE Parsing
Characteristics:
  - No OCR processing
  - Extracts only embedded/selectable text
  - Fastest processing time
  - Best for: Digital PDFs with selectable text
--------------------------------------------------
Started parsing the file under job_id 7e354f95-4768-43f2-8ad7-6f7aed37a676

Time: 9.32s
Content: 3365 chars


### 5.2 Premium Mode

In [21]:
# Premium mode - best quality parsing

parser_premium = LlamaParse(
    premium_mode=True,  # Use best available parser
    result_type="text",
    verbose=True,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    print("PREMIUM MODE Parsing")
    print("=" * 50)
    print("Characteristics:")
    print("  - Uses advanced AI models")
    print("  - Better table extraction")
    print("  - Improved layout understanding")
    print("  - Best for: Complex documents, financial reports")
    print("-" * 50)
    
    import time
    start = time.time()
    documents = parser_premium.load_data(pdf_path)
    elapsed = time.time() - start
    
    print(f"\nTime: {elapsed:.2f}s")
    print(f"Content: {len(documents[0].text)} chars")
else:
    print(f"File not found: {pdf_path}")

PREMIUM MODE Parsing
Characteristics:
  - Uses advanced AI models
  - Better table extraction
  - Improved layout understanding
  - Best for: Complex documents, financial reports
--------------------------------------------------
Started parsing the file under job_id 1e0398fd-2d72-432e-8a4b-0477a9e763e6

Time: 12.14s
Content: 3382 chars


### 5.3 Auto Mode with Triggers

Auto mode dynamically selects the best parsing strategy based on page content.

In [22]:
# Auto mode - automatically selects parsing strategy per page

parser_auto = LlamaParse(
    auto_mode=True,  # Enable automatic mode selection
    
    # Trigger conditions for upgrading to premium parsing
    auto_mode_trigger_on_image_in_page=True,   # Upgrade pages with images
    auto_mode_trigger_on_table_in_page=True,   # Upgrade pages with tables
    
    result_type="text",
    verbose=True,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    print("AUTO MODE with Triggers")
    print("=" * 50)
    print("Configuration:")
    print("  - auto_mode: True")
    print("  - Trigger on images: True")
    print("  - Trigger on tables: True")
    print("\nBehavior:")
    print("  - Simple text pages → Fast parsing")
    print("  - Pages with images/tables → Premium parsing")
    print("-" * 50)
    
    import time
    start = time.time()
    documents = parser_auto.load_data(pdf_path)
    elapsed = time.time() - start
    
    print(f"\nTime: {elapsed:.2f}s")
    print(f"Content: {len(documents[0].text)} chars")
else:
    print(f"File not found: {pdf_path}")

AUTO MODE with Triggers
Configuration:
  - auto_mode: True
  - Trigger on images: True
  - Trigger on tables: True

Behavior:
  - Simple text pages → Fast parsing
  - Pages with images/tables → Premium parsing
--------------------------------------------------
Started parsing the file under job_id a706051f-a98b-4db6-b8d3-f02ec209f0d3

Time: 12.95s
Content: 3365 chars


### 5.4 Auto Mode with Text Trigger

In [None]:
# Auto mode with text-based trigger

parser_auto_text = LlamaParse(
    auto_mode=True,
    auto_mode_trigger_on_text_in_page="table|figure|equation",  # Regex pattern
    result_type="markdown",
    verbose=True,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")
documents = parser_auto.load_data(pdf_path)

print("AUTO MODE with Text Trigger")
print("=" * 50)
print("Configuration:")
print('  - Trigger pattern: "table|figure|equation"')
print("\nBehavior:")
print("  - Pages containing 'table', 'figure', or 'equation' → Premium")
print("  - Other pages → Fast parsing")

print(f"Document: {documents} chars")

Started parsing the file under job_id d0a81471-f236-4403-81c1-87dfee681c2a
AUTO MODE with Text Trigger
Configuration:
  - Trigger pattern: "table|figure|equation"

Behavior:
  - Pages containing 'table', 'figure', or 'equation' → Premium
  - Other pages → Fast parsing
Document: [Document(id_='b0b0e88f-d615-4f04-8f83-c5a3f15d94f7', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='    arXiv:1706.03762v7 [cs.CL] 2 Aug 2023\n\n  Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\n    scholarly works.\n\n    Attention Is All You Need\n\n    Ashish Vaswani∗  Noam Shazeer∗        Niki Parmar∗  Jakob Uszkoreit∗\n    Google Brain         Google Brain     Google Research    Google Research\navaswani@google.co

### 5.5 Auto Mode with Regex Trigger

In [27]:
# Auto mode with regex pattern trigger

parser_auto_regex = LlamaParse(
    auto_mode=True,
    auto_mode_trigger_on_regexp_in_page=r"\d+\.\d+%|\$[\d,]+",  # Percentages or dollar amounts
    result_type="markdown",
    verbose=True,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")
documents = parser_auto_regex.load_data(pdf_path)
print(f"Document: {documents} chars")

print("AUTO MODE with Regex Trigger")
print("=" * 50)
print("Configuration:")
print(r'  - Regex pattern: "\d+\.\d+%|\$[\d,]+"')
print("  - Matches: Percentages (45.5%) or dollar amounts ($1,000)")
print("\nUse case:")
print("  - Upgrade pages with financial data to premium parsing")
print("  - Keep text-only pages in fast mode")

Started parsing the file under job_id 861e53fe-9be2-459d-847e-e180c856fb7b
.Document: [Document(id_='9fa88b14-1bcd-4ea7-9086-1baad51d72e4', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='arXiv:1706.03762v7 [cs.CL] 2 Aug 2023\n\n# Attention Is All You Need\n\nAshish Vaswani∗  Noam Shazeer∗        Niki Parmar∗  Jakob Uszkoreit∗\n\nGoogle Brain         Google Brain     Google Research    Google Research\n\navaswani@google.com    noam@google.com    nikip@google.com    usz@google.com\n\nLlion Jones∗     Aidan N. Gomez∗ †         Łukasz Kaiser∗\n\nGoogle Research    University of Toronto          Google Brain\n\nllion@google.com    aidan@cs.toronto.edu    lukaszkaiser@google.com\n\nIllia Polosukhin∗ ‡\n\nillia.polosukhin@gmail.com\n\n# Abstract\n\nThe dominant sequence transduction models are based on comple

### 5.6 Auto Mode with JSON Configuration

In [28]:
# Auto mode with advanced JSON configuration

auto_config = {
    "trigger_conditions": {
        "on_image": True,
        "on_table": True,
        "on_text_patterns": ["figure", "table", "chart"],
    },
    "fallback_mode": "fast",
    "upgrade_mode": "premium"
}

parser_auto_json = LlamaParse(
    auto_mode=True,
    auto_mode_configuration_json=json.dumps(auto_config),
    result_type="markdown",
    verbose=True,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")
documents = parser_auto_json.load_data(pdf_path)
print(f"Document: {documents} chars")

print("AUTO MODE with JSON Configuration")
print("=" * 50)
print("Configuration JSON:")
print(json.dumps(auto_config, indent=2))
print("\nThis allows complex conditional parsing logic.")

Started parsing the file under job_id 9fbf2b4f-f60e-4b96-b18c-c67ac66886f0
Error while parsing the file 'sample_documents/attention_paper.pdf': Job ID: 9fbf2b4f-f60e-4b96-b18c-c67ac66886f0 failed with status: ERROR, Error code: No error code found, Error message: UNKNOWN_ERROR: INVALID_AUTO_MODE_CONFIGURATION - Invalid auto_mode_configuration_json, should be an array
Document: [] chars
AUTO MODE with JSON Configuration
Configuration JSON:
{
  "trigger_conditions": {
    "on_image": true,
    "on_table": true,
    "on_text_patterns": [
      "figure",
      "table",
      "chart"
    ]
  },
  "fallback_mode": "fast",
  "upgrade_mode": "premium"
}

This allows complex conditional parsing logic.


### 5.7 Granular Parse Modes

In [None]:
# Overview of granular parse modes

granular_modes = {
    "parse_page_without_llm": {
        "description": "Fast extraction without AI reconstruction",
        "use_case": "Simple text documents, speed priority",
        "cost": "Low"
    },
    "parse_page_with_llm": {
        "description": "Uses language model for page parsing",
        "use_case": "Mixed content, text-heavy documents",
        "cost": "Medium"
    },
    "parse_page_with_lvm": {
        "description": "Uses vision-language model for each page",
        "use_case": "Image-heavy documents, diagrams",
        "cost": "Medium-High"
    },
    "parse_page_with_agent": {
        "description": "Agentic reasoning for complex pages",
        "use_case": "Complex layouts, nested structures",
        "cost": "High"
    },
    "parse_page_with_layout_agent": {
        "description": "Layout-aware agent for visual structure",
        "use_case": "Multi-column documents, forms",
        "cost": "High"
    },
}

print("GRANULAR PARSE MODES")
print("=" * 70)

for mode, info in granular_modes.items():
    print(f"\n{mode}:")
    print(f"  Description: {info['description']}")
    print(f"  Use Case: {info['use_case']}")
    print(f"  Cost: {info['cost']}")

### 5.8 Comparing Parse Modes

In [None]:
# Compare fast_mode vs premium_mode

import time

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    print("COMPARING FAST MODE vs PREMIUM MODE")
    print("=" * 60)
    
    # Fast mode
    print("\n1. Fast Mode:")
    parser_fast = LlamaParse(fast_mode=True, result_type="markdown", verbose=False)
    start = time.time()
    docs_fast = parser_fast.load_data(pdf_path)
    time_fast = time.time() - start
    print(f"   Time: {time_fast:.2f}s, Length: {len(docs_fast[0].text)} chars")
    
    # Premium mode
    print("\n2. Premium Mode:")
    parser_premium = LlamaParse(premium_mode=True, result_type="markdown", verbose=False)
    start = time.time()
    docs_premium = parser_premium.load_data(pdf_path)
    time_premium = time.time() - start
    print(f"   Time: {time_premium:.2f}s, Length: {len(docs_premium[0].text)} chars")
    
    # Auto mode
    print("\n3. Auto Mode:")
    parser_auto = LlamaParse(
        auto_mode=True, 
        auto_mode_trigger_on_table_in_page=True,
        result_type="text", 
        verbose=False
    )
    start = time.time()
    docs_auto = parser_auto.load_data(pdf_path)
    time_auto = time.time() - start
    print(f"   Time: {time_auto:.2f}s, Length: {len(docs_auto[0].text)} chars")
    
    # Summary
    print("\n" + "=" * 60)
    print("SUMMARY")
    print(f"{'Mode':<15} {'Time (s)':<12} {'Content Length':<15} {'Speed Ratio'}")
    print("-" * 55)
    print(f"{'Fast':<15} {time_fast:<12.2f} {len(docs_fast[0].text):<15} 1.0x")
    print(f"{'Premium':<15} {time_premium:<12.2f} {len(docs_premium[0].text):<15} {time_premium/time_fast:.1f}x")
    print(f"{'Auto':<15} {time_auto:<12.2f} {len(docs_auto[0].text):<15} {time_auto/time_fast:.1f}x")
else:
    print(f"File not found: {pdf_path}")

---
## 6. Supported File Formats

LlamaParse supports 70+ file formats:

| Category | Formats |
|----------|----------|
| Documents | PDF, DOC, DOCX, RTF, TXT, EPUB |
| Spreadsheets | XLSX, XLS, CSV, ODS |
| Presentations | PPTX, PPT |
| Images | JPG, PNG, GIF, BMP, TIFF, WEBP, SVG |
| Web | HTML, HTM |
| Audio | MP3, MP4, WAV, WEBM, M4A (≤20MB) |

### 6.1 Parsing PDF Documents

In [None]:
documents[0].metadata

In [None]:
documents[0].metadata.items()

In [None]:
# PDF parsing - the most common use case

parser = LlamaParse(
    result_type="markdown",
    verbose=True,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    print("PDF PARSING")
    print("=" * 50)
    
    documents = parser.load_data(pdf_path)
    
    print(f"\nParsed {len(documents)} document(s)")
    print(f"Content length: {len(documents[0].text)} characters")
    
    # Check metadata
    print(f"\nDocument Metadata:")
    for key, value in documents[0].metadata.items():
        print(f"  {key}: {value}")
else:
    print(f"File not found: {pdf_path}")

### 6.2 Parsing PDF with OCR Options

In [29]:
# PDF with different OCR configurations

# High-resolution OCR (slower but more accurate)
parser_high_ocr = LlamaParse(
    result_type="markdown",
    high_res_ocr=True,  # Better OCR quality
    verbose=True,
)

# Disable OCR (only extract embedded text)
parser_no_ocr = LlamaParse(
    result_type="markdown",
    disable_ocr=True,  # Skip OCR completely
    verbose=True,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")
documents = parser_high_ocr.load_data(pdf_path)
print(f"Document: {documents} chars")

print("PDF OCR OPTIONS")
print("=" * 50)
print("\n1. high_res_ocr=True:")
print("   - Higher accuracy for scanned documents")
print("   - Slower processing")
print("   - Best for: Scanned PDFs, poor quality images")
print("\n2. disable_ocr=True:")
print("   - Only extracts embedded/selectable text")
print("   - Fastest processing")
print("   - Best for: Digital PDFs with selectable text")

Started parsing the file under job_id 9cd0da14-a42f-46ff-85ca-d9e6c0133951
Document: [Document(id_='7fb68e7e-0a71-47be-87c1-448780035934', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='arXiv:1706.03762v7 [cs.CL] 2 Aug 2023\n\n# Attention Is All You Need\n\nAshish Vaswani∗  Noam Shazeer∗        Niki Parmar∗  Jakob Uszkoreit∗\n\nGoogle Brain         Google Brain     Google Research    Google Research\n\navaswani@google.com    noam@google.com    nikip@google.com    usz@google.com\n\nLlion Jones∗     Aidan N. Gomez∗ †         Łukasz Kaiser∗\n\nGoogle Research    University of Toronto          Google Brain\n\nllion@google.com    aidan@cs.toronto.edu    lukaszkaiser@google.com\n\nIllia Polosukhin∗ ‡\n\nillia.polosukhin@gmail.com\n\n# Abstract\n\nThe dominant sequence transduction models are based on complex

### 6.3 Parsing HTML Documents

In [None]:
# HTML parsing

parser = LlamaParse(
    result_type="markdown",
    verbose=True,
)

html_path = str(SAMPLE_DIR / "sample.html")

if Path(html_path).exists():
    print("HTML PARSING")
    print("=" * 50)
    
    documents = parser.load_data(html_path)
    
    print(f"\nParsed {len(documents)} document(s)")
    print(f"Content length: {len(documents[0].text)} characters")
    print(f"\nExtracted content preview:")
    print("-" * 50)
    print(documents[0].text[:800])
else:
    print(f"HTML file not found: {html_path}")
    print("\nExample usage:")
    print('  documents = parser.load_data("webpage.html")')

### 6.4 Parsing Markdown Documents

In [None]:
# Markdown parsing

parser = LlamaParse(
    result_type="text",
    verbose=True,
)

md_path = str(SAMPLE_DIR / "sample.md")

if Path(md_path).exists():
    print("MARKDOWN PARSING")
    print("=" * 50)
    
    documents = parser.load_data(md_path)
    
    print(f"\nParsed {len(documents)} document(s)")
    print(f"Content length: {len(documents[0].text)} characters")
    print(f"\nExtracted content preview:")
    print("-" * 50)
    print(documents[0].text[:600])
else:
    print(f"Markdown file not found: {md_path}")

### 6.5 Parsing Microsoft Office Documents

In [None]:
result

In [None]:
# Microsoft Office document parsing

# DOCX parsing
parser_docx = LlamaParse(
    result_type="markdown",
    verbose=True,
)

print("MICROSOFT OFFICE PARSING")
print("=" * 50)

print("\n1. DOCX (Word Documents):")
result = parser_docx.load_data('sample_documents/sample.docx')
print("   - Preserves headings, lists, tables")
print("   - Extracts embedded images")

# print("\n2. XLSX (Excel Spreadsheets):")
# parser = LlamaParse(
#     spreadsheet_extract_sub_tables=True,
#     guess_xlsx_sheet_names=True,
# )
# parser.load_data('spreadsheet.xlsx')

# print("\n3. PPTX (PowerPoint):")
# parser = LlamaParse(preset="slides")
# parser.load_data('presentation.pptx')
# print("   - Extracts slide content")
# print("   - Includes speaker notes")

### 6.6 Parsing Excel Spreadsheets with Options

In [None]:
# Excel-specific parsing options

parser_excel = LlamaParse(
    result_type="markdown",
    
    # Excel-specific options
    spreadsheet_extract_sub_tables=True,  # Extract sub-tables within sheets
    guess_xlsx_sheet_names=True,          # Try to infer meaningful sheet names
    compact_markdown_table=True,          # More compact table output
    output_tables_as_HTML=False,          # Keep as markdown (not HTML)
    
    verbose=True,
)

print("EXCEL PARSING OPTIONS")
print("=" * 50)
print("Configuration:")
print("  - spreadsheet_extract_sub_tables: True")
print("  - guess_xlsx_sheet_names: True")
print("  - compact_markdown_table: True")
print("  - output_tables_as_HTML: False")
print("\nUsage:")
print('  documents = parser_excel.load_data("financial_data.xlsx")')

### 6.7 Parsing Images

In [None]:
# Image parsing with OCR

parser_image = LlamaParse(
    result_type="markdown",
    high_res_ocr=True,  # High quality OCR for images
    verbose=True,
)

print("IMAGE PARSING")
print("=" * 50)
print("Supported formats: JPG, JPEG, PNG, GIF, BMP, TIFF, WEBP, SVG")
print("\nRecommended settings for images:")
print("  - high_res_ocr=True: Better text extraction")
print("  - premium_mode=True: Best quality for complex images")
print("\nUsage:")
print('  documents = parser_image.load_data("scanned_document.png")')

### 6.8 Parsing Multiple File Types Together

In [None]:
# Parsing multiple files of different types

parser = LlamaParse(
    result_type="markdown",
    num_workers=4,  # Parallel processing for multiple files
    verbose=True,
)

# Collect available sample files
files_to_parse = []
for ext in ["*.pdf", "*.html", "*.md"]:
    files_to_parse.extend(list(SAMPLE_DIR.glob(ext)))

print("MULTI-FILE PARSING")
print("=" * 50)
print(f"Found {len(files_to_parse)} files to parse:")
for f in files_to_parse:
    print(f"  - {f.name}")

if files_to_parse:
    # Parse all files
    file_paths = [str(f) for f in files_to_parse]
    
    print(f"\nParsing {len(file_paths)} files...")
    all_documents = []
    for path in file_paths:
        docs = parser.load_data(path)
        all_documents.extend(docs)
        print(f"  Parsed: {Path(path).name} ({len(docs)} doc(s))")
    
    print(f"\nTotal documents: {len(all_documents)}")
else:
    print("\nNo sample files found. Add files to sample_documents/ directory.")

---
## 7. Multimodal Parsing Features

LlamaParse can extract and process visual elements from documents.

### 7.1 Extract Charts

In [None]:
# Enable chart extraction from documents

parser_charts = LlamaParse(
    result_type="markdown",
    extract_charts=True,  # Extract charts and graphs
    verbose=True,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    print("CHART EXTRACTION")
    print("=" * 50)
    
    # Use parse() to access charts
    job_results = parser_charts.parse([pdf_path])
    
    if job_results:
        result = job_results[0]
        
        # Check for charts in each page
        total_charts = 0
        for i, page in enumerate(result.pages):
            if hasattr(page, 'charts') and page.charts:
                print(f"Page {i+1}: {len(page.charts)} chart(s) found")
                total_charts += len(page.charts)
        
        print(f"\nTotal charts extracted: {total_charts}")
else:
    print(f"File not found: {pdf_path}")

### 7.2 Take Screenshots

In [None]:
# Enable page screenshots

parser_screenshots = LlamaParse(
    result_type="markdown",
    take_screenshot=True,  # Capture page screenshots
    verbose=True,
)

print("PAGE SCREENSHOTS")
print("=" * 50)
print("Configuration: take_screenshot=True")
print("\nBehavior:")
print("  - Captures visual rendering of each page")
print("  - Useful for preserving visual context")
print("  - Screenshots accessible via aget_image_documents()")
print("\nUsage:")
print("  result = await parser.aparse('document.pdf')")
print("  images = await result.aget_image_documents(")
print("      include_screenshot_images=True")
print("  )")

### 7.3 High-Resolution OCR

In [None]:
# High-resolution OCR for better text extraction

parser_high_ocr = LlamaParse(
    result_type="markdown",
    high_res_ocr=True,  # Enable high-resolution OCR
    verbose=True,
)

print("HIGH-RESOLUTION OCR")
print("=" * 50)
print("Configuration: high_res_ocr=True")
print("\nBenefits:")
print("  - Better accuracy for scanned documents")
print("  - Improved extraction of small text")
print("  - Better handling of poor quality images")
print("\nTrade-offs:")
print("  - Slower processing time")
print("  - Higher computational cost")
print("\nBest for:")
print("  - Scanned PDFs")
print("  - Historical documents")
print("  - Low-quality image sources")

### 7.4 Using Vendor Multimodal Models

In [None]:
# Using external multimodal models (e.g., GPT-4V)

# Note: Requires your own API key for the multimodal provider
parser_multimodal = LlamaParse(
    result_type="markdown",
    use_vendor_multimodal_model=True,           # Enable vendor model
    vendor_multimodal_model_name="gpt-4o",      # Model name
    # vendor_multimodal_api_key="your-api-key", # Your OpenAI API key
    verbose=True,
)

print("VENDOR MULTIMODAL MODEL")
print("=" * 50)
print("Configuration:")
print("  - use_vendor_multimodal_model: True")
print("  - vendor_multimodal_model_name: 'gpt-4o'")
print("  - vendor_multimodal_api_key: Your API key")
print("\nSupported models:")
print("  - OpenAI: gpt-4o, gpt-4-vision-preview")
print("  - Anthropic: claude-3-opus, claude-3-sonnet")
print("  - Other vision-capable models")
print("\nUse case:")
print("  - Complex visual understanding")
print("  - Diagram interpretation")
print("  - Visual question answering")

### 7.5 Accessing Extracted Images

In [30]:
# Accessing extracted images from parsed documents

async def extract_images_example(file_path: str):
    """
    Demonstrate image extraction from documents.
    """
    parser = LlamaParse(
        result_type="markdown",
        extract_charts=True,
        take_screenshot=True,
        verbose=False,
    )
    
    # Parse document
    result = await parser.aparse(file_path)
    
    # Get image documents
    image_docs = await result.aget_image_documents(
        include_screenshot_images=True,  # Include page screenshots
        include_object_images=True,      # Include extracted objects
    )
    
    return image_docs

print("IMAGE EXTRACTION")
print("=" * 50)
print("Using aget_image_documents() method:")
print("")
print("Parameters:")
print("  - include_screenshot_images: Include page screenshots")
print("  - include_object_images: Include extracted charts/images")
print("")
print("Example:")
print("  images = await result.aget_image_documents(")
print("      include_screenshot_images=True,")
print("      include_object_images=True")
print("  )")

IMAGE EXTRACTION
Using aget_image_documents() method:

Parameters:
  - include_screenshot_images: Include page screenshots
  - include_object_images: Include extracted charts/images

Example:
  images = await result.aget_image_documents(
      include_screenshot_images=True,
      include_object_images=True
  )


### 7.6 Image and Chart Metadata

In [31]:
# Understanding image and chart metadata

parser = LlamaParse(
    result_type="markdown",
    extract_charts=True,
    verbose=False,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    print("IMAGE AND CHART METADATA")
    print("=" * 50)
    
    job_results = parser.parse([pdf_path])
    
    if job_results:
        result = job_results[0]
        
        for i, page in enumerate(result.pages[:5]):  # First 5 pages
            print(f"\nPage {i+1}:")
            
            # Images
            if hasattr(page, 'images') and page.images:
                print(f"  Images: {len(page.images)}")
                for j, img in enumerate(page.images[:2]):
                    print(f"    Image {j+1}: {type(img).__name__}")
            else:
                print("  Images: 0")
            
            # Charts
            if hasattr(page, 'charts') and page.charts:
                print(f"  Charts: {len(page.charts)}")
                for j, chart in enumerate(page.charts[:2]):
                    print(f"    Chart {j+1}: {type(chart).__name__}")
            else:
                print("  Charts: 0")
else:
    print(f"File not found: {pdf_path}")

IMAGE AND CHART METADATA


Getting job results: 100%|██████████| 1/1 [00:10<00:00, 10.44s/it]


Page 1:
  Images: 0
  Charts: 0

Page 2:
  Images: 0
  Charts: 0

Page 3:
  Images: 1
    Image 1: ImageItem
  Charts: 0

Page 4:
  Images: 2
    Image 1: ImageItem
    Image 2: ImageItem
  Charts: 0

Page 5:
  Images: 0
  Charts: 0





---
## 8. Layout Extraction

LlamaParse can extract and preserve document layout information.

### 8.1 Enable Layout Extraction

In [None]:
# Enable layout extraction

parser_layout = LlamaParse(
    result_type="markdown",
    extract_layout=True,  # Extract layout information
    verbose=True,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    print("LAYOUT EXTRACTION")
    print("=" * 50)
    
    job_results = parser_layout.parse([pdf_path])
    
    if job_results:
        result = job_results[0]
        
        # Check for layout data in pages
        for i, page in enumerate(result.pages[:3]):
            print(f"\nPage {i+1}:")
            if hasattr(page, 'layout') and page.layout:
                print(f"  Layout data available: Yes")
                print(f"  Layout type: {type(page.layout).__name__}")
                if isinstance(page.layout, dict):
                    print(f"  Layout keys: {list(page.layout.keys())[:5]}")
            else:
                print("  Layout data available: No")
else:
    print(f"File not found: {pdf_path}")

### 8.2 Preserve Column Layout

In [None]:
# Preserve column layout (don't unroll columns)

parser_columns = LlamaParse(
    result_type="markdown",
    extract_layout=True,
    do_not_unroll_columns=True,  # Keep column structure
    verbose=True,
)

print("PRESERVE COLUMN LAYOUT")
print("=" * 50)
print("Configuration: do_not_unroll_columns=True")
print("\nBehavior:")
print("  - Default: Columns are merged into single-column text")
print("  - With flag: Column structure is preserved")
print("\nBest for:")
print("  - Multi-column academic papers")
print("  - Newspapers and magazines")
print("  - Documents with parallel content")

### 8.3 Preserve Layout Alignment Across Pages

In [None]:
# Preserve layout alignment across pages

parser_aligned = LlamaParse(
    result_type="markdown",
    extract_layout=True,
    do_not_unroll_columns=True,
    preserve_layout_alignment_across_pages=True,  # Grid alignment
    verbose=True,
)

print("CROSS-PAGE LAYOUT ALIGNMENT")
print("=" * 50)
print("Configuration: preserve_layout_alignment_across_pages=True")
print("\nBehavior:")
print("  - Maintains consistent grid across pages")
print("  - Aligns columns that span multiple pages")
print("  - Useful for consistent document structure")
print("\nBest for:")
print("  - Books with consistent layout")
print("  - Technical manuals")
print("  - Standardized forms")

### 8.4 Bounding Box Extraction

In [None]:
# Extract only specific regions using bounding box

# Bounding box values are percentages (0.0 to 1.0)
parser_bbox = LlamaParse(
    result_type="markdown",
    bbox_top=0.1,      # Start 10% from top
    bbox_bottom=0.9,   # End at 90% from top (10% margin at bottom)
    bbox_left=0.05,    # Start 5% from left
    bbox_right=0.95,   # End at 95% from left (5% margin at right)
    verbose=True,
)

print("BOUNDING BOX EXTRACTION")
print("=" * 50)
print("Configuration (percentages of page):")
print("  - bbox_top: 0.1 (10% from top)")
print("  - bbox_bottom: 0.9 (90% from top)")
print("  - bbox_left: 0.05 (5% from left)")
print("  - bbox_right: 0.95 (95% from left)")
print("\nUse cases:")
print("  - Skip headers and footers")
print("  - Extract specific page regions")
print("  - Ignore page margins")
print("\nExample - Extract only main content:")
print("  parser = LlamaParse(")
print("      bbox_top=0.15,    # Skip header")
print("      bbox_bottom=0.85, # Skip footer")
print("  )")

### 8.5 Accessing Layout Data

In [None]:
# Accessing layout information from results

parser = LlamaParse(
    result_type="markdown",
    extract_layout=True,
    verbose=False,
)

pdf_path = str(SAMPLE_DIR / "attention_paper.pdf")

if Path(pdf_path).exists():
    print("ACCESSING LAYOUT DATA")
    print("=" * 50)
    
    job_results = parser.parse([pdf_path])
    
    if job_results:
        result = job_results[0]
        
        # Layout data is available per page
        print("Layout data structure:")
        print("  result.pages[i].layout")
        print("\nTypical layout information includes:")
        print("  - Bounding boxes for text blocks")
        print("  - Column definitions")
        print("  - Reading order")
        print("  - Element classifications (heading, paragraph, table, etc.)")
        
        # Check first page layout
        if result.pages:
            page = result.pages[0]
            print(f"\nPage 1 layout available: {hasattr(page, 'layout') and page.layout is not None}")
else:
    print(f"File not found: {pdf_path}")

---
## 9. Structured Output

LlamaParse can extract structured data from documents using JSON schemas.

### 9.1 Enable Structured Output

In [None]:
# Enable structured output extraction

parser_structured = LlamaParse(
    result_type="markdown",
    structured_output=True,  # Enable structured data extraction
    verbose=True,
)

print("STRUCTURED OUTPUT")
print("=" * 50)
print("Configuration: structured_output=True")
print("\nBehavior:")
print("  - Extracts structured data from documents")
print("  - Returns data in JSON format")
print("  - Accessible via page.structuredData")
print("\nAccessing structured data:")
print("  result = parser.parse(['document.pdf'])[0]")
print("  for page in result.pages:")
print("      print(page.structuredData)")

### 9.2 Auto-Infer Schema (imFeelingLucky)

In [None]:
# Auto-infer schema using "imFeelingLucky"

parser_auto_schema = LlamaParse(
    result_type="markdown",
    structured_output=True,
    structured_output_json_schema_name="imFeelingLucky",  # Auto-infer schema
    verbose=True,
)

print("AUTO-INFER SCHEMA (imFeelingLucky)")
print("=" * 50)
print('Configuration: structured_output_json_schema_name="imFeelingLucky"')
print("\nBehavior:")
print("  - LlamaParse automatically detects document structure")
print("  - Creates appropriate JSON schema")
print("  - Great for exploring unknown document types")
print("\nBest for:")
print("  - Quick prototyping")
print("  - Unknown document formats")
print("  - Exploratory analysis")

### 9.3 Custom JSON Schema

In [None]:
# Using custom JSON schema for structured extraction

# Define a custom schema for academic papers
paper_schema = {
    "type": "object",
    "properties": {
        "title": {
            "type": "string",
            "description": "The title of the paper"
        },
        "authors": {
            "type": "array",
            "items": {"type": "string"},
            "description": "List of author names"
        },
        "abstract": {
            "type": "string",
            "description": "The paper abstract"
        },
        "keywords": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Keywords or topics"
        },
        "sections": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "heading": {"type": "string"},
                    "content": {"type": "string"}
                }
            },
            "description": "Main sections of the paper"
        }
    },
    "required": ["title", "abstract"]
}

parser_custom_schema = LlamaParse(
    result_type="markdown",
    structured_output=True,
    structured_output_json_schema=json.dumps(paper_schema),
    verbose=True,
)

print("CUSTOM JSON SCHEMA")
print("=" * 50)
print("Schema defined for academic papers:")
print(json.dumps(paper_schema, indent=2))

### 9.4 Invoice Extraction Schema

In [None]:
# Custom schema for invoice extraction

invoice_schema = {
    "type": "object",
    "properties": {
        "invoice_number": {
            "type": "string",
            "description": "Invoice ID or number"
        },
        "invoice_date": {
            "type": "string",
            "description": "Date of the invoice"
        },
        "due_date": {
            "type": "string",
            "description": "Payment due date"
        },
        "vendor": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "address": {"type": "string"},
                "tax_id": {"type": "string"}
            }
        },
        "customer": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "address": {"type": "string"}
            }
        },
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "quantity": {"type": "number"},
                    "unit_price": {"type": "number"},
                    "total": {"type": "number"}
                }
            }
        },
        "subtotal": {"type": "number"},
        "tax": {"type": "number"},
        "total": {"type": "number"}
    },
    "required": ["invoice_number", "total"]
}

parser_invoice = LlamaParse(
    result_type="markdown",
    structured_output=True,
    structured_output_json_schema=json.dumps(invoice_schema),
    preset="invoice",  # Combine with invoice preset
    verbose=True,
)

print("INVOICE EXTRACTION SCHEMA")
print("=" * 50)
print("Schema for invoice data extraction:")
print(json.dumps(invoice_schema, indent=2)[:800] + "...")

### 9.5 Table Output Options

In [None]:
# Table-specific output options

# Compact markdown tables
parser_compact = LlamaParse(
    result_type="markdown",
    compact_markdown_table=True,  # More compact table output
    verbose=True,
)

# HTML table output
parser_html_tables = LlamaParse(
    result_type="markdown",
    output_tables_as_HTML=True,  # Tables as HTML instead of markdown
    verbose=True,
)

# Adaptive long tables
parser_long_tables = LlamaParse(
    result_type="markdown",
    adaptive_long_table=True,  # Better handling of multi-page tables
    continuous_mode=True,      # Required for long table detection
    verbose=True,
)

print("TABLE OUTPUT OPTIONS")
print("=" * 50)
print("\n1. compact_markdown_table=True")
print("   - More compact table formatting")
print("   - Reduces whitespace")
print("\n2. output_tables_as_HTML=True")
print("   - Tables rendered as HTML")
print("   - Better for web display")
print("\n3. adaptive_long_table=True + continuous_mode=True")
print("   - Detects tables spanning multiple pages")
print("   - Merges into single table")

---
## Summary

In this notebook, we covered:

1. **Setup**: Installation, API keys, imports
2. **Core Concepts**: JobResult structure, sync/async methods
3. **Basic Parsing**: load_data(), parse(), page access
4. **Presets**: fast, balanced, premium, scientific, invoice, slides
5. **Parse Modes**: fast_mode, premium_mode, auto_mode with triggers
6. **File Formats**: PDF, HTML, DOCX, XLSX, images
7. **Multimodal**: Charts, screenshots, high-res OCR
8. **Layout**: Column preservation, bounding boxes
9. **Structured Output**: JSON schemas, auto-inference

**Continue to Part 2** for:
- Custom Prompts & Instructions
- Advanced Configuration
- Async Operations & Batch Processing
- LlamaIndex Integration
- Complete RAG Example with ChromaDB
- CLI Usage
- Best Practices

In [None]:
# Quick reference - Common configurations

print("LLAMAPARSE QUICK REFERENCE")
print("="*60)

configs = {
    "Fast extraction": 'LlamaParse(fast_mode=True)',
    "Best quality": 'LlamaParse(premium_mode=True)',
    "Scientific papers": 'LlamaParse(preset="scientific")',
    "Auto mode": 'LlamaParse(auto_mode=True, auto_mode_trigger_on_table_in_page=True)',
    "With OCR": 'LlamaParse(high_res_ocr=True)',
    "Extract charts": 'LlamaParse(extract_charts=True)',
    "Preserve columns": 'LlamaParse(do_not_unroll_columns=True)',
    "Structured output": 'LlamaParse(structured_output=True)',
}

for use_case, config in configs.items():
    print(f"\n{use_case}:")
    print(f"  {config}")