## What is LlamaExtract?
**LlamaExtract** is a LlamaCloud service that converts **unstructured documents** (PDFs, images, plain text, etc.) into **structured JSON** that follows a schema you define.  
Instead of writing regexes or custom parsers, you declare *what fields you want*, and LlamaExtract returns validated structured output.

Typical use cases:
- Invoices / receipts
- Resumes / profiles
- SOPs / manuals
- Medical forms
- Research papers → metadata extraction

---

For deatiled explanation please read the last cell in this notebook


In [3]:
import os
from dotenv import load_dotenv
load_dotenv()  # pulls LLAMA_CLOUD_API_KEY from .env


True

In [5]:
from pydantic import BaseModel, Field
from typing import List, Optional

class LineItem(BaseModel):
    description: str = Field(description="Short description of the item/service")
    quantity: Optional[float] = Field(description="Quantity if present")
    unit_price: Optional[float] = Field(description="Unit price if present")
    amount: Optional[float] = Field(description="Line total amount")

class Invoice(BaseModel):
    vendor_name: str = Field(description="Company issuing the invoice")
    invoice_number: str = Field(description="Invoice or bill number")
    invoice_date: str = Field(description="Invoice date")
    due_date: Optional[str] = Field(description="Payment due date if present")
    currency: Optional[str] = Field(description="Currency code like USD, EUR, INR")
    total_amount: float = Field(description="Total amount due")
    line_items: List[LineItem] = Field(description="List of billed items")


In [14]:
pip install llama-cloud-services --quiet


Note: you may need to restart the kernel to use updated packages.


In [None]:
from llama_cloud_services import LlamaExtract

extractor = LlamaExtract(
    api_key=os.getenv("LLAMA_CLOUD"),
)

agent = extractor.create_agent(
    name="Invoice Extractor Demo Agent",
    data_schema=Invoice,
)


Uploading files: 100%|██████████| 1/1 [00:00<00:00,  1.15it/s]
Creating extraction jobs: 100%|██████████| 1/1 [00:00<00:00,  1.16it/s]
Extracting files: 100%|██████████| 1/1 [00:02<00:00,  2.60s/it]
Uploading files: 100%|██████████| 1/1 [00:00<00:00,  1.05it/s]
Creating extraction jobs: 100%|██████████| 1/1 [00:01<00:00,  1.04s/it]
Extracting files: 100%|██████████| 1/1 [00:04<00:00,  4.89s/it]
Uploading files: 100%|██████████| 1/1 [00:00<00:00,  1.39it/s]
Creating extraction jobs: 100%|██████████| 1/1 [00:00<00:00,  1.07it/s]
Extracting files: 100%|██████████| 1/1 [00:17<00:00, 17.14s/it]
Uploading files: 100%|██████████| 1/1 [00:00<00:00,  1.09it/s]
Creating extraction jobs: 100%|██████████| 1/1 [00:01<00:00,  1.01s/it]
Extracting files: 100%|██████████| 1/1 [00:27<00:00, 27.68s/it]
Uploading files: 100%|██████████| 1/1 [00:00<00:00,  1.32it/s]
Creating extraction jobs: 100%|██████████| 1/1 [00:01<00:00,  1.48s/it]
Extracting files: 100%|██████████| 1/1 [00:11<00:00, 11.82s/it]
Uploa

In [22]:
from llama_cloud_services import SourceText

sample_text = """
INVOICE
Vendor: Acme Robotics Inc.
Invoice No: AC-2025-0091
Invoice Date: 2025-11-23
Due Date: 2025-11-23
Currency: USD

Items:
1. Battery inspection service - Qty 2 - Unit $307.00 - Amount $307.00
2. Safety calibration kit - Qty 1 - Unit $420.00 - Amount $420.00

Total Due: $730.00
"""

run = agent.extract(SourceText(text_content=sample_text))
run.data


{'vendor_name': 'Acme Robotics Inc.',
 'invoice_number': 'AC-2025-0091',
 'invoice_date': '2025-11-23',
 'due_date': '2025-11-23',
 'currency': 'USD',
 'total_amount': 730.0,
 'line_items': [{'description': 'Battery inspection service',
   'quantity': 2.0,
   'unit_price': 307.0,
   'amount': 307.0},
  {'description': 'Safety calibration kit',
   'quantity': 1.0,
   'unit_price': 420.0,
   'amount': 420.0}]}

In [25]:
pdf_path = "invoice.pdf"  # local path in notebook
run_pdf = agent.extract(pdf_path)
run_pdf.data


{'vendor_name': 'DOTAIZ',
 'invoice_number': '2025-001',
 'invoice_date': '2025-11-23',
 'due_date': '2025-12-22',
 'currency': None,
 'total_amount': 14.85,
 'line_items': [{'description': 'dotaiz subscription',
   'quantity': 1.0,
   'unit_price': 30.0,
   'amount': 30.0}]}

In [30]:
from llama_cloud_services.extract import ExtractConfig, ExtractMode


agent.config = ExtractConfig(
    extraction_mode=ExtractMode.MULTIMODAL,  # needed for cite/confidence
    cite_sources=True,
    use_reasoning=True,
    confidence_scores=True,
)

rich_run = agent.extract(SourceText(text_content=sample_text))

rich_run.data


{'vendor_name': 'Acme Robotics Inc.',
 'invoice_number': 'AC-2025-0091',
 'invoice_date': '2025-11-23',
 'due_date': '2025-11-23',
 'currency': 'USD',
 'total_amount': 730.0,
 'line_items': [{'description': 'Battery inspection service',
   'quantity': 2.0,
   'unit_price': 307.0,
   'amount': 307.0},
  {'description': 'Safety calibration kit',
   'quantity': 1.0,
   'unit_price': 420.0,
   'amount': 420.0}]}

In [32]:
rich_run = agent.extract(pdf_path)

rich_run.data

{'vendor_name': 'DOTAIZ',
 'invoice_number': '2025-001',
 'invoice_date': '2025-11-23',
 'due_date': '2025-12-22',
 'currency': 'USD',
 'total_amount': 14.85,
 'line_items': [{'description': 'dotaiz subscription',
   'quantity': 1.0,
   'unit_price': 30.0,
   'amount': 30.0}]}

In [39]:
print("Reasoning:", rich_run)


Reasoning: config=ExtractConfig(chunk_mode=<DocumentChunkMode.PAGE: 'PAGE'>, cite_sources=True, confidence_scores=True, extract_model=<ExtractModels.OPENAI_GPT_41: 'openai-gpt-4-1'>, extraction_mode=<ExtractMode.MULTIMODAL: 'MULTIMODAL'>, extraction_target=<ExtractTarget.PER_DOC: 'PER_DOC'>, high_resolution_mode=False, invalidate_cache=False, multimodal_fast_mode=False, num_pages_context=None, page_range=None, parse_model=<PublicModelName.GEMINI_20_FLASH: 'gemini-2.0-flash'>, priority=None, system_prompt=None, use_reasoning=True) created_at=datetime.datetime(2025, 11, 23, 20, 8, 56, 636040, tzinfo=datetime.timezone.utc) data={'vendor_name': 'DOTAIZ', 'invoice_number': '2025-001', 'invoice_date': '2025-11-23', 'due_date': '2025-12-22', 'currency': 'USD', 'total_amount': 14.85, 'line_items': [{'description': 'dotaiz subscription', 'quantity': 1.0, 'unit_price': 30.0, 'amount': 30.0}]} data_schema={'additionalProperties': False, 'properties': {'vendor_name': {'description': 'Company issui

In [45]:
import json

full_payload = {
    "config": rich_run.config.model_dump() if hasattr(rich_run.config, "model_dump") else str(rich_run.config),
    "data": rich_run.data,  # already dict
    "field_metadata": rich_run.extraction_metadata.get("field_metadata"),
    "usage": rich_run.extraction_metadata.get("usage"),
    "status": str(rich_run.status),
    "file_name": getattr(rich_run.file, "name", None)
}

print(json.dumps(full_payload, indent=2, default=str))


{
  "config": "chunk_mode=<DocumentChunkMode.PAGE: 'PAGE'> cite_sources=True confidence_scores=True extract_model=<ExtractModels.OPENAI_GPT_41: 'openai-gpt-4-1'> extraction_mode=<ExtractMode.MULTIMODAL: 'MULTIMODAL'> extraction_target=<ExtractTarget.PER_DOC: 'PER_DOC'> high_resolution_mode=False invalidate_cache=False multimodal_fast_mode=False num_pages_context=None page_range=None parse_model=<PublicModelName.GEMINI_20_FLASH: 'gemini-2.0-flash'> priority=None system_prompt=None use_reasoning=True",
  "data": {
    "vendor_name": "DOTAIZ",
    "invoice_number": "2025-001",
    "invoice_date": "2025-11-23",
    "due_date": "2025-12-22",
    "currency": "USD",
    "total_amount": 14.85,
    "line_items": [
      {
        "description": "dotaiz subscription",
        "quantity": 1.0,
        "unit_price": 30.0,
        "amount": 30.0
      }
    ]
  },
  "field_metadata": {
    "vendor_name": {
      "reasoning": "VERBATIM EXTRACTION",
      "citation": [
        {
          "page": 1,


In [54]:
fm = rich_run.config
fm


ExtractConfig(chunk_mode=<DocumentChunkMode.PAGE: 'PAGE'>, cite_sources=True, confidence_scores=True, extract_model=<ExtractModels.OPENAI_GPT_41: 'openai-gpt-4-1'>, extraction_mode=<ExtractMode.MULTIMODAL: 'MULTIMODAL'>, extraction_target=<ExtractTarget.PER_DOC: 'PER_DOC'>, high_resolution_mode=False, invalidate_cache=False, multimodal_fast_mode=False, num_pages_context=None, page_range=None, parse_model=<PublicModelName.GEMINI_20_FLASH: 'gemini-2.0-flash'>, priority=None, system_prompt=None, use_reasoning=True)

In [56]:
from llama_cloud_services import LlamaExtract
from llama_cloud_services.extract import ExtractConfig, ExtractMode

extractor = LlamaExtract(
    api_key=os.getenv("LLAMA_CLOUD"),
)

config = ExtractConfig(
    extraction_mode=ExtractMode.MULTIMODAL  # or FAST if you want cheap/quick
)

stateless_run = extractor.extract(
    Invoice,        # schema class first
    config,         # config second
    pdf_path        # file path third
)

stateless_run.data


{'vendor_name': 'DOTAIZ',
 'invoice_number': '2025-001',
 'invoice_date': '2025-11-23',
 'due_date': '2025-12-22',
 'currency': 'USD',
 'total_amount': 14.85,
 'line_items': [{'description': 'dotaiz subscription',
   'quantity': 1.0,
   'unit_price': 30.0,
   'amount': 30.0}]}

# LlamaExtract Demo Notes (Updated End-to-End Guide)

## What is LlamaExtract?
**LlamaExtract** is a LlamaCloud service that converts **unstructured documents** (PDFs, images, plain text, etc.) into **structured JSON** matching a schema you define.  
Instead of regexes or brittle parsers, you declare *what fields you want*, and LlamaExtract returns validated structured output.

Typical use cases:
- Invoices / receipts
- Resumes / profiles
- SOPs / manuals
- Medical forms
- Research papers → metadata extraction

---

## Core Concepts

### 1) Schema (Pydantic) = “What you want back”
A **schema** defines the output shape.  
In Python, LlamaExtract uses **Pydantic** classes as schemas.

Benefits:
- Consistent JSON structure
- Strong typing (string, float, list, optional)
- Nested objects support (e.g., line items)

Example:
    class Invoice(BaseModel):
        vendor_name: str
        invoice_number: str
        invoice_date: str
        total_amount: float
        line_items: List[LineItem]

---

### 2) Client (`LlamaExtract`) = connection to LlamaCloud
Create a client with your API key:

    extractor = LlamaExtract(api_key=os.getenv("LLAMA_CLOUD_API_KEY"))

This client supports:
- **Agent-based extraction** (recommended for repeated use)
- **Stateless extraction** (one-off calls without an agent)

---

### 3) Agent = reusable extractor + schema + config
An **agent** is a reusable extraction setup:
- Name
- Schema
- Optional config (citations/reasoning/confidence)

We created one like:

    agent = extractor.create_agent(
        name="invoice-demo-agent",
        data_schema=Invoice,
    )

Why agents?
- Reuse schema for many docs
- Consistent results
- Easy batching/scaling later

---

### 4) Inputs = text or file
You can extract from:
1. **Raw text** via `SourceText`
2. **File paths** directly (PDF/images/etc.)

Text:

    run = agent.extract(SourceText(text_content=sample_text))

PDF:

    run_pdf = agent.extract("invoice.pdf")

Same `.extract()` works for both.

---

## Our Demo Workflow

### Step A — Install deps

    !pip install -U llama-cloud-services pydantic python-dotenv

---

### Step B — Set API key
In `.env`:

    LLAMA_CLOUD_API_KEY=llx-xxxx

In notebook:

    from dotenv import load_dotenv
    load_dotenv()

---

### Step C — Define schema
We defined an invoice schema with nested line items.

Schema tips:
- Keep names clear (`invoice_number`, not `inv_no`)
- Use `Optional[...]` where missing fields are possible
- Add `Field(description="...")` to boost accuracy

---

### Step D — Create agent

    agent = extractor.create_agent(
        name="invoice-demo-agent",
        data_schema=Invoice,
    )

---

### Step E — Extract

    run = agent.extract(SourceText(text_content=sample_text))
    run.data

Output is structured JSON matching the schema.

---

## Advanced “Wow” Features

### Citations + Reasoning + Confidence
LlamaExtract can include:
- **citations**: where values came from in the doc
- **reasoning**: why a value was chosen
- **confidence scores**: reliability per field

**Important:** in your SDK version, these live in the extract submodule, so use:

    from llama_cloud_services.extract import ExtractConfig, ExtractMode

Enable:

    agent.config = ExtractConfig(
        extraction_mode=ExtractMode.MULTIMODAL,
        cite_sources=True,
        use_reasoning=True,
        confidence_scores=True,
    )

Then:

    rich_run = agent.extract(SourceText(text_content=sample_text))

---

## How to Read Reasoning in *Your* Output
You saw a big object printout because you printed the whole run, not just reasoning.

In your SDK output, **reasoning is per-field** inside:

    rich_run.extraction_metadata["field_metadata"][FIELD]["reasoning"]

Example:

    fm = rich_run.extraction_metadata["field_metadata"]
    print("vendor_name reasoning:", fm["vendor_name"]["reasoning"])

Line items reasoning is nested:

    for i, item_meta in enumerate(fm["line_items"]):
        print("line item", i, "reasoning:", item_meta.get("reasoning"))

---

## Data Dumping (Important Update)
In your SDK version:

- `rich_run.data` is already a **plain dict**  
- so **`.model_dump()` will fail**

Correct dump:

    import json
    print(json.dumps(rich_run.data, indent=2))

If you want a Pydantic object anyway:

    invoice_obj = Invoice(**rich_run.data)
    invoice_obj.model_dump()   # now works (pydantic v2)

(Pydantic v1 users can use `.dict()` instead of `.model_dump()`.)

---

## Stateless Extraction (One-Off) — Correct Signature
You hit a `TypeError` because stateless `.extract()` has a different signature than agent `.extract()`.

Correct stateless call order:

    stateless_run = extractor.extract(
        Invoice,        # schema class FIRST
        config,         # config SECOND
        pdf_path        # file path or SourceText THIRD
    )

Example for PDF:

    config = ExtractConfig(extraction_mode=ExtractMode.MULTIMODAL)

    stateless_run = extractor.extract(
        Invoice,
        config,
        "invoice.pdf"
    )

Example for text:

    stateless_run = extractor.extract(
        Invoice,
        ExtractConfig(extraction_mode=ExtractMode.FAST),
        SourceText(text_content=sample_text)
    )

When to use:
- **Agent-based:** repeated extraction with same schema
- **Stateless:** quick experiments / single doc

---

## Common Issues We Hit

### 1) `ModuleNotFoundError: llama_cloud_services`
Cause: SDK not installed in the notebook kernel.  
Fix:

    !pip install -U llama-cloud-services

Also ensure Jupyter kernel is using the right env.

---

### 2) `ImportError: cannot import ExtractConfig`
Cause: older SDK doesn’t export config from top-level.  
Fix:

    from llama_cloud_services.extract import ExtractConfig, ExtractMode

---

## When to Prefer LlamaExtract
Use LlamaExtract when you want:
- Schema-first, reliable structured JSON
- Strong type validation
- Repeatable extraction at scale
- Optional auditability via citations/confidence

Use normal LLM JSON mode when:
- Schema changes constantly
- You only need a one-off quick parse

---

## Final Takeaway
**LlamaExtract = schema-first extraction.**  
Define a Pydantic schema → pass text/PDF → get validated JSON → optionally get per-field citations, reasoning, and confidence for explainability.
