# Complete Parse → Classify → Extract Workflow with LlamaCloud Services

This notebook demonstrates the complete workflow for processing documents using LlamaCloud services:
1. **Parse** - Extract and convert documents to markdown
2. **Classify** - Categorize documents based on their content
3. **Extract** - Extract structured data using the markdown as input via SourceText

## Overview of the Workflow

### 1. Parse Phase
- Use `LlamaParse` to convert documents (PDFs, Word docs, etc.) into structured formats
- Extract markdown content that preserves document structure
- Get both raw text and markdown representations

### 2. Classify Phase
- Use `ClassifyClient` to categorize documents based on content
- Apply classification rules to route documents appropriately
- Handle different document types with specific processing logic

### 3. Extract Phase
- Use `LlamaExtract` with `SourceText` to extract structured data
- Pass the markdown content as input for more accurate extraction
- Define custom schemas for structured data extraction

Let's walk through each step with practical examples.

## Setup and Installation

In [None]:
# Install required packages
!pip install llama-cloud-services
!pip install python-dotenv

In [None]:
import os
import nest_asyncio
from getpass import getpass
from dotenv import load_dotenv

# Load environment variables
load_dotenv()
nest_asyncio.apply()

# Set up API key
os.environ["LLAMA_CLOUD_API_KEY"] = ""  # edit it

# Setup Base URL
# os.envrion["LLAMA_CLOUD_BASE_URL"] = "https://api.cloud.eu.llamaindex.ai/" # update if necessay

print("✅ API key configured")

✅ API key configured


## Download Sample Documents

Let's download some sample documents to work with:

In [None]:
import requests
import os

# Create directory for sample documents
os.makedirs("sample_docs", exist_ok=True)

# Download sample documents
docs_to_download = {
    "financial_report.pdf": "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf",
    "technical_spec.pdf": "https://www.ti.com/lit/ds/symlink/lm317.pdf",
}

for filename, url in docs_to_download.items():
    filepath = f"sample_docs/{filename}"
    if not os.path.exists(filepath):
        print(f"Downloading {filename}...")
        response = requests.get(url)
        if response.status_code == 200:
            with open(filepath, "wb") as f:
                f.write(response.content)
            print(f"✅ Downloaded {filename}")
        else:
            print(f"❌ Failed to download {filename}")
    else:
        print(f"📁 {filename} already exists")

print("\n📂 Sample documents ready!")

📁 financial_report.pdf already exists
📁 technical_spec.pdf already exists

📂 Sample documents ready!


## Phase 1: Document Parsing

First, let's parse our documents using LlamaParse to extract clean markdown content.

In [None]:
from llama_cloud_services.parse.base import LlamaParse
from llama_cloud_services.parse.utils import ResultType
import asyncio

# Initialize the parser
parser = LlamaParse(
    result_type=ResultType.MD,  # Get markdown output
    verbose=True,
    language="en",
    # Premium mode for better accuracy
    premium_mode=True,
    # Extract tables as HTML for better structure
    output_tables_as_HTML=True,
    # Parse only first few pages for demo
)

print("🔄 Parsing documents...")

# Parse the financial report
financial_result = await parser.aparse("sample_docs/financial_report.pdf")
print(f"✅ Parsed financial report (Job ID: {financial_result.job_id})")

# Parse the technical specification
technical_result = await parser.aparse("sample_docs/technical_spec.pdf")
print(f"✅ Parsed technical spec (Job ID: {technical_result.job_id})")

print("\n📄 Parsing complete!")

🔄 Parsing documents...
Started parsing the file under job_id 8a8c76f9-354d-4275-91d8-312ff1adc762
...✅ Parsed financial report (Job ID: 8a8c76f9-354d-4275-91d8-312ff1adc762)
Started parsing the file under job_id 7e603448-ed80-4d18-948b-6801ed51c41b
✅ Parsed technical spec (Job ID: 7e603448-ed80-4d18-948b-6801ed51c41b)

📄 Parsing complete!


### Extract Markdown Content

Now let's get the markdown content from our parsed documents:

In [None]:
# Get markdown content from parsed documents
financial_markdown = await financial_result.aget_markdown()
technical_markdown = await technical_result.aget_markdown()

print("📋 Financial Report Markdown (first 500 chars):")
print(financial_markdown[:500])
print("...\n")

print("📋 Technical Spec Markdown (first 500 chars):")
print(technical_markdown[:500])
print("...\n")

print(f"📏 Financial report markdown length: {len(financial_markdown)} characters")
print(f"📏 Technical spec markdown length: {len(technical_markdown)} characters")

document_texts = [financial_markdown, technical_markdown]

📋 Financial Report Markdown (first 500 chars):


# UNITED STATES
# SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549

## FORM 10-K

(Mark One)

☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year ended December 31, 2021
OR
☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from_____ to _____
Commission File Number: 001-38902

# UBER TECHNOLOGIES, INC.
(Exact name of registrant as specified in its charter)

Delaware
...

📋 Technical Spec Markdown (first 500 chars):


LM317
SLVS044Z – SEPTEMBER 1997 – REVISED APRIL 2025

# LM317 3-Pin Adjustable Regulator

## 1 Features

• Output voltage range:
  – Adjustable: 1.25V to 37V
• Output current: 1.5A
• Line regulation: 0.01%/V (typ)
• Load regulation: 0.1% (typ)
• Internal short-circuit current limiting
• Thermal overload protection
• Output safe-area compensation (new chip)
• PSRR: 80dB at 120Hz for CADJ = 10μF (ne

## Phase 2: Document Classification

Next, let's classify our documents based on their content using the ClassifyClient.

In [None]:
from llama_cloud_services.beta.classifier.client import ClassifyClient
from llama_cloud.types import ClassifierRule
from llama_cloud_services.files.client import FileClient
from llama_cloud.client import AsyncLlamaCloud

# Initialize the classify client
api_key = os.environ["LLAMA_CLOUD_API_KEY"]
classify_client = ClassifyClient.from_api_key(api_key)

print("🏷️  Setting up document classification...")

# Define classification rules
classification_rules = [
    ClassifierRule(
        type="financial_document",
        description="Documents containing financial data, revenue, expenses, SEC filings, or financial statements",
    ),
    ClassifierRule(
        type="technical_specification",
        description="Technical datasheets, component specifications, engineering documents, or technical manuals",
    ),
    ClassifierRule(
        type="general_document",
        description="General business documents, contracts, or other unspecified document types",
    ),
]

print(f"📝 Created {len(classification_rules)} classification rules")

🏷️  Setting up document classification...
📝 Created 3 classification rules


## Phase 3: Structured Data Extraction using SourceText

Now comes the key part - using the markdown content as input for structured data extraction via SourceText.

In [None]:
from llama_cloud_services.extract.extract import LlamaExtract, SourceText
from llama_cloud.types import ExtractConfig, ExtractMode
from pydantic import BaseModel, Field
from typing import List, Optional

# Initialize LlamaExtract
llama_extract = LlamaExtract(api_key=api_key, verbose=True)

print("⚙️  LlamaExtract initialized")

⚙️  LlamaExtract initialized


### Define Extraction Schemas

Let's define different schemas for different document types:

In [None]:
# Schema for financial documents
class FinancialMetrics(BaseModel):
    company_name: str = Field(description="Name of the company")
    document_type: str = Field(
        description="Type of financial document (10-K, 10-Q, annual report, etc.)"
    )
    fiscal_year: int = Field(description="Fiscal year of the report")
    revenue_2021: str = Field(description="Total revenue in 2021")
    net_income_2021: str = Field(description="Net income in 2021")
    key_business_segments: List[str] = Field(
        default=[], description="Main business segments or divisions"
    )
    risk_factors: List[str] = Field(
        default=[], description="Key risk factors mentioned"
    )


# Schema for technical specifications
class VoltageRange(BaseModel):
    min_voltage: Optional[float] = Field(description="Minimum voltage")
    max_voltage: Optional[float] = Field(description="Maximum voltage")
    unit: str = Field(default="V", description="Voltage unit")


class TechnicalSpec(BaseModel):
    component_name: str = Field(description="Name of the technical component")
    manufacturer: Optional[str] = Field(description="Manufacturer name")
    part_number: Optional[str] = Field(description="Part or model number")
    description: str = Field(description="Brief description of the component")
    operating_voltage: Optional[VoltageRange] = Field(
        description="Operating voltage range"
    )
    maximum_current: Optional[float] = Field(
        description="Maximum current rating in amperes"
    )
    key_features: List[str] = Field(
        default=[], description="Key features and capabilities"
    )
    applications: List[str] = Field(default=[], description="Typical applications")


print("📋 Extraction schemas defined")

📋 Extraction schemas defined


## Complete Workflow Summary

Let's create a function that demonstrates the complete workflow:

In [None]:
import tempfile
from pathlib import Path
from llama_cloud import ExtractConfig


async def complete_document_workflow(markdown_content: str):
    """
    Complete workflow: Parse → Classify → Extract
    """
    print(f"🚀 Starting complete workflow")
    print("=" * 60)

    # Step 1: Classify
    print("🏷️  Step 2: Classifying document...")

    with tempfile.NamedTemporaryFile(
        mode="w", suffix=".md", delete=False, encoding="utf-8"
    ) as tmp:
        tmp.write(markdown_content)
        temp_path = Path(tmp.name)

    print(temp_path)

    classification = await classify_client.aclassify_file_path(
        rules=classification_rules, file_input_path=str(temp_path)
    )
    doc_type = classification.items[0].result.type
    confidence = classification.items[0].result.confidence
    print(f"   ✅ Classified as: {doc_type} (confidence: {confidence:.2f})")

    # Step 2: Extract based on classification
    print("🔍 Step 3: Extracting structured data using SourceText...")
    source_text = SourceText(
        text_content=markdown_content,
        filename=f"{os.path.basename(temp_path)}_markdown.md",
    )

    # Choose schema based on classification
    if "financial" in doc_type.lower():
        schema = FinancialMetrics
        print("   📊 Using FinancialMetrics schema")
    elif "technical" in doc_type.lower():
        schema = TechnicalSpec
        print("   🔧 Using TechnicalSpec schema")
    else:
        schema = FinancialMetrics  # Default fallback
        print("   📊 Using default FinancialMetrics schema")

    extract_config = ExtractConfig(
        extraction_mode="BALANCED",
    )

    extraction_result = llama_extract.extract(
        data_schema=schema, config=extract_config, files=source_text
    )

    print("   ✅ Extraction complete!")

    return {
        "file_path": temp_path,
        "markdown_length": len(markdown_content),
        "classification": doc_type,
        "confidence": confidence,
        "extracted_data": extraction_result.data,
        "markdown_sample": markdown_content[:200] + "..."
        if len(markdown_content) > 200
        else markdown_content,
    }


print("🔧 Workflow function defined!")

🔧 Workflow function defined!


## Run Complete Workflow on Both Documents

In [None]:
# Process both documents through the complete workflow
results = []

for doc_text in document_texts:
    try:
        result = await complete_document_workflow(doc_text)
        results.append(result)
        print("\n" + "=" * 60 + "\n")
    except Exception as e:
        print(f"❌ Error processing {doc_path}: {str(e)}")
        print("\n" + "=" * 60 + "\n")

print(f"📋 Processed {len(results)} documents successfully!")

🚀 Starting complete workflow
🏷️  Step 2: Classifying document...
/var/folders/g6/4b5lpp5974gcpr890ybhbw4r0000gn/T/tmpos3b62tm.md
   ✅ Classified as: financial_document (confidence: 1.00)
🔍 Step 3: Extracting structured data using SourceText...
   📊 Using FinancialMetrics schema
..   ✅ Extraction complete!


🚀 Starting complete workflow
🏷️  Step 2: Classifying document...
/var/folders/g6/4b5lpp5974gcpr890ybhbw4r0000gn/T/tmpppz9ub_m.md
   ✅ Classified as: technical_specification (confidence: 1.00)
🔍 Step 3: Extracting structured data using SourceText...
   🔧 Using TechnicalSpec schema
   ✅ Extraction complete!


📋 Processed 2 documents successfully!


## Final Results Summary

In [None]:
print("📈 COMPLETE WORKFLOW RESULTS SUMMARY")
print("=" * 70)

for i, result in enumerate(results, 1):
    print(f"\n📄 Document {i}: {os.path.basename(result['file_path'])}")
    print(
        f"   📊 Classification: {result['classification']} (confidence: {result['confidence']:.2f})"
    )
    print(f"   📝 Markdown length: {result['markdown_length']:,} characters")
    print(f"   📋 Markdown sample: {result['markdown_sample'][:100]}...")
    print(f"   🎯 Extracted fields: {len(result['extracted_data'])} fields")

    # Print all key–value pairs
    extracted = result["extracted_data"]
    for key, value in extracted.items():
        print(f"   • {key}: {value}")

print("\n✨ Workflow completed successfully!")
print("\n📚 Key Learnings:")
print("   • Parse: Converted documents to clean markdown format")
print("   • Classify: Automatically categorized document types")
print("   • Extract: Used SourceText with markdown for structured data extraction")
print(
    "   • The markdown content provides much better context for extraction than raw PDFs"
)

📈 COMPLETE WORKFLOW RESULTS SUMMARY

📄 Document 1: tmpos3b62tm.md
   📊 Classification: financial_document (confidence: 1.00)
   📝 Markdown length: 1,348,671 characters
   📋 Markdown sample: 

# UNITED STATES
# SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549

## FORM 10-K

(Mark O...
   🎯 Extracted fields: 7 fields
   • company_name: Uber Technologies, Inc.
   • document_type: Annual Report on Form 10-K
   • fiscal_year: 2021
   • revenue_2021: $21,764
   • net_income_2021: $(496)
   • key_business_segments: ['Mobility', 'Delivery', 'Freight', 'All Other (including former New Mobility, e-bikes, e-scooters, Advanced Technologies Group and other technology programs)']
   • risk_factors: ["The company faces numerous risk factors across its business operations and environment. The COVID-19 pandemic and related mitigation measures have adversely affected parts of the business, including reduced demand for Mobility offerings and creating ongoing uncertainties. The company's operatio

## Conclusion

This notebook demonstrated the complete **Parse → Classify → Extract** workflow using LlamaCloud services:

### Key Components:

1. **LlamaParse** (`llama_cloud_services.parse.base.LlamaParse`):
   - Converts documents to clean, structured markdown
   - Preserves document structure and formatting
   - Handles various file types (PDF, DOCX, etc.)

2. **ClassifyClient** (`llama_cloud_services.beta.classifier.client.ClassifyClient`):
   - Automatically categorizes documents based on content
   - Uses customizable rules for classification
   - Provides confidence scores for classifications

3. **LlamaExtract with SourceText** (`llama_cloud_services.extract.extract.LlamaExtract`, `SourceText`):
   - Extracts structured data using custom Pydantic schemas
   - **SourceText** allows using markdown content as input instead of raw files
   - Provides much better extraction accuracy when using processed markdown

### Workflow Benefits:

- **Better Accuracy**: Using markdown from parsing provides cleaner, more structured input for extraction
- **Automatic Routing**: Classification allows different processing logic for different document types
- **Structured Output**: Custom schemas ensure consistent, structured data extraction
- **Flexible Input**: SourceText supports text content, file paths, and bytes

### Key Insights:

1. **SourceText is the bridge**: It allows you to pass the clean markdown content from parsing directly to extraction
2. **Markdown improves extraction**: Pre-processed markdown provides much better context than raw PDFs
3. **Classification enables smart routing**: Different document types can use different extraction schemas
4. **End-to-end automation**: The entire workflow can be automated for production use

This approach is ideal for production document processing pipelines where you need to:
- Process various document types automatically
- Extract structured data consistently
- Maintain high accuracy and reliability
- Handle documents at scale

The combination of these three services provides a powerful, flexible document processing pipeline that can handle complex, real-world document processing requirements.