# Hybrid PDF Parser Demo

This notebook demonstrates how to use the Hybrid PDF Parser with the simple high-level API.

## Setup

First, configure your credentials:


In [None]:
# Import the simple API
from hybrid_pdf_parser import PDFExtractor

# Create extractor
extractor = PDFExtractor()

# Configure with OpenAI (credentials auto-loaded from .env or env vars)
extractor.config(
    provider="openai",
    #api_key=""sk-your-key-here",  # Or leave empty to use OPENAI_API_KEY from env"
    vision_model="gpt-4o-mini"
)

print("✓ Extractor configured")


## Extract a PDF

Simple extraction with auto-save:


In [None]:
# Extract PDF
result = extractor.extract(
    "example.pdf",
    output_md="output.md",  # Optional: auto-saves if provided
    report_jsonl="report.jsonl",  # Optional: provenance report
)

print(f"✓ Processed {result.pages} pages")
print(f"✓ Segment stats: {result.stats}")
print(f"✓ Output saved to: {result.markdown_path}")


## View Results in Jupyter

The markdown is returned and can be viewed directly:


In [None]:
# Display markdown (first 1000 chars as preview)
print(result.markdown[:1000])
print("..." if len(result.markdown) > 1000 else "")


## Extract Without Saving

Just get the markdown string without writing to disk:


In [None]:
# Extract and just get the markdown string
result = extractor.extract("example.pdf")

# Use the markdown directly
markdown_text = result.markdown
print(f"Got {len(markdown_text)} characters of markdown")


## Using Ollama (Local Models)

Use local open-source models:


In [None]:
extractor = PDFExtractor()
extractor.config(
    provider="ollama",
    vision_model="qwen2.5-vl",
    adjudicator_model="llama3.1",
)

result = extractor.extract("example.pdf")
print(f"Ollama extraction: {result.pages} pages")
