Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions .github/workflows/update-docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
name: Update reference docs

on:
pull_request:
paths:
- "src/parxy_cli/commands/**"
- "src/parxy_cli/cli.py"
- "src/parxy_core/models/config.py"
- "scripts/generate_docs.py"

jobs:
update-docs:
name: Regenerate reference docs
runs-on: ubuntu-latest
permissions:
contents: write

steps:
- uses: actions/checkout@v6
with:
fetch-depth: 1

- name: Install uv
uses: astral-sh/setup-uv@v7.3.1
with:
enable-cache: true

- name: Install dependencies
run: uv sync

- name: Generate reference docs
run: uv run python scripts/generate_docs.py

- name: Commit if changed
uses: stefanzweifel/git-auto-commit-action@v7.1.0
with:
commit_message: "docs: sync CLI and configuration reference"
file_pattern: "docs/reference/*.md"
5 changes: 5 additions & 0 deletions docs/howto/add_new_parser.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
title: Add a new parser
description: How to implement a custom driver, register it with Parxy at runtime, and make it available alongside the built-in parsers.
---

# How to Add a New Parser to Parxy

Parxy is designed to be **extensible** — you can integrate new parsing backends (drivers) or create custom variants of existing ones directly from your Python code, without modifying the core library.
Expand Down
5 changes: 5 additions & 0 deletions docs/howto/batch_processing.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
title: Process multiple documents in parallel
description: How to use Parxy's batch API to parse many documents concurrently, control worker count, handle per-file errors, and collect structured results.
---

# How to Process Multiple Documents in Parallel

Parxy provides a `batch` method for processing multiple documents in parallel, with support for per-file configuration. This is useful when you need to parse many documents efficiently or when different documents require different parsing strategies.
Expand Down
5 changes: 5 additions & 0 deletions docs/howto/configure_landingai.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
title: Configure LandingAI ADE
description: How to set up the LandingAI Agentic Document Extraction driver, configure the API key and environment, and override parsing options per document.
---

# How to Configure LandingAI ADE

This guide shows you how to configure the LandingAI ADE (Agentic Document Extraction) driver for document processing, including setting default options and overriding them on a per-document basis.
Expand Down
5 changes: 5 additions & 0 deletions docs/howto/configure_llamaparse.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
title: Configure LlamaParse
description: How to set up the LlamaParse driver, configure the API key and parsing mode, and override options on a per-document basis for better extraction results.
---

# How to Configure LlamaParse

This guide shows you how to configure the LlamaParse driver for document processing, including setting default options and overriding them on a per-document basis.
Expand Down
5 changes: 5 additions & 0 deletions docs/howto/configure_llmwhisperer.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
title: Configure LLMWhisperer
description: How to set up the LLMWhisperer driver, configure the API key and parsing mode, and override options on a per-document basis for better extraction results.
---

# How to Configure LLMWhisperer

This guide shows you how to configure the LLMWhisperer driver for document processing, including setting default options and overriding them on a per-document basis.
Expand Down
5 changes: 5 additions & 0 deletions docs/howto/configure_observability.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
title: Configure observability
description: How to enable OpenTelemetry tracing and metrics in Parxy, connect to an OTLP collector, and monitor document processing operations in your observability stack.
---

# How to Configure Observability

This guide shows you how to enable and configure OpenTelemetry-based observability in Parxy to monitor document processing operations.
Expand Down
5 changes: 5 additions & 0 deletions docs/howto/configure_pdfact.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
title: Configure PdfAct
description: How to set up the PdfAct driver against a self-hosted or remote service instance, configure the base URL and API key, and run PdfAct locally with Docker.
---

# How to Configure PdfAct

This guide shows you how to configure the PdfAct driver for document processing using a self-hosted or remote PdfAct service.
Expand Down
5 changes: 5 additions & 0 deletions docs/howto/configure_pymupdf.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
title: Configure PyMuPDF
description: How to use Parxy's default PyMuPDF driver, choose the right extraction level for your use case, and adjust the output when working with local PDF files.
---

# How to Configure PyMuPDF

This guide shows you how to use the PyMuPDF driver for document processing. PyMuPDF is the default driver in Parxy and requires no external services or API keys.
Expand Down
5 changes: 5 additions & 0 deletions docs/howto/configure_unstructured_local.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
title: Configure Unstructured library
description: How to install and configure the Unstructured local driver for offline PDF parsing without external APIs, including extraction levels and output options.
---

# How to Configure Unstructured Local

This guide shows you how to configure the Unstructured Local driver for document processing. This driver uses the open-source `unstructured` library for local document parsing without requiring external services.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
title: Merge and split PDFs
description: How to merge multiple PDFs and split a single PDF into pages or ranges from the command line using parxy pdf:merge and parxy pdf:split.
---

# How to Manipulate PDFs with Parxy

Parxy provides powerful **PDF manipulation commands** that allow you to merge multiple PDF files into one or split a single PDF into multiple files — all from the command line.
Expand Down
7 changes: 6 additions & 1 deletion docs/howto/pdf_attachments.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
title: Work with PDF attachments
description: How to add, list, extract, and remove file attachments embedded in a PDF using Parxy's CLI commands, with examples for common attachment workflows.
---

# How to Work with PDF Attachments

Parxy provides comprehensive **PDF attachment commands** that allow you to add, list, extract, and remove file attachments in PDF documents — all from the command line.
Expand Down Expand Up @@ -473,6 +478,6 @@ parxy attach:remove --help

## Related Documentation

- [PDF Manipulation](./pdf_manipulation.md) - Learn about merging and splitting PDFs
- [Merge and split PDFs](./merge_and_split_pdfs.md)
- [Getting Started Tutorial](../tutorials/getting_started.md) - General introduction to Parxy CLI
- [Using the CLI](../tutorials/using_cli.md) - Basic CLI usage patterns
94 changes: 94 additions & 0 deletions docs/installation_and_setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
title: Installation and setup
description: Quick instructions to install Parxy via pip, uv, or uvx and configuration via environment variables.
weight: 3
---

# Installation and Setup

## Requirements

- Python **3.12** or **3.13**

## Installation

Parxy can be installed via pip or uv, or run without installation using uvx.

### Via pip

```bash
pip install parxy # Basic installation (PyMuPDF and PdfAct drivers)
pip install parxy[all] # All drivers included
```

### Via uv

```bash
uv add parxy # Basic installation
uv add parxy --extra all # All drivers included
```

### Without installation (uvx)

[`uvx`](https://docs.astral.sh/uv/guides/tools/) runs Parxy in an isolated environment without a permanent install:

```bash
# Basic drivers only
uvx parxy --help
```

```bash
# All drivers included
uvx --from 'parxy[all]' parxy --help
```

### Installing specific drivers

If you only need a particular driver, install its extra instead of `all`:

```bash
pip install parxy[llama] # LlamaParse
pip install parxy[llmwhisperer] # LLMWhisperer
pip install parxy[landingai] # Landing AI
pip install parxy[unstructured_local] # Unstructured library
```

See [Supported Services](./supported_services.md) for the full list of drivers and their extras.

## Environment variables and API keys

Some drivers require an API key. Parxy reads these from environment variables, which can be set in a `.env` file in your project root.

To generate a template `.env` file:

```bash
parxy env
```

Then fill in the keys for the services you use:

```bash
# LlamaParse
PARXY_LLAMAPARSE_API_KEY=

# Unstract LLMWhisperer
PARXY_LLMWHISPERER_API_KEY=
```

### Core environment variables

| Variable | Default | Description |
|----------|---------|-------------|
| `PARXY_DEFAULT_DRIVER` | `pymupdf` | Driver used when none is specified |
| `PARXY_LOGGING_LEVEL` | `INFO` | Logging verbosity |
| `PARXY_LOGGING_FILE` | *(none)* | Path to write log output |

### Self-hosted services

Some drivers (such as PdfAct) can be run locally via Docker. To generate a Docker Compose configuration:

```bash
parxy docker
```

This produces a `compose.yaml` you can start with `docker compose up`.
86 changes: 86 additions & 0 deletions docs/introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
---
title: Introduction
description: What Parxy is, how it works, and a quick look at the CLI commands and Python library API before you dive in.
weight: 1
---

# Introduction

Parxy is a document processing gateway with a unified interface for multiple document parsing services. Via a common unified model it allows to swap providers without rewriting your application.

- Single API across different providers (local libraries and remote APIs)
- Supports PyMuPDF, Unstructured, LlamaParse, LLMWhisperer, PdfAct, and more
- Custom drivers can be registered directly in your application code
- Execution tracing to help debug parsing issues

## Available as CLI and library

Parxy works as a command line tool or as a Python library.

The quickest way to try it out is via [`uvx`](https://docs.astral.sh/uv/concepts/tools/#execution-vs-installation):

```bash
uvx parxy --help
```

To include all supported drivers:

```bash
uvx --from 'parxy[all]' parxy --help
```

See [Installation and Setup](./installation_and_setup.md) for the full installation options.

## CLI overview

Once installed, `parxy` provides the following commands:

| Command | Description |
|---------|-------------|
| `parxy parse` | Extract text content from documents with customizable granularity levels and output formats |
| `parxy markdown` | Convert documents into Markdown format, with optional combining of multiple documents |
| `parxy drivers` | List available document processing drivers |
| `parxy env` | Create a configuration file with default settings |
| `parxy docker` | Generate a Docker Compose configuration for self-hosted services |
| `parxy pdf:merge` | Merge multiple PDF files into one, with support for selecting specific page ranges |
| `parxy pdf:split` | Split a PDF file into individual pages |

```bash
# Parse a PDF to markdown
parxy parse --mode markdown document.pdf

# Launch interactive TUI for parser comparison
parxy tui ./documents

# Merge multiple PDFs with page ranges
parxy pdf:merge cover.pdf doc1.pdf[1:10] doc2.pdf -o merged.pdf
```

Run `parxy --help` for the full list of options.

## Library overview

Parxy can also be used directly in Python. After installation, import the `Parxy` facade:

```python
from parxy_core.facade import Parxy

# Parse a document using the default driver
doc = Parxy.parse('path/to/document.pdf')

print(f"Pages: {len(doc.pages)}")
print(f"Title: {doc.metadata.title}")

# Use a specific driver
doc = Parxy.driver(Parxy.LLAMAPARSE).parse('path/to/document.pdf')
```

Every driver returns the same `Document` structure, so you can switch providers without changing how you process the output.

For a step-by-step walkthrough, see the [Getting Started tutorial](./tutorials/getting_started.md).

## Next steps

- [Installation and first run](./installation_and_setup.md)
- [Available drivers](./supported_services.md) and their installation
- [Parse your first document](./tutorials/getting_started.md)
Loading
Loading