Kreuzberg

Kreuzberg is a high-performance Python library for text extraction from documents. Benchmarked as one of the fastest text extraction libraries available, it provides a unified interface for extracting text from PDFs, images, office documents, and more, with both async and sync APIs optimized for speed and efficiency.

Why Kreuzberg?

🚀 Substantially Faster: Extraction speeds that significantly outperform other text extraction libraries
⚡ Unique Dual API: The only framework supporting both sync and async APIs for maximum flexibility
💾 Memory Efficient: Lower memory footprint compared to competing libraries
📊 Proven Performance: Comprehensive benchmarks demonstrate superior performance across formats
Simple and Hassle-Free: Clean API that just works, without complex configuration
Local Processing: No external API calls or cloud dependencies required
Resource Efficient: Lightweight processing without GPU requirements
Format Support: Comprehensive support for documents, images, and text formats
Multiple OCR Engines: Support for Tesseract, EasyOCR, and PaddleOCR
Command Line Interface: Powerful CLI for batch processing and automation
Metadata Extraction: Get document metadata alongside text content
Table Extraction: Extract tables from documents using the excellent GMFT library
Modern Python: Built with async/await, type hints, and a functional-first approach
Permissive OSS: MIT licensed with permissively licensed dependencies

Quick Start

pip install kreuzberg

# Or install with CLI support
pip install "kreuzberg[cli]"

Install pandoc:

# Ubuntu/Debian
sudo apt-get install tesseract-ocr pandoc

# macOS
brew install tesseract pandoc

# Windows
choco install -y tesseract pandoc

The tesseract OCR engine is the default OCR engine. You can decide not to use it - and then either use one of the two alternative OCR engines, or have no OCR at all.

Alternative OCR engines

# Install with EasyOCR support
pip install "kreuzberg[easyocr]"

# Install with PaddleOCR support
pip install "kreuzberg[paddleocr]"

Quick Example

import asyncio
from kreuzberg import extract_file

async def main():
    # Extract text from a PDF
    result = await extract_file("document.pdf")
    print(result.content)

    # Extract text from an image
    result = await extract_file("scan.jpg")
    print(result.content)

    # Extract text from a Word document
    result = await extract_file("report.docx")
    print(result.content)

asyncio.run(main())

Command Line Interface

Kreuzberg includes a powerful CLI for processing documents from the command line:

# Extract text from a file
kreuzberg extract document.pdf

# Extract with JSON output and metadata
kreuzberg extract document.pdf --output-format json --show-metadata

# Extract from stdin
cat document.html | kreuzberg extract

# Use specific OCR backend
kreuzberg extract image.png --ocr-backend easyocr --easyocr-languages en,de

# Extract with configuration file
kreuzberg extract document.pdf --config config.toml

CLI Configuration

Configure via pyproject.toml:

[tool.kreuzberg]
force_ocr = true
chunk_content = false
extract_tables = true
max_chars = 4000
ocr_backend = "tesseract"

[tool.kreuzberg.tesseract]
language = "eng+deu"
psm = 3

For full CLI documentation, see the CLI Guide.

Documentation

For comprehensive documentation, visit our GitHub Pages:

Getting Started - Installation and basic usage
User Guide - In-depth usage information
CLI Guide - Command-line interface documentation
API Reference - Detailed API documentation
Examples - Code examples for common use cases
OCR Configuration - Configure OCR engines
OCR Backends - Choose the right OCR engine

Supported Formats

Kreuzberg supports a wide range of document formats:

Documents: PDF, DOCX, RTF, TXT, EPUB, etc.
Images: JPG, PNG, TIFF, BMP, GIF, etc.
Spreadsheets: XLSX, XLS, CSV, etc.
Presentations: PPTX, PPT, etc.
Web Content: HTML, XML, etc.

OCR Engines

Kreuzberg supports multiple OCR engines:

Tesseract (Default): Lightweight, fast startup, requires system installation
EasyOCR: Good for many languages, pure Python, but downloads models on first use
PaddleOCR: Excellent for Asian languages, pure Python, but downloads models on first use

For comparison and selection guidance, see the OCR Backends documentation.

Performance

Kreuzberg delivers exceptional performance compared to other text extraction libraries:

🏆 Competitive Benchmarks

Comprehensive benchmarks comparing Kreuzberg against other popular Python text extraction libraries show:

Fastest Extraction: Consistently fastest processing times across file formats
Lowest Memory Usage: Most memory-efficient text extraction solution
100% Success Rate: Reliable extraction across all tested document types
Optimal for High-Throughput: Designed for real-time, production applications

💾 Installation Size Efficiency

Kreuzberg delivers maximum performance with minimal overhead:

Kreuzberg: 71.0 MB (20 deps) - Most lightweight
Unstructured: 145.8 MB (54 deps) - Moderate footprint
MarkItDown: 250.7 MB (25 deps) - ML inference overhead
Docling: 1,031.9 MB (88 deps) - Full ML stack included

Kreuzberg is up to 14x smaller than competing solutions while delivering superior performance.

⚡ Sync vs Async Performance

Kreuzberg is the only library offering both sync and async APIs. Choose based on your use case:

Operation	Sync Time	Async Time	Async Advantage
Simple text (Markdown)	0.4ms	17.5ms	❌ 41x slower
HTML documents	1.6ms	1.1ms	✅ 1.5x faster
Complex PDFs	39.0s	8.5s	✅ 4.6x faster
OCR processing	0.4s	0.7s	✅ 1.7x faster
Batch operations	38.6s	8.5s	✅ 4.5x faster

Rule of thumb: Use async for complex documents, OCR, batch processing, and backend APIs.

For detailed benchmarks and methodology, see our Performance Documentation.

Contributing

We welcome contributions! Please see our Contributing Guide for details on setting up your development environment and submitting pull requests.

License

This library is released under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 265 Commits
.github		.github
benchmarks		benchmarks
docs		docs
kreuzberg		kreuzberg
scripts		scripts
tests		tests
.commitlintrc		.commitlintrc
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
ai-rulez.yaml		ai-rulez.yaml
benchmark_results.json		benchmark_results.json
mkdocs.yaml		mkdocs.yaml
pyproject.toml		pyproject.toml
run_benchmarks.py		run_benchmarks.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kreuzberg

Why Kreuzberg?

Quick Start

Alternative OCR engines

Quick Example

Command Line Interface

CLI Configuration

Documentation

Supported Formats

OCR Engines

Performance

🏆 Competitive Benchmarks

💾 Installation Size Efficiency

⚡ Sync vs Async Performance

Contributing

License

About

Uh oh!

Releases 25

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

License

Goldziher/kreuzberg

Folders and files

Latest commit

History

Repository files navigation

Kreuzberg

Why Kreuzberg?

Quick Start

Alternative OCR engines

Quick Example

Command Line Interface

CLI Configuration

Documentation

Supported Formats

OCR Engines

Performance

🏆 Competitive Benchmarks

💾 Installation Size Efficiency

⚡ Sync vs Async Performance

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 25

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages