Semantic document graph extraction. Transform PDFs into structured, queryable graphs for RAG, search, and document understanding.
- Semantic graph output — Preserves document structure (sections, paragraphs, lists, tables)
- Bounding boxes — Every node maps to exact PDF coordinates
- Hierarchical structure — Parent-child relationships preserved
- Fast — Native Rust with embedded Tika for PDF parsing
- Local-first — No API key required, runs entirely on your machine
git clone https://github.com/AmplifyTechnology/blazegraph-io.git
cd blazegraph-io
cargo build --release -p blazegraph-cli# Parse a PDF to JSON graph
./target/release/blazegraph-cli -i document.pdf -o graph.json
# With custom config
./target/release/blazegraph-cli -i document.pdf -c config.yaml -o graph.json
# See all options
./target/release/blazegraph-cli --helpNote: On first run, the CLI will automatically download a Java Runtime (~60MB) for PDF processing. It's cached for future use:
- macOS/Linux:
~/.local/share/blazegraph/jre- Windows:
%LOCALAPPDATA%\blazegraph\jre
| Format | Description |
|---|---|
graph |
Full graph structure with nodes and edges (default) |
sequential |
Ordered segments with hierarchy info (good for RAG) |
# Sequential format for RAG pipelines
./target/release/blazegraph-cli -i document.pdf -f sequential -o chunks.jsonSee blazegraph-cli/configs/processing/ for example configuration files. These control:
- Section detection thresholds
- List detection patterns
- Spatial clustering parameters
- Size enforcement (max chunk size)
blazegraph-io/
├── blazegraph-core/ # Core parsing library
├── blazegraph-cli/ # Command-line interface
└── Cargo.toml # Workspace definition
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT license (LICENSE-MIT)
at your option.
This project is actively developed. Here's what's planned:
- Publish CLI to crates.io (
blazegraph-io) - Publish core library to crates.io (
blazegraph-io-core) - Publish Python wrapper to PyPI (
blazegraph-io)
- Markdown (
.md) - Word documents (
.docx)
- Stable v1 schema specification
- Schema documentation with examples
- Migration guide for schema changes
- Getting started guide
- Configuration reference
- Integration examples (LangChain, LlamaIndex, etc.)
- Output schema reference
Contributions and feedback welcome! Open an issue to discuss.