Docling-Graph converts documents into validated Pydantic objects and then into a directed knowledge graph, with exports to CSV or Cypher and both static and interactive visualizations.
This transformation of unstructured documents into validated knowledge graphs with precise semantic relationships—essential for complex domains like chemistry, finance, and physics where AI systems must understand exact entity connections (e.g., chemical compounds and their reactions, financial instruments and their dependencies, physical properties and their measurements) rather than approximate text vectors, enabling explainable reasoning over technical document collections.
The toolkit supports two extraction families: local VLM via Docling and LLM-based extraction via local (vLLM, Ollama) or API providers (Mistral, OpenAI, Gemini, IBM WatsonX), all orchestrated by a flexible, config-driven pipeline.
- 🧠 Extraction:
- Local
VLM(Docling's information extraction pipeline - ideal for small documents with key-value focus) LLM(local via vLLM/Ollama or remote via Mistral/OpenAI/Gemini/IBM WatsonX API)Hybrid ChunkingLeveraging Docling's segmentation with semantic LLM chunking for more context-aware extractionPage-wiseorwhole-documentconversion strategies for flexible processing
- Local
- 🔨 Graph Construction:
- Markdown to Graph: Convert validated Pydantic instances to a
NetworkX DiGraphwith rich edge metadata and stable node IDs - Smart Merge: Combine multi-page documents into a single Pydantic instance for unified processing
- Modular graph module with enhanced type safety and configuration
- Markdown to Graph: Convert validated Pydantic instances to a
- 📦 Export:
Docling Documentexports (JSON format with full document structure)Markdownexports (full document and per-page options)CSVcompatible withNeo4jadmin importCypherscript generation for bulk ingestionJSONexport for general-purpose graph data
- 📊 Visualization:
- Interactive
HTMLvisualization in full-page browser view with enhanced node/edge exploration - Detailed
MARKDOWNreport with graph nodes content and edges
- Interactive
- 🪜 Multi-Stage Extraction: Define
extraction_stagein templates to control multi-pass extraction. - 🧩 Interactive Template Builder: Guided workflows for building Pydantic templates.
- 🧬 Ontology-Based Templates: Match content to the best Pydantic template using semantic similarity.
- ✍🏻 Flexible Inputs: Accepts
text,markdown, andDoclingDocumentdirectly. - ⚡ Batch Optimization: Faster GPU inference with better memory handling.
- 💾 Graph Database Integration: Export data straight into
Neo4j,ArangoDB, and similar databases.
- Python 3.10 or higher
- UV package manager
git clone https://github.com/IBM/docling-graph
cd docling-graphChoose the installation option that matches your use case:
| Option | Command | Description |
|---|---|---|
| Minimal | uv sync |
Includes core VLM features (Docling), no LLM inference |
| Full | uv sync --extra all |
Includes all features, VLM, and all local/remote LLM providers |
| Local LLM | uv sync --extra local |
Adds support for vLLM and Ollama (requires GPU for vLLM) |
| Remote API | uv sync --extra remote |
Adds support for Mistral, OpenAI, Gemini, and IBM WatsonX APIs |
| WatsonX | uv sync --extra watsonx |
Adds support for IBM WatsonX foundation models (Granite, Llama, Mixtral) |
Follow the steps in this guide to install PyTorch with NVIDIA GPU (CUDA) support.
If you're using remote/cloud inference, set your API keys for the providers you plan to use:
export OPENAI_API_KEY="..." # OpenAI
export MISTRAL_API_KEY="..." # Mistral
export GEMINI_API_KEY="..." # Google Gemini
export WATSONX_API_KEY="..." # IBM WatsonX
export WATSONX_PROJECT_ID="..." # IBM WatsonX Project ID
export WATSONX_URL="..." # IBM WatsonX URL (optional, defaults to US South)On Windows, replace export with set in Command Prompt or $env: in PowerShell.
Alternatively, add them to your .env file.
Note: For IBM WatsonX setup and available models, see the WatsonX Integration Guide.
Docling Graph is primarily driven by its CLI, but you can easily integrate the core pipeline into Python scripts.
To run a conversion programmatically, you define a configuration dictionary and pass it to the run_pipeline function. This example uses a remote LLM API in a many-to-one mode for a single multi-page document:
from docling_graph import run_pipeline, PipelineConfig
from docs.examples.templates.rheology_research import Research # Pydantic model to use as an extraction template
# Create typed config
config = PipelineConfig(
source="docs/examples/data/research_paper/rheology.pdf",
template=Research,
backend="llm",
inference="remote",
processing_mode="many-to-one",
provider_override="mistral", # Specify your preferred provider and ensure its API key is set
model_override="mistral-medium-latest", # Specify your preferred LLM model
use_chunking=True, # Enable docling's hybrid chunker
llm_consolidation=False, # If False, programmatically merge batch-extracted dictionaries
output_dir="outputs/battery_research"
)
try:
run_pipeline(config)
print(f"\nExtraction complete! Graph data saved to: {config.output_dir}")
except Exception as e:
print(f"An error occurred: {e}")Use the command-line interface for quick conversions and inspections. The following command runs the conversion using the local VLM backend and outputs a graph ready for Neo4j import:
A wizard will walk you through setting up the right configfor your use case.
uv run docling-graph initNote: This command may take a little longer to start on the first run, as it checks for installed dependencies.
You can use: docling-graph convert --help to see the full list of available options and usage details
# uv run docling-graph convert <SOURCE_FILE_PATH> --template "<TEMPLATE_DOTTED_PATH>" [OPTIONS]
uv run docling-graph convert "docs/examples/data/research_paper/rheology.pdf" \
--template "docs.examples.templates.rheology_research.Research" \
--output-dir "outputs/battery_research" \
--processing-mode "many-to-one" \
--use-chunking \
--no-llm-consolidation # uv run docling-graph inspect <CONVERT_OUTPUT_PATH> [OPTIONS]
uv run docling-graph inspect outputs/battery_researchTemplates are the foundation of Docling Graph, defining both the extraction schema and the resulting graph structure.
- Use
is_entity=Trueinmodel_configto explicitly mark a class as a graph node. - Leverage
model_config.graph_id_fieldsto create stable, readable node IDs (natural keys). - Use the
Edge()helper to define explicit relationships between entities.
Example:
from pydantic import BaseModel, Field
from typing import Optional
class Person(BaseModel):
"""Person entity with stable ID based on name and DOB."""
model_config = {
'is_entity': True,
'graph_id_fields': ['last_name', 'date_of_birth']
}
first_name: str = Field(description="Person's first name")
last_name: str = Field(description="Person's last name")
date_of_birth: str = Field(description="Date of birth (YYYY-MM-DD)")Reference Pydantic templates are available to help you get started quickly.
For complete guidance, see: Pydantic Templates for Knowledge Graph Extraction
- Work In Progress...
Get hands-on with Docling Graph examples to convert documents into knowledge graphs through VLM or LLM-based processing.
MIT License - see LICENSE for details.
- Powered by Docling for advanced document processing.
- Uses Pydantic for data validation.
- Graph generation powered by NetworkX.
- Visualizations powered by Cytoscape.js.
- CLI powered by Typer and Rich.
Docling Graph has been brought to you by IBM.