Agentic Document Extraction CLI -- layout-aware, context-threaded extraction for PDF, PPTX, DOCX, and HTML.
Extracta is a CLI tool that sends your documents to the Extracta server for intelligent, agentic extraction. It does not just pull text -- it:
- Detects layout per page (single column, multi-column, mixed, table-heavy)
- Determines the correct reading order (H-Major or V-Major) using projection profile analysis and Recursive XY-Cut
- Threads context between segments using an LLM agent (ADE -- Agentic Document Extraction)
- Outputs structured JSON with full context metadata per region
pip install extractaextracta-extract path/to/your/document.pdfOutput is saved as document_extracted.json in the same directory as the input file.
| Format | Extension |
|---|---|
.pdf |
|
| PowerPoint | .pptx |
| Word | .docx |
| HTML | .html / .htm |
{
"file": "report.pdf",
"format": "pdf",
"total_pages": 3,
"pages": [
{
"page_number": 1,
"layout_type": "multi_col",
"strategy": "v_major",
"regions": [
{
"region_id": "p1_r1",
"type": "title",
"text": "Efficacy in Treatment-Naive Patients",
"bbox": { "x0": 50.0, "y0": 40.0, "x1": 540.0, "y1": 65.0 },
"sequence": 1,
"context_thread_id": "thread_001",
"context_role": "heading",
"continues_on_page": null,
"references_region": null
}
],
"full_text": "Efficacy in Treatment-Naive Patients\n\nIn clinical trials..."
}
]
}╭──────────────────────────────────────────╮
│ Extracta -- Agentic Document Extraction │
╰──────────────────────────────────────────╯
File : report.pdf
Server : http://localhost:8000
Analysing layout...
Page Layout Strategy Regions
1 multi_col V-Major 12
2 single_col H-Major 8
3 mixed V-Major 15
Running ADE context threading...
Done -- 3 pages | 35 regions
Output : report_extracted.json
File Uploaded
↓
[DETECT] -- scan all pages, determine H-Major or V-Major per page
↓
[EXTRACT] -- extract blocks in natural reading order using Recursive XY-Cut
↓
[ADE] -- LLM agent threads context, links segments, assigns roles
↓
JSON Output
| Layout Type | Description |
|---|---|
| single_col | Simple single column document |
| multi_col | Two or more columns (e.g. academic papers) |
| mixed | Complex irregular layout (e.g. pharma slides) |
| table_heavy | Majority of content is tabular |
| image_heavy | Majority of content is images |
| Strategy | Description |
|---|---|
| V-Major | Vertical-first -- top to bottom within each column |
| H-Major | Horizontal-first -- left to right across each row |
| Role | Description |
|---|---|
| heading | Section title or heading |
| body | Main body paragraph |
| callout | Sidebar, highlighted box, callout |
| caption | Image or table caption |
| footnote | Footer or footnote text |
| continuation | Continues directly from a previous block |
extracta-client/
├── extracta/
│ ├── __init__.py
│ ├── cli.py -- entry point
│ ├── client.py -- HTTP calls to extracta-server
│ └── display.py -- rich terminal output
├── pyproject.toml
└── README.md
This CLI requires a running instance of extracta-server. By default it connects to http://localhost:8000.
To use a deployed server, update SERVER_URL in extracta/client.py.
pip install build twine
python -m build
twine upload dist/*MIT -- see LICENSE
Swapnil Bhattacharya -- NorthCommits