extracta

Agentic Document Extraction CLI -- layout-aware, context-threaded extraction for PDF, PPTX, DOCX, and HTML.

What is Extracta?

Extracta is a CLI tool that sends your documents to the Extracta server for intelligent, agentic extraction. It does not just pull text -- it:

Detects layout per page (single column, multi-column, mixed, table-heavy)
Determines the correct reading order (H-Major or V-Major) using projection profile analysis and Recursive XY-Cut
Threads context between segments using an LLM agent (ADE -- Agentic Document Extraction)
Outputs structured JSON with full context metadata per region

Installation

pip install extracta

Usage

extracta-extract path/to/your/document.pdf

Output is saved as document_extracted.json in the same directory as the input file.

Supported Formats

Format	Extension
PDF	`.pdf`
PowerPoint	`.pptx`
Word	`.docx`
HTML	`.html` / `.htm`

Example Output

{
  "file": "report.pdf",
  "format": "pdf",
  "total_pages": 3,
  "pages": [
    {
      "page_number": 1,
      "layout_type": "multi_col",
      "strategy": "v_major",
      "regions": [
        {
          "region_id": "p1_r1",
          "type": "title",
          "text": "Efficacy in Treatment-Naive Patients",
          "bbox": { "x0": 50.0, "y0": 40.0, "x1": 540.0, "y1": 65.0 },
          "sequence": 1,
          "context_thread_id": "thread_001",
          "context_role": "heading",
          "continues_on_page": null,
          "references_region": null
        }
      ],
      "full_text": "Efficacy in Treatment-Naive Patients\n\nIn clinical trials..."
    }
  ]
}

Terminal Output

╭──────────────────────────────────────────╮
│  Extracta -- Agentic Document Extraction  │
╰──────────────────────────────────────────╯
  File   : report.pdf
  Server : http://localhost:8000

  Analysing layout...

  Page   Layout        Strategy    Regions
  1      multi_col     V-Major     12
  2      single_col    H-Major     8
  3      mixed         V-Major     15

  Running ADE context threading...

  Done -- 3 pages | 35 regions

  Output : report_extracted.json

How It Works

File Uploaded
     ↓
[DETECT]   -- scan all pages, determine H-Major or V-Major per page
     ↓
[EXTRACT]  -- extract blocks in natural reading order using Recursive XY-Cut
     ↓
[ADE]      -- LLM agent threads context, links segments, assigns roles
     ↓
JSON Output

Layout Types

Layout Type	Description
single_col	Simple single column document
multi_col	Two or more columns (e.g. academic papers)
mixed	Complex irregular layout (e.g. pharma slides)
table_heavy	Majority of content is tabular
image_heavy	Majority of content is images

Reading Strategies

Strategy	Description
V-Major	Vertical-first -- top to bottom within each column
H-Major	Horizontal-first -- left to right across each row

Context Roles

Role	Description
heading	Section title or heading
body	Main body paragraph
callout	Sidebar, highlighted box, callout
caption	Image or table caption
footnote	Footer or footnote text
continuation	Continues directly from a previous block

Project Structure

extracta-client/
├── extracta/
│   ├── __init__.py
│   ├── cli.py          -- entry point
│   ├── client.py       -- HTTP calls to extracta-server
│   └── display.py      -- rich terminal output
├── pyproject.toml
└── README.md

Server

This CLI requires a running instance of extracta-server. By default it connects to http://localhost:8000.

To use a deployed server, update SERVER_URL in extracta/client.py.

Publishing to PyPI

pip install build twine
python -m build
twine upload dist/*

License

MIT -- see LICENSE

Author

Swapnil Bhattacharya -- NorthCommits

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
extracta		extracta
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

extracta

What is Extracta?

Installation

Usage

Supported Formats

Example Output

Terminal Output

How It Works

Layout Types

Reading Strategies

Context Roles

Project Structure

Server

Publishing to PyPI

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

extracta

What is Extracta?

Installation

Usage

Supported Formats

Example Output

Terminal Output

How It Works

Layout Types

Reading Strategies

Context Roles

Project Structure

Server

Publishing to PyPI

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages