Skip to content

NorthCommits/extracta-client

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

extracta

Agentic Document Extraction CLI -- layout-aware, context-threaded extraction for PDF, PPTX, DOCX, and HTML.

PyPI version Python License: MIT


What is Extracta?

Extracta is a CLI tool that sends your documents to the Extracta server for intelligent, agentic extraction. It does not just pull text -- it:

  • Detects layout per page (single column, multi-column, mixed, table-heavy)
  • Determines the correct reading order (H-Major or V-Major) using projection profile analysis and Recursive XY-Cut
  • Threads context between segments using an LLM agent (ADE -- Agentic Document Extraction)
  • Outputs structured JSON with full context metadata per region

Installation

pip install extracta

Usage

extracta-extract path/to/your/document.pdf

Output is saved as document_extracted.json in the same directory as the input file.

Supported Formats

Format Extension
PDF .pdf
PowerPoint .pptx
Word .docx
HTML .html / .htm

Example Output

{
  "file": "report.pdf",
  "format": "pdf",
  "total_pages": 3,
  "pages": [
    {
      "page_number": 1,
      "layout_type": "multi_col",
      "strategy": "v_major",
      "regions": [
        {
          "region_id": "p1_r1",
          "type": "title",
          "text": "Efficacy in Treatment-Naive Patients",
          "bbox": { "x0": 50.0, "y0": 40.0, "x1": 540.0, "y1": 65.0 },
          "sequence": 1,
          "context_thread_id": "thread_001",
          "context_role": "heading",
          "continues_on_page": null,
          "references_region": null
        }
      ],
      "full_text": "Efficacy in Treatment-Naive Patients\n\nIn clinical trials..."
    }
  ]
}

Terminal Output

╭──────────────────────────────────────────╮
│  Extracta -- Agentic Document Extraction  │
╰──────────────────────────────────────────╯
  File   : report.pdf
  Server : http://localhost:8000

  Analysing layout...

  Page   Layout        Strategy    Regions
  1      multi_col     V-Major     12
  2      single_col    H-Major     8
  3      mixed         V-Major     15

  Running ADE context threading...

  Done -- 3 pages | 35 regions

  Output : report_extracted.json

How It Works

File Uploaded
     ↓
[DETECT]   -- scan all pages, determine H-Major or V-Major per page
     ↓
[EXTRACT]  -- extract blocks in natural reading order using Recursive XY-Cut
     ↓
[ADE]      -- LLM agent threads context, links segments, assigns roles
     ↓
JSON Output

Layout Types

Layout Type Description
single_col Simple single column document
multi_col Two or more columns (e.g. academic papers)
mixed Complex irregular layout (e.g. pharma slides)
table_heavy Majority of content is tabular
image_heavy Majority of content is images

Reading Strategies

Strategy Description
V-Major Vertical-first -- top to bottom within each column
H-Major Horizontal-first -- left to right across each row

Context Roles

Role Description
heading Section title or heading
body Main body paragraph
callout Sidebar, highlighted box, callout
caption Image or table caption
footnote Footer or footnote text
continuation Continues directly from a previous block

Project Structure

extracta-client/
├── extracta/
│   ├── __init__.py
│   ├── cli.py          -- entry point
│   ├── client.py       -- HTTP calls to extracta-server
│   └── display.py      -- rich terminal output
├── pyproject.toml
└── README.md

Server

This CLI requires a running instance of extracta-server. By default it connects to http://localhost:8000.

To use a deployed server, update SERVER_URL in extracta/client.py.


Publishing to PyPI

pip install build twine
python -m build
twine upload dist/*

License

MIT -- see LICENSE


Author

Swapnil Bhattacharya -- NorthCommits

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages