Skip to content

ArtifexSoftware/gigatext

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GigaText

PDF intelligence for AI coding agents.

Why

GigaText started from how Claude Code reads PDFs internally. The current pattern sends PDF binaries to the model API as base64 document blocks on every read, which spends tokens on binary encoding, is limited to 20 pages per read, and gives you no OCR or table detection.

GigaText replaces that flow with local extraction first. Read the PDF on your machine, convert it to structured markdown, and only send text to the model.

Install

pip install gigatext

Requires Python 3.12+.

Quick start

--pages uses zero-based page numbers.

gigatext read file.pdf
gigatext info file.pdf
gigatext read file.pdf --pages 0,1,2,3,4

Token comparison

Document Mode Pages read Input tokens Notes
Purchase order Anthropic native PDF estimate 8 ~12,544 Uses Anthropic's published ~1,568 tokens per PDF page
Purchase order gigatext read 8 6,595 About 47% fewer tokens than the estimate above
10-K annual report Anthropic native PDF estimate 185 ~290,120 Exceeds a 200K context window
10-K annual report gigatext read 185 166,670 Full document read stays below the native PDF estimate
10-K annual report gigatext read --pages 0,1,2,3,4 5 3,258 Targeted read for the first 5 pages

For the 185-page 10-K, targeted local reading was about 60x faster than a full read in the local comparison run: 1.48s for --pages versus 92.5s for the full document.

Three ways to use

CLI

Use gigatext when you want clean markdown on stdout or fast metadata before reading.

gigatext read report.pdf
gigatext info report.pdf
gigatext read report.pdf --pages 0,1,2

Claude Code skill

The package includes gigatext/skill/SKILL.md, which tells Claude Code to use gigatext instead of sending PDFs as base64 to the API.

Typical flow:

gigatext info file.pdf
gigatext read file.pdf --pages 0,1,2

MCP server

Run GigaText as an MCP server over stdio. It exposes read_pdf and pdf_info.

gigatext serve

Features

  • Hybrid OCR - detects bad text regions automatically, OCRs only where needed, and was about 48% faster than full-page OCR in local tests
  • Page targeting - --pages keeps large documents practical for agent workflows
  • Document info - gigatext info returns page count and per-page flags before extraction
  • Clean stdout - OCR progress is redirected from fd 1 to fd 2 so pipes stay clean
  • Layout-aware chunking - coming soon

Powered by

Built on PyMuPDF4LLM.

License

AGPL-3.0. Copyright Artifex Software Inc.

About

An experiment in packaging PyMuPDF4LLM as an AI agent toolkit for PDF extraction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages