PDF intelligence for AI coding agents.
GigaText started from how Claude Code reads PDFs internally. The current pattern sends PDF binaries to the model API as base64 document blocks on every read, which spends tokens on binary encoding, is limited to 20 pages per read, and gives you no OCR or table detection.
GigaText replaces that flow with local extraction first. Read the PDF on your machine, convert it to structured markdown, and only send text to the model.
pip install gigatextRequires Python 3.12+.
--pages uses zero-based page numbers.
gigatext read file.pdfgigatext info file.pdfgigatext read file.pdf --pages 0,1,2,3,4| Document | Mode | Pages read | Input tokens | Notes |
|---|---|---|---|---|
| Purchase order | Anthropic native PDF estimate | 8 | ~12,544 | Uses Anthropic's published ~1,568 tokens per PDF page |
| Purchase order | gigatext read |
8 | 6,595 | About 47% fewer tokens than the estimate above |
| 10-K annual report | Anthropic native PDF estimate | 185 | ~290,120 | Exceeds a 200K context window |
| 10-K annual report | gigatext read |
185 | 166,670 | Full document read stays below the native PDF estimate |
| 10-K annual report | gigatext read --pages 0,1,2,3,4 |
5 | 3,258 | Targeted read for the first 5 pages |
For the 185-page 10-K, targeted local reading was about 60x faster than a full read in the local comparison run: 1.48s for --pages versus 92.5s for the full document.
Use gigatext when you want clean markdown on stdout or fast metadata before reading.
gigatext read report.pdf
gigatext info report.pdf
gigatext read report.pdf --pages 0,1,2The package includes gigatext/skill/SKILL.md, which tells Claude Code to use gigatext instead of sending PDFs as base64 to the API.
Typical flow:
gigatext info file.pdf
gigatext read file.pdf --pages 0,1,2Run GigaText as an MCP server over stdio. It exposes read_pdf and pdf_info.
gigatext serve- Hybrid OCR - detects bad text regions automatically, OCRs only where needed, and was about 48% faster than full-page OCR in local tests
- Page targeting -
--pageskeeps large documents practical for agent workflows - Document info -
gigatext inforeturns page count and per-page flags before extraction - Clean stdout - OCR progress is redirected from fd 1 to fd 2 so pipes stay clean
- Layout-aware chunking - coming soon
Built on PyMuPDF4LLM.
AGPL-3.0. Copyright Artifex Software Inc.