GigaText

PDF intelligence for AI coding agents.

Why

GigaText started from how Claude Code reads PDFs internally. The current pattern sends PDF binaries to the model API as base64 document blocks on every read, which spends tokens on binary encoding, is limited to 20 pages per read, and gives you no OCR or table detection.

GigaText replaces that flow with local extraction first. Read the PDF on your machine, convert it to structured markdown, and only send text to the model.

Install

pip install gigatext

Requires Python 3.12+.

Quick start

--pages uses zero-based page numbers.

gigatext read file.pdf

gigatext info file.pdf

gigatext read file.pdf --pages 0,1,2,3,4

Token comparison

Document	Mode	Pages read	Input tokens	Notes
Purchase order	Anthropic native PDF estimate	8	~12,544	Uses Anthropic's published ~1,568 tokens per PDF page
Purchase order	`gigatext read`	8	6,595	About 47% fewer tokens than the estimate above
10-K annual report	Anthropic native PDF estimate	185	~290,120	Exceeds a 200K context window
10-K annual report	`gigatext read`	185	166,670	Full document read stays below the native PDF estimate
10-K annual report	`gigatext read --pages 0,1,2,3,4`	5	3,258	Targeted read for the first 5 pages

For the 185-page 10-K, targeted local reading was about 60x faster than a full read in the local comparison run: 1.48s for --pages versus 92.5s for the full document.

Three ways to use

CLI

Use gigatext when you want clean markdown on stdout or fast metadata before reading.

gigatext read report.pdf
gigatext info report.pdf
gigatext read report.pdf --pages 0,1,2

Claude Code skill

The package includes gigatext/skill/SKILL.md, which tells Claude Code to use gigatext instead of sending PDFs as base64 to the API.

Typical flow:

gigatext info file.pdf
gigatext read file.pdf --pages 0,1,2

MCP server

Run GigaText as an MCP server over stdio. It exposes read_pdf and pdf_info.

gigatext serve

Features

Hybrid OCR - detects bad text regions automatically, OCRs only where needed, and was about 48% faster than full-page OCR in local tests
Page targeting - --pages keeps large documents practical for agent workflows
Document info - gigatext info returns page count and per-page flags before extraction
Clean stdout - OCR progress is redirected from fd 1 to fd 2 so pipes stay clean
Layout-aware chunking - coming soon

Powered by

Built on PyMuPDF4LLM.

License

AGPL-3.0. Copyright Artifex Software Inc.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
gigatext		gigatext
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GigaText

Why

Install

Quick start

Token comparison

Three ways to use

CLI

Claude Code skill

MCP server

Features

Powered by

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GigaText

Why

Install

Quick start

Token comparison

Three ways to use

CLI

Claude Code skill

MCP server

Features

Powered by

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages