Production-ready PDF processing server for AI agents
PDF inspection β’ PDF search β’ Agent document map β’ Accessibility report β’ Visual evidence β’ Region crops β’ Configured OCR
PDF Reader MCP is a production-ready Model Context Protocol server that empowers AI agents with structured, local-first PDF processing capabilities. Inspect PDFs before extraction, search text evidence with page and bbox provenance, render page-level visual evidence, crop bbox-grounded page regions, run configured OCR for scanned-page text layers, then extract a full agent document map, accessibility report, text, Markdown, semantic citation chunks, images, tables, annotations, outlines, structure trees, form fields, attachment metadata, and agent-ready document elements with strong performance and reliability.
The Problem:
// Traditional PDF processing
- Sequential page processing (slow)
- No natural content ordering
- Complex path handling
- Poor error isolationThe Solution:
// PDF Reader MCP
- Preflight PDF inspection for agent extraction planning π
- MCP-native PDF search with snippets and bbox evidence π
- Bounded page rendering for visual evidence and OCR routing πΌοΈ
- Bbox-grounded region crops for source evidence π
- Configured local OCR provider for scanned-page text layers π‘
- 5-10x faster parallel processing β‘
- Full agent document map linking pages, elements, chunks, layout, safety, and geometry π§
- Semantic document AST for page/section/paragraph/list/table/image traversal π³
- PDF trust report for content safety, layout, table, and link-risk routing π‘οΈ
- Accessibility report for tagged-PDF coverage, headings, images, forms, links, and permissions βΏ
- Structured element output for agent workflows π§©
- Table quality diagnostics with inferred cell spans and continuation candidates π
- Markdown rendering for RAG and summarization π
- Citation-ready semantic/table/page chunks π
- Layout diagnostics with reading-order confidence π
- Outlines, annotations, structure trees, forms, attachments, labels, and permission signals ποΈ
- Column-aware reading order π
- Flexible path support (absolute/relative) π―
- Per-page error resilience π‘οΈ
- CI-backed quality β
Result: Production-ready PDF processing that scales.
- π 5-10x faster than sequential with automatic parallelization
- β‘ 12,933 ops/sec error handling, 5,575 ops/sec text extraction
- π¨ Process 50-page PDFs in seconds with multi-core utilization
- π¦ TypeScript-first with performance-bounded local execution
- π― Path Flexibility - Absolute & relative paths, Windows/Unix support (v1.3.0)
- π PDF Inspection - Profile PDFs before extraction and get recommended
read_pdfarguments for agent workflows - π PDF Search Evidence - Search selected PDF pages with snippets, match offsets, text-item bounding boxes, and provenance
- πΌοΈ Visual Page Evidence - Render selected pages as bounded PNG image parts with JSON provenance and pixel budgets
- π Region Crop Evidence - Crop PDF-coordinate regions as bounded PNG image parts for table, figure, chart, and citation verification
- π§ Visual Region Analysis - Send focused crops to a configured local provider and normalize table, chart, formula, figure, and image-description results
- π‘ Configured OCR Text Layer - Route rendered pages through an env-configured local OCR command and return normalized text, confidence, words, and provenance
- π§Ύ PDF Text Layer - Optional line and word records with page-level character ranges, best-effort bounding boxes, and provenance
- π§ Agent Document Map - Optional page map that links elements, chunks, layout confidence, safety findings, routing signals, and page geometry
- π³ Document AST - Optional semantic tree with page, section, paragraph, list item, table, and image nodes linked back to evidence IDs
- π‘οΈ Trust Report - Optional consolidated report for prompt-injection text, hidden/off-page signals, layout uncertainty, sparse pages, table warnings, and external links
- βΏ Accessibility Report - Optional deterministic report for tagged-PDF coverage, structure tree availability, heading roles, image alt-text verifiability, form labels, link labels, and accessibility permissions
- π§© Structured Elements - Optional page-level elements with stable IDs, provenance, and best-effort bounding boxes
- π Table Intelligence - Optional table quality metrics, inferred header/span hints, sparse-cell warnings, and repeated-header continuation candidates
- π Layout Diagnostics - Optional page profiles, column signals, and reading-order confidence for agent routing
- π Markdown Rendering - Optional page-aware Markdown for RAG, summarization, and agent context
- π Citation Chunks - Optional page, semantic, size, and table chunks with element IDs and best-effort bounding boxes
- ποΈ Document Signals - Optional outlines, page labels, annotations, structure trees, forms, attachments, permissions, and mark info
- πΌοΈ Smart Ordering - Column-aware content ordering improves natural reading flow
- π‘οΈ Type Safe - Full TypeScript with strict mode enabled
- π Battle-tested - Automated tests, strict TypeScript, and CI validation
- π¨ Simple API -
inspect_pdfplans extraction,search_pdffinds text evidence,render_pagereturns visual evidence,extract_regionscrops source evidence,analyze_regionsenriches visual regions,ocr_pagesruns configured OCR,read_pdfperforms extraction
Real-world performance from production testing:
| Operation | Ops/sec | Performance | Use Case |
|---|---|---|---|
| Error handling | 12,933 | β‘β‘β‘β‘β‘ | Validation & safety |
| Extract full text | 5,575 | β‘β‘β‘β‘ | Document analysis |
| Extract page | 5,329 | β‘β‘β‘β‘ | Single page ops |
| Multiple pages | 5,242 | β‘β‘β‘β‘ | Batch processing |
| Metadata only | 4,912 | β‘β‘β‘ | Quick inspection |
| Document | Sequential | Parallel | Speedup |
|---|---|---|---|
| 10-page PDF | ~2s | ~0.3s | 5-8x faster |
| 50-page PDF | ~10s | ~1s | 10x faster |
| 100+ pages | ~20s | ~2s | Linear scaling with CPU cores |
Benchmarks vary based on PDF complexity and system resources.
claude mcp add pdf-reader -- npx @sylphx/pdf-reader-mcpAdd to claude_desktop_config.json:
{
"mcpServers": {
"pdf-reader": {
"command": "npx",
"args": ["@sylphx/pdf-reader-mcp"]
}
}
}π Config file locations
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json - Linux:
~/.config/Claude/claude_desktop_config.json
code --add-mcp '{"name":"pdf-reader","command":"npx","args":["@sylphx/pdf-reader-mcp"]}'- Open Settings β MCP β Add new MCP Server
- Select Command type
- Enter:
npx @sylphx/pdf-reader-mcp
Add to your Windsurf MCP config:
{
"mcpServers": {
"pdf-reader": {
"command": "npx",
"args": ["@sylphx/pdf-reader-mcp"]
}
}
}Add to Cline's MCP settings:
{
"mcpServers": {
"pdf-reader": {
"command": "npx",
"args": ["@sylphx/pdf-reader-mcp"]
}
}
}- Go to Settings β AI β Manage MCP Servers β Add
- Command:
npx, Args:@sylphx/pdf-reader-mcp
Add the server in Settings β MCP Servers β Add Server with command npx and args @sylphx/pdf-reader-mcp. See Ontheia's compatible MCP servers for the full list.
npx -y @smithery/cli install @sylphx/pdf-reader-mcp --client claude# Quick start - zero installation
npx @sylphx/pdf-reader-mcp
# Or install globally
npm install -g @sylphx/pdf-reader-mcpUse inspect_pdf when an agent needs to decide how to process an unfamiliar
PDF. It samples a bounded number of pages, detects selectable-text versus
image-like pages, surfaces document signals, and recommends useful read_pdf
arguments without extracting image bytes.
{
"sources": [{
"path": "documents/report.pdf"
}],
"sample_pages": 5,
"include_metadata": true
}Result:
- PDF profile such as
digital_text,scanned_or_image_only, ormixed_text_and_scan - Page-level text density, token estimates, and image paint-operation counts
- Signals for outlines, page labels, forms, attachments, permissions, and structure trees
- Recommended
read_pdfarguments for citation chunks, safety findings, tables, or OCR triage
Use search_pdf when an agent needs to locate text evidence before deciding
whether to read a whole page, crop a region, or cite a result.
{
"sources": [{
"path": "documents/report.pdf",
"pages": "1-20"
}],
"query": "risk controls",
"whole_word": true,
"max_matches_per_source": 10
}Response includes:
- A JSON summary with
profile: "pdf_search_results"and effective search options - Page numbers, snippets, match offsets, and text-item indexes
- Best-effort text-item bounding boxes when coordinates are available
- Per-match provenance so agents can route hits into
render_pageorextract_regions - Bounded defaults:
max_pagesdefault 100 andmax_matches_per_sourcedefault 50
{
"sources": [{
"path": "documents/report.pdf"
}],
"include_full_text": true,
"include_metadata": true,
"include_page_count": true
}Result:
- β Full text content extracted
- β PDF metadata (author, title, dates)
- β Total page count
- β Structured JSON summary for agent workflows
{
"sources": [{
"path": "documents/manual.pdf",
"pages": "1-5,10,15-20"
}],
"include_full_text": true
}{
"sources": [{
"path": "documents/report.pdf",
"pages": "1-3"
}],
"include_elements": true,
"include_metadata": true,
"include_page_count": true
}Response includes:
- Stable element IDs such as
p1-text-1 - Page numbers and provenance for each element
- Best-effort bounding boxes when coordinates are available
- Text, image metadata, and table elements without embedding image bytes in the JSON summary
- Table elements include best-effort table and cell bounding boxes, quality metrics, header/span hints, and continuation candidates when coordinates are available
Use include_document_map when an agent needs one navigable PDF structure
instead of separate page, element, chunk, layout, and safety outputs.
{
"sources": [{
"path": "documents/report.pdf",
"pages": "1-5"
}],
"include_document_map": true,
"include_full_text": false
}Response includes:
- Page records with element IDs, chunk IDs, safety finding indexes, text density, image count, table count, and page geometry
- Semantic elements and citation chunks derived from the same stable IDs
- Layout diagnostics and routing signals for low-confidence, sparse, and OCR-needed pages
- Safety findings linked back to page and element evidence
- No embedded image bytes inside the JSON document map
Use include_document_ast when an agent needs a navigable semantic tree rather
than reconstructing document structure from flat text items.
{
"sources": [{
"path": "documents/report.pdf",
"pages": "1-5"
}],
"include_document_ast": true,
"include_full_text": false
}Response includes:
- A
document_astroot with page, section, paragraph, list item, table, and image nodes - Node-level
element_ids,chunk_ids, bounding boxes, confidence, and semantic roles where available - Table nodes with rows, quality diagnostics, and continuation candidates when tables are detected
- No forced top-level
elements,chunks, ortablesoutput unless those options are requested
Use include_text_layer when an agent needs deterministic line and word
references instead of only full text. It exposes page text, line records, word
records, page-level character ranges, best-effort bounding boxes, and
provenance from the same extracted text-content pass.
{
"sources": [{
"path": "documents/report.pdf",
"pages": "1-5"
}],
"include_text_layer": true,
"include_full_text": false
}Response includes:
- A
text_layerobject with one page record per selected page - Line IDs, line text, page-level
char_start/char_end, and line bounding boxes when available - Word text, page-level character ranges, and estimated word boxes when the line has geometry
- Summary counts for pages, lines, words, characters, and bbox coverage
- No forced
full_textor rawpage_contentsoutput
Use include_trust_report when an agent needs one local risk summary before
using extracted PDF content as instructions, evidence, or retrieval context.
{
"sources": [{
"path": "documents/report.pdf",
"pages": "1-5"
}],
"include_trust_report": true,
"include_full_text": false
}Response includes:
- Document and page-level risk scores
- Content safety, layout uncertainty, sparse/scanned-page, table quality, and external-link signals
- Guidance for when to verify with OCR, page rendering, or region crops
- No forced top-level safety, layout, annotation, or table outputs unless those options are requested
Use include_accessibility_report when an agent needs a deterministic view of
tagged-PDF and accessibility-relevant structure before relying on the document
for navigation, form filling, summarization, or assisted reading workflows.
{
"sources": [{
"path": "documents/report.pdf",
"pages": "1-5"
}],
"include_accessibility_report": true,
"include_full_text": false
}Response includes:
- Document and page-level accessibility scores and grades
- Tagged-page coverage, structure role counts, heading counts, image counts, link counts, and form field counts
- Issues for missing mark info, untagged pages, suspect tags, image alt-text verifiability, weak form labels, weak link labels, and missing
copy_for_accessibility - Guidance for when agents should verify semantics with source files, rendering, or region crops
- No forced top-level permissions, mark info, annotations, form fields, or structure trees unless those options are requested
Use render_page when an agent needs to inspect the original page image,
prepare OCR routing, or verify visual layout without stuffing base64 into JSON.
{
"sources": [{
"path": "documents/report.pdf",
"pages": "1-2"
}],
"scale": 2,
"max_pages": 2
}Response includes:
- A JSON summary with page number, render scale, pixel count, byte length, evidence ID, and provenance
- PNG pages as MCP image content parts when
include_imageis true - Bounded defaults: first page by default,
max_pagesdefault 5, andmax_pixels_per_pagedefault 16MP - No rendered page base64 duplicated inside the first JSON content part
Use extract_regions when an agent has a table, figure, chart, formula, or
citation bounding box and needs a focused crop from the original page.
{
"sources": [{
"path": "documents/report.pdf",
"regions": [{
"id": "table-1",
"page": 1,
"bounding_box": { "left": 72, "bottom": 420, "right": 540, "top": 620 },
"padding": 8
}]
}],
"scale": 2,
"max_regions": 20
}Response includes:
- A JSON summary with region ID, source bounding box, crop pixel bounds, evidence ID, and provenance
- PNG region crops as MCP image content parts when
include_imageis true - Bounded defaults:
max_regionsdefault 20 andmax_pixels_per_pagedefault 16MP - No cropped image base64 duplicated inside the first JSON content part
Use analyze_regions when an agent has a crop target for a table, chart,
formula, figure, or image and wants a normalized local-provider result linked
back to source pixels. The provider is configured by environment variables, not
by request arguments.
{
"sources": [{
"path": "documents/report.pdf",
"regions": [{
"id": "chart-1",
"page": 2,
"bounding_box": { "left": 72, "bottom": 240, "right": 540, "top": 520 },
"padding": 8
}]
}],
"scale": 2,
"max_regions": 10,
"languages": ["eng"]
}Response includes:
- A JSON summary with
profile: "region_analysis"and the effective analysis options - Region-level
kind, description, text, Markdown, confidence, normalized table rows, formula fields, chart data points, warnings, and provenance when supplied by the provider source_crop_evidence_id, source bounding box, crop pixel bounds, and scale for every analyzed region- Bounded defaults:
max_regionsdefault 20,max_pixels_per_pagedefault 16MP, andtimeout_msdefault 60 seconds per region - No cropped image base64 duplicated inside the JSON response
Use ocr_pages after inspect_pdf flags scanned or sparse pages, or when an
agent needs a text layer from pages that have little selectable text. The
server renders bounded page images and passes each temporary PNG to the
configured local OCR command.
{
"sources": [{
"path": "documents/scanned-report.pdf",
"pages": "1-3"
}],
"scale": 2,
"max_pages": 3,
"languages": ["eng"]
}Response includes:
- A JSON summary with
profile: "ocr_text_layer"and the effective OCR options - Page-level OCR text, confidence, optional word bounding boxes, language, and provenance
source_render_evidence_idlinking each OCR page back to the page render used as OCR input- Bounded defaults:
max_pagesdefault 5,max_pixels_per_pagedefault 16MP, andtimeout_msdefault 60 seconds per page - No rendered image base64 duplicated inside the JSON response
{
"sources": [{
"path": "documents/report.pdf",
"pages": "1-5"
}],
"include_markdown": true,
"include_full_text": false
}Response includes:
- Page-aware Markdown sections
- Text blocks in extraction order
- Image placeholders with dimensions when images are requested
- Extracted tables appended as Markdown when
include_tablesis enabled
{
"sources": [{
"path": "documents/report.pdf",
"pages": "1-5"
}],
"include_chunks": true,
"include_semantic_hints": true,
"include_tables": true,
"include_full_text": false
}Response includes:
- Stable chunk IDs such as
p1-chunk-1 - Page ranges for each chunk
- Chunk strategies such as
page,semantic,size, andtable - Semantic headings when heading boundaries are available
- Element IDs that map back to structured elements
- Best-effort bounding boxes for source highlighting
{
"sources": [{
"path": "documents/spec.pdf",
"pages": "1-5"
}],
"include_outline": true,
"include_annotations": true,
"include_page_labels": true,
"include_permissions": true,
"include_structure_tree": true,
"include_form_fields": true,
"include_attachments": true
}Response includes, when available:
- Bookmark/outline trees
- Page labels such as roman numerals or section labels
- Link and note annotation summaries with bounding boxes
- Tagged PDF structure trees for selected pages when available
- Form field summaries with values, field types, and bounding boxes when available
- Embedded attachment metadata without returning attachment bytes
- Permission labels and marking signals
// Windows - Both formats work!
{
"sources": [{
"path": "C:\\Users\\John\\Documents\\report.pdf"
}],
"include_full_text": true
}
// Unix/Mac
{
"sources": [{
"path": "/home/user/documents/contract.pdf"
}],
"include_full_text": true
}No more "Absolute paths are not allowed" errors!
{
"sources": [{
"path": "presentation.pdf",
"pages": [1, 2, 3]
}],
"include_images": true,
"include_full_text": true
}Response includes:
- Text and images in Y-coordinate reading order
- Base64-encoded images with metadata (width, height, format)
- Natural reading flow preserved for AI comprehension
{
"sources": [
{ "path": "C:\\Reports\\Q1.pdf", "pages": "1-10" },
{ "path": "/home/user/Q2.pdf", "pages": "1-10" },
{ "url": "https://example.com/Q3.pdf" }
],
"include_full_text": true
}β‘ All PDFs processed in parallel automatically!
- β
PDF Inspection - Profile PDFs before extraction, detect low-text/scanned pages, and recommend
read_pdfoptions - β Text Extraction - Full document or specific pages with intelligent parsing
- β PDF Search Evidence - Literal search with page numbers, snippets, match offsets, text-item bounding boxes, and provenance
- β Image Extraction - Base64-encoded with complete metadata (width, height, format)
- β Agent Document Map - Pages, elements, chunks, layout diagnostics, safety findings, routing signals, and geometry in one contract
- β Document AST - Semantic tree for page, section, paragraph, list item, table, and image traversal
- β Trust Report - Local risk routing for content safety, layout uncertainty, table quality, sparse pages, and external links
- β Accessibility Report - Tagged-PDF coverage, structure tree, heading, image, form, link, and permission signals
- β PDF Text Layer - Line records, word records, character ranges, best-effort bounding boxes, and provenance
- β Configured OCR Text Layer - Optional command-provider OCR over rendered pages, with normalized text, confidence, words, language, and provenance
- β Structured Elements - Agent-ready elements with stable IDs, provenance, and best-effort bounding boxes
- β Markdown Output - Page-aware Markdown for RAG, summaries, and context preparation
- β Citation Chunks - Page, semantic, size, and table chunks with source references for downstream retrieval
- β Document Signals - Outlines, annotations, structure trees, forms, attachments, page labels, permissions, and mark info when exposed by the PDF
- β Content Ordering - Column-aware layout preservation for natural reading flow
- β Metadata Extraction - Author, title, creation date, and custom properties
- β Page Counting - Fast enumeration without loading full content
- β Dual Sources - Local files (absolute or relative paths) and HTTP/HTTPS URLs
- β Batch Processing - Multiple PDFs processed concurrently
- β‘ 5-10x Performance - Parallel page processing with Promise.all
- π― Smart Pagination - Extract ranges like "1-5,10-15,20"
- πΌοΈ Multi-Format Images - RGB, RGBA, Grayscale with automatic detection
- π‘οΈ Path Flexibility - Windows, Unix, and relative paths all supported (v1.3.0)
- π Error Resilience - Per-page error isolation with detailed messages
- π Large File Support - Efficient streaming and memory management
- π Type Safe - Full TypeScript with strict mode enabled
include_document_map returns a single agent-ready map that links pages,
structured elements, citation chunks, layout diagnostics, content safety
findings, routing signals, and page geometry. It is designed for agents that
need to navigate the original PDF evidence without manually stitching together
separate response fields.
The map is performance-bounded: it reuses the same extraction path, keeps image bytes out of JSON, and provides page-level routing signals such as low-confidence pages and pages that likely need OCR.
include_accessibility_report returns a deterministic report for tagged-PDF
coverage, page structure trees, heading roles, image alt-text verifiability,
form field labels, link labels, mark info, and copy_for_accessibility
permissions. It gives agents routing guidance without claiming PDF/UA
certification or forcing raw structure outputs into top-level JSON.
ocr_pages renders selected PDF pages and sends those temporary PNGs to a
local OCR command configured by environment variables. This keeps the default
TypeScript package private and dependency-bounded while giving teams a real
scanned PDF path when they already run Tesseract, PaddleOCR, a local HTTP shim,
or an internal OCR binary. MCP_PDF_OCR_PRESET=tesseract provides a built-in
Tesseract command template without bundling an OCR model.
The OCR provider is env-only, not request-controlled. Tool responses normalize provider output into page text, confidence, optional word boxes, language, render evidence IDs, and provenance. Image bytes are not embedded in the JSON response.
inspect_pdf adds a bounded planning tool for agent workflows. It samples
up to 20 pages per source, counts selectable text and image paint operations,
surfaces document-level signals, and returns a recommendation with the next
best read_pdf arguments.
Inspection is intentionally low overhead: it does not decode image bytes and it
does not perform OCR. When sampled pages look scanned or image-only, the tool
marks needs_ocr: true so agents do not mistake an image-based PDF for a text
extraction failure. It also reports safe optional-provider readiness for
ocr_pages and analyze_regions without exposing local command paths.
include_layout_diagnostics adds deterministic page-level signals for layout
profile, reading-order model, confidence, column count, positioned item ratio,
and warnings. This helps agents decide when local extraction is safe for RAG and
when a page should be routed to a heavier parser, OCR/vision workflow, or human
review.
include_elements adds structured document elements to the JSON response while keeping the existing text, metadata, image, and table outputs backward compatible.
{
"sources": [{ "path": "report.pdf" }],
"include_elements": true,
"include_semantic_hints": true
}Elements include stable IDs, page numbers, provenance, and best-effort bounding boxes where available. Image bytes stay out of the JSON summary so MCP clients can keep context payloads manageable.
include_semantic_hints adds deterministic heading/list/paragraph hints to text elements, with confidence and signals, without claiming a full semantic parser.
include_markdown adds page-aware Markdown for workflows that need clean text context without manually rebuilding sections from raw page text.
include_html adds an escaped HTML rendering for previews, export workflows, and downstream conversion.
The extraction pipeline also separates distant same-line text into independent segments before ordering, which improves multi-column PDFs without requiring any extra configuration.
include_chunks adds citation-ready chunks with stable IDs, strategy labels, element references, and best-effort bounding boxes for downstream retrieval and citation workflows. When include_semantic_hints is also enabled, chunks split on deterministic heading boundaries; table chunks are emitted when table extraction is requested.
include_outline, include_annotations, include_page_labels, include_page_geometry, include_permissions, include_structure_tree, include_form_fields, and include_attachments expose additional document signals without changing the default response shape.
include_safety_findings adds deterministic findings for common prompt-injection patterns, tiny text, and off-page text so agents can inspect risky document content before using it as instructions.
// β
Windows
{ "path": "C:\\Users\\John\\Documents\\report.pdf" }
{ "path": "C:/Users/John/Documents/report.pdf" }
// β
Unix/Mac
{ "path": "/home/john/documents/report.pdf" }
{ "path": "/Users/john/Documents/report.pdf" }
// β
Relative (still works)
{ "path": "documents/report.pdf" }Other Improvements:
- π‘οΈ Filesystem and HTTP access restrictions for safer deployments
- π Table extraction with Markdown output
- π¦ Updated parser resources for CMaps, fonts, WASM decoders, and color profiles
π View Full Changelog
v1.2.0 - Content Ordering
- Y-coordinate based text and image ordering
- Natural reading flow for AI models
- Intelligent line grouping
v1.1.0 - Image Extraction & Performance
- Base64-encoded image extraction
- 10x speedup with parallel processing
- Comprehensive test coverage
Plan PDF extraction before running a heavier read. This is useful for agents that need to choose between metadata review, citation-ready extraction, mixed PDF handling, or OCR-capable workflows.
| Parameter | Type | Description | Default |
|---|---|---|---|
sources |
Array | List of PDF sources to inspect | Required |
sample_pages |
number | Maximum pages to sample per source, capped at 20 | 5 |
include_metadata |
boolean | Include PDF metadata and info objects | true |
| Field | Description |
|---|---|
profile |
digital_text, scanned_or_image_only, mixed_text_and_scan, low_text_or_form, or unknown |
sampled_pages |
Pages used for the bounded inspection sample |
page_signals |
Text chars, text items, token estimate, image paint operations, and scan/low-text flags |
document_signals |
Outline, labels, permissions, forms, attachments, and structure-tree availability |
recommendation |
Suggested workflow, OCR need, reason, and ready-to-use read_pdf arguments |
provider_status |
Safe readiness metadata for optional ocr_pages and analyze_regions providers without command paths |
Render selected pages as PNG visual evidence. This gives agents a page image they can inspect or route to OCR/vision workflows while keeping binary content out of the JSON summary.
| Parameter | Type | Description | Default |
|---|---|---|---|
sources |
Array | List of PDF sources to render | Required |
scale |
number | Render scale relative to PDF points, from 0.25 to 4 | 2 |
max_pages |
number | Maximum pages to render per source, capped at 20 | 5 |
max_pixels_per_page |
number | Maximum rendered pixels per page, capped at 64MP | 16000000 |
include_image |
boolean | Return PNG pages as MCP image parts | true |
{
"sources": [{ "path": "report.pdf", "pages": "1-2" }],
"scale": 2,
"max_pages": 2
}The first content part is JSON metadata with profile: "page_render_evidence".
Rendered PNG data is returned as subsequent MCP image parts and referenced by
image_content_index.
Search extracted PDF text using bounded literal matching and return evidence that agents can cite or route into visual tools.
| Parameter | Type | Description | Default |
|---|---|---|---|
sources |
Array | List of PDF sources to search | Required |
query |
string | Literal text query to search for | Required |
case_sensitive |
boolean | Use case-sensitive matching | false |
whole_word |
boolean | Match only whole words using ASCII word boundaries | false |
max_pages |
number | Maximum pages to search per source, capped at 1000 | 100 |
max_matches_per_source |
number | Maximum matches returned per source, capped at 500 | 50 |
context_chars |
number | Context characters around each match, capped at 1000 | 120 |
{
"sources": [{ "path": "report.pdf", "pages": "1-20" }],
"query": "risk controls",
"whole_word": true,
"max_matches_per_source": 10
}The first content part is JSON metadata with profile: "pdf_search_results".
Matches include page number, matched text, snippet, match offsets, text-item
index, optional text-item bounding box, and provenance. Search uses literal
matching only; request payloads do not accept arbitrary regular expressions.
Crop selected PDF-coordinate page regions as PNG visual evidence. This is useful when an agent has bounding boxes from the document map, table detector, or downstream layout workflow and needs focused source evidence.
| Parameter | Type | Description | Default |
|---|---|---|---|
sources |
Array | List of PDF sources with regions to crop |
Required |
scale |
number | Render scale used before cropping, from 0.25 to 4 | 2 |
max_regions |
number | Maximum regions to crop per source, capped at 100 | 20 |
max_pixels_per_page |
number | Maximum rendered pixels per page before cropping, capped at 64MP | 16000000 |
include_image |
boolean | Return cropped regions as MCP image parts | true |
Each region uses PDF coordinates:
{
"id": "figure-1",
"page": 1,
"bounding_box": { "left": 72, "bottom": 420, "right": 540, "top": 620 },
"padding": 8
}The first content part is JSON metadata with profile: "region_crop_evidence". Cropped PNG data is returned as subsequent MCP image
parts and referenced by image_content_index.
Analyze selected PDF-coordinate page regions with a configured local provider. This is useful for visual table recognition, chart-to-data enrichment, formula recognition, figure descriptions, and image captions while keeping every result linked to a crop evidence ID.
| Parameter | Type | Description | Default |
|---|---|---|---|
sources |
Array | List of PDF sources with regions to analyze |
Required |
scale |
number | Render scale used before cropping and analysis, from 0.25 to 4 | 2 |
max_regions |
number | Maximum regions to analyze per source, capped at 100 | 20 |
max_pixels_per_page |
number | Maximum rendered pixels per page before cropping, capped at 64MP | 16000000 |
timeout_ms |
number | Timeout per analyzed region in milliseconds, capped at 300000 | 60000 |
max_output_chars |
number | Maximum provider output characters returned per region | 200000 |
languages |
string[] | Optional language tags passed to the configured provider | - |
| Variable | Description |
|---|---|
MCP_PDF_REGION_ANALYSIS_COMMAND |
Absolute or PATH-resolved command used for visual region analysis. Required to enable analyze_regions. |
MCP_PDF_REGION_ANALYSIS_ARGS_JSON |
Optional JSON string array of command arguments. Must include {input} and may also use {page}, {source}, {region_id}, {evidence_id}, {left}, {bottom}, {right}, {top}, {language}, and {languages} placeholders. Defaults to ["{input}"]. |
Provider stdout may be plain text or JSON:
{
"kind": "table",
"description": "Quarterly revenue table",
"text": "Q1 revenue...",
"markdown": "| Quarter | Revenue |",
"confidence": 0.91,
"table": {
"rows": [["Quarter", "Revenue"], ["Q1", "$1.2M"]],
"confidence": 0.9
},
"formula": {
"latex": "E = mc^2",
"confidence": 0.82
},
"chart": {
"title": "Revenue by quarter",
"summary": "Revenue rises across the period.",
"data_points": [{ "label": "Q1", "value": 1.2 }],
"confidence": 0.78
},
"warnings": ["Low contrast axis labels"]
}The first content part is JSON metadata with profile: "region_analysis".
Each analysis includes source_crop_evidence_id, source bounding box, crop
pixel bounds, scale, provider, provenance, and normalized fields supplied by
the local provider. The request cannot select an executable.
Run selected rendered pages through a configured local OCR provider and return a normalized OCR text layer. The provider is configured through environment variables so an MCP request cannot choose arbitrary commands.
| Parameter | Type | Description | Default |
|---|---|---|---|
sources |
Array | List of PDF sources to OCR | Required |
scale |
number | Render scale used before OCR, from 0.25 to 4 | 2 |
max_pages |
number | Maximum pages to OCR per source, capped at 20 | 5 |
max_pixels_per_page |
number | Maximum rendered pixels per page before OCR, capped at 64MP | 16000000 |
timeout_ms |
number | Timeout per OCR page in milliseconds, capped at 300000 | 60000 |
max_output_chars |
number | Maximum OCR text characters returned per page | 200000 |
languages |
string[] | Optional OCR language tags passed to the configured provider | - |
| Variable | Description |
|---|---|
MCP_PDF_OCR_PRESET |
Optional built-in command template. Supported value: tesseract. |
MCP_PDF_OCR_COMMAND |
Absolute or PATH-resolved command used for OCR. Required unless MCP_PDF_OCR_PRESET is set. Overrides the preset command when both are set. |
MCP_PDF_OCR_ARGS_JSON |
Optional JSON string array of command arguments. Must include {input} and may also use {page}, {source}, {language}, {languages}, and {languages_tesseract} placeholders. Defaults to the preset template or ["{input}"]. |
Provider stdout may be plain text or JSON:
{
"text": "Recognized text",
"confidence": 0.93,
"language": "eng",
"words": [{
"text": "Recognized",
"confidence": 0.95,
"bounding_box": { "left": 10, "bottom": 20, "right": 90, "top": 40 }
}]
}The first content part is JSON metadata with profile: "ocr_text_layer".
OCR results reference the render evidence ID used to create each temporary page
image. The default package does not bundle an OCR model or call a cloud OCR
service.
The extraction tool that handles PDF content, structure, citations, images, tables, and document signals.
| Parameter | Type | Description | Default |
|---|---|---|---|
sources |
Array | List of PDF sources to process | Required |
include_full_text |
boolean | Extract full text content | false |
include_metadata |
boolean | Extract PDF metadata | true |
include_page_count |
boolean | Include total page count | true |
include_images |
boolean | Extract embedded images | false |
include_tables |
boolean | Detect tables with rows, cell metadata, confidence, quality diagnostics, inferred spans, continuation candidates, and best-effort geometry | false |
include_document_map |
boolean | Include an agent document map that links pages, elements, chunks, layout diagnostics, safety findings, routing signals, and page geometry | false |
include_document_ast |
boolean | Include a semantic document AST with page, section, paragraph, list item, table, and image nodes linked to element/chunk evidence | false |
include_trust_report |
boolean | Include a consolidated trust report for content safety, layout uncertainty, sparse/scanned pages, table quality, and external links | false |
include_accessibility_report |
boolean | Include a deterministic accessibility report for tagged-PDF coverage, structure trees, headings, images, forms, links, and accessibility permissions | false |
include_elements |
boolean | Include structured document elements for agent workflows | false |
include_semantic_hints |
boolean | Include deterministic heading/list/paragraph hints on text elements | false |
include_markdown |
boolean | Include page-aware Markdown for RAG and summarization | false |
include_html |
boolean | Include escaped page-aware HTML for preview/export workflows | false |
include_chunks |
boolean | Include page, semantic, size, and table chunks with source references | false |
include_text_layer |
boolean | Include line and word records with page-level character ranges, best-effort bounding boxes, and provenance | false |
include_layout_diagnostics |
boolean | Include page layout profiles, reading-order confidence, column signals, and warnings | false |
include_outline |
boolean | Include PDF outline/bookmarks when available | false |
include_annotations |
boolean | Include safe annotation summaries for selected pages | false |
include_page_labels |
boolean | Include PDF page labels when available | false |
include_page_geometry |
boolean | Include page viewport geometry and PDF view boxes | false |
include_permissions |
boolean | Include permission labels and mark info when available | false |
include_structure_tree |
boolean | Include tagged PDF structure trees for selected pages when available | false |
include_form_fields |
boolean | Include PDF form field summaries when available | false |
include_attachments |
boolean | Include embedded attachment metadata without attachment bytes | false |
include_safety_findings |
boolean | Include deterministic content safety findings for agent workflows | false |
{
path?: string; // Local file path (absolute or relative)
url?: string; // HTTP/HTTPS URL to PDF
pages?: string | number[]; // Pages to extract: "1-5,10" or [1,2,3]
}Metadata only (fast):
{
"sources": [{ "path": "large.pdf" }],
"include_metadata": true,
"include_page_count": true,
"include_full_text": false
}From URL:
{
"sources": [{
"url": "https://arxiv.org/pdf/2301.00001.pdf"
}],
"include_full_text": true
}Page ranges:
{
"sources": [{
"path": "manual.pdf",
"pages": "1-5,10-15,20" // Pages 1,2,3,4,5,10,11,12,13,14,15,20
}]
}Structured elements:
{
"sources": [{ "path": "report.pdf", "pages": "1-3" }],
"include_elements": true,
"include_metadata": true
}Elements are designed for agent workflows that need stable page references, provenance, and best-effort coordinates for citation-ready downstream processing.
Agent document map:
{
"sources": [{ "path": "report.pdf", "pages": "1-5" }],
"include_document_map": true,
"include_full_text": false
}The document map is designed for agents that need one navigable structure for pages, elements, chunks, layout confidence, safety findings, routing signals, and page geometry without embedding image bytes in JSON.
π Column-Aware Content Ordering
Content is returned in natural reading order using Y-coordinates plus deterministic column segmentation:
Document Layout:
βββββββββββββββββββββββ
β [Title] Y:100 β
β [Image] Y:150 β
β [Text] Y:400 β
β [Photo A] Y:500 β
β [Photo B] Y:550 β
βββββββββββββββββββββββ
Response Order:
[
{ type: "text", text: "Title..." },
{ type: "image", data: "..." },
{ type: "text", text: "..." },
{ type: "image", data: "..." },
{ type: "image", data: "..." }
]
Benefits:
- AI understands spatial relationships
- Natural document comprehension
- Perfect for vision-enabled models
- Automatic multi-line text grouping
- Better ordering for common two-column PDFs
πΌοΈ Image Extraction
Enable extraction:
{
"sources": [{ "path": "manual.pdf" }],
"include_images": true
}Response format:
{
"images": [{
"page": 1,
"index": 0,
"width": 1920,
"height": 1080,
"format": "rgb",
"data": "base64-encoded-png..."
}]
}Supported formats: RGB, RGBA, Grayscale Auto-detected: JPEG, PNG, and other embedded formats
π Path Configuration
Absolute paths (v1.3.0+) - Direct file access:
{ "path": "C:\\Users\\John\\file.pdf" }
{ "path": "/home/user/file.pdf" }Relative paths - Workspace files:
{ "path": "docs/report.pdf" }
{ "path": "./2024/Q1.pdf" }Configure working directory:
{
"mcpServers": {
"pdf-reader-mcp": {
"command": "npx",
"args": ["@sylphx/pdf-reader-mcp"],
"cwd": "/path/to/documents"
}
}
}π Large PDF Strategies
Strategy 1: Page ranges
{ "sources": [{ "path": "big.pdf", "pages": "1-20" }] }Strategy 2: Progressive loading
// Step 1: Get page count
{ "sources": [{ "path": "big.pdf" }], "include_full_text": false }
// Step 2: Extract sections
{ "sources": [{ "path": "big.pdf", "pages": "50-75" }] }Strategy 3: Parallel batching
{
"sources": [
{ "path": "big.pdf", "pages": "1-50" },
{ "path": "big.pdf", "pages": "51-100" }
]
}By default the server can read any local file the host process can access and fetch any HTTP(S) URL. When running outside a sandbox you should restrict it to a specific working set.
Use --allow-dir (repeatable) or the MCP_PDF_ALLOWED_DIRS env var (: or , separated). Once set, all path sources must resolve inside one of the allowed directories β relative paths, absolute paths, and .. traversal are all checked after resolution.
# CLI flags
npx @sylphx/pdf-reader-mcp --allow-dir=/srv/pdfs --allow-dir=/data/reports
# Environment
MCP_PDF_ALLOWED_DIRS="/srv/pdfs:/data/reports" npx @sylphx/pdf-reader-mcp{
"mcpServers": {
"pdf-reader": {
"command": "npx",
"args": ["@sylphx/pdf-reader-mcp", "--allow-dir=/srv/pdfs"]
}
}
}# Block all URL sources
npx @sylphx/pdf-reader-mcp --no-http
MCP_PDF_ALLOW_HTTP=false npx @sylphx/pdf-reader-mcp
# Allowlist hosts (everything else rejected)
npx @sylphx/pdf-reader-mcp --allow-host=cdn.example.com --allow-host=files.internal
MCP_PDF_ALLOWED_HOSTS="cdn.example.com,files.internal" npx @sylphx/pdf-reader-mcp| Setting | CLI flag | Environment variable | Default |
|---|---|---|---|
| Filesystem allowlist | --allow-dir=<path> (repeatable) |
MCP_PDF_ALLOWED_DIRS (: or , separated) |
unrestricted |
| Disable HTTP | --no-http |
MCP_PDF_ALLOW_HTTP=false |
enabled |
| HTTP host allowlist | --allow-host=<host> (repeatable) |
MCP_PDF_ALLOWED_HOSTS (, separated) |
any host |
Denied requests fail fast with an Access denied error before any disk read or network call.
Solution: Upgrade to v1.3.0+
npm update @sylphx/pdf-reader-mcpRestart your MCP client completely.
Causes:
- File doesn't exist at path
- Wrong working directory
- Permission issues
Solutions:
Use absolute path:
{ "path": "C:\\Full\\Path\\file.pdf" }Or configure cwd:
{
"pdf-reader-mcp": {
"command": "npx",
"args": ["@sylphx/pdf-reader-mcp"],
"cwd": "/path/to/docs"
}
}Solution:
npm cache clean --force
rm -rf node_modules package-lock.json
npm install @sylphx/pdf-reader-mcp@latestRestart MCP client completely.
By default, PDF Reader MCP uses stdio transport for local use. You can also run it as an HTTP server for remote access from multiple machines.
# Run as HTTP server on port 8080
MCP_TRANSPORT=http npx @sylphx/pdf-reader-mcp| Variable | Default | Description |
|---|---|---|
MCP_TRANSPORT |
stdio |
Transport type: stdio or http |
MCP_HTTP_PORT |
8080 |
HTTP server port |
MCP_HTTP_HOST |
0.0.0.0 |
HTTP server hostname |
MCP_API_KEY |
- | Optional API key for authentication |
MCP_PDF_OCR_PRESET |
- | Optional OCR preset. Supported value: tesseract |
MCP_PDF_OCR_COMMAND |
- | Optional local OCR command used by ocr_pages |
MCP_PDF_OCR_ARGS_JSON |
["{input}"] |
Optional JSON string array of OCR command arguments. Must include {input}. |
MCP_PDF_REGION_ANALYSIS_COMMAND |
- | Optional local visual-region analysis command used by analyze_regions |
MCP_PDF_REGION_ANALYSIS_ARGS_JSON |
["{input}"] |
Optional JSON string array of region analysis command arguments. Must include {input}. |
FROM oven/bun:1
WORKDIR /app
RUN bun add @sylphx/pdf-reader-mcp
ENV MCP_TRANSPORT=http
ENV MCP_HTTP_PORT=8080
EXPOSE 8080
CMD ["bun", "node_modules/@sylphx/pdf-reader-mcp/dist/index.js"]{
"servers": {
"pdf-reader": {
"type": "http",
"url": "https://your-server.com/mcp",
"headers": {
"X-API-Key": "your-api-key"
}
}
}
}| Endpoint | Method | Description |
|---|---|---|
/mcp |
POST | JSON-RPC endpoint |
/mcp/health |
GET | Health check |
| Component | Technology |
|---|---|
| Runtime | Node.js 22+ ESM |
| PDF Engine | PDF.js (Mozilla) |
| Validation | Vex + JSON Schema |
| Protocol | MCP SDK |
| Language | TypeScript (strict) |
| Testing | Bun test suite |
| Quality | Biome (50x faster) |
| CI/CD | GitHub Actions |
- π Security First - Flexible paths with secure defaults
- π― Simple Interface - One tool, all operations
- β‘ Performance - Parallel processing, efficient memory
- π‘οΈ Reliability - Per-page isolation, detailed errors
- π§ͺ Quality - Automated tests, strict TypeScript, and CI validation
- π Type Safety - No
anytypes, strict mode - π Backward Compatible - Smooth upgrades always
Setup & Scripts
Prerequisites:
- Node.js >= 22.13.0 (required by pdfjs-dist v6)
- Bun (this repo uses
bun@1.3.1)
Setup:
git clone https://github.com/SylphxAI/pdf-reader-mcp.git
cd pdf-reader-mcp
bun install && bun run buildScripts:
bun run build # Build with bunup
bun test # Run the test suite
bun run test:cov # Run coverage
bun run check # Lint + format
bun run check:fix # Auto-fix
bun run benchmark # Reproducible local performance benchmarkQuality:
- β Automated tests
- β Coverage reporting
- β Strict TypeScript
- β Zero lint errors
- β Strict TypeScript
Contributing
Quick Start:
- Fork repository
- Create branch:
git checkout -b feature/awesome - Make changes:
bun test - Format:
bun run check:fix - Commit: Use Conventional Commits
- Open PR
Commit Format:
feat(images): add WebP support
fix(paths): handle UNC paths
docs(readme): update examples
See CONTRIBUTING.md
- π Full Docs - Complete guides
- π Getting Started - Quick start
- π API Reference - Detailed API
- ποΈ Design - Architecture
- β‘ Performance - Benchmarks
- π Comparison - vs. alternatives
β Completed
- Image extraction (v1.1.0)
- 5-10x parallel speedup (v1.1.0)
- Y-coordinate ordering (v1.2.0)
- Absolute paths (v1.3.0)
- Table extraction
- Structured element output
- Semantic document AST
- PDF trust report
- PDF accessibility report
- Table quality diagnostics, inferred cell spans, and continuation candidates
- Markdown rendering
- Citation-ready page, semantic, size, and table chunks
- MCP-native PDF search with snippets and bbox provenance
- Outlines, annotations, structure trees, form fields, attachment metadata, page labels, and permission signals
- Column-aware ordering for common multi-column PDFs
- Layout diagnostics with reading-order confidence
- Configured local OCR provider for scanned-page text layers
- Tesseract OCR provider preset without bundling OCR model assets
- Configured local visual region analysis provider for table, chart, formula, figure, and image-description enrichment
- Quality evals for semantic chunks, table ordering, renderers, and safety findings
- Filesystem and HTTP access restrictions
π Next
- Richer semantic layout detection
- Fixture-backed OCR and visual-region accuracy benchmarks
- Engine-specific visual region provider presets
- Optional advanced parser engines
- 100+ MB streaming
- Advanced caching
Vote at Discussions
Featured on:
Local-first β’ Agent-ready β’ Battle-tested
- π Bug Reports
- π¬ Discussions
- π Documentation
- π§ Email
Show Your Support: β Star β’ π Watch β’ π Report bugs β’ π‘ Suggest features β’ π Contribute
CI-backed quality β’ Structured extraction β’ Production ready
MIT Β© Sylphx
Built with:
Special thanks to the open source community β€οΈ
This project uses the following @sylphx packages:
- @sylphx/mcp-server-sdk - MCP server framework
- @sylphx/vex - Schema validation
- @sylphx/biome-config - Biome configuration
- @sylphx/tsconfig - TypeScript configuration