Universal document parsing library for PDF, Word, and Excel files
ParseFlow is a comprehensive document parsing solution that supports PDF, Word (docx), and Excel (xlsx/xls) files. It provides both a standalone library and an MCP (Model Context Protocol) server for AI assistants.
ไธญๆๆๆกฃ | Examples | GitHub
- โ Text extraction with multiple strategies (raw, formatted, clean)
- โ Page-specific and range-based extraction
- โ Metadata retrieval (title, author, dates, page count)
- โ Full-text search with context
- โ Image extraction (placeholder)
- โ Table of contents (TOC) extraction (placeholder)
- โ Text extraction
- โ HTML conversion
- โ Metadata retrieval
- โ Text search with context
- โ Multi-sheet data extraction
- โ Multiple output formats (JSON, CSV, Text)
- โ Sheet-specific extraction
- โ Cell-based search
- โ Range extraction
- โ Workbook metadata
- โ 9 tools for AI assistants (5 PDF + 2 Word + 2 Excel)
- โ Works with Claude Desktop and other MCP clients
- โ Path security with allowlist support
npm install parseflow-corenpm install -g parseflow-mcp-serverOr use with npx:
npx parseflow-mcp-serverimport { PDFParser } from 'parseflow-core';
const parser = new PDFParser();
// Extract all text
const text = await parser.extractText('document.pdf');
// Extract specific page
const page5 = await parser.extractPage('document.pdf', 5);
// Search
const results = await parser.search('document.pdf', 'keyword');
// Get metadata
const metadata = await parser.getMetadata('document.pdf');import { WordParser } from 'parseflow-core';
const parser = new WordParser();
// Extract text
const result = await parser.extractText('report.docx');
console.log(result.text);
// Convert to HTML
const html = await parser.extractHTML('report.docx');
// Search
const matches = await parser.searchText('report.docx', 'budget');import { ExcelParser } from 'parseflow-core';
const parser = new ExcelParser();
// Extract all sheets (JSON format)
const data = await parser.extractData('spreadsheet.xlsx');
// Extract specific sheet
const sales = await parser.extractData('data.xlsx', {
sheetName: 'Q4 Sales',
format: 'json'
});
// Search in cells
const results = await parser.searchText('data.xlsx', 'revenue');Add to claude_desktop_config.json:
{
"mcpServers": {
"parseflow": {
"command": "npx",
"args": ["-y", "parseflow-mcp-server"],
"env": {
"PARSEFLOW_ALLOWED_PATHS": "C:\\Documents;D:\\Projects"
}
}
}
}extract_text- Extract text from PDF filessearch_pdf- Search for keywords in PDFget_metadata- Get PDF metadataextract_images- Extract images from PDFget_toc- Get table of contents
extract_word- Extract text/HTML from Word documentssearch_word- Search in Word documents
extract_excel- Extract data from Excel spreadsheetssearch_excel- Search in Excel cells
"่ฏท่ฏปๅ report.docx ๆไปถ็ๅ
ๅฎน"
โ Uses extract_word tool
"ๅจ sales.xlsx ไธญๆฅๆพ 'ไบงๅA'"
โ Uses search_excel tool
"ๆๅ document.pdf ็ๅ
ๆฐๆฎ"
โ Uses get_metadata tool
- Office Examples - Word and Excel usage examples
- Release Guide - How to publish new versions
- Contributing - Contribution guidelines
- Security Policy - Security vulnerability reporting
- Code of Conduct - Community guidelines
ParseFlow/
โโโ packages/
โ โโโ pdf-parser-core/ # Core library (parseflow-core)
โ โ โโโ src/
โ โ โ โโโ parser.ts # PDF parser
โ โ โ โโโ WordParser.ts # Word parser
โ โ โ โโโ ExcelParser.ts # Excel parser
โ โ โโโ package.json
โ โโโ mcp-server/ # MCP server (parseflow-mcp-server)
โ โโโ src/
โ โ โโโ index.ts # Server entry
โ โ โโโ tools/ # MCP tools
โ โโโ package.json
โโโ docs/ # Documentation
โโโ examples/ # Usage examples
โโโ tests/ # Test files
โโโ scripts/ # Build scripts
# Run all tests
pnpm test
# Test coverage
pnpm test:coverage
# Run specific test
pnpm test parser.test.ts- Wordๆต่ฏๆไปถ.docx - Word test document
- Excelๆต่ฏๆไปถ.xlsx - Excel test workbook (3 sheets)
- PDFๆต่ฏๆๆกฃ.pdf - PDF test document
# Install dependencies
pnpm install
# Build all packages
pnpm build
# Watch mode
pnpm dev
# Lint
pnpm lint
# Type check
pnpm type-check- โ Word (docx) support
- โ Excel (xlsx/xls) support
- โ 9 MCP tools
- Encrypted PDF support
- OCR text recognition
- PowerPoint (pptx) support
- Batch processing optimization
- Plugin system
- More document formats (CSV, TXT, RTF)
- Advanced table extraction
- Document conversion
We welcome contributions! Please see CONTRIBUTING.md for details.
- ๐ Report bugs
- ๐ก Suggest features
- ๐ Improve documentation
- ๐ง Submit pull requests
| Package | Version | Description |
|---|---|---|
| parseflow-core | 1.0.1 | Core parsing library |
| parseflow-mcp-server | 1.0.2 | MCP server for AI |
- npm Core: https://www.npmjs.com/package/parseflow-core
- npm MCP: https://www.npmjs.com/package/parseflow-mcp-server
- GitHub: https://github.com/Libres-coder/ParseFlow
- Issues: https://github.com/Libres-coder/ParseFlow/issues
- MCP Registry: https://registry.modelcontextprotocol.io/
MIT License - see LICENSE file for details.
- pdf-parse - PDF parsing
- pdf-lib - PDF manipulation
- mammoth - Word document parsing
- xlsx - Excel spreadsheet parsing
- MCP SDK - Model Context Protocol
- Test Coverage: 83%+
- Supported Formats: 3 (PDF, Word, Excel)
- MCP Tools: 9
- Dependencies: Minimal and well-maintained
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Made with โค๏ธ by Libres-coder
Status: ๐ Production Ready (v1.1.0)