Releases: Libres-coder/ParseFlow
ParseFlow v1.1.0 - Office Documents Support
ParseFlow v1.1.0 - Office Documents Support 📄📊
Release Date: 2025-12-03
We're excited to announce ParseFlow v1.1.0, a major feature release that adds Word (docx) and Excel (xlsx/xls) document parsing support! 🎉
🌟 What's New
📝 Word Document Support
ParseFlow now supports Word (.docx) documents with comprehensive parsing capabilities:
- ✅ Text Extraction - Extract plain text from Word documents
- ✅ HTML Conversion - Convert Word documents to HTML
- ✅ Metadata Retrieval - Get document properties and file information
- ✅ Text Search - Search for keywords with context snippets
- ✅ MCP Tools - 2 new tools for AI assistants (
extract_word,search_word)
Example:
import { WordParser } from 'parseflow-core';
const parser = new WordParser();
const result = await parser.extractText('report.docx');
console.log(result.text);📊 Excel Spreadsheet Support
Full support for Excel (.xlsx/.xls) spreadsheets:
- ✅ Multi-Sheet Extraction - Extract data from all sheets or specific ones
- ✅ Multiple Formats - JSON, CSV, or plain text output
- ✅ Cell Search - Find values across all sheets with cell coordinates
- ✅ Range Extraction - Extract specific cell ranges (e.g., A1:C10)
- ✅ Workbook Metadata - Sheet names, counts, and properties
- ✅ MCP Tools - 2 new tools for AI assistants (
extract_excel,search_excel)
Example:
import { ExcelParser } from 'parseflow-core';
const parser = new ExcelParser();
const data = await parser.extractData('spreadsheet.xlsx', {
sheetName: 'Sales',
format: 'json'
});
console.log(data);🛠️ MCP Server Updates
The MCP server now includes 9 tools (up from 5):
PDF Tools (Existing - 5 tools)
extract_text- Extract text from PDFsearch_pdf- Search in PDFget_metadata- Get PDF metadataextract_images- Extract imagesget_toc- Get table of contents
Word Tools (New - 2 tools)
extract_word- Extract text/HTML from Word documentssearch_word- Search in Word documents
Excel Tools (New - 2 tools)
extract_excel- Extract data from Excel spreadsheetssearch_excel- Search in Excel cells
Usage in Claude Desktop:
"请读取 report.docx 文件的内容"
→ Uses extract_word tool
"在 sales.xlsx 中查找 '产品A'"
→ Uses search_excel tool
📦 Package Updates
parseflow-core v1.1.0
- New:
WordParserclass for Word document parsing - New:
ExcelParserclass for Excel spreadsheet parsing - Dependencies: Added
mammoth@^1.11.0andxlsx@^0.18.5 - Updated: Package description now mentions all supported formats
parseflow-mcp-server v1.1.0
- New: 4 additional MCP tools (2 Word + 2 Excel)
- Updated: Server description updated to mention Office documents
- Total: 9 tools serving AI assistants
📚 Documentation
New Documentation
- OFFICE_EXAMPLES.md - Comprehensive guide with examples
- Word parsing methods (4 approaches)
- Excel parsing methods (8 approaches)
- 5 real-world use cases
- Performance tips and troubleshooting
Updated Documentation
-
README.md - Completely rewritten
- Feature overview for all formats
- Quick start guides
- MCP server configuration
- Project structure
-
CHANGELOG.md - v1.1.0 entry added
- Detailed feature list
- Breaking changes (none!)
- Upgrade guide
🧪 Testing
All new features are thoroughly tested:
-
✅ Word Parser: 4/4 tests passing
- Text extraction
- Metadata retrieval
- Text search
- HTML conversion
-
✅ Excel Parser: 8/8 tests passing
- Multi-sheet extraction
- Format conversion (JSON/CSV/Text)
- Cell search
- Metadata retrieval
Test Files Included:
Word测试文件.docx(6 MB)Excel测试文件.xlsx(19 KB)
🚀 Installation
npm
# Core library
npm install parseflow-core@1.1.0
# MCP Server (global)
npm install -g parseflow-mcp-server@1.1.0pnpm
pnpm add parseflow-core@1.1.0
pnpm add -g parseflow-mcp-server@1.1.0📊 Supported Formats
| Format | Extension | Read | Search | Metadata | Tools |
|---|---|---|---|---|---|
| ✅ | ✅ | ✅ | 5 | ||
| Word | .docx | ✅ | ✅ | ✅ | 2 |
| Excel | .xlsx/.xls | ✅ | ✅ | ✅ | 2 |
🔧 Dependencies
New Dependencies
mammoth@^1.11.0- Word document parsingxlsx@^0.18.5- Excel spreadsheet parsing
Existing Dependencies
pdf-parse@^1.1.1- PDF parsingpdf-lib@^1.17.1- PDF manipulation@modelcontextprotocol/sdk@^1.0.4- MCP SDK
🐛 Bug Fixes
- Fixed Excel metadata extraction reliability
- Added null checks for sheet names in Excel parser
- Improved error handling for malformed Office files
- Better error messages for unsupported file types
🧹 Cleanup
- Removed 8 redundant documentation files (~35 KB)
- Simplified
PROJECT_STATUS.md - Improved project organization
- Updated
.gitignorefor test files
🔄 Upgrade Guide
From v1.0.x
No breaking changes! Simply update:
npm install parseflow-core@latest
npm install -g parseflow-mcp-server@latestNew Features
Import the new parsers:
import { WordParser, ExcelParser } from 'parseflow-core';For MCP users, the new tools are automatically available after updating.
📖 Examples
Extract Text from Word Document
import { WordParser } from 'parseflow-core';
const parser = new WordParser();
const result = await parser.extractText('document.docx');
console.log(result.text);Extract Data from Excel
import { ExcelParser } from 'parseflow-core';
const parser = new ExcelParser();
const sheets = await parser.extractData('data.xlsx');
sheets.forEach(sheet => {
console.log(`${sheet.sheetName}: ${sheet.rowCount} rows`);
});Search Across Documents
const wordParser = new WordParser();
const excelParser = new ExcelParser();
// Search in Word
const wordMatches = await wordParser.searchText('report.docx', 'budget');
// Search in Excel
const excelMatches = await excelParser.searchText('data.xlsx', 'revenue');More examples in OFFICE_EXAMPLES.md!
🌐 Links
- npm Core: https://www.npmjs.com/package/parseflow-core
- npm MCP: https://www.npmjs.com/package/parseflow-mcp-server
- GitHub: https://github.com/Libres-coder/ParseFlow
- MCP Registry: https://registry.modelcontextprotocol.io/
- Issues: https://github.com/Libres-coder/ParseFlow/issues
- Documentation: https://github.com/Libres-coder/ParseFlow#readme
🙏 Acknowledgments
Special thanks to:
- mammoth - For excellent Word document parsing
- xlsx (SheetJS) - For comprehensive Excel support
- MCP Community - For feedback and support
📝 Full Changelog
See CHANGELOG.md for complete details.
🎯 What's Next?
Looking ahead to v1.2.0:
- PowerPoint (pptx) support
- Encrypted document support
- OCR text recognition
- Performance optimizations
💬 Feedback
We'd love to hear from you!
- Report bugs: GitHub Issues
- Request features: GitHub Discussions
Made with ❤️ by Libres-coder
Enjoy ParseFlow v1.1.0! 🎉
ParseFlow v1.0.2 - Documentation & Cleanup
🎉 ParseFlow v1.0.2 - Documentation & Cleanup
✨ What's New
- Added complete README for mcp-server package
- Major project cleanup (removed 60+ temporary files, 19 MB)
- Improved project structure and organization
📦 Packages
- parseflow-core@1.0.1 - Core PDF parsing library
- parseflow-mcp-server@1.0.2 - MCP server (now with README!)
📖 Documentation Improvements
- Complete installation guide
- Usage examples for Claude Desktop, Windsurf, Cursor
- API reference for all 5 tools
🔗 Links
ParseFlow v1.0.1 - MCP Registry Launch
chore: clean up project structure - Moved temporary documents to docs/archive/ - Moved test scripts to scripts/manual-tests/ - Organized project root directory for better maintainability
ParseFlow v1.0.0 - Production Ready
ParseFlow v1.0.0 - Production Ready 🎉
发布日期: 2025-11-28
版本: v1.0.0
状态: 生产就绪
🎉 重大发布 - 所有核心功能已完成
ParseFlow v1.0.0 是一个功能完整、生产就绪的 PDF 解析 MCP 服务器,为 AI 编程助手(Windsurf 和 Cursor)提供强大的 PDF 处理能力。
✨ 新功能
🖼️ 图片提取
- 使用 poppler-utils (pdfimages) 提取 PDF 中的图片
- 支持 PNG 和 JPG 格式
- 可自定义输出目录和格式
- 支持尺寸过滤选项
- 跨平台支持 (Windows/Linux/macOS)
📑 目录提取
- 提取 PDF 书签和大纲结构
- 支持 pdftk (完整功能) 和 pdfinfo (基础功能)
- 层级化的目录结构
- 自动解析页码
- 外部工具集成
🔧 外部工具集成
ImageExtractorExternal- 通过 pdfimages 提取图片TOCExtractorExternal- 通过 pdftk/pdfinfo 提取目录- 自动工具检测
- 跨平台支持(Windows PowerShell、Linux、macOS)
- 支持自定义工具路径配置
📊 完整功能列表
核心功能 (100%)
| 功能 | 状态 | 实现方式 |
|---|---|---|
| 📄 文本提取 | ✅ | pdf-parse |
| 📊 元数据提取 | ✅ | pdf-parse |
| 🔍 关键词搜索 | ✅ | 自研搜索引擎 |
| 🖼️ 图片提取 | ✅ | poppler-utils |
| 📑 目录提取 | ✅ | pdftk/pdfinfo |
🔧 改进
Windows 兼容性
- ✅ PowerShell 命令执行支持
- ✅ 环境变量继承问题修复
- ✅ 自定义工具路径配置
测试完善
- ✅ 真实 PDF 测试验证
- ✅ 52 个单元测试(100% 通过)
- ✅ 外部工具集成测试
- ✅ 83.6% 代码覆盖率
文档完善
- ✅ 外部工具安装指南
- ✅ Windows/Linux/macOS 安装说明
- ✅ 完整的 API 文档
- ✅ 使用示例和最佳实践
- ✅ 中英文文档同步
📦 技术细节
依赖更新
- 新增:
pdf-lib@1.17.1- PDF 操作库 - 移除:
pdfjs-dist- 由于 Node.js 兼容性问题
架构
- Monorepo:
pdf-parser-core+mcp-server - 代码质量: ESLint 0 错误,TypeScript 严格模式
- 平台支持: 完整的 Windows/Linux/macOS 支持
质量指标
✅ 构建: 成功
✅ 测试: 52/52 通过 (100%)
✅ 覆盖率: 83.6%
✅ Lint: 0 错误
✅ TypeScript: 严格模式
🐛 Bug 修复
- 修复 Windows 环境下 Node.js 进程的环境变量继承问题
- 修复
.gitignorenull 字节问题 - 解决 Jest/pdfjs-dist ESM 兼容性问题
- 删除误提交的编译文件
🚀 快速开始
安装
# 克隆仓库
git clone https://github.com/Libres-coder/ParseFlow.git
cd ParseFlow
# 安装依赖
pnpm install
# 构建项目
pnpm build配置 Windsurf
编辑 C:\Users\<用户名>\.codeium\windsurf\mcp_config.json:
{
"mcpServers": {
"parseflow": {
"command": "node",
"args": ["<项目根目录>\\packages\\mcp-server\\dist\\index.js"],
"env": {
"PARSEFLOW_CACHE_DIR": "<项目根目录>\\.cache",
"PARSEFLOW_MAX_FILE_SIZE": "52428800",
"PARSEFLOW_ALLOWED_PATHS": "D:\\;C:\\Users"
}
}
}
}使用
在 Windsurf 中直接说:
分析 D:\report.pdf
这个 PDF 有多少页?
在合同中搜索"违约责任"
📚 文档
用户指南
开发文档
配置指南
💻 系统要求
- Node.js: >= 18.0.0
- pnpm: >= 8.0.0
- 操作系统: Windows 10/11, Ubuntu 20.04+, macOS 11+
可选工具(用于图片和目录提取)
Windows:
下载 Poppler: https://github.com/oschwartz10612/poppler-windows/releases
添加到系统 PATH
Ubuntu/Debian:
sudo apt-get install poppler-utils pdftkmacOS:
brew install poppler pdftk-java🤝 贡献
欢迎贡献!请查看 CONTRIBUTING.md
贡献流程
- Fork 仓库
- 创建功能分支 (
git checkout -b feature/AmazingFeature) - 提交更改 (
git commit -m 'Add AmazingFeature') - 推送到分支 (
git push origin feature/AmazingFeature) - 开启 Pull Request
📝 Breaking Changes
无 - 所有更改都向后兼容
🔄 从旧版本升级
如果你使用的是早期版本:
-
更新代码
git pull origin main pnpm install pnpm build
-
重启 IDE(Windsurf/Cursor)
-
可选:安装外部工具以使用新功能
🙏 致谢
- Model Context Protocol - MCP 协议标准
- pdf-parse - PDF 文本提取库
- pdf-lib - PDF 操作库
- Poppler - PDF 渲染库
- Windsurf 社区 - 测试和反馈
📄 许可证
本项目采用 MIT 许可证 - 详见 LICENSE 文件
📮 链接
- GitHub: https://github.com/Libres-coder/ParseFlow
- 问题反馈: https://github.com/Libres-coder/ParseFlow/issues
- 讨论区: https://github.com/Libres-coder/ParseFlow/discussions
- 文档: docs/
Made with ❤️ by ParseFlow Team
如果这个项目对你有帮助,请给个 ⭐ Star!