03 Dec 14:23

67022d6

ParseFlow v1.1.0 - Office Documents Support Latest

Latest

ParseFlow v1.1.0 - Office Documents Support 📄📊

Release Date: 2025-12-03

We're excited to announce ParseFlow v1.1.0, a major feature release that adds Word (docx) and Excel (xlsx/xls) document parsing support! 🎉

🌟 What's New

📝 Word Document Support

ParseFlow now supports Word (.docx) documents with comprehensive parsing capabilities:

✅ Text Extraction - Extract plain text from Word documents
✅ HTML Conversion - Convert Word documents to HTML
✅ Metadata Retrieval - Get document properties and file information
✅ Text Search - Search for keywords with context snippets
✅ MCP Tools - 2 new tools for AI assistants (extract_word, search_word)

Example:

import { WordParser } from 'parseflow-core';

const parser = new WordParser();
const result = await parser.extractText('report.docx');
console.log(result.text);

📊 Excel Spreadsheet Support

Full support for Excel (.xlsx/.xls) spreadsheets:

✅ Multi-Sheet Extraction - Extract data from all sheets or specific ones
✅ Multiple Formats - JSON, CSV, or plain text output
✅ Cell Search - Find values across all sheets with cell coordinates
✅ Range Extraction - Extract specific cell ranges (e.g., A1:C10)
✅ Workbook Metadata - Sheet names, counts, and properties
✅ MCP Tools - 2 new tools for AI assistants (extract_excel, search_excel)

Example:

import { ExcelParser } from 'parseflow-core';

const parser = new ExcelParser();
const data = await parser.extractData('spreadsheet.xlsx', {
  sheetName: 'Sales',
  format: 'json'
});
console.log(data);

🛠️ MCP Server Updates

The MCP server now includes 9 tools (up from 5):

PDF Tools (Existing - 5 tools)

extract_text - Extract text from PDF
search_pdf - Search in PDF
get_metadata - Get PDF metadata
extract_images - Extract images
get_toc - Get table of contents

Word Tools (New - 2 tools)

extract_word - Extract text/HTML from Word documents
search_word - Search in Word documents

Excel Tools (New - 2 tools)

extract_excel - Extract data from Excel spreadsheets
search_excel - Search in Excel cells

Usage in Claude Desktop:

"请读取 report.docx 文件的内容"
→ Uses extract_word tool

"在 sales.xlsx 中查找 '产品A'"
→ Uses search_excel tool

📦 Package Updates

parseflow-core v1.1.0

New: WordParser class for Word document parsing
New: ExcelParser class for Excel spreadsheet parsing
Dependencies: Added mammoth@^1.11.0 and xlsx@^0.18.5
Updated: Package description now mentions all supported formats

parseflow-mcp-server v1.1.0

New: 4 additional MCP tools (2 Word + 2 Excel)
Updated: Server description updated to mention Office documents
Total: 9 tools serving AI assistants

📚 Documentation

New Documentation

OFFICE_EXAMPLES.md - Comprehensive guide with examples
- Word parsing methods (4 approaches)
- Excel parsing methods (8 approaches)
- 5 real-world use cases
- Performance tips and troubleshooting

Updated Documentation

README.md - Completely rewritten
- Feature overview for all formats
- Quick start guides
- MCP server configuration
- Project structure
CHANGELOG.md - v1.1.0 entry added
- Detailed feature list
- Breaking changes (none!)
- Upgrade guide

🧪 Testing

All new features are thoroughly tested:

✅ Word Parser: 4/4 tests passing
- Text extraction
- Metadata retrieval
- Text search
- HTML conversion
✅ Excel Parser: 8/8 tests passing
- Multi-sheet extraction
- Format conversion (JSON/CSV/Text)
- Cell search
- Metadata retrieval

Test Files Included:

Word测试文件.docx (6 MB)
Excel测试文件.xlsx (19 KB)

🚀 Installation

npm

# Core library
npm install parseflow-core@1.1.0

# MCP Server (global)
npm install -g parseflow-mcp-server@1.1.0

pnpm

pnpm add parseflow-core@1.1.0
pnpm add -g parseflow-mcp-server@1.1.0

📊 Supported Formats

Format	Extension	Read	Search	Metadata	Tools
PDF	.pdf	✅	✅	✅	5
Word	.docx	✅	✅	✅	2
Excel	.xlsx/.xls	✅	✅	✅	2

🔧 Dependencies

New Dependencies

mammoth@^1.11.0 - Word document parsing
xlsx@^0.18.5 - Excel spreadsheet parsing

Existing Dependencies

pdf-parse@^1.1.1 - PDF parsing
pdf-lib@^1.17.1 - PDF manipulation
@modelcontextprotocol/sdk@^1.0.4 - MCP SDK

🐛 Bug Fixes

Fixed Excel metadata extraction reliability
Added null checks for sheet names in Excel parser
Improved error handling for malformed Office files
Better error messages for unsupported file types

🧹 Cleanup

Removed 8 redundant documentation files (~35 KB)
Simplified PROJECT_STATUS.md
Improved project organization
Updated .gitignore for test files

🔄 Upgrade Guide

From v1.0.x

No breaking changes! Simply update:

npm install parseflow-core@latest
npm install -g parseflow-mcp-server@latest

New Features

Import the new parsers:

import { WordParser, ExcelParser } from 'parseflow-core';

For MCP users, the new tools are automatically available after updating.

📖 Examples

Extract Text from Word Document

import { WordParser } from 'parseflow-core';

const parser = new WordParser();
const result = await parser.extractText('document.docx');
console.log(result.text);

Extract Data from Excel

import { ExcelParser } from 'parseflow-core';

const parser = new ExcelParser();
const sheets = await parser.extractData('data.xlsx');

sheets.forEach(sheet => {
  console.log(`${sheet.sheetName}: ${sheet.rowCount} rows`);
});

Search Across Documents

const wordParser = new WordParser();
const excelParser = new ExcelParser();

// Search in Word
const wordMatches = await wordParser.searchText('report.docx', 'budget');

// Search in Excel
const excelMatches = await excelParser.searchText('data.xlsx', 'revenue');

More examples in OFFICE_EXAMPLES.md!

🌐 Links

npm Core: https://www.npmjs.com/package/parseflow-core
npm MCP: https://www.npmjs.com/package/parseflow-mcp-server
GitHub: https://github.com/Libres-coder/ParseFlow
MCP Registry: https://registry.modelcontextprotocol.io/
Issues: https://github.com/Libres-coder/ParseFlow/issues
Documentation: https://github.com/Libres-coder/ParseFlow#readme

🙏 Acknowledgments

Special thanks to:

mammoth - For excellent Word document parsing
xlsx (SheetJS) - For comprehensive Excel support
MCP Community - For feedback and support

📝 Full Changelog

See CHANGELOG.md for complete details.

🎯 What's Next?

Looking ahead to v1.2.0:

PowerPoint (pptx) support
Encrypted document support
OCR text recognition
Performance optimizations

💬 Feedback

We'd love to hear from you!

Report bugs: GitHub Issues
Request features: GitHub Discussions

Made with ❤️ by Libres-coder

Enjoy ParseFlow v1.1.0! 🎉

Assets 2

02 Dec 17:58

Libres-coder

v1.0.2

b26b120

ParseFlow v1.0.2 - Documentation & Cleanup

🎉 ParseFlow v1.0.2 - Documentation & Cleanup

✨ What's New

Added complete README for mcp-server package
Major project cleanup (removed 60+ temporary files, 19 MB)
Improved project structure and organization

📦 Packages

parseflow-core@1.0.1 - Core PDF parsing library
parseflow-mcp-server@1.0.2 - MCP server (now with README!)

📖 Documentation Improvements

Complete installation guide
Usage examples for Claude Desktop, Windsurf, Cursor
API reference for all 5 tools

🔗 Links

npm: https://www.npmjs.com/package/parseflow-mcp-server
MCP Registry: https://registry.modelcontextprotocol.io/

Assets 2

02 Dec 17:20

Libres-coder

v1.0.1

b26b120

ParseFlow v1.0.1 - MCP Registry Launch

chore: clean up project structure

- Moved temporary documents to docs/archive/
- Moved test scripts to scripts/manual-tests/
- Organized project root directory for better maintainability

Assets 2

28 Nov 08:10

Libres-coder

v1.0.0

115a26f

ParseFlow v1.0.0 - Production Ready

ParseFlow v1.0.0 - Production Ready 🎉

发布日期: 2025-11-28
版本: v1.0.0
状态: 生产就绪

🎉 重大发布 - 所有核心功能已完成

ParseFlow v1.0.0 是一个功能完整、生产就绪的 PDF 解析 MCP 服务器，为 AI 编程助手（Windsurf 和 Cursor）提供强大的 PDF 处理能力。

✨ 新功能

🖼️ 图片提取

使用 poppler-utils (pdfimages) 提取 PDF 中的图片
支持 PNG 和 JPG 格式
可自定义输出目录和格式
支持尺寸过滤选项
跨平台支持 (Windows/Linux/macOS)

📑 目录提取

提取 PDF 书签和大纲结构
支持 pdftk (完整功能) 和 pdfinfo (基础功能)
层级化的目录结构
自动解析页码
外部工具集成

🔧 外部工具集成

ImageExtractorExternal - 通过 pdfimages 提取图片
TOCExtractorExternal - 通过 pdftk/pdfinfo 提取目录
自动工具检测
跨平台支持（Windows PowerShell、Linux、macOS）
支持自定义工具路径配置

📊 完整功能列表

核心功能 (100%)

功能	状态	实现方式
📄 文本提取	✅	pdf-parse
📊 元数据提取	✅	pdf-parse
🔍 关键词搜索	✅	自研搜索引擎
🖼️ 图片提取	✅	poppler-utils
📑 目录提取	✅	pdftk/pdfinfo

🔧 改进

Windows 兼容性

✅ PowerShell 命令执行支持
✅ 环境变量继承问题修复
✅ 自定义工具路径配置

测试完善

✅ 真实 PDF 测试验证
✅ 52 个单元测试（100% 通过）
✅ 外部工具集成测试
✅ 83.6% 代码覆盖率

文档完善

✅ 外部工具安装指南
✅ Windows/Linux/macOS 安装说明
✅ 完整的 API 文档
✅ 使用示例和最佳实践
✅ 中英文文档同步

📦 技术细节

依赖更新

新增: pdf-lib@1.17.1 - PDF 操作库
移除: pdfjs-dist - 由于 Node.js 兼容性问题

架构

Monorepo: pdf-parser-core + mcp-server
代码质量: ESLint 0 错误，TypeScript 严格模式
平台支持: 完整的 Windows/Linux/macOS 支持

质量指标

✅ 构建: 成功
✅ 测试: 52/52 通过 (100%)
✅ 覆盖率: 83.6%
✅ Lint: 0 错误
✅ TypeScript: 严格模式

🐛 Bug 修复

修复 Windows 环境下 Node.js 进程的环境变量继承问题
修复 .gitignore null 字节问题
解决 Jest/pdfjs-dist ESM 兼容性问题
删除误提交的编译文件

🚀 快速开始

安装

# 克隆仓库
git clone https://github.com/Libres-coder/ParseFlow.git
cd ParseFlow

# 安装依赖
pnpm install

# 构建项目
pnpm build

配置 Windsurf

编辑 C:\Users\<用户名>\.codeium\windsurf\mcp_config.json:

{
  "mcpServers": {
    "parseflow": {
      "command": "node",
      "args": ["<项目根目录>\\packages\\mcp-server\\dist\\index.js"],
      "env": {
        "PARSEFLOW_CACHE_DIR": "<项目根目录>\\.cache",
        "PARSEFLOW_MAX_FILE_SIZE": "52428800",
        "PARSEFLOW_ALLOWED_PATHS": "D:\\;C:\\Users"
      }
    }
  }
}

使用

在 Windsurf 中直接说：

分析 D:\report.pdf
这个 PDF 有多少页？
在合同中搜索"违约责任"

📚 文档

用户指南

开发文档

配置指南

💻 系统要求

Node.js: >= 18.0.0
pnpm: >= 8.0.0
操作系统: Windows 10/11, Ubuntu 20.04+, macOS 11+

可选工具（用于图片和目录提取）

Windows:

下载 Poppler: https://github.com/oschwartz10612/poppler-windows/releases
添加到系统 PATH

Ubuntu/Debian:

sudo apt-get install poppler-utils pdftk

macOS:

brew install poppler pdftk-java

🤝 贡献

欢迎贡献！请查看 CONTRIBUTING.md

贡献流程

Fork 仓库
创建功能分支 (git checkout -b feature/AmazingFeature)
提交更改 (git commit -m 'Add AmazingFeature')
推送到分支 (git push origin feature/AmazingFeature)
开启 Pull Request

📝 Breaking Changes

无 - 所有更改都向后兼容

🔄 从旧版本升级

如果你使用的是早期版本：

更新代码

git pull origin main
pnpm install
pnpm build

重启 IDE（Windsurf/Cursor）
可选：安装外部工具以使用新功能

🙏 致谢

Model Context Protocol - MCP 协议标准
pdf-parse - PDF 文本提取库
pdf-lib - PDF 操作库
Poppler - PDF 渲染库
Windsurf 社区 - 测试和反馈

📄 许可证

本项目采用 MIT 许可证 - 详见 LICENSE 文件

📮 链接

GitHub: https://github.com/Libres-coder/ParseFlow
问题反馈: https://github.com/Libres-coder/ParseFlow/issues
讨论区: https://github.com/Libres-coder/ParseFlow/discussions
文档: docs/

Made with ❤️ by ParseFlow Team

如果这个项目对你有帮助，请给个 ⭐ Star！

Assets 2

Releases: Libres-coder/ParseFlow