Skip to content

Releases: Libres-coder/ParseFlow

ParseFlow v1.1.0 - Office Documents Support

03 Dec 14:23

Choose a tag to compare

ParseFlow v1.1.0 - Office Documents Support 📄📊

Release Date: 2025-12-03

We're excited to announce ParseFlow v1.1.0, a major feature release that adds Word (docx) and Excel (xlsx/xls) document parsing support! 🎉


🌟 What's New

📝 Word Document Support

ParseFlow now supports Word (.docx) documents with comprehensive parsing capabilities:

  • Text Extraction - Extract plain text from Word documents
  • HTML Conversion - Convert Word documents to HTML
  • Metadata Retrieval - Get document properties and file information
  • Text Search - Search for keywords with context snippets
  • MCP Tools - 2 new tools for AI assistants (extract_word, search_word)

Example:

import { WordParser } from 'parseflow-core';

const parser = new WordParser();
const result = await parser.extractText('report.docx');
console.log(result.text);

📊 Excel Spreadsheet Support

Full support for Excel (.xlsx/.xls) spreadsheets:

  • Multi-Sheet Extraction - Extract data from all sheets or specific ones
  • Multiple Formats - JSON, CSV, or plain text output
  • Cell Search - Find values across all sheets with cell coordinates
  • Range Extraction - Extract specific cell ranges (e.g., A1:C10)
  • Workbook Metadata - Sheet names, counts, and properties
  • MCP Tools - 2 new tools for AI assistants (extract_excel, search_excel)

Example:

import { ExcelParser } from 'parseflow-core';

const parser = new ExcelParser();
const data = await parser.extractData('spreadsheet.xlsx', {
  sheetName: 'Sales',
  format: 'json'
});
console.log(data);

🛠️ MCP Server Updates

The MCP server now includes 9 tools (up from 5):

PDF Tools (Existing - 5 tools)

  • extract_text - Extract text from PDF
  • search_pdf - Search in PDF
  • get_metadata - Get PDF metadata
  • extract_images - Extract images
  • get_toc - Get table of contents

Word Tools (New - 2 tools)

  • extract_word - Extract text/HTML from Word documents
  • search_word - Search in Word documents

Excel Tools (New - 2 tools)

  • extract_excel - Extract data from Excel spreadsheets
  • search_excel - Search in Excel cells

Usage in Claude Desktop:

"请读取 report.docx 文件的内容"
→ Uses extract_word tool

"在 sales.xlsx 中查找 '产品A'"
→ Uses search_excel tool

📦 Package Updates

parseflow-core v1.1.0

  • New: WordParser class for Word document parsing
  • New: ExcelParser class for Excel spreadsheet parsing
  • Dependencies: Added mammoth@^1.11.0 and xlsx@^0.18.5
  • Updated: Package description now mentions all supported formats

parseflow-mcp-server v1.1.0

  • New: 4 additional MCP tools (2 Word + 2 Excel)
  • Updated: Server description updated to mention Office documents
  • Total: 9 tools serving AI assistants

📚 Documentation

New Documentation

  • OFFICE_EXAMPLES.md - Comprehensive guide with examples
    • Word parsing methods (4 approaches)
    • Excel parsing methods (8 approaches)
    • 5 real-world use cases
    • Performance tips and troubleshooting

Updated Documentation

  • README.md - Completely rewritten

    • Feature overview for all formats
    • Quick start guides
    • MCP server configuration
    • Project structure
  • CHANGELOG.md - v1.1.0 entry added

    • Detailed feature list
    • Breaking changes (none!)
    • Upgrade guide

🧪 Testing

All new features are thoroughly tested:

  • Word Parser: 4/4 tests passing

    • Text extraction
    • Metadata retrieval
    • Text search
    • HTML conversion
  • Excel Parser: 8/8 tests passing

    • Multi-sheet extraction
    • Format conversion (JSON/CSV/Text)
    • Cell search
    • Metadata retrieval

Test Files Included:

  • Word测试文件.docx (6 MB)
  • Excel测试文件.xlsx (19 KB)

🚀 Installation

npm

# Core library
npm install parseflow-core@1.1.0

# MCP Server (global)
npm install -g parseflow-mcp-server@1.1.0

pnpm

pnpm add parseflow-core@1.1.0
pnpm add -g parseflow-mcp-server@1.1.0

📊 Supported Formats

Format Extension Read Search Metadata Tools
PDF .pdf 5
Word .docx 2
Excel .xlsx/.xls 2

🔧 Dependencies

New Dependencies

  • mammoth@^1.11.0 - Word document parsing
  • xlsx@^0.18.5 - Excel spreadsheet parsing

Existing Dependencies

  • pdf-parse@^1.1.1 - PDF parsing
  • pdf-lib@^1.17.1 - PDF manipulation
  • @modelcontextprotocol/sdk@^1.0.4 - MCP SDK

🐛 Bug Fixes

  • Fixed Excel metadata extraction reliability
  • Added null checks for sheet names in Excel parser
  • Improved error handling for malformed Office files
  • Better error messages for unsupported file types

🧹 Cleanup

  • Removed 8 redundant documentation files (~35 KB)
  • Simplified PROJECT_STATUS.md
  • Improved project organization
  • Updated .gitignore for test files

🔄 Upgrade Guide

From v1.0.x

No breaking changes! Simply update:

npm install parseflow-core@latest
npm install -g parseflow-mcp-server@latest

New Features

Import the new parsers:

import { WordParser, ExcelParser } from 'parseflow-core';

For MCP users, the new tools are automatically available after updating.


📖 Examples

Extract Text from Word Document

import { WordParser } from 'parseflow-core';

const parser = new WordParser();
const result = await parser.extractText('document.docx');
console.log(result.text);

Extract Data from Excel

import { ExcelParser } from 'parseflow-core';

const parser = new ExcelParser();
const sheets = await parser.extractData('data.xlsx');

sheets.forEach(sheet => {
  console.log(`${sheet.sheetName}: ${sheet.rowCount} rows`);
});

Search Across Documents

const wordParser = new WordParser();
const excelParser = new ExcelParser();

// Search in Word
const wordMatches = await wordParser.searchText('report.docx', 'budget');

// Search in Excel
const excelMatches = await excelParser.searchText('data.xlsx', 'revenue');

More examples in OFFICE_EXAMPLES.md!


🌐 Links


🙏 Acknowledgments

Special thanks to:

  • mammoth - For excellent Word document parsing
  • xlsx (SheetJS) - For comprehensive Excel support
  • MCP Community - For feedback and support

📝 Full Changelog

See CHANGELOG.md for complete details.


🎯 What's Next?

Looking ahead to v1.2.0:

  • PowerPoint (pptx) support
  • Encrypted document support
  • OCR text recognition
  • Performance optimizations

💬 Feedback

We'd love to hear from you!


Made with ❤️ by Libres-coder

Enjoy ParseFlow v1.1.0! 🎉

ParseFlow v1.0.2 - Documentation & Cleanup

02 Dec 17:58

Choose a tag to compare

🎉 ParseFlow v1.0.2 - Documentation & Cleanup

✨ What's New

  • Added complete README for mcp-server package
  • Major project cleanup (removed 60+ temporary files, 19 MB)
  • Improved project structure and organization

📦 Packages

  • parseflow-core@1.0.1 - Core PDF parsing library
  • parseflow-mcp-server@1.0.2 - MCP server (now with README!)

📖 Documentation Improvements

  • Complete installation guide
  • Usage examples for Claude Desktop, Windsurf, Cursor
  • API reference for all 5 tools

🔗 Links

ParseFlow v1.0.1 - MCP Registry Launch

02 Dec 17:20

Choose a tag to compare

chore: clean up project structure

- Moved temporary documents to docs/archive/
- Moved test scripts to scripts/manual-tests/
- Organized project root directory for better maintainability

ParseFlow v1.0.0 - Production Ready

28 Nov 08:10

Choose a tag to compare

ParseFlow v1.0.0 - Production Ready 🎉

发布日期: 2025-11-28
版本: v1.0.0
状态: 生产就绪


🎉 重大发布 - 所有核心功能已完成

ParseFlow v1.0.0 是一个功能完整、生产就绪的 PDF 解析 MCP 服务器,为 AI 编程助手(Windsurf 和 Cursor)提供强大的 PDF 处理能力。


✨ 新功能

🖼️ 图片提取

  • 使用 poppler-utils (pdfimages) 提取 PDF 中的图片
  • 支持 PNG 和 JPG 格式
  • 可自定义输出目录和格式
  • 支持尺寸过滤选项
  • 跨平台支持 (Windows/Linux/macOS)

📑 目录提取

  • 提取 PDF 书签和大纲结构
  • 支持 pdftk (完整功能) 和 pdfinfo (基础功能)
  • 层级化的目录结构
  • 自动解析页码
  • 外部工具集成

🔧 外部工具集成

  • ImageExtractorExternal - 通过 pdfimages 提取图片
  • TOCExtractorExternal - 通过 pdftk/pdfinfo 提取目录
  • 自动工具检测
  • 跨平台支持(Windows PowerShell、Linux、macOS)
  • 支持自定义工具路径配置

📊 完整功能列表

核心功能 (100%)

功能 状态 实现方式
📄 文本提取 pdf-parse
📊 元数据提取 pdf-parse
🔍 关键词搜索 自研搜索引擎
🖼️ 图片提取 poppler-utils
📑 目录提取 pdftk/pdfinfo

🔧 改进

Windows 兼容性

  • ✅ PowerShell 命令执行支持
  • ✅ 环境变量继承问题修复
  • ✅ 自定义工具路径配置

测试完善

  • ✅ 真实 PDF 测试验证
  • ✅ 52 个单元测试(100% 通过)
  • ✅ 外部工具集成测试
  • ✅ 83.6% 代码覆盖率

文档完善

  • ✅ 外部工具安装指南
  • ✅ Windows/Linux/macOS 安装说明
  • ✅ 完整的 API 文档
  • ✅ 使用示例和最佳实践
  • ✅ 中英文文档同步

📦 技术细节

依赖更新

  • 新增: pdf-lib@1.17.1 - PDF 操作库
  • 移除: pdfjs-dist - 由于 Node.js 兼容性问题

架构

  • Monorepo: pdf-parser-core + mcp-server
  • 代码质量: ESLint 0 错误,TypeScript 严格模式
  • 平台支持: 完整的 Windows/Linux/macOS 支持

质量指标

✅ 构建: 成功
✅ 测试: 52/52 通过 (100%)
✅ 覆盖率: 83.6%
✅ Lint: 0 错误
✅ TypeScript: 严格模式

🐛 Bug 修复

  • 修复 Windows 环境下 Node.js 进程的环境变量继承问题
  • 修复 .gitignore null 字节问题
  • 解决 Jest/pdfjs-dist ESM 兼容性问题
  • 删除误提交的编译文件

🚀 快速开始

安装

# 克隆仓库
git clone https://github.com/Libres-coder/ParseFlow.git
cd ParseFlow

# 安装依赖
pnpm install

# 构建项目
pnpm build

配置 Windsurf

编辑 C:\Users\<用户名>\.codeium\windsurf\mcp_config.json:

{
  "mcpServers": {
    "parseflow": {
      "command": "node",
      "args": ["<项目根目录>\\packages\\mcp-server\\dist\\index.js"],
      "env": {
        "PARSEFLOW_CACHE_DIR": "<项目根目录>\\.cache",
        "PARSEFLOW_MAX_FILE_SIZE": "52428800",
        "PARSEFLOW_ALLOWED_PATHS": "D:\\;C:\\Users"
      }
    }
  }
}

使用

在 Windsurf 中直接说:

分析 D:\report.pdf
这个 PDF 有多少页?
在合同中搜索"违约责任"

📚 文档

用户指南

开发文档

配置指南


💻 系统要求

  • Node.js: >= 18.0.0
  • pnpm: >= 8.0.0
  • 操作系统: Windows 10/11, Ubuntu 20.04+, macOS 11+

可选工具(用于图片和目录提取)

Windows:

下载 Poppler: https://github.com/oschwartz10612/poppler-windows/releases
添加到系统 PATH

Ubuntu/Debian:

sudo apt-get install poppler-utils pdftk

macOS:

brew install poppler pdftk-java

🤝 贡献

欢迎贡献!请查看 CONTRIBUTING.md

贡献流程

  1. Fork 仓库
  2. 创建功能分支 (git checkout -b feature/AmazingFeature)
  3. 提交更改 (git commit -m 'Add AmazingFeature')
  4. 推送到分支 (git push origin feature/AmazingFeature)
  5. 开启 Pull Request

📝 Breaking Changes

无 - 所有更改都向后兼容


🔄 从旧版本升级

如果你使用的是早期版本:

  1. 更新代码

    git pull origin main
    pnpm install
    pnpm build
  2. 重启 IDE(Windsurf/Cursor)

  3. 可选:安装外部工具以使用新功能


🙏 致谢


📄 许可证

本项目采用 MIT 许可证 - 详见 LICENSE 文件


📮 链接


Made with ❤️ by ParseFlow Team

如果这个项目对你有帮助,请给个 ⭐ Star!