Skip to content

Libres-coder/ParseFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

73 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

๐Ÿ“„ ParseFlow

Universal document parsing library for PDF, Word, and Excel files

npm version MCP Server License: MIT

ParseFlow is a comprehensive document parsing solution that supports PDF, Word (docx), and Excel (xlsx/xls) files. It provides both a standalone library and an MCP (Model Context Protocol) server for AI assistants.

ไธญๆ–‡ๆ–‡ๆกฃ | Examples | GitHub


โœจ Features

๐Ÿ“„ PDF Support

  • โœ… Text extraction with multiple strategies (raw, formatted, clean)
  • โœ… Page-specific and range-based extraction
  • โœ… Metadata retrieval (title, author, dates, page count)
  • โœ… Full-text search with context
  • โœ… Image extraction (placeholder)
  • โœ… Table of contents (TOC) extraction (placeholder)

๐Ÿ“ Word (docx) Support

  • โœ… Text extraction
  • โœ… HTML conversion
  • โœ… Metadata retrieval
  • โœ… Text search with context

๐Ÿ“Š Excel (xlsx/xls) Support

  • โœ… Multi-sheet data extraction
  • โœ… Multiple output formats (JSON, CSV, Text)
  • โœ… Sheet-specific extraction
  • โœ… Cell-based search
  • โœ… Range extraction
  • โœ… Workbook metadata

๐Ÿค– MCP Server

  • โœ… 9 tools for AI assistants (5 PDF + 2 Word + 2 Excel)
  • โœ… Works with Claude Desktop and other MCP clients
  • โœ… Path security with allowlist support

๐Ÿ“ฆ Installation

Core Library

npm install parseflow-core

MCP Server (Global)

npm install -g parseflow-mcp-server

Or use with npx:

npx parseflow-mcp-server

๐Ÿš€ Quick Start

PDF Parsing

import { PDFParser } from 'parseflow-core';

const parser = new PDFParser();

// Extract all text
const text = await parser.extractText('document.pdf');

// Extract specific page
const page5 = await parser.extractPage('document.pdf', 5);

// Search
const results = await parser.search('document.pdf', 'keyword');

// Get metadata
const metadata = await parser.getMetadata('document.pdf');

Word Parsing

import { WordParser } from 'parseflow-core';

const parser = new WordParser();

// Extract text
const result = await parser.extractText('report.docx');
console.log(result.text);

// Convert to HTML
const html = await parser.extractHTML('report.docx');

// Search
const matches = await parser.searchText('report.docx', 'budget');

Excel Parsing

import { ExcelParser } from 'parseflow-core';

const parser = new ExcelParser();

// Extract all sheets (JSON format)
const data = await parser.extractData('spreadsheet.xlsx');

// Extract specific sheet
const sales = await parser.extractData('data.xlsx', {
  sheetName: 'Q4 Sales',
  format: 'json'
});

// Search in cells
const results = await parser.searchText('data.xlsx', 'revenue');

๐Ÿ› ๏ธ MCP Server Usage

Configuration for Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "parseflow": {
      "command": "npx",
      "args": ["-y", "parseflow-mcp-server"],
      "env": {
        "PARSEFLOW_ALLOWED_PATHS": "C:\\Documents;D:\\Projects"
      }
    }
  }
}

Available Tools

PDF Tools

  • extract_text - Extract text from PDF files
  • search_pdf - Search for keywords in PDF
  • get_metadata - Get PDF metadata
  • extract_images - Extract images from PDF
  • get_toc - Get table of contents

Word Tools

  • extract_word - Extract text/HTML from Word documents
  • search_word - Search in Word documents

Excel Tools

  • extract_excel - Extract data from Excel spreadsheets
  • search_excel - Search in Excel cells

Example Usage in Claude

"่ฏท่ฏปๅ– report.docx ๆ–‡ไปถ็š„ๅ†…ๅฎน"
โ†’ Uses extract_word tool

"ๅœจ sales.xlsx ไธญๆŸฅๆ‰พ 'ไบงๅ“A'"
โ†’ Uses search_excel tool

"ๆๅ– document.pdf ็š„ๅ…ƒๆ•ฐๆฎ"
โ†’ Uses get_metadata tool

๐Ÿ“š Documentation


๐Ÿ—๏ธ Project Structure

ParseFlow/
โ”œโ”€โ”€ packages/
โ”‚   โ”œโ”€โ”€ pdf-parser-core/      # Core library (parseflow-core)
โ”‚   โ”‚   โ”œโ”€โ”€ src/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ parser.ts     # PDF parser
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ WordParser.ts # Word parser
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ ExcelParser.ts # Excel parser
โ”‚   โ”‚   โ””โ”€โ”€ package.json
โ”‚   โ””โ”€โ”€ mcp-server/           # MCP server (parseflow-mcp-server)
โ”‚       โ”œโ”€โ”€ src/
โ”‚       โ”‚   โ”œโ”€โ”€ index.ts      # Server entry
โ”‚       โ”‚   โ””โ”€โ”€ tools/        # MCP tools
โ”‚       โ””โ”€โ”€ package.json
โ”œโ”€โ”€ docs/                     # Documentation
โ”œโ”€โ”€ examples/                 # Usage examples
โ”œโ”€โ”€ tests/                    # Test files
โ””โ”€โ”€ scripts/                  # Build scripts

๐Ÿงช Testing

# Run all tests
pnpm test

# Test coverage
pnpm test:coverage

# Run specific test
pnpm test parser.test.ts

Test Files

  • Wordๆต‹่ฏ•ๆ–‡ไปถ.docx - Word test document
  • Excelๆต‹่ฏ•ๆ–‡ไปถ.xlsx - Excel test workbook (3 sheets)
  • PDFๆต‹่ฏ•ๆ–‡ๆกฃ.pdf - PDF test document

๐Ÿ”ง Development

# Install dependencies
pnpm install

# Build all packages
pnpm build

# Watch mode
pnpm dev

# Lint
pnpm lint

# Type check
pnpm type-check

๐Ÿ“ˆ Roadmap

v1.1.0 (Current)

  • โœ… Word (docx) support
  • โœ… Excel (xlsx/xls) support
  • โœ… 9 MCP tools

v1.2.0 (Planned)

  • Encrypted PDF support
  • OCR text recognition
  • PowerPoint (pptx) support
  • Batch processing optimization

v2.0.0 (Future)

  • Plugin system
  • More document formats (CSV, TXT, RTF)
  • Advanced table extraction
  • Document conversion

๐Ÿค Contributing

We welcome contributions! Please see CONTRIBUTING.md for details.

Ways to Contribute

  • ๐Ÿ› Report bugs
  • ๐Ÿ’ก Suggest features
  • ๐Ÿ“ Improve documentation
  • ๐Ÿ”ง Submit pull requests

๐Ÿ“ฆ Packages

Package Version Description
parseflow-core 1.0.1 Core parsing library
parseflow-mcp-server 1.0.2 MCP server for AI

๐Ÿ”— Links


๐Ÿ“„ License

MIT License - see LICENSE file for details.


๐Ÿ™ Acknowledgments

  • pdf-parse - PDF parsing
  • pdf-lib - PDF manipulation
  • mammoth - Word document parsing
  • xlsx - Excel spreadsheet parsing
  • MCP SDK - Model Context Protocol

๐Ÿ“Š Stats

  • Test Coverage: 83%+
  • Supported Formats: 3 (PDF, Word, Excel)
  • MCP Tools: 9
  • Dependencies: Minimal and well-maintained

๐Ÿ’ฌ Community


Made with โค๏ธ by Libres-coder

Status: ๐ŸŽ‰ Production Ready (v1.1.0)

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published