Intelligent File Processing Pipeline

A modular, extensible system for detecting, parsing, and standardizing various file types into a unified JSON structure.

Features

Multi-format Support: PDF, Markdown, Word, Spreadsheets, Text, Images, HTML, ZIP archives
Advanced Processing: Google Document AI, AWS Textract, OCR capabilities
Standardized Output: Consistent JSON structure across all file types
Recursive Processing: Handle ZIP archives and directory structures
Configurable: Flexible backend selection and processing options

Supported File Types

File Type	Extensions	Primary Tool	Fallback Tool
PDF	`.pdf`	Google Document AI	pdfplumber
Markdown	`.md`, `.markdown`	Google Document AI	mistune
Word	`.docx`, `.doc`	Google Document AI	python-docx
Spreadsheets	`.csv`, `.xlsx`, `.xls`	pandas	openpyxl
Text	`.txt`	Native	-
Images	`.png`, `.jpg`, `.tiff`	Google Vision API	pytesseract
HTML	`.html`, `.htm`	BeautifulSoup	-
ZIP	`.zip`, `.tar.gz`	zipfile	-

Installation

# Install dependencies
pip install -r requirements.txt

# For Google Cloud services
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"

# For AWS services
aws configure

Usage

Command Line Interface

# Process a single file
python -m pipeline.main --input document.pdf --output result.json

# Process a directory
python -m pipeline.main --input ./documents --output ./results --batch

# Use Google Document AI
python -m pipeline.main --input document.pdf --output result.json \
  --gcp-project-id my-project --gcp-location us --gcp-processor-id my-processor

# Use AWS Textract
python -m pipeline.main --input document.pdf --output result.json \
  --aws-region us-east-1 --aws-s3-bucket my-bucket

Python API

from pipeline import FileProcessor

# Initialize processor
config = {
    'gcp_project_id': 'my-project',
    'gcp_location': 'us',
    'gcp_processor_id': 'my-processor'
}
processor = FileProcessor(config)

# Process a single file
result = processor.process_file('document.pdf')
print(result.to_json())

# Process a directory
results = processor.process_directory('./documents')
for result in results:
    print(f"Processed {result.file_type}: {result.success}")

Output Format

All processed files return a standardized JSON structure:

{
  "file_type": "PDF",
  "processing_tool": "Google Document AI",
  "parsed_content": {
    "text": "Extracted text content...",
    "tables": [
      {
        "page": 1,
        "data": [["Header1", "Header2"], ["Value1", "Value2"]]
      }
    ],
    "entities": [
      {
        "type": "PERSON",
        "mention_text": "John Doe",
        "confidence": 0.95
      }
    ]
  },
  "metadata": {
    "page_count": 2,
    "confidence": 0.98
  },
  "success": true,
  "error_message": null
}

Configuration

Google Cloud Setup

Create a Google Cloud project
Enable Document AI and Vision APIs
Create a processor in Document AI
Download service account credentials
Set environment variable: export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"

AWS Setup

Configure AWS credentials: aws configure
Create an S3 bucket for Textract processing
Set appropriate IAM permissions for Textract access

Architecture

pipeline/
├── __init__.py          # Package initialization
├── core.py             # Core classes and data structures
├── detector.py         # File type detection
├── main.py            # CLI entry point
├── parsers/           # File type parsers
│   ├── __init__.py
│   ├── base.py        # Base parser class
│   ├── pdf.py         # PDF parser
│   ├── markdown.py    # Markdown parser
│   ├── word.py        # Word parser
│   ├── spreadsheet.py # Spreadsheet parser
│   ├── text.py        # Text parser
│   ├── image.py       # Image parser
│   ├── html.py        # HTML parser
│   └── zip.py         # ZIP parser
├── requirements.txt   # Dependencies
└── README.md         # Documentation

Extending the Pipeline

Adding New File Types

Create a new parser in parsers/
Inherit from BaseParser
Implement the parse() method
Add file type detection in detector.py
Register the parser in core.py

Adding New Processing Tools

Add the tool to ProcessingTool enum
Implement backend detection in parser _setup_backends()
Add configuration options
Implement parsing logic

Error Handling

The pipeline includes comprehensive error handling:

File not found: Returns error result with message
Unsupported format: Detected and reported
Backend unavailable: Falls back to alternative tools
Processing errors: Captured and included in results
Network issues: Graceful degradation

Performance Considerations

Batch processing: Process multiple files efficiently
Memory management: Large files handled with streaming
Caching: Results can be cached for repeated processing
Parallel processing: Multiple files can be processed concurrently

License

MIT License - see LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Intelligent File Processing Pipeline

Features

Supported File Types

Installation

Usage

Command Line Interface

Python API

Output Format

Configuration

Google Cloud Setup

AWS Setup

Architecture

Extending the Pipeline

Adding New File Types

Adding New Processing Tools

Error Handling

Performance Considerations

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
parsers		parsers
README.md		README.md
__init__.py		__init__.py
core.py		core.py
detector.py		detector.py
example_usage.py		example_usage.py
main.py		main.py
requirements.txt		requirements.txt

ContiAI/ME-Data

Folders and files

Latest commit

History

Repository files navigation

Intelligent File Processing Pipeline

Features

Supported File Types

Installation

Usage

Command Line Interface

Python API

Output Format

Configuration

Google Cloud Setup

AWS Setup

Architecture

Extending the Pipeline

Adding New File Types

Adding New Processing Tools

Error Handling

Performance Considerations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages