Skip to content

ContiAI/ME-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intelligent File Processing Pipeline

A modular, extensible system for detecting, parsing, and standardizing various file types into a unified JSON structure.

Features

  • Multi-format Support: PDF, Markdown, Word, Spreadsheets, Text, Images, HTML, ZIP archives
  • Advanced Processing: Google Document AI, AWS Textract, OCR capabilities
  • Standardized Output: Consistent JSON structure across all file types
  • Recursive Processing: Handle ZIP archives and directory structures
  • Configurable: Flexible backend selection and processing options

Supported File Types

File Type Extensions Primary Tool Fallback Tool
PDF .pdf Google Document AI pdfplumber
Markdown .md, .markdown Google Document AI mistune
Word .docx, .doc Google Document AI python-docx
Spreadsheets .csv, .xlsx, .xls pandas openpyxl
Text .txt Native -
Images .png, .jpg, .tiff Google Vision API pytesseract
HTML .html, .htm BeautifulSoup -
ZIP .zip, .tar.gz zipfile -

Installation

# Install dependencies
pip install -r requirements.txt

# For Google Cloud services
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"

# For AWS services
aws configure

Usage

Command Line Interface

# Process a single file
python -m pipeline.main --input document.pdf --output result.json

# Process a directory
python -m pipeline.main --input ./documents --output ./results --batch

# Use Google Document AI
python -m pipeline.main --input document.pdf --output result.json \
  --gcp-project-id my-project --gcp-location us --gcp-processor-id my-processor

# Use AWS Textract
python -m pipeline.main --input document.pdf --output result.json \
  --aws-region us-east-1 --aws-s3-bucket my-bucket

Python API

from pipeline import FileProcessor

# Initialize processor
config = {
    'gcp_project_id': 'my-project',
    'gcp_location': 'us',
    'gcp_processor_id': 'my-processor'
}
processor = FileProcessor(config)

# Process a single file
result = processor.process_file('document.pdf')
print(result.to_json())

# Process a directory
results = processor.process_directory('./documents')
for result in results:
    print(f"Processed {result.file_type}: {result.success}")

Output Format

All processed files return a standardized JSON structure:

{
  "file_type": "PDF",
  "processing_tool": "Google Document AI",
  "parsed_content": {
    "text": "Extracted text content...",
    "tables": [
      {
        "page": 1,
        "data": [["Header1", "Header2"], ["Value1", "Value2"]]
      }
    ],
    "entities": [
      {
        "type": "PERSON",
        "mention_text": "John Doe",
        "confidence": 0.95
      }
    ]
  },
  "metadata": {
    "page_count": 2,
    "confidence": 0.98
  },
  "success": true,
  "error_message": null
}

Configuration

Google Cloud Setup

  1. Create a Google Cloud project
  2. Enable Document AI and Vision APIs
  3. Create a processor in Document AI
  4. Download service account credentials
  5. Set environment variable: export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"

AWS Setup

  1. Configure AWS credentials: aws configure
  2. Create an S3 bucket for Textract processing
  3. Set appropriate IAM permissions for Textract access

Architecture

pipeline/
├── __init__.py          # Package initialization
├── core.py             # Core classes and data structures
├── detector.py         # File type detection
├── main.py            # CLI entry point
├── parsers/           # File type parsers
│   ├── __init__.py
│   ├── base.py        # Base parser class
│   ├── pdf.py         # PDF parser
│   ├── markdown.py    # Markdown parser
│   ├── word.py        # Word parser
│   ├── spreadsheet.py # Spreadsheet parser
│   ├── text.py        # Text parser
│   ├── image.py       # Image parser
│   ├── html.py        # HTML parser
│   └── zip.py         # ZIP parser
├── requirements.txt   # Dependencies
└── README.md         # Documentation

Extending the Pipeline

Adding New File Types

  1. Create a new parser in parsers/
  2. Inherit from BaseParser
  3. Implement the parse() method
  4. Add file type detection in detector.py
  5. Register the parser in core.py

Adding New Processing Tools

  1. Add the tool to ProcessingTool enum
  2. Implement backend detection in parser _setup_backends()
  3. Add configuration options
  4. Implement parsing logic

Error Handling

The pipeline includes comprehensive error handling:

  • File not found: Returns error result with message
  • Unsupported format: Detected and reported
  • Backend unavailable: Falls back to alternative tools
  • Processing errors: Captured and included in results
  • Network issues: Graceful degradation

Performance Considerations

  • Batch processing: Process multiple files efficiently
  • Memory management: Large files handled with streaming
  • Caching: Results can be cached for repeated processing
  • Parallel processing: Multiple files can be processed concurrently

License

MIT License - see LICENSE file for details.

About

Data_pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages