A modular, extensible system for detecting, parsing, and standardizing various file types into a unified JSON structure.
- Multi-format Support: PDF, Markdown, Word, Spreadsheets, Text, Images, HTML, ZIP archives
- Advanced Processing: Google Document AI, AWS Textract, OCR capabilities
- Standardized Output: Consistent JSON structure across all file types
- Recursive Processing: Handle ZIP archives and directory structures
- Configurable: Flexible backend selection and processing options
File Type | Extensions | Primary Tool | Fallback Tool |
---|---|---|---|
.pdf |
Google Document AI | pdfplumber | |
Markdown | .md , .markdown |
Google Document AI | mistune |
Word | .docx , .doc |
Google Document AI | python-docx |
Spreadsheets | .csv , .xlsx , .xls |
pandas | openpyxl |
Text | .txt |
Native | - |
Images | .png , .jpg , .tiff |
Google Vision API | pytesseract |
HTML | .html , .htm |
BeautifulSoup | - |
ZIP | .zip , .tar.gz |
zipfile | - |
# Install dependencies
pip install -r requirements.txt
# For Google Cloud services
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"
# For AWS services
aws configure
# Process a single file
python -m pipeline.main --input document.pdf --output result.json
# Process a directory
python -m pipeline.main --input ./documents --output ./results --batch
# Use Google Document AI
python -m pipeline.main --input document.pdf --output result.json \
--gcp-project-id my-project --gcp-location us --gcp-processor-id my-processor
# Use AWS Textract
python -m pipeline.main --input document.pdf --output result.json \
--aws-region us-east-1 --aws-s3-bucket my-bucket
from pipeline import FileProcessor
# Initialize processor
config = {
'gcp_project_id': 'my-project',
'gcp_location': 'us',
'gcp_processor_id': 'my-processor'
}
processor = FileProcessor(config)
# Process a single file
result = processor.process_file('document.pdf')
print(result.to_json())
# Process a directory
results = processor.process_directory('./documents')
for result in results:
print(f"Processed {result.file_type}: {result.success}")
All processed files return a standardized JSON structure:
{
"file_type": "PDF",
"processing_tool": "Google Document AI",
"parsed_content": {
"text": "Extracted text content...",
"tables": [
{
"page": 1,
"data": [["Header1", "Header2"], ["Value1", "Value2"]]
}
],
"entities": [
{
"type": "PERSON",
"mention_text": "John Doe",
"confidence": 0.95
}
]
},
"metadata": {
"page_count": 2,
"confidence": 0.98
},
"success": true,
"error_message": null
}
- Create a Google Cloud project
- Enable Document AI and Vision APIs
- Create a processor in Document AI
- Download service account credentials
- Set environment variable:
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"
- Configure AWS credentials:
aws configure
- Create an S3 bucket for Textract processing
- Set appropriate IAM permissions for Textract access
pipeline/
├── __init__.py # Package initialization
├── core.py # Core classes and data structures
├── detector.py # File type detection
├── main.py # CLI entry point
├── parsers/ # File type parsers
│ ├── __init__.py
│ ├── base.py # Base parser class
│ ├── pdf.py # PDF parser
│ ├── markdown.py # Markdown parser
│ ├── word.py # Word parser
│ ├── spreadsheet.py # Spreadsheet parser
│ ├── text.py # Text parser
│ ├── image.py # Image parser
│ ├── html.py # HTML parser
│ └── zip.py # ZIP parser
├── requirements.txt # Dependencies
└── README.md # Documentation
- Create a new parser in
parsers/
- Inherit from
BaseParser
- Implement the
parse()
method - Add file type detection in
detector.py
- Register the parser in
core.py
- Add the tool to
ProcessingTool
enum - Implement backend detection in parser
_setup_backends()
- Add configuration options
- Implement parsing logic
The pipeline includes comprehensive error handling:
- File not found: Returns error result with message
- Unsupported format: Detected and reported
- Backend unavailable: Falls back to alternative tools
- Processing errors: Captured and included in results
- Network issues: Graceful degradation
- Batch processing: Process multiple files efficiently
- Memory management: Large files handled with streaming
- Caching: Results can be cached for repeated processing
- Parallel processing: Multiple files can be processed concurrently
MIT License - see LICENSE file for details.