A powerful, AI-driven document processing pipeline that converts various document formats (PDF, DOCX, PPTX, images) into structured outputs using computer vision and large language models.
Document Processor is a comprehensive solution that combines multiple AI technologies to extract, analyze, and structure content from documents. The system uses YOLO for object detection, advanced vision models (OpenAI GPT-4 Vision or Google Gemini) for content extraction, and sophisticated processing pipelines to deliver high-quality structured outputs.
- Multi-format Support: Process PDFs, DOCX, PPTX, and image files
- AI-Powered Extraction: Uses state-of-the-art computer vision and LLM models
- Flexible Output Formats: Generate JSON, Markdown, HTML, or plain text
- Scalable Architecture: Microservices-based design using AWS Lambda functions
- Cloud-Native: Built for AWS with S3 storage integration
- Docker Support: Fully containerized for easy deployment and testing
The Document Processor follows a modular microservices architecture with three main components:
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Splitter โ -> โ Page Processor โ -> โ Combiner โ
โ โ โ (per page) โ โ โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
S3 Storage S3 Storage S3 Storage
- Splitter: Converts documents into individual page images and extracts raw text
- Page Processor: Analyzes each page using YOLO + LLM for structured content extraction
- Combiner: Aggregates page results into final structured documents
Document-Processor-Python/
โโโ run_pipeline.py # Main orchestration script
โโโ splitter/ # Document splitting service
โ โโโ lambda_function.py # PDF/DOCX/PPTX to pages
โ โโโ Dockerfile
โ โโโ docker-compose.yml
โ โโโ requirements.txt
โโโ page_processor/ # AI-powered page analysis
โ โโโ lambda_function.py # Main processing logic
โ โโโ config.py # Configuration settings
โ โโโ llm_apis.py # OpenAI/Gemini integration
โ โโโ yolo_inference.py # Computer vision processing
โ โโโ prompts.py # LLM prompts for extraction
โ โโโ s3_utils.py # AWS S3 utilities
โ โโโ utils.py # Helper functions
โ โโโ yolov10x_best.onnx # YOLO model weights
โ โโโ Dockerfile
โ โโโ docker-compose.yml
โ โโโ requirements.txt
โโโ combiner/ # Result aggregation service
โโโ lambda_function.py # Combines page results
โโโ Dockerfile
โโโ docker-compose.yml
โโโ requirements.txt
- Python 3.10+
- Docker & Docker Compose
- AWS Account with S3 access
- OpenAI API key or Google Gemini API key
-
Clone the repository:
git clone <repository-url> cd Document-Processor-Python
-
Configure environment variables: Create a
.envfile in thepage_processor/directory:# AWS Configuration AWS_ACCESS_KEY_ID=your_access_key AWS_SECRET_ACCESS_KEY=your_secret_key AWS_REGION=us-east-1 S3_BUCKET_NAME=your-bucket-name # Vision Provider (choose 'openai' or 'gemini') VISION_PROVIDER=gemini # OpenAI Configuration OPENAI_API_KEY=your_openai_key OPENAI_VISION_MODEL=gpt-4o # Gemini Configuration GEMINI_API_KEY=your_gemini_key GEMINI_VISION_MODEL=gemini-2.0-flash # Processing Configuration MAX_IMAGE_DIMENSION=1024
-
Install Python dependencies (if running locally):
pip install requests
-
Build and run all services:
# Start splitter service (port 9000) cd splitter docker-compose up --build -d # Start page processor service (port 9001) cd ../page_processor docker-compose up --build -d # Start combiner service (port 9002) cd ../combiner docker-compose up --build -d
-
Verify services are running:
curl http://localhost:9000/ # Splitter curl http://localhost:9001/ # Page Processor curl http://localhost:9002/ # Combiner
Process a document using the main pipeline script:
python run_pipeline.py "s3_object_key_or_path" --output_format markdownNote: The first argument should be the S3 object key (e.g., input/document.pdf) rather than a local file path. The document should already be uploaded to your configured S3 bucket.
- Input: PDF, DOCX, PPTX, PNG, JPG, JPEG
- Output: JSON, Markdown, HTML, Plain Text
# Process a PDF to Markdown (S3 object key)
python run_pipeline.py "input/report.pdf" --output_format markdown
# Process a PowerPoint to JSON
python run_pipeline.py "presentations/slides.pptx" --output_format json
# Process a Word document to HTML
python run_pipeline.py "documents/manual.docx" --output_format html- Document Upload: Upload your document to the configured S3 bucket
- Splitter Service: Converts document to individual page images and extracts raw text
- Page Processor: Each page is processed using:
- YOLO inference for object detection (images, tables)
- LLM analysis for content extraction and structuring
- Text grounding against raw OCR text for accuracy
- Combiner Service: Aggregates all page results into final structured output
When running locally with Docker:
- Splitter:
http://localhost:9000 - Page Processor:
http://localhost:9001 - Combiner:
http://localhost:9002
You can also invoke services directly using REST API:
# Invoke splitter
curl -X POST "http://localhost:9000/2015-03-31/functions/function/invocations" \
-d '{"s3_input_uri": "documents/sample.pdf", "output_format": "markdown"}'
# Invoke page processor
curl -X POST "http://localhost:9001/2015-03-31/functions/function/invocations" \
-d '{
"run_uuid": "example-uuid",
"s3_page_image_uri": "s3://bucket/images/page_1.png",
"s3_page_text_uri": "s3://bucket/text/page_1.txt",
"output_format": "markdown",
"page_number": 1,
"original_base_filename": "document"
}'The system supports two vision providers:
VISION_PROVIDER=openai
OPENAI_API_KEY=your_key_here
OPENAI_VISION_MODEL=gpt-4oVISION_PROVIDER=gemini
GEMINI_API_KEY=your_key_here
GEMINI_VISION_MODEL=gemini-2.0-flashAWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1
S3_BUCKET_NAME=your-processing-bucket# Image processing
MAX_IMAGE_DIMENSION=1024
PDF_DPI=200
# S3 prefixes
INTERMEDIATE_IMAGES_PREFIX=intermediate-images
INTERMEDIATE_RAW_TEXT_PREFIX=intermediate-raw-text
CROPPED_IMAGES_PREFIX=cropped-images
PAGE_RESULTS_PREFIX=intermediate-page-results
FINAL_OUTPUT_PREFIX=final-outputs
# YOLO Model Configuration
YOLO_MODEL_LOCAL_PATH=/var/task/yolov10x_best.onnx
YOLO_MODEL_S3_KEY=models/yolov10x_best.onnx # Alternative: store in S3# Document processing
PDF_DPI=200 # DPI for PDF to image conversion
INTERMEDIATE_IMAGES_PREFIX=intermediate-images
INTERMEDIATE_RAW_TEXT_PREFIX=intermediate-raw-text# Vision processing
VISION_PROVIDER=gemini
MAX_IMAGE_DIMENSION=1024
CROPPED_IMAGES_PREFIX=cropped-images
PAGE_RESULTS_PREFIX=intermediate-page-results
# Model configuration
YOLO_MODEL_LOCAL_PATH=/var/task/yolov10x_best.onnx# Output configuration
FINAL_OUTPUT_PREFIX=final-outputsYour AWS credentials need these S3 permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}- Model: YOLOv10x (ONNX format)
- Purpose: Detect and locate images, tables, and other objects within document pages
- Features:
- Bounding box detection with confidence scoring
- Object indexing for referencing in text
- Cropped image extraction for detailed analysis
- Configurable confidence thresholds (default: 0.2)
- OpenAI: GPT-4 Vision, GPT-4o
- Google: Gemini 2.0 Flash, Gemini Pro Vision
- Text Extraction: OCR-like text extraction from document images
- Structure Recognition: Identifies headings (H1-H6), paragraphs, lists, tables
- Content Grounding: Cross-references extracted content with raw OCR text for accuracy
- Multi-format Output: Generates JSON, Markdown, HTML, or plain text
- Image Description: Provides detailed descriptions of charts, graphs, and diagrams
- Smart Prompting: Uses specialized prompts for each output format
- Compares AI-extracted text with raw OCR text
- Corrects factual inaccuracies and misinterpretations
- Preserves document structure while ensuring accuracy
- Detects and numbers images within documents
- Generates contextual descriptions
- Integrates image references into document flow
- Handles charts, graphs, diagrams, and photos
{
"page_content": [
{
"type": "heading",
"level": 1,
"text": "Document Title"
},
{
"type": "paragraph",
"text": "Document content..."
},
{
"type": "image_description",
"image_id": 1,
"description": "Chart showing sales data"
}
]
}# Document Title
Document content...
Image #1: Chart showing sales data<h1>Document Title</h1>
<p>Document content...</p>
<p class="image-placeholder" data-image-id="1">Image #1: Chart showing sales data</p># Test individual services
cd page_processor
python -m pytest tests/
# Test full pipeline
python run_pipeline.py test_documents/sample.pdf --output_format json- Add prompt template in
page_processor/prompts.py - Update format handling in
page_processor/lambda_function.py - Add generation logic in
combiner/lambda_function.py
All services include comprehensive logging:
- Request/response logging
- Error tracking
- Performance metrics
- S3 operation logging
- aioboto3: Async AWS SDK
- opencv-python: Computer vision
- onnxruntime: ML model inference
- PyMuPDF: PDF processing
- pdf2image: PDF to image conversion
- python-pptx: PowerPoint processing
- docx2txt: Word document processing
- Pillow: Image processing
- openai: OpenAI API client
- google-genai: Google Gemini API client
- numpy: Numerical computations
- Create Lambda functions for each service
- Configure environment variables
- Set up S3 buckets for storage
- Configure IAM roles with appropriate permissions
Services are containerized and can be deployed using:
- Docker Compose (development)
- Kubernetes (production)
- AWS ECS/Fargate (cloud)
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
[Add your license information here]
# Check if Docker containers are running
docker ps
# Check service logs
docker logs lambda_splitter_service
docker logs lambda_page_processor_service
docker logs lambda_combiner_service
# Restart services
docker-compose down && docker-compose up --build- Ensure AWS credentials are correctly set in
.envfiles - Verify S3 bucket exists and is accessible
- Check IAM permissions for S3 read/write access
- Verify API keys are valid and have sufficient credits
- Check rate limiting and quotas
- Monitor API response times and error rates
- Increase Docker memory limits in
docker-compose.yml - Monitor
/tmpdirectory usage (Lambda has 512MB-10GB ephemeral storage) - Consider using smaller images or reducing batch sizes
- Increase
MAX_IMAGE_DIMENSIONfor better quality (impacts processing time) - Adjust
PDF_DPIfor optimal image resolution vs. processing speed - Use
VISION_PROVIDER=geminifor faster processing (generally)
- Scale horizontal with multiple container instances
- Implement AWS Lambda concurrent execution limits
- Use S3 transfer acceleration for large files
Enable verbose logging:
# Set environment variable
DEBUG=1
# Or check individual service logs
docker logs -f lambda_page_processor_serviceFor issues and questions:
- Check the troubleshooting section above
- Review existing issues in the repository
- Check service logs for error details
- Create a new issue with:
- Document type and size
- Error messages and logs
- Environment configuration
- Steps to reproduce
- Enhanced table extraction and formatting
- Batch processing capabilities for multiple documents
- Performance optimizations for large documents
- Better error handling and retry mechanisms
- Web interface for easy document upload and processing
- API documentation with OpenAPI/Swagger
- Support for additional document formats (RTF, ODT, etc.)
- Multi-language document support
- Real-time processing pipeline
- Advanced document understanding (relationships, semantics)
- Custom model training capabilities
- Enterprise integrations (SharePoint, Google Drive, etc.)