Document Processor

A powerful, AI-driven document processing pipeline that converts various document formats (PDF, DOCX, PPTX, images) into structured outputs using computer vision and large language models.

🚀 Overview

Document Processor is a comprehensive solution that combines multiple AI technologies to extract, analyze, and structure content from documents. The system uses YOLO for object detection, advanced vision models (OpenAI GPT-4 Vision or Google Gemini) for content extraction, and sophisticated processing pipelines to deliver high-quality structured outputs.

Key Features

Multi-format Support: Process PDFs, DOCX, PPTX, and image files
AI-Powered Extraction: Uses state-of-the-art computer vision and LLM models
Flexible Output Formats: Generate JSON, Markdown, HTML, or plain text
Scalable Architecture: Microservices-based design using AWS Lambda functions
Cloud-Native: Built for AWS with S3 storage integration
Docker Support: Fully containerized for easy deployment and testing

🏗️ Architecture

The Document Processor follows a modular microservices architecture with three main components:

┌─────────────┐    ┌─────────────────┐    ┌─────────────┐
│   Splitter  │ -> │ Page Processor  │ -> │  Combiner   │
│             │    │  (per page)     │    │             │
└─────────────┘    └─────────────────┘    └─────────────┘
       │                    │                     │
       ▼                    ▼                     ▼
   S3 Storage          S3 Storage             S3 Storage

Component Overview

Splitter: Converts documents into individual page images and extracts raw text
Page Processor: Analyzes each page using YOLO + LLM for structured content extraction
Combiner: Aggregates page results into final structured documents

📁 Project Structure

Document-Processor-Python/
├── run_pipeline.py          # Main orchestration script
├── splitter/                # Document splitting service
│   ├── lambda_function.py   # PDF/DOCX/PPTX to pages
│   ├── Dockerfile
│   ├── docker-compose.yml
│   └── requirements.txt
├── page_processor/          # AI-powered page analysis
│   ├── lambda_function.py   # Main processing logic
│   ├── config.py           # Configuration settings
│   ├── llm_apis.py         # OpenAI/Gemini integration
│   ├── yolo_inference.py   # Computer vision processing
│   ├── prompts.py          # LLM prompts for extraction
│   ├── s3_utils.py         # AWS S3 utilities
│   ├── utils.py            # Helper functions
│   ├── yolov10x_best.onnx  # YOLO model weights
│   ├── Dockerfile
│   ├── docker-compose.yml
│   └── requirements.txt
└── combiner/               # Result aggregation service
    ├── lambda_function.py   # Combines page results
    ├── Dockerfile
    ├── docker-compose.yml
    └── requirements.txt

🛠️ Installation

Prerequisites

Python 3.10+
Docker & Docker Compose
AWS Account with S3 access
OpenAI API key or Google Gemini API key

Environment Setup

Clone the repository:

git clone <repository-url>
cd Document-Processor-Python

Configure environment variables: Create a .env file in the page_processor/ directory:

# AWS Configuration
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1
S3_BUCKET_NAME=your-bucket-name

# Vision Provider (choose 'openai' or 'gemini')
VISION_PROVIDER=gemini

# OpenAI Configuration
OPENAI_API_KEY=your_openai_key
OPENAI_VISION_MODEL=gpt-4o

# Gemini Configuration
GEMINI_API_KEY=your_gemini_key
GEMINI_VISION_MODEL=gemini-2.0-flash

# Processing Configuration
MAX_IMAGE_DIMENSION=1024

Install Python dependencies (if running locally):
```
pip install requests
```

Docker Deployment

Build and run all services:

# Start splitter service (port 9000)
cd splitter
docker-compose up --build -d

# Start page processor service (port 9001)
cd ../page_processor
docker-compose up --build -d

# Start combiner service (port 9002)
cd ../combiner
docker-compose up --build -d

Verify services are running:

curl http://localhost:9000/  # Splitter
curl http://localhost:9001/  # Page Processor
curl http://localhost:9002/  # Combiner

🚀 Usage

Basic Usage

Process a document using the main pipeline script:

python run_pipeline.py "s3_object_key_or_path" --output_format markdown

Note: The first argument should be the S3 object key (e.g., input/document.pdf) rather than a local file path. The document should already be uploaded to your configured S3 bucket.

Supported Formats

Input: PDF, DOCX, PPTX, PNG, JPG, JPEG
Output: JSON, Markdown, HTML, Plain Text

Example Commands

# Process a PDF to Markdown (S3 object key)
python run_pipeline.py "input/report.pdf" --output_format markdown

# Process a PowerPoint to JSON
python run_pipeline.py "presentations/slides.pptx" --output_format json

# Process a Word document to HTML
python run_pipeline.py "documents/manual.docx" --output_format html

Processing Pipeline Flow

Document Upload: Upload your document to the configured S3 bucket
Splitter Service: Converts document to individual page images and extracts raw text
Page Processor: Each page is processed using:
- YOLO inference for object detection (images, tables)
- LLM analysis for content extraction and structuring
- Text grounding against raw OCR text for accuracy
Combiner Service: Aggregates all page results into final structured output

Service Ports

When running locally with Docker:

Splitter: http://localhost:9000
Page Processor: http://localhost:9001
Combiner: http://localhost:9002

API Usage

You can also invoke services directly using REST API:

# Invoke splitter
curl -X POST "http://localhost:9000/2015-03-31/functions/function/invocations" \
  -d '{"s3_input_uri": "documents/sample.pdf", "output_format": "markdown"}'

# Invoke page processor
curl -X POST "http://localhost:9001/2015-03-31/functions/function/invocations" \
  -d '{
    "run_uuid": "example-uuid",
    "s3_page_image_uri": "s3://bucket/images/page_1.png",
    "s3_page_text_uri": "s3://bucket/text/page_1.txt",
    "output_format": "markdown",
    "page_number": 1,
    "original_base_filename": "document"
  }'

🔧 Configuration

Vision Provider Configuration

The system supports two vision providers:

OpenAI GPT-4 Vision

VISION_PROVIDER=openai
OPENAI_API_KEY=your_key_here
OPENAI_VISION_MODEL=gpt-4o

Google Gemini Vision

VISION_PROVIDER=gemini
GEMINI_API_KEY=your_key_here
GEMINI_VISION_MODEL=gemini-2.0-flash

AWS Configuration

AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1
S3_BUCKET_NAME=your-processing-bucket

Processing Configuration

# Image processing
MAX_IMAGE_DIMENSION=1024
PDF_DPI=200

# S3 prefixes
INTERMEDIATE_IMAGES_PREFIX=intermediate-images
INTERMEDIATE_RAW_TEXT_PREFIX=intermediate-raw-text
CROPPED_IMAGES_PREFIX=cropped-images
PAGE_RESULTS_PREFIX=intermediate-page-results
FINAL_OUTPUT_PREFIX=final-outputs

# YOLO Model Configuration
YOLO_MODEL_LOCAL_PATH=/var/task/yolov10x_best.onnx
YOLO_MODEL_S3_KEY=models/yolov10x_best.onnx  # Alternative: store in S3

Service-Specific Configuration

Splitter Service

# Document processing
PDF_DPI=200  # DPI for PDF to image conversion
INTERMEDIATE_IMAGES_PREFIX=intermediate-images
INTERMEDIATE_RAW_TEXT_PREFIX=intermediate-raw-text

Page Processor Service

# Vision processing
VISION_PROVIDER=gemini
MAX_IMAGE_DIMENSION=1024
CROPPED_IMAGES_PREFIX=cropped-images
PAGE_RESULTS_PREFIX=intermediate-page-results

# Model configuration
YOLO_MODEL_LOCAL_PATH=/var/task/yolov10x_best.onnx

Combiner Service

# Output configuration  
FINAL_OUTPUT_PREFIX=final-outputs

Required IAM Permissions

Your AWS credentials need these S3 permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject", 
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}

🧠 AI Components

YOLO Object Detection

Model: YOLOv10x (ONNX format)
Purpose: Detect and locate images, tables, and other objects within document pages
Features:
- Bounding box detection with confidence scoring
- Object indexing for referencing in text
- Cropped image extraction for detailed analysis
- Configurable confidence thresholds (default: 0.2)

Language Models

Supported Models:

OpenAI: GPT-4 Vision, GPT-4o
Google: Gemini 2.0 Flash, Gemini Pro Vision

Capabilities:

Text Extraction: OCR-like text extraction from document images
Structure Recognition: Identifies headings (H1-H6), paragraphs, lists, tables
Content Grounding: Cross-references extracted content with raw OCR text for accuracy
Multi-format Output: Generates JSON, Markdown, HTML, or plain text
Image Description: Provides detailed descriptions of charts, graphs, and diagrams
Smart Prompting: Uses specialized prompts for each output format

Processing Features

Text Grounding

Compares AI-extracted text with raw OCR text
Corrects factual inaccuracies and misinterpretations
Preserves document structure while ensuring accuracy

Image Analysis

Detects and numbers images within documents
Generates contextual descriptions
Integrates image references into document flow
Handles charts, graphs, diagrams, and photos

📊 Output Formats

JSON Structure

{
  "page_content": [
    {
      "type": "heading",
      "level": 1,
      "text": "Document Title"
    },
    {
      "type": "paragraph",
      "text": "Document content..."
    },
    {
      "type": "image_description",
      "image_id": 1,
      "description": "Chart showing sales data"
    }
  ]
}

Markdown Output

# Document Title

Document content...

Image #1: Chart showing sales data

HTML Output

<h1>Document Title</h1>
<p>Document content...</p>
<p class="image-placeholder" data-image-id="1">Image #1: Chart showing sales data</p>

🔧 Development

Running Tests

# Test individual services
cd page_processor
python -m pytest tests/

# Test full pipeline
python run_pipeline.py test_documents/sample.pdf --output_format json

Adding New Output Formats

Add prompt template in page_processor/prompts.py
Update format handling in page_processor/lambda_function.py
Add generation logic in combiner/lambda_function.py

Monitoring and Logging

All services include comprehensive logging:

Request/response logging
Error tracking
Performance metrics
S3 operation logging

📚 Dependencies

Core Dependencies

aioboto3: Async AWS SDK
opencv-python: Computer vision
onnxruntime: ML model inference
PyMuPDF: PDF processing
pdf2image: PDF to image conversion
python-pptx: PowerPoint processing
docx2txt: Word document processing
Pillow: Image processing

AI/ML Dependencies

openai: OpenAI API client
google-genai: Google Gemini API client
numpy: Numerical computations

🚀 Deployment

AWS Lambda Deployment

Create Lambda functions for each service
Configure environment variables
Set up S3 buckets for storage
Configure IAM roles with appropriate permissions

Docker Deployment

Services are containerized and can be deployed using:

Docker Compose (development)
Kubernetes (production)
AWS ECS/Fargate (cloud)

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Submit a pull request

📄 License

[Add your license information here]

🔍 Troubleshooting

Common Issues

Service Not Responding

# Check if Docker containers are running
docker ps

# Check service logs
docker logs lambda_splitter_service
docker logs lambda_page_processor_service
docker logs lambda_combiner_service

# Restart services
docker-compose down && docker-compose up --build

AWS Credentials Issues

Ensure AWS credentials are correctly set in .env files
Verify S3 bucket exists and is accessible
Check IAM permissions for S3 read/write access

Vision API Errors

Verify API keys are valid and have sufficient credits
Check rate limiting and quotas
Monitor API response times and error rates

Memory Issues

Increase Docker memory limits in docker-compose.yml
Monitor /tmp directory usage (Lambda has 512MB-10GB ephemeral storage)
Consider using smaller images or reducing batch sizes

Performance Optimization

For Large Documents

Increase MAX_IMAGE_DIMENSION for better quality (impacts processing time)
Adjust PDF_DPI for optimal image resolution vs. processing speed
Use VISION_PROVIDER=gemini for faster processing (generally)

For High-Volume Processing

Scale horizontal with multiple container instances
Implement AWS Lambda concurrent execution limits
Use S3 transfer acceleration for large files

Debug Mode

Enable verbose logging:

# Set environment variable
DEBUG=1

# Or check individual service logs
docker logs -f lambda_page_processor_service

🆘 Support

For issues and questions:

Check the troubleshooting section above
Review existing issues in the repository
Check service logs for error details
Create a new issue with:
- Document type and size
- Error messages and logs
- Environment configuration
- Steps to reproduce

🔮 Roadmap

Near Term

Enhanced table extraction and formatting
Batch processing capabilities for multiple documents
Performance optimizations for large documents
Better error handling and retry mechanisms

Medium Term

Web interface for easy document upload and processing
API documentation with OpenAPI/Swagger
Support for additional document formats (RTF, ODT, etc.)
Multi-language document support

Long Term

Real-time processing pipeline
Advanced document understanding (relationships, semantics)
Custom model training capabilities
Enterprise integrations (SharePoint, Google Drive, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
combiner		combiner
page_processor		page_processor
splitter		splitter
README.md		README.md
run_pipeline.py		run_pipeline.py

Suvom2024/Document-Processor-Python

Folders and files

Latest commit

History

Repository files navigation

Document Processor

🚀 Overview

Key Features

🏗️ Architecture

Component Overview

📁 Project Structure

🛠️ Installation

Prerequisites

Environment Setup

Docker Deployment

🚀 Usage

Basic Usage

Supported Formats

Example Commands

Processing Pipeline Flow

Service Ports

API Usage

🔧 Configuration

Vision Provider Configuration

OpenAI GPT-4 Vision

Google Gemini Vision

AWS Configuration

Processing Configuration

Service-Specific Configuration

Splitter Service

Page Processor Service

Combiner Service

Required IAM Permissions

🧠 AI Components

YOLO Object Detection

Language Models

Supported Models:

Capabilities:

Processing Features

Text Grounding

Image Analysis

📊 Output Formats

JSON Structure

Markdown Output

HTML Output

🔧 Development

Running Tests

Adding New Output Formats

Monitoring and Logging

📚 Dependencies

Core Dependencies

AI/ML Dependencies

🚀 Deployment

AWS Lambda Deployment

Docker Deployment

🤝 Contributing

📄 License

🔍 Troubleshooting

Common Issues

Service Not Responding

AWS Credentials Issues

Vision API Errors

Memory Issues

Performance Optimization

For Large Documents

For High-Volume Processing

Debug Mode

🆘 Support

🔮 Roadmap

Near Term

Medium Term

Long Term

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages