Skip to content

Suvom2024/Document-Processor-Python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1 Commit
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Document Processor

A powerful, AI-driven document processing pipeline that converts various document formats (PDF, DOCX, PPTX, images) into structured outputs using computer vision and large language models.

๐Ÿš€ Overview

Document Processor is a comprehensive solution that combines multiple AI technologies to extract, analyze, and structure content from documents. The system uses YOLO for object detection, advanced vision models (OpenAI GPT-4 Vision or Google Gemini) for content extraction, and sophisticated processing pipelines to deliver high-quality structured outputs.

Key Features

  • Multi-format Support: Process PDFs, DOCX, PPTX, and image files
  • AI-Powered Extraction: Uses state-of-the-art computer vision and LLM models
  • Flexible Output Formats: Generate JSON, Markdown, HTML, or plain text
  • Scalable Architecture: Microservices-based design using AWS Lambda functions
  • Cloud-Native: Built for AWS with S3 storage integration
  • Docker Support: Fully containerized for easy deployment and testing

๐Ÿ—๏ธ Architecture

The Document Processor follows a modular microservices architecture with three main components:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Splitter  โ”‚ -> โ”‚ Page Processor  โ”‚ -> โ”‚  Combiner   โ”‚
โ”‚             โ”‚    โ”‚  (per page)     โ”‚    โ”‚             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚                    โ”‚                     โ”‚
       โ–ผ                    โ–ผ                     โ–ผ
   S3 Storage          S3 Storage             S3 Storage

Component Overview

  1. Splitter: Converts documents into individual page images and extracts raw text
  2. Page Processor: Analyzes each page using YOLO + LLM for structured content extraction
  3. Combiner: Aggregates page results into final structured documents

๐Ÿ“ Project Structure

Document-Processor-Python/
โ”œโ”€โ”€ run_pipeline.py          # Main orchestration script
โ”œโ”€โ”€ splitter/                # Document splitting service
โ”‚   โ”œโ”€โ”€ lambda_function.py   # PDF/DOCX/PPTX to pages
โ”‚   โ”œโ”€โ”€ Dockerfile
โ”‚   โ”œโ”€โ”€ docker-compose.yml
โ”‚   โ””โ”€โ”€ requirements.txt
โ”œโ”€โ”€ page_processor/          # AI-powered page analysis
โ”‚   โ”œโ”€โ”€ lambda_function.py   # Main processing logic
โ”‚   โ”œโ”€โ”€ config.py           # Configuration settings
โ”‚   โ”œโ”€โ”€ llm_apis.py         # OpenAI/Gemini integration
โ”‚   โ”œโ”€โ”€ yolo_inference.py   # Computer vision processing
โ”‚   โ”œโ”€โ”€ prompts.py          # LLM prompts for extraction
โ”‚   โ”œโ”€โ”€ s3_utils.py         # AWS S3 utilities
โ”‚   โ”œโ”€โ”€ utils.py            # Helper functions
โ”‚   โ”œโ”€โ”€ yolov10x_best.onnx  # YOLO model weights
โ”‚   โ”œโ”€โ”€ Dockerfile
โ”‚   โ”œโ”€โ”€ docker-compose.yml
โ”‚   โ””โ”€โ”€ requirements.txt
โ””โ”€โ”€ combiner/               # Result aggregation service
    โ”œโ”€โ”€ lambda_function.py   # Combines page results
    โ”œโ”€โ”€ Dockerfile
    โ”œโ”€โ”€ docker-compose.yml
    โ””โ”€โ”€ requirements.txt

๐Ÿ› ๏ธ Installation

Prerequisites

  • Python 3.10+
  • Docker & Docker Compose
  • AWS Account with S3 access
  • OpenAI API key or Google Gemini API key

Environment Setup

  1. Clone the repository:

    git clone <repository-url>
    cd Document-Processor-Python
  2. Configure environment variables: Create a .env file in the page_processor/ directory:

    # AWS Configuration
    AWS_ACCESS_KEY_ID=your_access_key
    AWS_SECRET_ACCESS_KEY=your_secret_key
    AWS_REGION=us-east-1
    S3_BUCKET_NAME=your-bucket-name
    
    # Vision Provider (choose 'openai' or 'gemini')
    VISION_PROVIDER=gemini
    
    # OpenAI Configuration
    OPENAI_API_KEY=your_openai_key
    OPENAI_VISION_MODEL=gpt-4o
    
    # Gemini Configuration
    GEMINI_API_KEY=your_gemini_key
    GEMINI_VISION_MODEL=gemini-2.0-flash
    
    # Processing Configuration
    MAX_IMAGE_DIMENSION=1024
  3. Install Python dependencies (if running locally):

    pip install requests

Docker Deployment

  1. Build and run all services:

    # Start splitter service (port 9000)
    cd splitter
    docker-compose up --build -d
    
    # Start page processor service (port 9001)
    cd ../page_processor
    docker-compose up --build -d
    
    # Start combiner service (port 9002)
    cd ../combiner
    docker-compose up --build -d
  2. Verify services are running:

    curl http://localhost:9000/  # Splitter
    curl http://localhost:9001/  # Page Processor
    curl http://localhost:9002/  # Combiner

๐Ÿš€ Usage

Basic Usage

Process a document using the main pipeline script:

python run_pipeline.py "s3_object_key_or_path" --output_format markdown

Note: The first argument should be the S3 object key (e.g., input/document.pdf) rather than a local file path. The document should already be uploaded to your configured S3 bucket.

Supported Formats

  • Input: PDF, DOCX, PPTX, PNG, JPG, JPEG
  • Output: JSON, Markdown, HTML, Plain Text

Example Commands

# Process a PDF to Markdown (S3 object key)
python run_pipeline.py "input/report.pdf" --output_format markdown

# Process a PowerPoint to JSON
python run_pipeline.py "presentations/slides.pptx" --output_format json

# Process a Word document to HTML
python run_pipeline.py "documents/manual.docx" --output_format html

Processing Pipeline Flow

  1. Document Upload: Upload your document to the configured S3 bucket
  2. Splitter Service: Converts document to individual page images and extracts raw text
  3. Page Processor: Each page is processed using:
    • YOLO inference for object detection (images, tables)
    • LLM analysis for content extraction and structuring
    • Text grounding against raw OCR text for accuracy
  4. Combiner Service: Aggregates all page results into final structured output

Service Ports

When running locally with Docker:

  • Splitter: http://localhost:9000
  • Page Processor: http://localhost:9001
  • Combiner: http://localhost:9002

API Usage

You can also invoke services directly using REST API:

# Invoke splitter
curl -X POST "http://localhost:9000/2015-03-31/functions/function/invocations" \
  -d '{"s3_input_uri": "documents/sample.pdf", "output_format": "markdown"}'

# Invoke page processor
curl -X POST "http://localhost:9001/2015-03-31/functions/function/invocations" \
  -d '{
    "run_uuid": "example-uuid",
    "s3_page_image_uri": "s3://bucket/images/page_1.png",
    "s3_page_text_uri": "s3://bucket/text/page_1.txt",
    "output_format": "markdown",
    "page_number": 1,
    "original_base_filename": "document"
  }'

๐Ÿ”ง Configuration

Vision Provider Configuration

The system supports two vision providers:

OpenAI GPT-4 Vision

VISION_PROVIDER=openai
OPENAI_API_KEY=your_key_here
OPENAI_VISION_MODEL=gpt-4o

Google Gemini Vision

VISION_PROVIDER=gemini
GEMINI_API_KEY=your_key_here
GEMINI_VISION_MODEL=gemini-2.0-flash

AWS Configuration

AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1
S3_BUCKET_NAME=your-processing-bucket

Processing Configuration

# Image processing
MAX_IMAGE_DIMENSION=1024
PDF_DPI=200

# S3 prefixes
INTERMEDIATE_IMAGES_PREFIX=intermediate-images
INTERMEDIATE_RAW_TEXT_PREFIX=intermediate-raw-text
CROPPED_IMAGES_PREFIX=cropped-images
PAGE_RESULTS_PREFIX=intermediate-page-results
FINAL_OUTPUT_PREFIX=final-outputs

# YOLO Model Configuration
YOLO_MODEL_LOCAL_PATH=/var/task/yolov10x_best.onnx
YOLO_MODEL_S3_KEY=models/yolov10x_best.onnx  # Alternative: store in S3

Service-Specific Configuration

Splitter Service

# Document processing
PDF_DPI=200  # DPI for PDF to image conversion
INTERMEDIATE_IMAGES_PREFIX=intermediate-images
INTERMEDIATE_RAW_TEXT_PREFIX=intermediate-raw-text

Page Processor Service

# Vision processing
VISION_PROVIDER=gemini
MAX_IMAGE_DIMENSION=1024
CROPPED_IMAGES_PREFIX=cropped-images
PAGE_RESULTS_PREFIX=intermediate-page-results

# Model configuration
YOLO_MODEL_LOCAL_PATH=/var/task/yolov10x_best.onnx

Combiner Service

# Output configuration  
FINAL_OUTPUT_PREFIX=final-outputs

Required IAM Permissions

Your AWS credentials need these S3 permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject", 
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}

๐Ÿง  AI Components

YOLO Object Detection

  • Model: YOLOv10x (ONNX format)
  • Purpose: Detect and locate images, tables, and other objects within document pages
  • Features:
    • Bounding box detection with confidence scoring
    • Object indexing for referencing in text
    • Cropped image extraction for detailed analysis
    • Configurable confidence thresholds (default: 0.2)

Language Models

Supported Models:

  • OpenAI: GPT-4 Vision, GPT-4o
  • Google: Gemini 2.0 Flash, Gemini Pro Vision

Capabilities:

  • Text Extraction: OCR-like text extraction from document images
  • Structure Recognition: Identifies headings (H1-H6), paragraphs, lists, tables
  • Content Grounding: Cross-references extracted content with raw OCR text for accuracy
  • Multi-format Output: Generates JSON, Markdown, HTML, or plain text
  • Image Description: Provides detailed descriptions of charts, graphs, and diagrams
  • Smart Prompting: Uses specialized prompts for each output format

Processing Features

Text Grounding

  • Compares AI-extracted text with raw OCR text
  • Corrects factual inaccuracies and misinterpretations
  • Preserves document structure while ensuring accuracy

Image Analysis

  • Detects and numbers images within documents
  • Generates contextual descriptions
  • Integrates image references into document flow
  • Handles charts, graphs, diagrams, and photos

๐Ÿ“Š Output Formats

JSON Structure

{
  "page_content": [
    {
      "type": "heading",
      "level": 1,
      "text": "Document Title"
    },
    {
      "type": "paragraph",
      "text": "Document content..."
    },
    {
      "type": "image_description",
      "image_id": 1,
      "description": "Chart showing sales data"
    }
  ]
}

Markdown Output

# Document Title

Document content...

Image #1: Chart showing sales data

HTML Output

<h1>Document Title</h1>
<p>Document content...</p>
<p class="image-placeholder" data-image-id="1">Image #1: Chart showing sales data</p>

๐Ÿ”ง Development

Running Tests

# Test individual services
cd page_processor
python -m pytest tests/

# Test full pipeline
python run_pipeline.py test_documents/sample.pdf --output_format json

Adding New Output Formats

  1. Add prompt template in page_processor/prompts.py
  2. Update format handling in page_processor/lambda_function.py
  3. Add generation logic in combiner/lambda_function.py

Monitoring and Logging

All services include comprehensive logging:

  • Request/response logging
  • Error tracking
  • Performance metrics
  • S3 operation logging

๐Ÿ“š Dependencies

Core Dependencies

  • aioboto3: Async AWS SDK
  • opencv-python: Computer vision
  • onnxruntime: ML model inference
  • PyMuPDF: PDF processing
  • pdf2image: PDF to image conversion
  • python-pptx: PowerPoint processing
  • docx2txt: Word document processing
  • Pillow: Image processing

AI/ML Dependencies

  • openai: OpenAI API client
  • google-genai: Google Gemini API client
  • numpy: Numerical computations

๐Ÿš€ Deployment

AWS Lambda Deployment

  1. Create Lambda functions for each service
  2. Configure environment variables
  3. Set up S3 buckets for storage
  4. Configure IAM roles with appropriate permissions

Docker Deployment

Services are containerized and can be deployed using:

  • Docker Compose (development)
  • Kubernetes (production)
  • AWS ECS/Fargate (cloud)

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Submit a pull request

๐Ÿ“„ License

[Add your license information here]

๐Ÿ” Troubleshooting

Common Issues

Service Not Responding

# Check if Docker containers are running
docker ps

# Check service logs
docker logs lambda_splitter_service
docker logs lambda_page_processor_service
docker logs lambda_combiner_service

# Restart services
docker-compose down && docker-compose up --build

AWS Credentials Issues

  • Ensure AWS credentials are correctly set in .env files
  • Verify S3 bucket exists and is accessible
  • Check IAM permissions for S3 read/write access

Vision API Errors

  • Verify API keys are valid and have sufficient credits
  • Check rate limiting and quotas
  • Monitor API response times and error rates

Memory Issues

  • Increase Docker memory limits in docker-compose.yml
  • Monitor /tmp directory usage (Lambda has 512MB-10GB ephemeral storage)
  • Consider using smaller images or reducing batch sizes

Performance Optimization

For Large Documents

  • Increase MAX_IMAGE_DIMENSION for better quality (impacts processing time)
  • Adjust PDF_DPI for optimal image resolution vs. processing speed
  • Use VISION_PROVIDER=gemini for faster processing (generally)

For High-Volume Processing

  • Scale horizontal with multiple container instances
  • Implement AWS Lambda concurrent execution limits
  • Use S3 transfer acceleration for large files

Debug Mode

Enable verbose logging:

# Set environment variable
DEBUG=1

# Or check individual service logs
docker logs -f lambda_page_processor_service

๐Ÿ†˜ Support

For issues and questions:

  1. Check the troubleshooting section above
  2. Review existing issues in the repository
  3. Check service logs for error details
  4. Create a new issue with:
    • Document type and size
    • Error messages and logs
    • Environment configuration
    • Steps to reproduce

๐Ÿ”ฎ Roadmap

Near Term

  • Enhanced table extraction and formatting
  • Batch processing capabilities for multiple documents
  • Performance optimizations for large documents
  • Better error handling and retry mechanisms

Medium Term

  • Web interface for easy document upload and processing
  • API documentation with OpenAPI/Swagger
  • Support for additional document formats (RTF, ODT, etc.)
  • Multi-language document support

Long Term

  • Real-time processing pipeline
  • Advanced document understanding (relationships, semantics)
  • Custom model training capabilities
  • Enterprise integrations (SharePoint, Google Drive, etc.)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published