A comprehensive, enterprise-ready document processing solution built on Azure that enables organizations to rapidly deploy and scale document processing workflows. This accelerator combines the power of Azure AI services, cloud-native architecture, and modern development practices to provide a complete platform for document ingestion, processing, and analysis.
This solution accelerator provides a production-ready foundation for building document processing applications on Azure. It includes:
- Modular Processing Pipeline: A flexible Python library (
doc-proc-lib) for creating custom document processing workflows - Web-based Management UI: React + TypeScript frontend (
doc-proc-web) for managing pipelines, services, and monitoring executions - Scalable Worker Architecture: High-performance background worker service (
doc-proc-worker) with Azure Storage Queue integration - Data Source Crawler: Scalable background worker service for crawling and retrieving content from sources (
doc-proc-crawler) - RESTful API Backend: FastAPI-based service (
doc-proc-api) with Azure Cosmos DB for data persistence - Infrastructure as Code: Bicep templates for automated Azure deployment (
doc-proc-deploy) 
- 🏃♂️ Rapid Development: Get started with document processing in minutes, not months
 - 🔧 Highly Configurable: Modular architecture allows customization without core changes
 - ☁️ Cloud-Native: Built specifically for Azure with best practices and security in mind
 - 📊 Production-Ready: Includes monitoring, logging, error handling, and scalability features
 - 🔌 Extensible: Easy to integrate with existing systems and add custom processing steps
 - 💰 Cost-Effective: Optimized resource usage with serverless and managed services
 
The solution consists of the following building blocks:
The Processing Engine - A flexible Python library that serves as the heart of the document processing pipeline.
- Modular Architecture: Catalog-based configuration for services, steps, and pipelines
 - Azure Integration: Built-in connectors for Blob Storage, AI Document Intelligence, OpenAI, and more
 - Async Processing: High-performance asynchronous processing capabilities
 - Custom Components: Easy framework for building custom processing steps and service integrations
 - Environment Management: Comprehensive configuration management with environment-specific settings
 
📖 View detailed documentation →
The Management Interface - A modern React-based web application for managing and monitoring document processing workflows.
- Technology Stack: React 18 + TypeScript + Vite
 - UI Framework: Radix UI components with Tailwind CSS styling
 - Features:
- Pipeline configuration and management
 - Service and step catalog administration
 - Real-time execution monitoring
 - Interactive workflow designer
 - Responsive design for desktop and mobile
 
 
📖 View detailed documentation →
The Backend API - FastAPI-based service providing RESTful APIs for document processing operations.
- Technology Stack: FastAPI + Python with Azure Cosmos DB
 - Features:
- RESTful API for all CRUD operations
 - Service catalog management
 - Step catalog management
 - Pipeline configuration and execution
 - Health monitoring and diagnostics
 - CORS-enabled for web client integration
 - Azure App Configuration integration
 - Comprehensive error handling and validation
 
 
📖 View detailed documentation →
The Processing Engine - High-performance background processing service for executing document processing jobs at scale.
- Technology Stack: Python with Azure Storage Queue integration
 - Key Features:
- Asynchronous queue processing with Azure Storage Queues
 - Pipeline execution with step-by-step orchestration
 - Batch processing with progress tracking and error recovery
 - Multiprocessing support for CPU-intensive workloads
 - Fault tolerance with retry mechanisms and graceful degradation
 - Health monitoring with automatic restart capabilities
 - Cloud-native Azure integration
 
 
📖 View detailed documentation →
The Document Discovery Engine - Intelligent distributed crawler service for automated document discovery and ingestion from various sources.
- Technology Stack: Python with Azure Cosmos DB and distributed coordination
 - Key Features:
- Distributed coordination with lease-based conflict prevention
 - Automatic source discovery from Cosmos DB configuration
 - Multi-source support (file systems, cloud storage, databases, APIs)
 - Intelligent load balancing across multiple deployment instances
 - Self-healing operations with automatic restart and error recovery
 - Configurable crawling schedules and polling intervals
 - Metadata extraction and content indexing
 - Integration with document processing pipelines
 
 
📖 View detailed documentation →
Infrastructure as Code - Automated deployment templates and scripts for Azure resources.
- Bicep Templates: Complete infrastructure provisioning with Azure Bicep
 - Automated Scripts: Shell scripts for streamlined deployment process
 - Resource Provisioning: Container Apps, Container Registry, Cosmos DB, Storage Accounts, App Configuration
 - Environment Configuration: Support for multiple environments (dev, staging, prod)
 - CI/CD Ready: Scripts designed for integration with automated pipelines
 
📖 View deployment documentation →
This solution accelerator can be applied across various industries and document processing workflows. Below are common use cases with domain-specific examples and configuration patterns.
Process diverse document types and formats in a unified workflow with intelligent format detection and specialized extraction:
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  Mixed Content  │───▶│  Format         │───▶│  Specialized    │───▶│  Unified Data   │
│  Repository     │    │  Detection      │    │  Extraction     │    │  Structure      │
└─────────────────┘    └─────────────────┘    └─────────────────┘    └─────────────────┘
Supported Document Types:
- Structured PDFs: Forms, invoices, contracts with precise field extraction
 - Scanned Images: Historical documents, handwritten forms with OCR processing
 - Office Documents: Word, Excel, PowerPoint with native content extraction
 - Email Archives: Outlook PST/MSG files with attachment processing
 - Web Content: HTML pages, online forms with content scraping
 - Audio/Video: Meeting recordings, training materials with transcription
 
Intelligent Processing Pipeline:
- Format Detection: Automatic MIME type detection and content analysis
 - Content Routing: Route documents to specialized extraction services based on format
 - Cross-Reference Linking: Connect related documents across different formats
 - Metadata Harmonization: Standardize metadata across diverse document types
 - Quality Validation: Ensure extraction accuracy with confidence scoring
 
Key Benefits:
- Format Agnostic: Single pipeline handles any document type
 - Intelligent Routing: Automatic selection of optimal processing methods
 - Preservation of Context: Maintain relationships between multi-format document sets
 - Scalable Processing: Parallel processing of different formats simultaneously
 
Streamline accounts payable workflows with automated invoice processing:
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   PDF/Email     │───▶│  Document AI    │───▶│   Validation    │───▶│   ERP System    │
│   Invoice       │    │   Extraction    │    │   & Approval    │    │   Integration   │
└─────────────────┘    └─────────────────┘    └─────────────────┘    └─────────────────┘
Accelerate loan application processing with document verification:
- Bank Statements: Extract transaction history, balance verification, income calculation
 - Tax Returns: Parse tax forms, verify income sources, calculate debt-to-income ratios
 - Employment Letters: Extract salary information, employment status, tenure
 
Enterprise-scale document discovery and ingestion with intelligent coordination:
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   File Shares   │───▶│   Distributed   │───▶│   Document      │───▶│   Processing    │
│   Cloud Storage │    │    Crawler      │    │   Queue         │    │   Pipeline      │
│   API Endpoints │    │   Discovery     │    │   Management    │    │   Execution     │
└─────────────────┘    └─────────────────┘    └─────────────────┘    └─────────────────┘
Key Benefits:
- Multi-Source Support: Automatically discover and ingest from file systems, SharePoint, cloud storage, databases, and REST APIs
 - Distributed Coordination: Multiple crawler instances work together with lease-based coordination to prevent duplicates
 - Smart Scheduling: Configurable polling intervals and change detection for efficient resource usage
 - Self-Healing: Automatic restart of failed processes and graceful handling of source availability
 
Transform paper-based medical records into structured digital formats:
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  Scanned Chart  │───▶│      OCR +      │───▶│   FHIR Data     │───▶│      EHR        │
│   Documents     │    │   Medical AI    │    │   Mapping       │    │   Integration   │
└─────────────────┘    └─────────────────┘    └─────────────────┘    └─────────────────┘
Process research documents and patient data for clinical trials:
- Consent Forms: Extract patient consent status, trial parameters
 - Case Report Forms: Structure clinical observations and measurements
 - Adverse Event Reports: Parse safety data for regulatory compliance
 
Automate contract review processes with AI-powered analysis:
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Contract      │───▶│  Clause         │───▶│  Risk Analysis  │───▶│  Review         │
│   Document      │    │  Extraction     │    │  & Compliance   │    │  Dashboard      │
└─────────────────┘    └─────────────────┘    └─────────────────┘    └─────────────────┘
Process large volumes of documents for litigation support:
- Email Processing: Extract metadata, identify privileged communications
 - Document Classification: Categorize documents by relevance and privilege
 - Redaction: Automatically redact sensitive information
 
Process inspection reports, certificates, and compliance documents:
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  Inspection     │───▶│  Data           │───▶│  Compliance     │───▶│  Quality        │
│  Reports        │    │  Extraction     │    │  Verification   │    │  Dashboard      │
└─────────────────┘    └─────────────────┘    └─────────────────┘    └─────────────────┘
- Certificates of Compliance: Verify supplier certifications and standards
 - Material Safety Data Sheets: Extract safety information for regulatory compliance
 - Purchase Orders: Process and validate supplier documentation
 
Before getting started, ensure you have the following:
Deployment/Development Environment:
- Python 3.11+ with pip
 - Node.js 18+ with npm (required for building front-end)
 - Docker Desktop (optional, for containerized development)
 - Git for version control
 - Azure CLI (required for Azure deployment)
 - Powershell Version 7.0 or higher (if using powershell scripts for deployment)
 - bash (if using bash scripts for deployment)
 
Azure Resources:
- Azure subscription with appropriate permissions
 - Resource group with required permissions (Contributor and User Access Administrator)
 - Azure AI Model Deployment Quota for AI services
 
Choose the deployment method that best fits your needs:
Production-ready deployment on Azure with full scalability:
Using Powershell for deployment
# 0. Clone the repository
git clone https://github.com/Azure/doc-proc-solution-accelerator.git
cd doc-proc-solution-accelerator
# 1. Deploy Azure infrastructure (AI Foundry, Container Apps, Cosmos DB, Storage Account, etc.)
pwsh .\doc-proc-deploy\DeployAzureInfra.ps1 -ResourceGroup myResourceGroup -Location westus -p docproc
# 2. Build and push Docker images to Azure Container Registry
pwsh .\doc-proc-deploy\BuildAndPushImages.ps1 -Registry myregistry.azurecr.io -Tag latest
# 3. Deploy applications to Azure Container Apps
pwsh .\doc-proc-deploy\DeployApps.ps1 -ResourceGroup myResourceGroupUsing shell scripts for deployment
# 0. Clone the repository
git clone https://github.com/Azure/doc-proc-solution-accelerator.git
cd doc-proc-solution-accelerator
# 1. Deploy Azure infrastructure (AI Foundry, Container Apps, Cosmos DB, Storage Account, etc.)
./doc-proc-deploy/deploy-azure-infra.sh -g myResourceGroup -l westus -p docproc
# 2. Build and push Docker images to Azure Container Registry
./doc-proc-deploy/build-and-push-images.sh -r myregistry.azurecr.io -t latest
# 3. Deploy applications to Azure Container Apps
./doc-proc-deploy/deploy-apps.sh -g myResourceGroupThis creates:
- Azure Container Registry for storing Docker images
 - Azure Container Apps for hosting API, Web, worker, and crawler services
 - Azure Cosmos DB for configuration data persistence
 - Azure Storage Account for queue management and blob storage
 - Azure App Configuration for centralized configuration management
 - Azure AI Foundry for AI Services
 
Once the Azure resources are deployed, you can run the solution services locally for development:
cd doc-proc-solution-accelerator
# Configure environment variables for each service
# Copy .env.example to .env and update with your Azure resource endpoints
cp doc-proc-api\.env.example doc-proc-api\.env
cp doc-proc-worker\.env.example doc-proc-worker\.env
cp doc-proc-crawler\.env.example doc-proc-crawler\.env
cp doc-proc-web\.env.example doc-proc-web\.env
# Edit the .env files with your Azure resource information:
# doc-proc-api\.env - Add Azure App Configuration endpoint
# doc-proc-worker\.env - Add Azure App Configuration endpoint
# doc-proc-crawler\.env - Add Azure App Configuration endpoint
# doc-proc-web\.env - Update API base URL if different from http://localhost:8090
# Start all services locally with auto-reload
pwsh .\doc-proc-deploy\StartServicesLocally.ps1
#./doc-proc-deploy/start-services-locally.sh  # if using shell
Required Configuration Values:
AZURE_APP_CONFIG_ENDPOINT: Endpoint URL (format:https://<app-config-name>.azconfig.io)VITE_API_BASE_URL: API endpoint for the web application (default:http://localhost:8090)
This will start:
- API Server: http://localhost:8090 (FastAPI backend)
 - Web Application: http://localhost:8080 (React frontend)
 - Worker Service: Background processing service
 - Crawler Service: Document discovery and ingestion service
 
💡 For detailed instructions and additional options, see the comprehensive deployment guide →
Container Apps are configured with intelligent auto-scaling:
- HTTP-based scaling: Scales based on concurrent requests
 - Queue-based scaling: Worker scales based on queue depth
 - CPU/Memory scaling: Scales based on resource utilization
 
All services include comprehensive health monitoring:
- Health check endpoints for Container Apps
 - Application Insights integration for telemetry
 - Log Analytics workspace for centralized logging
 
- Managed Identity: Services use managed identities for Azure resource access
 - Network isolation: Container Apps Environment with virtual network integration
 - HTTPS enforcement: All endpoints secured with SSL/TLS
 
doc-proc-solution-accelerator/
├── doc-proc-lib/              # 🔧 Core processing library and pipeline engine
│   ├── doc/                   # Processing modules and components
│   ├── examples/              # Example pipelines and usage patterns
│   ├── tests/                 # Unit and integration tests
│   ├── pipeline_config.yaml   # Pipeline configuration examples
│   ├── service_catalog.yaml   # Service definitions and configurations
│   └── step_catalog.yaml      # Step definitions and configurations
├── doc-proc-api/              # 🚀 FastAPI backend service
│   ├── app/                   # Application code
│   │   ├── db/               # Database models and operations
│   │   ├── models/           # Pydantic models and schemas
│   │   ├── routers/          # API route handlers
│   │   └── services/         # Business logic services
│   ├── infra/                # Infrastructure configuration for API
│   ├── Dockerfile            # Container configuration
│   └── requirements.txt      # Python dependencies
├── doc-proc-web/              # 🎨 React + TypeScript frontend
│   ├── src/                  # Source code
│   │   ├── components/       # Reusable UI components
│   │   ├── pages/           # Application pages
│   │   ├── services/        # API integration services
│   │   └── types/           # TypeScript type definitions
│   ├── infra/               # Infrastructure configuration for web
│   ├── Dockerfile           # Container configuration
│   └── package.json         # Node.js dependencies
├── doc-proc-worker/           # ⚡ Background processing worker
│   ├── app/                  # Worker application code
│   ├── demo/                 # Demo scripts and examples
│   ├── infra/               # Infrastructure configuration for worker
│   ├── tmp/                 # Temporary processing files
│   ├── Dockerfile           # Container configuration
│   └── requirements.txt     # Python dependencies
├── doc-proc-crawler/          # 🔍 Document discovery and crawling service
│   ├── app/                  # Crawler application code
│   │   ├── discovery/        # Distributed coordination and source discovery
│   │   ├── sources/          # Source connectors (filesystem, cloud, API, etc.)
│   │   ├── models/           # Data models for crawling and coordination
│   │   └── proxy/            # Azure service integration proxies
│   ├── infra/               # Infrastructure configuration for crawler
│   ├── DISTRIBUTED_ARCHITECTURE.md  # Distributed coordination documentation
│   ├── Dockerfile           # Container configuration
│   ├── run_crawler.py       # Main crawler entry point
│   └── requirements.txt     # Python dependencies
├── doc-proc-deploy/           # 🏗️ Infrastructure as Code and deployment
│   ├── infra/
│   │   └── bicep/           # Azure Bicep templates
│   │       ├── main.bicep   # Main infrastructure template
│   │       └── modules/     # Reusable Bicep modules
│   ├── deploy-azure-infra.sh     # Deploy infrastructure script
│   ├── build-and-push-images.sh  # Build and push Docker images
│   ├── deploy-apps.sh            # Deploy applications script
│   ├── start-services-locally.sh # Local development setup
│   └── DEPLOYMENT.md             # Detailed deployment guide
├── bicepconfig.json          # Bicep configuration
├── logo.svg                  # Solution logo
├── LICENSE                   # MIT license
└── README.md                 # This documentation
Some of the great features planned for the next release:
- Deployment as per Zero Trust Architecture best practices with integration with VNETs.
 - Pipelines for management of deletions of documents.
 - More Sources, Steps and Services for various use cases.
 
We welcome contributions! Please see our Contributing Guidelines for details on how to:
- Submit bug reports and feature requests
 - Set up your development environment
 - Submit pull requests
 - Follow our coding standards
 
This project is licensed under the MIT License - see the LICENSE file for details.
- Documentation: Detailed guides in each component's README
 - Issues: Report bugs and request features via GitHub Issues
 - Discussions: Community discussions and Q&A in GitHub Discussions
 
⚡ Ready to get started? Follow the Getting Started guide above or dive deep into the doc-proc-lib documentation.