Skip to content

SS12dev/PriorityProcessingAgent

Repository files navigation

Policy Document Agent

An intelligent Python agent that processes policy documents (PDFs) and automatically generates hierarchical decision trees with eligibility questions. The agent is completely policy-agnostic and can handle insurance, legal, regulatory, corporate, healthcare, and financial policy documents.

Features

Core Capabilities

  • PDF Processing: Extracts text from PDFs using multiple methods (pdfplumber, PyPDF2, OCR)
  • Document Intelligence: Automatically understands document type, structure, and complexity
  • Smart Chunking: Intelligently breaks large documents into processable chunks while respecting semantic boundaries
  • Policy Extraction: Extracts policies, sub-policies, conditions, and definitions with source traceability
  • Decision Tree Generation: Converts policy conditions into interactive eligibility questions
  • Comprehensive Validation: Validates generated trees for completeness, consistency, and coverage
  • Real-time Streaming: Provides real-time progress updates via Server-Sent Events
  • Redis Caching: Efficient state management and result caching

Key Features

Policy-Agnostic: Works with any type of policy document ✅ Source Traceability: Every extraction includes exact source references ✅ Confidence Scoring: All extractions include confidence scores ✅ Hierarchical Processing: Handles complex nested policy structures ✅ Multiple Question Types: Supports yes/no, multiple choice, numeric ranges, text input, dates, and currency ✅ Validation System: Ensures accuracy and completeness ✅ Streaming API: Real-time progress updates ✅ Error Handling: Robust retry logic and error recovery

Architecture

src/
├── core/
│   ├── models.py           # Pydantic data models
│   └── orchestrator.py     # Main processing orchestrator
├── processors/
│   ├── pdf_processor.py    # PDF text extraction
│   ├── document_analyzer.py # Document intelligence
│   └── chunking_strategy.py # Smart chunking
├── extractors/
│   └── policy_extractor.py # Policy extraction engine
├── generators/
│   └── decision_tree_generator.py # Decision tree generation
├── validators/
│   └── tree_validator.py   # Tree validation
├── api/
│   └── main.py            # FastAPI application
└── utils/
    ├── redis_manager.py    # Redis integration
    └── logging_config.py   # Structured logging

Installation

Prerequisites

  • Python 3.11+
  • Redis server
  • Tesseract OCR (for scanned PDFs)
  • OpenAI API key

Setup

  1. Clone the repository:
git clone <repository-url>
cd PriorityProcessingAgent
  1. Install dependencies:
pip install -r requirements.txt
  1. Install Tesseract OCR (for scanned PDFs):

Ubuntu/Debian:

sudo apt-get install tesseract-ocr

macOS:

brew install tesseract

Windows: Download from: https://github.com/UB-Mannheim/tesseract/wiki

  1. Start Redis:
# Using Docker
docker run -d -p 6379:6379 redis:latest

# Or install locally
# Ubuntu: sudo apt-get install redis-server
# macOS: brew install redis
  1. Configure environment:
cp .env.example .env
# Edit .env and add your OpenAI API key and other settings

Required environment variables:

OPENAI_API_KEY=your_api_key_here
REDIS_HOST=localhost
REDIS_PORT=6379

Usage

Starting the API Server

python -m src.api.main

The API will be available at http://localhost:8000

API Documentation: http://localhost:8000/docs

API Endpoints

1. Process Document (Async with Streaming)

POST /api/process

Submit a document for processing:

curl -X POST "http://localhost:8000/api/process" \
  -H "Content-Type: application/json" \
  -d '{
    "document_base64": "<base64_encoded_pdf>",
    "document_id": "optional-custom-id"
  }'

Response:

{
  "document_id": "abc-123",
  "status": "processing",
  "message": "Document processing started",
  "stream_url": "/api/process/abc-123/stream"
}

2. Stream Progress (SSE)

GET /api/process/{document_id}/stream

Stream real-time progress updates:

curl -N "http://localhost:8000/api/process/abc-123/stream"

Server-Sent Events stream example:

event: progress
data: {"status": "analyzing", "progress_percentage": 15.0, "message": "Analyzing document..."}

event: progress
data: {"status": "extracting", "progress_percentage": 45.0, "message": "Extracting policies..."}

event: complete
data: {"message": "Stream ended"}

3. Get Result

GET /api/process/{document_id}/result

Retrieve the complete processing result:

curl "http://localhost:8000/api/process/abc-123/result"

4. Get Status

GET /api/process/{document_id}/status

Check current processing status:

curl "http://localhost:8000/api/process/abc-123/status"

5. Synchronous Processing

POST /api/process/sync

Process and wait for completion (not recommended for large documents):

curl -X POST "http://localhost:8000/api/process/sync" \
  -H "Content-Type: application/json" \
  -d '{
    "document_base64": "<base64_encoded_pdf>"
  }'

Python Client Example

import base64
import requests
import json
import sseclient

# Read and encode PDF
with open("policy.pdf", "rb") as f:
    pdf_base64 = base64.b64encode(f.read()).decode()

# Submit for processing
response = requests.post(
    "http://localhost:8000/api/process",
    json={"document_base64": pdf_base64}
)
document_id = response.json()["document_id"]
print(f"Processing document: {document_id}")

# Stream progress
stream_url = f"http://localhost:8000/api/process/{document_id}/stream"
response = requests.get(stream_url, stream=True)
client = sseclient.SSEClient(response)

for event in client.events():
    if event.event == "progress":
        data = json.loads(event.data)
        print(f"Progress: {data['progress_percentage']}% - {data['message']}")
    elif event.event == "complete":
        break

# Get final result
result = requests.get(f"http://localhost:8000/api/process/{document_id}/result")
print(json.dumps(result.json(), indent=2))

Output Structure

The processing result includes:

Document Metadata

  • Document type (insurance, legal, regulatory, etc.)
  • Title, version, effective date
  • Page count and complexity score

Policy Hierarchy

  • All extracted policies and sub-policies
  • Policy conditions with types (eligibility, exclusion, requirement, etc.)
  • Definitions and related terms
  • Source references for every policy

Decision Trees

  • Interactive decision trees for each policy
  • Questions with appropriate types (yes/no, multiple choice, numeric, etc.)
  • Logical branching based on answers
  • All possible outcomes
  • Source references for every question

Validation Results

  • Completeness scores (all conditions covered)
  • Consistency scores (logical integrity)
  • Coverage scores (all paths lead to outcomes)
  • List of issues and recommendations
  • Reprocessing flags for low-quality extractions

Example Output Structure:

{
  "document_id": "abc-123",
  "metadata": {
    "document_type": "insurance",
    "title": "Health Insurance Policy",
    "total_pages": 25,
    "complexity_score": 0.75
  },
  "policy_hierarchy": {
    "total_policies": 12,
    "root_policies": ["policy-1", "policy-2"],
    "policies": {
      "policy-1": {
        "title": "Coverage Eligibility",
        "description": "Requirements for policy coverage",
        "conditions": [...],
        "source": {...}
      }
    }
  },
  "decision_trees": [
    {
      "tree_id": "tree-1",
      "policy_id": "policy-1",
      "title": "Decision Tree: Coverage Eligibility",
      "root_node_id": "node-1",
      "nodes": {
        "node-1": {
          "node_type": "question",
          "question": {
            "text": "Are you between 18 and 65 years old?",
            "question_type": "yes_no",
            "source": {...}
          },
          "next_node_map": {
            "yes": "node-2",
            "no": "outcome-1"
          }
        }
      }
    }
  ],
  "validation_results": [...],
  "overall_confidence": {
    "score": 0.87,
    "level": "high"
  }
}

Configuration

All configuration is managed through environment variables in .env:

OpenAI Settings

  • OPENAI_API_KEY: Your OpenAI API key (required)
  • OPENAI_MODEL_PRIMARY: Primary model (default: gpt-4o-mini)
  • OPENAI_MODEL_COMPLEX: Model for complex sections (default: gpt-4o)
  • OPENAI_MAX_RETRIES: Retry attempts (default: 3)

Redis Settings

  • REDIS_HOST: Redis host (default: localhost)
  • REDIS_PORT: Redis port (default: 6379)
  • REDIS_PASSWORD: Redis password (optional)
  • REDIS_DEFAULT_TTL: Default TTL in seconds (default: 3600)

Processing Settings

  • MAX_CHUNK_SIZE: Maximum tokens per chunk (default: 3000)
  • CHUNK_OVERLAP: Overlap between chunks (default: 200)
  • CONFIDENCE_THRESHOLD: Minimum confidence score (default: 0.7)

API Settings

  • API_HOST: API host (default: 0.0.0.0)
  • API_PORT: API port (default: 8000)
  • LOG_LEVEL: Logging level (default: INFO)

Performance

Expected processing times:

  • 5-10 pages: ~10-20 seconds
  • 10-20 pages: ~20-40 seconds
  • 20-30 pages: ~40-90 seconds
  • 30+ pages: ~90+ seconds

Performance factors:

  • Document complexity
  • Number of policies and conditions
  • PDF extraction method (OCR is slower)
  • OpenAI API response times

Error Handling

The system includes comprehensive error handling:

  • Retry Logic: Automatic retries with exponential backoff for API calls
  • Graceful Degradation: Falls back to alternative PDF extraction methods
  • Validation: Low-confidence extractions are flagged for review
  • Detailed Logging: Structured logs for debugging
  • Partial Results: Stores intermediate results for recovery

Development

Running Tests

pytest tests/

Code Quality

# Format code
black src/

# Sort imports
isort src/

# Type checking
mypy src/

# Linting
flake8 src/

Project Structure

PriorityProcessingAgent/
├── src/              # Source code
├── tests/            # Test suites
├── config/           # Configuration
├── examples/         # Example usage
├── requirements.txt  # Dependencies
└── README.md        # Documentation

Limitations

  • Maximum recommended document size: 50 pages
  • Requires clear, well-structured policy documents
  • OCR quality depends on scan quality
  • Processing time increases with document complexity
  • Requires internet connection for OpenAI API

Troubleshooting

PDF Extraction Issues

  • Ensure Tesseract is installed for scanned PDFs
  • Check PDF file is not corrupted
  • Try converting PDF to text-based format if possible

Redis Connection Issues

  • Verify Redis is running: redis-cli ping
  • Check Redis host and port in .env
  • Ensure firewall allows Redis connections

API Errors

  • Check OpenAI API key is valid
  • Verify OpenAI API quota/limits
  • Review logs for detailed error messages

Low Confidence Scores

  • Document may be ambiguous or poorly structured
  • Consider using gpt-4o for complex documents
  • Review validation results for specific issues

License

[Your License Here]

Contributing

[Contributing Guidelines]

Support

For issues and questions:

  • GitHub Issues: [Repository Issues]
  • Documentation: [Full Documentation URL]
  • Email: [Support Email]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published