Policy Document Agent

An intelligent Python agent that processes policy documents (PDFs) and automatically generates hierarchical decision trees with eligibility questions. The agent is completely policy-agnostic and can handle insurance, legal, regulatory, corporate, healthcare, and financial policy documents.

Features

Core Capabilities

PDF Processing: Extracts text from PDFs using multiple methods (pdfplumber, PyPDF2, OCR)
Document Intelligence: Automatically understands document type, structure, and complexity
Smart Chunking: Intelligently breaks large documents into processable chunks while respecting semantic boundaries
Policy Extraction: Extracts policies, sub-policies, conditions, and definitions with source traceability
Decision Tree Generation: Converts policy conditions into interactive eligibility questions
Comprehensive Validation: Validates generated trees for completeness, consistency, and coverage
Real-time Streaming: Provides real-time progress updates via Server-Sent Events
Redis Caching: Efficient state management and result caching

Key Features

✅ Policy-Agnostic: Works with any type of policy document ✅ Source Traceability: Every extraction includes exact source references ✅ Confidence Scoring: All extractions include confidence scores ✅ Hierarchical Processing: Handles complex nested policy structures ✅ Multiple Question Types: Supports yes/no, multiple choice, numeric ranges, text input, dates, and currency ✅ Validation System: Ensures accuracy and completeness ✅ Streaming API: Real-time progress updates ✅ Error Handling: Robust retry logic and error recovery

Architecture

src/
├── core/
│   ├── models.py           # Pydantic data models
│   └── orchestrator.py     # Main processing orchestrator
├── processors/
│   ├── pdf_processor.py    # PDF text extraction
│   ├── document_analyzer.py # Document intelligence
│   └── chunking_strategy.py # Smart chunking
├── extractors/
│   └── policy_extractor.py # Policy extraction engine
├── generators/
│   └── decision_tree_generator.py # Decision tree generation
├── validators/
│   └── tree_validator.py   # Tree validation
├── api/
│   └── main.py            # FastAPI application
└── utils/
    ├── redis_manager.py    # Redis integration
    └── logging_config.py   # Structured logging

Installation

Prerequisites

Python 3.11+
Redis server
Tesseract OCR (for scanned PDFs)
OpenAI API key

Setup

Clone the repository:

git clone <repository-url>
cd PriorityProcessingAgent

Install dependencies:

pip install -r requirements.txt

Install Tesseract OCR (for scanned PDFs):

Ubuntu/Debian:

sudo apt-get install tesseract-ocr

macOS:

brew install tesseract

Windows: Download from: https://github.com/UB-Mannheim/tesseract/wiki

Start Redis:

# Using Docker
docker run -d -p 6379:6379 redis:latest

# Or install locally
# Ubuntu: sudo apt-get install redis-server
# macOS: brew install redis

Configure environment:

cp .env.example .env
# Edit .env and add your OpenAI API key and other settings

Required environment variables:

OPENAI_API_KEY=your_api_key_here
REDIS_HOST=localhost
REDIS_PORT=6379

Usage

Starting the API Server

python -m src.api.main

The API will be available at http://localhost:8000

API Documentation: http://localhost:8000/docs

API Endpoints

1. Process Document (Async with Streaming)

POST /api/process

Submit a document for processing:

curl -X POST "http://localhost:8000/api/process" \
  -H "Content-Type: application/json" \
  -d '{
    "document_base64": "<base64_encoded_pdf>",
    "document_id": "optional-custom-id"
  }'

Response:

{
  "document_id": "abc-123",
  "status": "processing",
  "message": "Document processing started",
  "stream_url": "/api/process/abc-123/stream"
}

2. Stream Progress (SSE)

GET /api/process/{document_id}/stream

Stream real-time progress updates:

curl -N "http://localhost:8000/api/process/abc-123/stream"

Server-Sent Events stream example:

event: progress
data: {"status": "analyzing", "progress_percentage": 15.0, "message": "Analyzing document..."}

event: progress
data: {"status": "extracting", "progress_percentage": 45.0, "message": "Extracting policies..."}

event: complete
data: {"message": "Stream ended"}

3. Get Result

GET /api/process/{document_id}/result

Retrieve the complete processing result:

curl "http://localhost:8000/api/process/abc-123/result"

4. Get Status

GET /api/process/{document_id}/status

Check current processing status:

curl "http://localhost:8000/api/process/abc-123/status"

5. Synchronous Processing

POST /api/process/sync

Process and wait for completion (not recommended for large documents):

curl -X POST "http://localhost:8000/api/process/sync" \
  -H "Content-Type: application/json" \
  -d '{
    "document_base64": "<base64_encoded_pdf>"
  }'

Python Client Example

import base64
import requests
import json
import sseclient

# Read and encode PDF
with open("policy.pdf", "rb") as f:
    pdf_base64 = base64.b64encode(f.read()).decode()

# Submit for processing
response = requests.post(
    "http://localhost:8000/api/process",
    json={"document_base64": pdf_base64}
)
document_id = response.json()["document_id"]
print(f"Processing document: {document_id}")

# Stream progress
stream_url = f"http://localhost:8000/api/process/{document_id}/stream"
response = requests.get(stream_url, stream=True)
client = sseclient.SSEClient(response)

for event in client.events():
    if event.event == "progress":
        data = json.loads(event.data)
        print(f"Progress: {data['progress_percentage']}% - {data['message']}")
    elif event.event == "complete":
        break

# Get final result
result = requests.get(f"http://localhost:8000/api/process/{document_id}/result")
print(json.dumps(result.json(), indent=2))

Output Structure

The processing result includes:

Document Metadata

Document type (insurance, legal, regulatory, etc.)
Title, version, effective date
Page count and complexity score

Policy Hierarchy

All extracted policies and sub-policies
Policy conditions with types (eligibility, exclusion, requirement, etc.)
Definitions and related terms
Source references for every policy

Decision Trees

Interactive decision trees for each policy
Questions with appropriate types (yes/no, multiple choice, numeric, etc.)
Logical branching based on answers
All possible outcomes
Source references for every question

Validation Results

Completeness scores (all conditions covered)
Consistency scores (logical integrity)
Coverage scores (all paths lead to outcomes)
List of issues and recommendations
Reprocessing flags for low-quality extractions

Example Output Structure:

{
  "document_id": "abc-123",
  "metadata": {
    "document_type": "insurance",
    "title": "Health Insurance Policy",
    "total_pages": 25,
    "complexity_score": 0.75
  },
  "policy_hierarchy": {
    "total_policies": 12,
    "root_policies": ["policy-1", "policy-2"],
    "policies": {
      "policy-1": {
        "title": "Coverage Eligibility",
        "description": "Requirements for policy coverage",
        "conditions": [...],
        "source": {...}
      }
    }
  },
  "decision_trees": [
    {
      "tree_id": "tree-1",
      "policy_id": "policy-1",
      "title": "Decision Tree: Coverage Eligibility",
      "root_node_id": "node-1",
      "nodes": {
        "node-1": {
          "node_type": "question",
          "question": {
            "text": "Are you between 18 and 65 years old?",
            "question_type": "yes_no",
            "source": {...}
          },
          "next_node_map": {
            "yes": "node-2",
            "no": "outcome-1"
          }
        }
      }
    }
  ],
  "validation_results": [...],
  "overall_confidence": {
    "score": 0.87,
    "level": "high"
  }
}

Configuration

All configuration is managed through environment variables in .env:

OpenAI Settings

OPENAI_API_KEY: Your OpenAI API key (required)
OPENAI_MODEL_PRIMARY: Primary model (default: gpt-4o-mini)
OPENAI_MODEL_COMPLEX: Model for complex sections (default: gpt-4o)
OPENAI_MAX_RETRIES: Retry attempts (default: 3)

Redis Settings

REDIS_HOST: Redis host (default: localhost)
REDIS_PORT: Redis port (default: 6379)
REDIS_PASSWORD: Redis password (optional)
REDIS_DEFAULT_TTL: Default TTL in seconds (default: 3600)

Processing Settings

MAX_CHUNK_SIZE: Maximum tokens per chunk (default: 3000)
CHUNK_OVERLAP: Overlap between chunks (default: 200)
CONFIDENCE_THRESHOLD: Minimum confidence score (default: 0.7)

API Settings

API_HOST: API host (default: 0.0.0.0)
API_PORT: API port (default: 8000)
LOG_LEVEL: Logging level (default: INFO)

Performance

Expected processing times:

5-10 pages: ~10-20 seconds
10-20 pages: ~20-40 seconds
20-30 pages: ~40-90 seconds
30+ pages: ~90+ seconds

Performance factors:

Document complexity
Number of policies and conditions
PDF extraction method (OCR is slower)
OpenAI API response times

Error Handling

The system includes comprehensive error handling:

Retry Logic: Automatic retries with exponential backoff for API calls
Graceful Degradation: Falls back to alternative PDF extraction methods
Validation: Low-confidence extractions are flagged for review
Detailed Logging: Structured logs for debugging
Partial Results: Stores intermediate results for recovery

Development

Running Tests

pytest tests/

Code Quality

# Format code
black src/

# Sort imports
isort src/

# Type checking
mypy src/

# Linting
flake8 src/

Project Structure

PriorityProcessingAgent/
├── src/              # Source code
├── tests/            # Test suites
├── config/           # Configuration
├── examples/         # Example usage
├── requirements.txt  # Dependencies
└── README.md        # Documentation

Limitations

Maximum recommended document size: 50 pages
Requires clear, well-structured policy documents
OCR quality depends on scan quality
Processing time increases with document complexity
Requires internet connection for OpenAI API

Troubleshooting

PDF Extraction Issues

Ensure Tesseract is installed for scanned PDFs
Check PDF file is not corrupted
Try converting PDF to text-based format if possible

Redis Connection Issues

Verify Redis is running: redis-cli ping
Check Redis host and port in .env
Ensure firewall allows Redis connections

API Errors

Check OpenAI API key is valid
Verify OpenAI API quota/limits
Review logs for detailed error messages

Low Confidence Scores

Document may be ambiguous or poorly structured
Consider using gpt-4o for complex documents
Review validation results for specific issues

License

[Your License Here]

Contributing

[Contributing Guidelines]

Support

For issues and questions:

GitHub Issues: [Repository Issues]
Documentation: [Full Documentation URL]
Email: [Support Email]

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
examples		examples
src		src
.env.example		.env.example
.gitignore		.gitignore
PROJECT_OVERVIEW.md		PROJECT_OVERVIEW.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
requirements.txt		requirements.txt
start_server.sh		start_server.sh

SS12dev/PriorityProcessingAgent

Folders and files

Latest commit

History

Repository files navigation