An intelligent Python agent that processes policy documents (PDFs) and automatically generates hierarchical decision trees with eligibility questions. The agent is completely policy-agnostic and can handle insurance, legal, regulatory, corporate, healthcare, and financial policy documents.
- PDF Processing: Extracts text from PDFs using multiple methods (pdfplumber, PyPDF2, OCR)
- Document Intelligence: Automatically understands document type, structure, and complexity
- Smart Chunking: Intelligently breaks large documents into processable chunks while respecting semantic boundaries
- Policy Extraction: Extracts policies, sub-policies, conditions, and definitions with source traceability
- Decision Tree Generation: Converts policy conditions into interactive eligibility questions
- Comprehensive Validation: Validates generated trees for completeness, consistency, and coverage
- Real-time Streaming: Provides real-time progress updates via Server-Sent Events
- Redis Caching: Efficient state management and result caching
✅ Policy-Agnostic: Works with any type of policy document ✅ Source Traceability: Every extraction includes exact source references ✅ Confidence Scoring: All extractions include confidence scores ✅ Hierarchical Processing: Handles complex nested policy structures ✅ Multiple Question Types: Supports yes/no, multiple choice, numeric ranges, text input, dates, and currency ✅ Validation System: Ensures accuracy and completeness ✅ Streaming API: Real-time progress updates ✅ Error Handling: Robust retry logic and error recovery
src/
├── core/
│ ├── models.py # Pydantic data models
│ └── orchestrator.py # Main processing orchestrator
├── processors/
│ ├── pdf_processor.py # PDF text extraction
│ ├── document_analyzer.py # Document intelligence
│ └── chunking_strategy.py # Smart chunking
├── extractors/
│ └── policy_extractor.py # Policy extraction engine
├── generators/
│ └── decision_tree_generator.py # Decision tree generation
├── validators/
│ └── tree_validator.py # Tree validation
├── api/
│ └── main.py # FastAPI application
└── utils/
├── redis_manager.py # Redis integration
└── logging_config.py # Structured logging
- Python 3.11+
- Redis server
- Tesseract OCR (for scanned PDFs)
- OpenAI API key
- Clone the repository:
git clone <repository-url>
cd PriorityProcessingAgent- Install dependencies:
pip install -r requirements.txt- Install Tesseract OCR (for scanned PDFs):
Ubuntu/Debian:
sudo apt-get install tesseract-ocrmacOS:
brew install tesseractWindows: Download from: https://github.com/UB-Mannheim/tesseract/wiki
- Start Redis:
# Using Docker
docker run -d -p 6379:6379 redis:latest
# Or install locally
# Ubuntu: sudo apt-get install redis-server
# macOS: brew install redis- Configure environment:
cp .env.example .env
# Edit .env and add your OpenAI API key and other settingsRequired environment variables:
OPENAI_API_KEY=your_api_key_here
REDIS_HOST=localhost
REDIS_PORT=6379python -m src.api.mainThe API will be available at http://localhost:8000
API Documentation: http://localhost:8000/docs
POST /api/process
Submit a document for processing:
curl -X POST "http://localhost:8000/api/process" \
-H "Content-Type: application/json" \
-d '{
"document_base64": "<base64_encoded_pdf>",
"document_id": "optional-custom-id"
}'Response:
{
"document_id": "abc-123",
"status": "processing",
"message": "Document processing started",
"stream_url": "/api/process/abc-123/stream"
}GET /api/process/{document_id}/stream
Stream real-time progress updates:
curl -N "http://localhost:8000/api/process/abc-123/stream"Server-Sent Events stream example:
event: progress
data: {"status": "analyzing", "progress_percentage": 15.0, "message": "Analyzing document..."}
event: progress
data: {"status": "extracting", "progress_percentage": 45.0, "message": "Extracting policies..."}
event: complete
data: {"message": "Stream ended"}
GET /api/process/{document_id}/result
Retrieve the complete processing result:
curl "http://localhost:8000/api/process/abc-123/result"GET /api/process/{document_id}/status
Check current processing status:
curl "http://localhost:8000/api/process/abc-123/status"POST /api/process/sync
Process and wait for completion (not recommended for large documents):
curl -X POST "http://localhost:8000/api/process/sync" \
-H "Content-Type: application/json" \
-d '{
"document_base64": "<base64_encoded_pdf>"
}'import base64
import requests
import json
import sseclient
# Read and encode PDF
with open("policy.pdf", "rb") as f:
pdf_base64 = base64.b64encode(f.read()).decode()
# Submit for processing
response = requests.post(
"http://localhost:8000/api/process",
json={"document_base64": pdf_base64}
)
document_id = response.json()["document_id"]
print(f"Processing document: {document_id}")
# Stream progress
stream_url = f"http://localhost:8000/api/process/{document_id}/stream"
response = requests.get(stream_url, stream=True)
client = sseclient.SSEClient(response)
for event in client.events():
if event.event == "progress":
data = json.loads(event.data)
print(f"Progress: {data['progress_percentage']}% - {data['message']}")
elif event.event == "complete":
break
# Get final result
result = requests.get(f"http://localhost:8000/api/process/{document_id}/result")
print(json.dumps(result.json(), indent=2))The processing result includes:
- Document type (insurance, legal, regulatory, etc.)
- Title, version, effective date
- Page count and complexity score
- All extracted policies and sub-policies
- Policy conditions with types (eligibility, exclusion, requirement, etc.)
- Definitions and related terms
- Source references for every policy
- Interactive decision trees for each policy
- Questions with appropriate types (yes/no, multiple choice, numeric, etc.)
- Logical branching based on answers
- All possible outcomes
- Source references for every question
- Completeness scores (all conditions covered)
- Consistency scores (logical integrity)
- Coverage scores (all paths lead to outcomes)
- List of issues and recommendations
- Reprocessing flags for low-quality extractions
{
"document_id": "abc-123",
"metadata": {
"document_type": "insurance",
"title": "Health Insurance Policy",
"total_pages": 25,
"complexity_score": 0.75
},
"policy_hierarchy": {
"total_policies": 12,
"root_policies": ["policy-1", "policy-2"],
"policies": {
"policy-1": {
"title": "Coverage Eligibility",
"description": "Requirements for policy coverage",
"conditions": [...],
"source": {...}
}
}
},
"decision_trees": [
{
"tree_id": "tree-1",
"policy_id": "policy-1",
"title": "Decision Tree: Coverage Eligibility",
"root_node_id": "node-1",
"nodes": {
"node-1": {
"node_type": "question",
"question": {
"text": "Are you between 18 and 65 years old?",
"question_type": "yes_no",
"source": {...}
},
"next_node_map": {
"yes": "node-2",
"no": "outcome-1"
}
}
}
}
],
"validation_results": [...],
"overall_confidence": {
"score": 0.87,
"level": "high"
}
}All configuration is managed through environment variables in .env:
OPENAI_API_KEY: Your OpenAI API key (required)OPENAI_MODEL_PRIMARY: Primary model (default: gpt-4o-mini)OPENAI_MODEL_COMPLEX: Model for complex sections (default: gpt-4o)OPENAI_MAX_RETRIES: Retry attempts (default: 3)
REDIS_HOST: Redis host (default: localhost)REDIS_PORT: Redis port (default: 6379)REDIS_PASSWORD: Redis password (optional)REDIS_DEFAULT_TTL: Default TTL in seconds (default: 3600)
MAX_CHUNK_SIZE: Maximum tokens per chunk (default: 3000)CHUNK_OVERLAP: Overlap between chunks (default: 200)CONFIDENCE_THRESHOLD: Minimum confidence score (default: 0.7)
API_HOST: API host (default: 0.0.0.0)API_PORT: API port (default: 8000)LOG_LEVEL: Logging level (default: INFO)
Expected processing times:
- 5-10 pages: ~10-20 seconds
- 10-20 pages: ~20-40 seconds
- 20-30 pages: ~40-90 seconds
- 30+ pages: ~90+ seconds
Performance factors:
- Document complexity
- Number of policies and conditions
- PDF extraction method (OCR is slower)
- OpenAI API response times
The system includes comprehensive error handling:
- Retry Logic: Automatic retries with exponential backoff for API calls
- Graceful Degradation: Falls back to alternative PDF extraction methods
- Validation: Low-confidence extractions are flagged for review
- Detailed Logging: Structured logs for debugging
- Partial Results: Stores intermediate results for recovery
pytest tests/# Format code
black src/
# Sort imports
isort src/
# Type checking
mypy src/
# Linting
flake8 src/PriorityProcessingAgent/
├── src/ # Source code
├── tests/ # Test suites
├── config/ # Configuration
├── examples/ # Example usage
├── requirements.txt # Dependencies
└── README.md # Documentation
- Maximum recommended document size: 50 pages
- Requires clear, well-structured policy documents
- OCR quality depends on scan quality
- Processing time increases with document complexity
- Requires internet connection for OpenAI API
- Ensure Tesseract is installed for scanned PDFs
- Check PDF file is not corrupted
- Try converting PDF to text-based format if possible
- Verify Redis is running:
redis-cli ping - Check Redis host and port in
.env - Ensure firewall allows Redis connections
- Check OpenAI API key is valid
- Verify OpenAI API quota/limits
- Review logs for detailed error messages
- Document may be ambiguous or poorly structured
- Consider using gpt-4o for complex documents
- Review validation results for specific issues
[Your License Here]
[Contributing Guidelines]
For issues and questions:
- GitHub Issues: [Repository Issues]
- Documentation: [Full Documentation URL]
- Email: [Support Email]