An intelligent document processing service that extracts text from PDF and DOCX files, analyzes content using Large Language Models (LLMs) via OpenRouter, and provides structured metadata extraction.
- Document Upload: Support for PDF and DOCX files.
- Secure Storage: Files are stored securely in Minio/S3.
- Text Extraction: Automatic extraction of text content from uploaded documents.
- AI Analysis:
- Concise Summaries.
- Document Type Detection (Invoice, CV, Report, etc.).
- Metadata Extraction (Date, Sender, Amount, etc.).
- REST API: Built with FastAPI for high performance and easy integration.
- Backend: FastAPI
- Database: PostgreSQL
- Storage: Minio (S3 Compatible)
- AI/LLM: OpenRouter (GPT-4o-mini or compatible)
- ORM: SQLAlchemy
- Dependencies:
boto3,pdfplumber,python-docx,python-multipart
- Python 3.9+
- PostgreSQL
- Minio Server (or S3 access)
- OpenRouter API Key
-
Clone the repository
git clone https://github.com/danielzfega/document-analyzer cd document-analyzer -
Create a virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Configuration Create a
.envfile in the root directory (ensure it is UTF-8 encoded):DATABASE_URL=postgresql://user:password@localhost:5432/document_analyzer MINIO_ENDPOINT=http://localhost:9000 MINIO_ACCESS_KEY=minioadmin MINIO_SECRET_KEY=minioadmin123 MINIO_BUCKET_NAME=document-analyzer OPENROUTER_API_KEY=your_openrouter_api_key_here
-
Run the Application
uvicorn main:app --reload
POST /documents/upload
- Body:
multipart/form-datawithfilefield. - Response:
{ "id": "1", "file_name": "resume.pdf" }
POST /documents/{id}/analyze
- Response:
{ "message": "Analysis complete" }
GET /documents/{id}
- Response:
{ "id": "1", "file_name": "resume.pdf", "text": "Extracted text content...", "summary": "This is a resume for...", "detected_type": "CV", "attributes": { "name": "John Doe", "email": "john@example.com" } }
Run the test suite using pytest:
pytestNote: Ensure your .env is configured for testing (e.g., using a test database or SQLite).