Skip to content

Abisyc/doc-analyzer-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI-Powered Document Analysis API

Description

A production-ready REST API that accepts PDF, DOCX, and image files (as base64) and returns AI-generated summaries, named entity extraction, and sentiment classification — powered by Claude (Anthropic).


Tech Stack

Layer Technology
Language / Framework Python 3.12 + FastAPI
AI / LLM Groq API (LLaMA 3.3 70B)
PDF extraction pdfplumber
DOCX extraction python-docx
OCR (images) pytesseract (Tesseract) + Claude Vision fallback
Deployment Docker → Railway / Render

Setup Instructions

1. Clone the repository

git clone https://github.com/YOUR_USERNAME/doc-analyzer-api.git
cd doc-analyzer-api

2. Install system dependencies (Tesseract)

Ubuntu/Debian:

sudo apt-get install tesseract-ocr tesseract-ocr-eng

macOS:

brew install tesseract

3. Install Python dependencies

pip install -r requirements.txt

4. Set environment variables

cp .env.example .env
# Edit .env and fill in your keys
ANTHROPIC_API_KEY=sk-ant-...
API_KEY=sk_track2_your_secret_here

5. Run the application

uvicorn src.main:app --host 0.0.0.0 --port 8000 --reload

The API will be live at http://localhost:8000.


API Reference

POST /api/document-analyze

Authentication: x-api-key: YOUR_API_KEY header required (401 if missing/invalid).

Request Body:

{
  "fileName": "sample.pdf",
  "fileType": "pdf",
  "fileBase64": "<base64-encoded file content>"
}
Field Type Values
fileName string Any filename
fileType string pdf, docx, image
fileBase64 string Base64-encoded file bytes

Success Response (200):

{
  "status": "success",
  "fileName": "sample.pdf",
  "summary": "This document is an invoice issued by ABC Pvt Ltd to Ravi Kumar on 10 March 2026 for an amount of ₹10,000.",
  "entities": {
    "names": ["Ravi Kumar"],
    "dates": ["10 March 2026"],
    "organizations": ["ABC Pvt Ltd"],
    "amounts": ["₹10,000"]
  },
  "sentiment": "Neutral"
}

Error Responses:

  • 401 – Missing or invalid API key
  • 400 – Invalid base64 data
  • 422 – Unsupported file type or unreadable file
  • 500 – Internal extraction or AI error

GET /health

Returns {"status": "ok"} — used for deployment health checks.


Example cURL

# Encode your file
B64=$(base64 -w 0 sample.pdf)

curl -X POST https://your-domain.com/api/document-analyze \
  -H "Content-Type: application/json" \
  -H "x-api-key: sk_track2_your_secret_here" \
  -d "{\"fileName\":\"sample.pdf\",\"fileType\":\"pdf\",\"fileBase64\":\"$B64\"}"

Approach

Text Extraction Strategy

Format Method
PDF pdfplumber — parses the PDF page-by-page and stitches text blocks in reading order. Handles multi-column layouts.
DOCX python-docx — extracts all paragraphs and table cells from the document XML.
Image pytesseract (Tesseract 5) for primary OCR. If Tesseract returns empty output, falls back to sending the image directly to Claude Vision for transcription.

AI Tools Used

  • Claude (Anthropic) — Used for assistance in code generation, debugging, and architecture decisions during development
  • Groq API (LLaMA 3.3 70B) — Used as the AI model for document summarisation, entity extraction, and sentiment analysis at runtime

AI Analysis Pipeline

A single Claude API call (system-prompted for strict JSON output) performs all three tasks simultaneously:

  1. Summary — 2-4 sentence factual summary generated from the full document text (truncated to 12,000 chars to fit the context window efficiently).
  2. Entity Extraction — Claude identifies and categorises named entities into four types: names, dates, organizations, amounts.
  3. Sentiment — Document-level sentiment classified as exactly Positive, Neutral, or Negative.

The system prompt enforces JSON-only output, and the response parser strips any accidental markdown fences before parsing.


Project Structure

your-repo/
├── README.md
├── Dockerfile
├── railway.json
├── render.yaml
├── requirements.txt
├── .env.example
└── src/
    └── main.py

Deployment (Railway)

  1. Push code to GitHub
  2. Go to railway.app → New Project → Deploy from GitHub
  3. Set environment variables: ANTHROPIC_API_KEY and API_KEY
  4. Railway auto-builds the Dockerfile and provides a public URL

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages