AI-Powered Document Analysis API

Description

A production-ready REST API that accepts PDF, DOCX, and image files (as base64) and returns AI-generated summaries, named entity extraction, and sentiment classification — powered by Claude (Anthropic).

Tech Stack

Layer	Technology
Language / Framework	Python 3.12 + FastAPI
AI / LLM	Groq API (LLaMA 3.3 70B)
PDF extraction	`pdfplumber`
DOCX extraction	`python-docx`
OCR (images)	`pytesseract` (Tesseract) + Claude Vision fallback
Deployment	Docker → Railway / Render

Setup Instructions

1. Clone the repository

git clone https://github.com/YOUR_USERNAME/doc-analyzer-api.git
cd doc-analyzer-api

2. Install system dependencies (Tesseract)

Ubuntu/Debian:

sudo apt-get install tesseract-ocr tesseract-ocr-eng

macOS:

brew install tesseract

3. Install Python dependencies

pip install -r requirements.txt

4. Set environment variables

cp .env.example .env
# Edit .env and fill in your keys

ANTHROPIC_API_KEY=sk-ant-...
API_KEY=sk_track2_your_secret_here

5. Run the application

uvicorn src.main:app --host 0.0.0.0 --port 8000 --reload

The API will be live at http://localhost:8000.

API Reference

`POST /api/document-analyze`

Authentication: x-api-key: YOUR_API_KEY header required (401 if missing/invalid).

Request Body:

{
  "fileName": "sample.pdf",
  "fileType": "pdf",
  "fileBase64": "<base64-encoded file content>"
}

Field	Type	Values
`fileName`	string	Any filename
`fileType`	string	`pdf`, `docx`, `image`
`fileBase64`	string	Base64-encoded file bytes

Success Response (200):

{
  "status": "success",
  "fileName": "sample.pdf",
  "summary": "This document is an invoice issued by ABC Pvt Ltd to Ravi Kumar on 10 March 2026 for an amount of ₹10,000.",
  "entities": {
    "names": ["Ravi Kumar"],
    "dates": ["10 March 2026"],
    "organizations": ["ABC Pvt Ltd"],
    "amounts": ["₹10,000"]
  },
  "sentiment": "Neutral"
}

Error Responses:

401 – Missing or invalid API key
400 – Invalid base64 data
422 – Unsupported file type or unreadable file
500 – Internal extraction or AI error

`GET /health`

Returns {"status": "ok"} — used for deployment health checks.

Example cURL

# Encode your file
B64=$(base64 -w 0 sample.pdf)

curl -X POST https://your-domain.com/api/document-analyze \
  -H "Content-Type: application/json" \
  -H "x-api-key: sk_track2_your_secret_here" \
  -d "{\"fileName\":\"sample.pdf\",\"fileType\":\"pdf\",\"fileBase64\":\"$B64\"}"

Approach

Text Extraction Strategy

Format	Method
PDF	`pdfplumber` — parses the PDF page-by-page and stitches text blocks in reading order. Handles multi-column layouts.
DOCX	`python-docx` — extracts all paragraphs and table cells from the document XML.
Image	`pytesseract` (Tesseract 5) for primary OCR. If Tesseract returns empty output, falls back to sending the image directly to Claude Vision for transcription.

AI Tools Used

Claude (Anthropic) — Used for assistance in code generation, debugging, and architecture decisions during development
Groq API (LLaMA 3.3 70B) — Used as the AI model for document summarisation, entity extraction, and sentiment analysis at runtime

AI Analysis Pipeline

A single Claude API call (system-prompted for strict JSON output) performs all three tasks simultaneously:

Summary — 2-4 sentence factual summary generated from the full document text (truncated to 12,000 chars to fit the context window efficiently).
Entity Extraction — Claude identifies and categorises named entities into four types: names, dates, organizations, amounts.
Sentiment — Document-level sentiment classified as exactly Positive, Neutral, or Negative.

The system prompt enforces JSON-only output, and the response parser strips any accidental markdown fences before parsing.

Project Structure

your-repo/
├── README.md
├── Dockerfile
├── railway.json
├── render.yaml
├── requirements.txt
├── .env.example
└── src/
    └── main.py

Deployment (Railway)

Push code to GitHub
Go to railway.app → New Project → Deploy from GitHub
Set environment variables: ANTHROPIC_API_KEY and API_KEY
Railway auto-builds the Dockerfile and provides a public URL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-Powered Document Analysis API

Description

Tech Stack

Setup Instructions

1. Clone the repository

2. Install system dependencies (Tesseract)

3. Install Python dependencies

4. Set environment variables

5. Run the application

API Reference

`POST /api/document-analyze`

`GET /health`

Example cURL

Approach

Text Extraction Strategy

AI Tools Used

AI Analysis Pipeline

Project Structure

Deployment (Railway)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
Procfile		Procfile
README.md		README.md
railway.json		railway.json
render.yaml		render.yaml
requirements.txt		requirements.txt
test_local.py		test_local.py

Folders and files

Latest commit

History

Repository files navigation

AI-Powered Document Analysis API

Description

Tech Stack

Setup Instructions

1. Clone the repository

2. Install system dependencies (Tesseract)

3. Install Python dependencies

4. Set environment variables

5. Run the application

API Reference

POST /api/document-analyze

GET /health

Example cURL

Approach

Text Extraction Strategy

AI Tools Used

AI Analysis Pipeline

Project Structure

Deployment (Railway)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

`POST /api/document-analyze`

`GET /health`

Packages