A production-ready AI pipeline that extracts structured JSON from invoices, receipts, and forms — combining OCR, document layout analysis, and NLP-based field extraction, served via a FastAPI REST API.
Document (PDF / PNG / JPG)
│
▼
┌─────────────────────┐
│ Image Preprocessing │ ← grayscale · denoise · threshold · deskew
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ OCR Engine │ ← EasyOCR (text + bounding boxes)
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Layout Detection │ ← LayoutParser (title/table/paragraph/…)
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Information Extract │ ← Regex + spaCy NER
└────────┬────────────┘
│
▼
JSON Output
│
▼
FastAPI Response
document_ai_system/
├── api/
│ ├── main.py # FastAPI app entry point
│ ├── routes.py # /extract /batch /visualize /health
│ └── schemas.py # Pydantic v2 response models
├── src/
│ ├── preprocess.py # Image preprocessing pipeline
│ ├── ocr_engine.py # EasyOCR wrapper (singleton)
│ ├── layout_detection.py # LayoutParser + rule-based fallback
│ ├── information_extraction.py # Regex + spaCy NER
│ └── pipeline.py # Master orchestration
├── utils/
│ ├── file_loader.py # PDF/image loader + JSON saver
│ └── visualizer.py # Bounding-box visualisation
├── tests/
│ ├── test_preprocess.py
│ ├── test_ocr.py
│ └── test_pipeline.py
├── data/
│ ├── invoices/ # Sample invoice images
│ └── receipts/
├── outputs/json_results/ # Auto-saved extraction results
├── generate_sample_invoice.py
├── conftest.py
├── requirements.txt
└── .env.example
git clone <your-repo-url>
cd document_ai_system
python -m venv venv
# Windows:
venv\Scripts\activate
# macOS / Linux:
source venv/bin/activatepip install -r requirements.txtpython -m spacy download en_core_web_smcopy .env.example .env # Windows
# cp .env.example .env # macOS/Linux
# Edit .env as neededpython generate_sample_invoice.pyuvicorn api.main:app --reload --host 0.0.0.0 --port 8000- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
Liveness check.
Response:
{ "status": "ok", "version": "1.0.0" }Extract structured fields from a single document.
Request: multipart/form-data with field file
cURL example:
curl -X POST http://localhost:8000/extract \
-F "file=@data/invoices/sample_invoice.png"Response:
{
"filename": "sample_invoice.png",
"status": "success",
"processing_time": 1.234,
"fields": {
"invoice_number": "2026-001",
"date": "2026-03-10",
"vendor": "ABC Electronics Ltd.",
"total_amount": "648.00",
"address": "Tech City, TX",
"email": "billing@abcelectronics.com",
"phone": "555) 987-6543",
"tax_id": null,
"po_number": null
},
"raw_text": "ABC ELECTRONICS LTD. ...",
"ocr_blocks": [...],
"layout_blocks": [...],
"visualization_path": "outputs/visualizations/vis_sample_invoice.png"
}Tip
You can view the annotated image directly in your browser at: http://localhost:8000/outputs/visualizations/vis_sample_invoice.png
Process up to 20 documents in one request.
curl -X POST http://localhost:8000/batch \
-F "files=@invoice1.png" \
-F "files=@invoice2.pdf"Returns the document image annotated with colour-coded bounding boxes.
curl -X POST http://localhost:8000/visualize \
-F "file=@data/invoices/sample_invoice.png" \
--output annotated.png# From the document_ai_system/ directory:
python -m pytest tests/ -v| Type | Extension | Notes |
|---|---|---|
| Images | PNG, JPG, JPEG, TIFF, BMP, WEBP | Direct processing |
| First page extracted via PyMuPDF |
| Dataset | Description | Link |
|---|---|---|
| SROIE | Scanned receipt OCR & IE | SROIE |
| FUNSD | Form understanding in noisy scanned docs | FUNSD |
| DocVQA | Document visual QA | DocVQA |
| Variable | Default | Description |
|---|---|---|
OCR_LANG |
en |
EasyOCR language code |
OCR_USE_GPU |
False |
Enable GPU for OCR |
LAYOUT_SCORE_THRESHOLD |
0.5 |
LayoutParser confidence threshold |
OUTPUT_DIR |
outputs/json_results |
Auto-saved JSON directory |
OPENAI_API_KEY |
(blank) | Optional LLM extraction |
MIT © 2026