Intelligent Document Understanding System

A production-ready AI pipeline that extracts structured JSON from invoices, receipts, and forms — combining OCR, document layout analysis, and NLP-based field extraction, served via a FastAPI REST API.

Architecture

Document (PDF / PNG / JPG)
        │
        ▼
┌─────────────────────┐
│  Image Preprocessing │  ← grayscale · denoise · threshold · deskew
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│    OCR Engine        │  ← EasyOCR  (text + bounding boxes)
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Layout Detection    │  ← LayoutParser (title/table/paragraph/…)
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│ Information Extract  │  ← Regex + spaCy NER
└────────┬────────────┘
         │
         ▼
     JSON Output
         │
         ▼
   FastAPI Response

Project Structure

document_ai_system/
├── api/
│   ├── main.py               # FastAPI app entry point
│   ├── routes.py             # /extract  /batch  /visualize  /health
│   └── schemas.py            # Pydantic v2 response models
├── src/
│   ├── preprocess.py         # Image preprocessing pipeline
│   ├── ocr_engine.py         # EasyOCR wrapper (singleton)
│   ├── layout_detection.py   # LayoutParser + rule-based fallback
│   ├── information_extraction.py  # Regex + spaCy NER
│   └── pipeline.py           # Master orchestration
├── utils/
│   ├── file_loader.py        # PDF/image loader + JSON saver
│   └── visualizer.py         # Bounding-box visualisation
├── tests/
│   ├── test_preprocess.py
│   ├── test_ocr.py
│   └── test_pipeline.py
├── data/
│   ├── invoices/             # Sample invoice images
│   └── receipts/
├── outputs/json_results/     # Auto-saved extraction results
├── generate_sample_invoice.py
├── conftest.py
├── requirements.txt
└── .env.example

Installation

1 — Clone & create virtual environment

git clone <your-repo-url>
cd document_ai_system

python -m venv venv
# Windows:
venv\Scripts\activate
# macOS / Linux:
source venv/bin/activate

2 — Install dependencies

pip install -r requirements.txt

3 — Download spaCy language model

python -m spacy download en_core_web_sm

4 — Configure environment

copy .env.example .env   # Windows
# cp .env.example .env   # macOS/Linux
# Edit .env as needed

5 — Generate sample invoice (optional)

python generate_sample_invoice.py

Running the API

uvicorn api.main:app --reload --host 0.0.0.0 --port 8000

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

API Reference

`GET /health`

Liveness check.

Response:

{ "status": "ok", "version": "1.0.0" }

`POST /extract`

Extract structured fields from a single document.

Request: multipart/form-data with field file

cURL example:

curl -X POST http://localhost:8000/extract \
     -F "file=@data/invoices/sample_invoice.png"

Response:

{
  "filename": "sample_invoice.png",
  "status": "success",
  "processing_time": 1.234,
  "fields": {
    "invoice_number": "2026-001",
    "date": "2026-03-10",
    "vendor": "ABC Electronics Ltd.",
    "total_amount": "648.00",
    "address": "Tech City, TX",
    "email": "billing@abcelectronics.com",
    "phone": "555) 987-6543",
    "tax_id": null,
    "po_number": null
  },
  "raw_text": "ABC ELECTRONICS LTD. ...",
  "ocr_blocks": [...],
  "layout_blocks": [...],
  "visualization_path": "outputs/visualizations/vis_sample_invoice.png"
}

Tip

You can view the annotated image directly in your browser at: http://localhost:8000/outputs/visualizations/vis_sample_invoice.png

`POST /batch`

Process up to 20 documents in one request.

curl -X POST http://localhost:8000/batch \
     -F "files=@invoice1.png" \
     -F "files=@invoice2.pdf"

`POST /visualize`

Returns the document image annotated with colour-coded bounding boxes.

curl -X POST http://localhost:8000/visualize \
     -F "file=@data/invoices/sample_invoice.png" \
     --output annotated.png

Running Tests

# From the document_ai_system/ directory:
python -m pytest tests/ -v

Supported Document Types

Type	Extension	Notes
Images	PNG, JPG, JPEG, TIFF, BMP, WEBP	Direct processing
PDF	.pdf	First page extracted via PyMuPDF

Datasets for Testing

Dataset	Description	Link
SROIE	Scanned receipt OCR & IE	SROIE
FUNSD	Form understanding in noisy scanned docs	FUNSD
DocVQA	Document visual QA	DocVQA

Configuration (`.env`)

Variable	Default	Description
`OCR_LANG`	`en`	EasyOCR language code
`OCR_USE_GPU`	`False`	Enable GPU for OCR
`LAYOUT_SCORE_THRESHOLD`	`0.5`	LayoutParser confidence threshold
`OUTPUT_DIR`	`outputs/json_results`	Auto-saved JSON directory
`OPENAI_API_KEY`	(blank)	Optional LLM extraction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intelligent Document Understanding System

Architecture

Project Structure

Installation

1 — Clone & create virtual environment

2 — Install dependencies

3 — Download spaCy language model

4 — Configure environment

5 — Generate sample invoice (optional)

Running the API

API Reference

`GET /health`

`POST /extract`

`POST /batch`

`POST /visualize`

Running Tests

Supported Document Types

Datasets for Testing

Configuration (`.env`)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
api		api
data/invoices		data/invoices
outputs		outputs
src		src
tests		tests
utils		utils
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
conftest.py		conftest.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Intelligent Document Understanding System

Architecture

Project Structure

Installation

1 — Clone & create virtual environment

2 — Install dependencies

3 — Download spaCy language model

4 — Configure environment

5 — Generate sample invoice (optional)

Running the API

API Reference

GET /health

POST /extract

POST /batch

POST /visualize

Running Tests

Supported Document Types

Datasets for Testing

Configuration (.env)

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /health`

`POST /extract`

`POST /batch`

`POST /visualize`

Configuration (`.env`)

Packages