Skip to content

Farhan3376/Intelligent-Document-Understanding-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intelligent Document Understanding System

A production-ready AI pipeline that extracts structured JSON from invoices, receipts, and forms — combining OCR, document layout analysis, and NLP-based field extraction, served via a FastAPI REST API.


Architecture

Document (PDF / PNG / JPG)
        │
        ▼
┌─────────────────────┐
│  Image Preprocessing │  ← grayscale · denoise · threshold · deskew
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│    OCR Engine        │  ← EasyOCR  (text + bounding boxes)
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Layout Detection    │  ← LayoutParser (title/table/paragraph/…)
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│ Information Extract  │  ← Regex + spaCy NER
└────────┬────────────┘
         │
         ▼
     JSON Output
         │
         ▼
   FastAPI Response

Project Structure

document_ai_system/
├── api/
│   ├── main.py               # FastAPI app entry point
│   ├── routes.py             # /extract  /batch  /visualize  /health
│   └── schemas.py            # Pydantic v2 response models
├── src/
│   ├── preprocess.py         # Image preprocessing pipeline
│   ├── ocr_engine.py         # EasyOCR wrapper (singleton)
│   ├── layout_detection.py   # LayoutParser + rule-based fallback
│   ├── information_extraction.py  # Regex + spaCy NER
│   └── pipeline.py           # Master orchestration
├── utils/
│   ├── file_loader.py        # PDF/image loader + JSON saver
│   └── visualizer.py         # Bounding-box visualisation
├── tests/
│   ├── test_preprocess.py
│   ├── test_ocr.py
│   └── test_pipeline.py
├── data/
│   ├── invoices/             # Sample invoice images
│   └── receipts/
├── outputs/json_results/     # Auto-saved extraction results
├── generate_sample_invoice.py
├── conftest.py
├── requirements.txt
└── .env.example

Installation

1 — Clone & create virtual environment

git clone <your-repo-url>
cd document_ai_system

python -m venv venv
# Windows:
venv\Scripts\activate
# macOS / Linux:
source venv/bin/activate

2 — Install dependencies

pip install -r requirements.txt

3 — Download spaCy language model

python -m spacy download en_core_web_sm

4 — Configure environment

copy .env.example .env   # Windows
# cp .env.example .env   # macOS/Linux
# Edit .env as needed

5 — Generate sample invoice (optional)

python generate_sample_invoice.py

Running the API

uvicorn api.main:app --reload --host 0.0.0.0 --port 8000

API Reference

GET /health

Liveness check.

Response:

{ "status": "ok", "version": "1.0.0" }

POST /extract

Extract structured fields from a single document.

Request: multipart/form-data with field file

cURL example:

curl -X POST http://localhost:8000/extract \
     -F "file=@data/invoices/sample_invoice.png"

Response:

{
  "filename": "sample_invoice.png",
  "status": "success",
  "processing_time": 1.234,
  "fields": {
    "invoice_number": "2026-001",
    "date": "2026-03-10",
    "vendor": "ABC Electronics Ltd.",
    "total_amount": "648.00",
    "address": "Tech City, TX",
    "email": "billing@abcelectronics.com",
    "phone": "555) 987-6543",
    "tax_id": null,
    "po_number": null
  },
  "raw_text": "ABC ELECTRONICS LTD. ...",
  "ocr_blocks": [...],
  "layout_blocks": [...],
  "visualization_path": "outputs/visualizations/vis_sample_invoice.png"
}

Tip

You can view the annotated image directly in your browser at: http://localhost:8000/outputs/visualizations/vis_sample_invoice.png


POST /batch

Process up to 20 documents in one request.

curl -X POST http://localhost:8000/batch \
     -F "files=@invoice1.png" \
     -F "files=@invoice2.pdf"

POST /visualize

Returns the document image annotated with colour-coded bounding boxes.

curl -X POST http://localhost:8000/visualize \
     -F "file=@data/invoices/sample_invoice.png" \
     --output annotated.png

Running Tests

# From the document_ai_system/ directory:
python -m pytest tests/ -v

Supported Document Types

Type Extension Notes
Images PNG, JPG, JPEG, TIFF, BMP, WEBP Direct processing
PDF .pdf First page extracted via PyMuPDF

Datasets for Testing

Dataset Description Link
SROIE Scanned receipt OCR & IE SROIE
FUNSD Form understanding in noisy scanned docs FUNSD
DocVQA Document visual QA DocVQA

Configuration (.env)

Variable Default Description
OCR_LANG en EasyOCR language code
OCR_USE_GPU False Enable GPU for OCR
LAYOUT_SCORE_THRESHOLD 0.5 LayoutParser confidence threshold
OUTPUT_DIR outputs/json_results Auto-saved JSON directory
OPENAI_API_KEY (blank) Optional LLM extraction

License

MIT © 2026

About

An AI-powered Intelligent Document Understanding System that extracts structured information from documents using OCR, document layout analysis, and NLP. The system processes invoices, receipts, and forms, converts them into structured JSON data, and exposes a REST API built with FastAPI.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages