Skip to content

Jonathan-321/docintel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocIntel — Document Intelligence API

Upload any document → get structured, machine-readable data back.

DocIntel is an end-to-end document intelligence pipeline that takes PDFs and images through layout detection, OCR, and named entity extraction — returning clean, structured JSON. Built for real-world use cases like digitizing land deeds, processing research papers, and extracting data from scanned forms.

Python 3.11+ FastAPI React License: MIT


Screenshots

Upload Interface

Drag-and-drop document upload with supported format indicators and processing capabilities overview.

DocIntel Upload Interface

Text Blocks — Layout Detection Results

Extracted text blocks with semantic type classification (Title, Paragraph, List, Table, Footer), confidence scores, and bounding box coordinates.

DocIntel Text Blocks View

Entity Extraction

Named entities detected across pages — people, locations, dates, monetary values, and organizations — with type-coded badges and character offsets.

DocIntel Entities View

Structured JSON Output

Full API response with syntax highlighting — ready for downstream integration.

DocIntel JSON Output


The Problem

Organizations worldwide — from smallholder farmers registering land deeds, to NGOs digitizing health records, to researchers processing paper archives — need to convert unstructured documents into machine-readable data. Existing tools are either expensive cloud APIs, or fragmented open-source libraries that require significant glue code.

DocIntel provides a single API endpoint that handles the entire pipeline: upload a document, get structured JSON back with text blocks, bounding boxes, and extracted entities.

Features

  • Multi-format support — PDF, PNG, JPG, JPEG, TIFF, BMP
  • Layout detection — Identifies titles, paragraphs, tables, lists, and figures using heuristic-based connected component analysis
  • OCR extraction — Tesseract-powered text extraction with per-block confidence scores
  • Named entity recognition — Extracts dates, monetary amounts, percentages, emails, phone numbers, and addresses via regex patterns; optional spaCy integration for PERSON, ORG, GPE entities
  • Structured output — Clean JSON with bounding boxes, block types, and page-level organization
  • Async processing — Submit large documents for background processing with status polling
  • React dashboard — Upload documents and explore results with an interactive UI

Architecture

DocIntel Pipeline Architecture

Tech Stack

Layer Technology
API FastAPI, Uvicorn, Pydantic v2
OCR Tesseract (via pytesseract)
PDF pdf2image (Poppler), pypdf
NLP Regex patterns + spaCy (optional)
Image Pillow, NumPy
Frontend React 18, Vite, Tailwind CSS
Container Docker, Docker Compose

Quick Start

Prerequisites

  • Python 3.11+
  • Node.js 18+
  • Tesseract OCR (brew install tesseract / apt install tesseract-ocr)
  • Poppler (brew install poppler / apt install poppler-utils)

Option 1: Docker (Recommended)

git clone https://github.com/Jonathan-321/docintel.git
cd docintel
docker-compose up --build

The API will be available at http://localhost:8000 and the frontend at http://localhost:3000.

Option 2: Manual Setup

Backend:

cd backend
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

# Optional: install spaCy model for enhanced NER
python -m spacy download en_core_web_sm

uvicorn app.main:app --reload --port 8000

Frontend:

cd frontend
npm install
npm run dev

API Reference

Process Document (Sync)

POST /api/v1/process
Content-Type: multipart/form-data
Parameter Type Description
file File Document to process

Response:

{
  "filename": "land_deed.pdf",
  "num_pages": 2,
  "pages": [
    {
      "page_number": 1,
      "width": 2550,
      "height": 3300,
      "blocks": [
        {
          "text": "CERTIFICATE OF TITLE",
          "confidence": 0.95,
          "bbox": { "x": 120, "y": 50, "width": 800, "height": 60 },
          "block_type": "title"
        }
      ],
      "entities": [
        {
          "text": "January 15, 2024",
          "label": "DATE",
          "start": 45,
          "end": 61
        },
        {
          "text": "$150,000",
          "label": "MONEY",
          "start": 120,
          "end": 128
        }
      ]
    }
  ],
  "metadata": {
    "ocr_engine": "tesseract",
    "language": "eng",
    "spacy_available": true
  },
  "processing_time_ms": 1234.56
}

Process Document (Async)

POST /api/v1/process/async
Content-Type: multipart/form-data

Returns a job_id for polling:

{
  "job_id": "abc123",
  "status": "processing",
  "progress": 0.0
}

Check Job Status

GET /api/v1/status/{job_id}

List Supported Formats

GET /api/v1/formats

Interactive Docs

FastAPI auto-generates interactive API documentation:

  • Swagger UI: http://localhost:8000/docs
  • ReDoc: http://localhost:8000/redoc

Project Structure

docintel/
├── backend/
│   ├── app/
│   │   ├── api/
│   │   │   └── routes.py          # API endpoints
│   │   ├── core/
│   │   │   └── config.py          # Settings & configuration
│   │   ├── models/
│   │   │   └── schemas.py         # Pydantic data models
│   │   ├── pipeline/
│   │   │   ├── processor.py       # Main document processor
│   │   │   ├── layout.py          # Layout detection
│   │   │   ├── ocr.py             # OCR engine
│   │   │   └── entities.py        # Entity extraction
│   │   ├── utils/
│   │   │   └── file_utils.py      # File handling utilities
│   │   └── main.py                # FastAPI app entry point
│   ├── tests/
│   │   ├── test_health.py
│   │   ├── test_process.py
│   │   └── test_entities.py
│   ├── Dockerfile
│   └── requirements.txt
├── frontend/
│   ├── src/
│   │   ├── components/
│   │   │   ├── FileUpload.jsx
│   │   │   ├── ResultsView.jsx
│   │   │   ├── ProcessingStatus.jsx
│   │   │   ├── Header.jsx
│   │   │   └── Sidebar.jsx
│   │   ├── api/
│   │   │   └── client.js
│   │   ├── App.jsx
│   │   └── main.jsx
│   ├── package.json
│   └── vite.config.js
├── docker-compose.yml
├── LICENSE
└── README.md

Development

Running Tests

cd backend
pip install pytest pytest-asyncio httpx
pytest -v

Environment Variables

Copy .env.example to .env in the backend directory:

Variable Default Description
DEBUG false Enable debug mode
OCR_LANGUAGE eng Tesseract language pack
TESSERACT_CMD tesseract Path to tesseract binary
MAX_FILE_SIZE 20971520 Max upload size in bytes (20MB)
UPLOAD_DIR /tmp/docintel Temporary upload directory

Use Cases

  • Land administration — Digitize property deeds and extract parcel numbers, dates, and monetary values
  • Healthcare — Process scanned medical records and extract patient information, dates, and diagnoses
  • Research — Bulk-process academic papers and extract titles, authors, citations, and key findings
  • Finance — Extract transaction data from scanned invoices, receipts, and bank statements
  • Government — Digitize census forms, birth certificates, and other civil documents

Roadmap

  • Table structure recognition (row/column detection)
  • Handwriting recognition support
  • Multi-language OCR (Arabic, Kinyarwanda, French)
  • Document classification (invoice vs. letter vs. form)
  • Batch processing endpoint
  • Webhook notifications for async jobs
  • Fine-tuned layout detection model (YOLO-based)

License

MIT License — see LICENSE for details.

About

End-to-end document intelligence API. Upload PDFs or images → layout detection, OCR, entity extraction → structured JSON. FastAPI + React + PyTorch.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors