DocIntel — Document Intelligence API

Upload any document → get structured, machine-readable data back.

DocIntel is an end-to-end document intelligence pipeline that takes PDFs and images through layout detection, OCR, and named entity extraction — returning clean, structured JSON. Built for real-world use cases like digitizing land deeds, processing research papers, and extracting data from scanned forms.

Screenshots

Upload Interface

Drag-and-drop document upload with supported format indicators and processing capabilities overview.

Text Blocks — Layout Detection Results

Extracted text blocks with semantic type classification (Title, Paragraph, List, Table, Footer), confidence scores, and bounding box coordinates.

Entity Extraction

Named entities detected across pages — people, locations, dates, monetary values, and organizations — with type-coded badges and character offsets.

Structured JSON Output

Full API response with syntax highlighting — ready for downstream integration.

The Problem

Organizations worldwide — from smallholder farmers registering land deeds, to NGOs digitizing health records, to researchers processing paper archives — need to convert unstructured documents into machine-readable data. Existing tools are either expensive cloud APIs, or fragmented open-source libraries that require significant glue code.

DocIntel provides a single API endpoint that handles the entire pipeline: upload a document, get structured JSON back with text blocks, bounding boxes, and extracted entities.

Features

Multi-format support — PDF, PNG, JPG, JPEG, TIFF, BMP
Layout detection — Identifies titles, paragraphs, tables, lists, and figures using heuristic-based connected component analysis
OCR extraction — Tesseract-powered text extraction with per-block confidence scores
Named entity recognition — Extracts dates, monetary amounts, percentages, emails, phone numbers, and addresses via regex patterns; optional spaCy integration for PERSON, ORG, GPE entities
Structured output — Clean JSON with bounding boxes, block types, and page-level organization
Async processing — Submit large documents for background processing with status polling
React dashboard — Upload documents and explore results with an interactive UI

Architecture

Tech Stack

Layer	Technology
API	FastAPI, Uvicorn, Pydantic v2
OCR	Tesseract (via pytesseract)
PDF	pdf2image (Poppler), pypdf
NLP	Regex patterns + spaCy (optional)
Image	Pillow, NumPy
Frontend	React 18, Vite, Tailwind CSS
Container	Docker, Docker Compose

Quick Start

Prerequisites

Python 3.11+
Node.js 18+
Tesseract OCR (brew install tesseract / apt install tesseract-ocr)
Poppler (brew install poppler / apt install poppler-utils)

Option 1: Docker (Recommended)

git clone https://github.com/Jonathan-321/docintel.git
cd docintel
docker-compose up --build

The API will be available at http://localhost:8000 and the frontend at http://localhost:3000.

Option 2: Manual Setup

Backend:

cd backend
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

# Optional: install spaCy model for enhanced NER
python -m spacy download en_core_web_sm

uvicorn app.main:app --reload --port 8000

Frontend:

cd frontend
npm install
npm run dev

API Reference

Process Document (Sync)

POST /api/v1/process
Content-Type: multipart/form-data

Parameter	Type	Description
`file`	File	Document to process

Response:

{
  "filename": "land_deed.pdf",
  "num_pages": 2,
  "pages": [
    {
      "page_number": 1,
      "width": 2550,
      "height": 3300,
      "blocks": [
        {
          "text": "CERTIFICATE OF TITLE",
          "confidence": 0.95,
          "bbox": { "x": 120, "y": 50, "width": 800, "height": 60 },
          "block_type": "title"
        }
      ],
      "entities": [
        {
          "text": "January 15, 2024",
          "label": "DATE",
          "start": 45,
          "end": 61
        },
        {
          "text": "$150,000",
          "label": "MONEY",
          "start": 120,
          "end": 128
        }
      ]
    }
  ],
  "metadata": {
    "ocr_engine": "tesseract",
    "language": "eng",
    "spacy_available": true
  },
  "processing_time_ms": 1234.56
}

Process Document (Async)

POST /api/v1/process/async
Content-Type: multipart/form-data

Returns a job_id for polling:

{
  "job_id": "abc123",
  "status": "processing",
  "progress": 0.0
}

Check Job Status

GET /api/v1/status/{job_id}

List Supported Formats

GET /api/v1/formats

Interactive Docs

FastAPI auto-generates interactive API documentation:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Project Structure

docintel/
├── backend/
│   ├── app/
│   │   ├── api/
│   │   │   └── routes.py          # API endpoints
│   │   ├── core/
│   │   │   └── config.py          # Settings & configuration
│   │   ├── models/
│   │   │   └── schemas.py         # Pydantic data models
│   │   ├── pipeline/
│   │   │   ├── processor.py       # Main document processor
│   │   │   ├── layout.py          # Layout detection
│   │   │   ├── ocr.py             # OCR engine
│   │   │   └── entities.py        # Entity extraction
│   │   ├── utils/
│   │   │   └── file_utils.py      # File handling utilities
│   │   └── main.py                # FastAPI app entry point
│   ├── tests/
│   │   ├── test_health.py
│   │   ├── test_process.py
│   │   └── test_entities.py
│   ├── Dockerfile
│   └── requirements.txt
├── frontend/
│   ├── src/
│   │   ├── components/
│   │   │   ├── FileUpload.jsx
│   │   │   ├── ResultsView.jsx
│   │   │   ├── ProcessingStatus.jsx
│   │   │   ├── Header.jsx
│   │   │   └── Sidebar.jsx
│   │   ├── api/
│   │   │   └── client.js
│   │   ├── App.jsx
│   │   └── main.jsx
│   ├── package.json
│   └── vite.config.js
├── docker-compose.yml
├── LICENSE
└── README.md

Development

Running Tests

cd backend
pip install pytest pytest-asyncio httpx
pytest -v

Environment Variables

Copy .env.example to .env in the backend directory:

Variable	Default	Description
`DEBUG`	`false`	Enable debug mode
`OCR_LANGUAGE`	`eng`	Tesseract language pack
`TESSERACT_CMD`	`tesseract`	Path to tesseract binary
`MAX_FILE_SIZE`	`20971520`	Max upload size in bytes (20MB)
`UPLOAD_DIR`	`/tmp/docintel`	Temporary upload directory

Use Cases

Land administration — Digitize property deeds and extract parcel numbers, dates, and monetary values
Healthcare — Process scanned medical records and extract patient information, dates, and diagnoses
Research — Bulk-process academic papers and extract titles, authors, citations, and key findings
Finance — Extract transaction data from scanned invoices, receipts, and bank statements
Government — Digitize census forms, birth certificates, and other civil documents

Roadmap

Table structure recognition (row/column detection)
Handwriting recognition support
Multi-language OCR (Arabic, Kinyarwanda, French)
Document classification (invoice vs. letter vs. form)
Batch processing endpoint
Webhook notifications for async jobs
Fine-tuned layout detection model (YOLO-based)

License

MIT License — see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocIntel — Document Intelligence API

Screenshots

Upload Interface

Text Blocks — Layout Detection Results

Entity Extraction

Structured JSON Output

The Problem

Features

Architecture

Tech Stack

Quick Start

Prerequisites

Option 1: Docker (Recommended)

Option 2: Manual Setup

API Reference

Process Document (Sync)

Process Document (Async)

Check Job Status

List Supported Formats

Interactive Docs

Project Structure

Development

Running Tests

Environment Variables

Use Cases

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
backend		backend
docs		docs
frontend		frontend
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

DocIntel — Document Intelligence API

Screenshots

Upload Interface

Text Blocks — Layout Detection Results

Entity Extraction

Structured JSON Output

The Problem

Features

Architecture

Tech Stack

Quick Start

Prerequisites

Option 1: Docker (Recommended)

Option 2: Manual Setup

API Reference

Process Document (Sync)

Process Document (Async)

Check Job Status

List Supported Formats

Interactive Docs

Project Structure

Development

Running Tests

Environment Variables

Use Cases

Roadmap

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages