Skip to content

Extract structured text and data from documents like invoices, book pages, tables, etc.. using OpenCV and Tesseract OCR

Notifications You must be signed in to change notification settings

Kamaruddheen/document-scanner

Repository files navigation

Document Scanner

To extract text and data from documents like invoices, book pages, tables etc using OpenCV and Tesseract OCR.

Overview

The document scanner implements:

  • Preprocessing images to improve OCR accuracy
  • Contour detection and perspective transforms to isolate ROIs
  • Text extraction using PyTesseract

It can handle multiple document types:

  • Invoices
  • Book pages
  • Tables

The dataextractor.py module contains the core implementation.

Requirements

The scanner requires:

  • OpenCV
  • PyTesseract
  • NumPy

Install requirements using:

pip install -r requirements.txt

Examples

The static/samples/ folder contains example images of different documents.

About

Extract structured text and data from documents like invoices, book pages, tables, etc.. using OpenCV and Tesseract OCR

Topics

Resources

Stars

Watchers

Forks