Skip to content

An Optical Character Reader for extracting text from images and images containing scanned handwritten text.

Notifications You must be signed in to change notification settings

Lakshay-812/OCR-NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Optical Character Recognition with NLP

An Optical Character Reader for extracting text from images and scanned handwritten text.

❖ Text from Images Using Tesseract
❖ Text from handwritten Images Using TensorFlow

Natural Language Processing (NLP) techniques used to improve OCR accuracy.

❖ Using BERT(Bidirectional Encoder Representations from Transformers)
❖ Using NLTK
❖ Using Python Spellchecker

OCR Using Tesseract

❑ Tesseract is used directly using an API to extract printed text from images.

❑ Tesseract includes a new neural network subsystem and uses LSTM.

❑ Doesn’t work well while extracting handwritten text.

OCR Using TensorFlow

➢ OCR for extracting text from images containing handwritten text.

➢ Consists of a Neural Network (NN) which is trained using images containing handwritten text from the IAM dataset.

➢ Image is split line-wise for text extraction, as the model is trained for extracting text from a line.

Model Overview

Model consists of :

--> Convolutional NN (CNN) layers

--> Recurrent NN (RNN) layers

--> Connectionist Temporal Classification (CTC).

Post-OCR Error Detection and Correction

I. Process scanned image using OCR

✓ Scanned text is cleaned by removing special and unwanted characters using NLTK library functions.

II. Process document and identify unreadable words

✓ Incorrect words are identified by Python enchant’s SpellChecker function.
✓ NLTK’s “Parts of Speech” tagging is used to exclude person names from incorrect words.
✓ Each incorrect word is replaced with a [MASK] token, and replacement word suggestions from SpellChecker are stored.

III. Load BERT model and predict replacement words

✓ BERT model looks for the [MASK] tokens and then predicts the original value of the masked words, based on the context provided by the other words in the sequence.

IV. Refine BERT predictions by using suggestions from Python SpellChecker

✓ The suggested word list from SpellChecker, which incorporates characters from the garbled OCR output, is combined with BERT’s context-based suggestions to yield better predictions and the best prediction replaces the [MASK] token.

Sample Output

Sample 1

About

An Optical Character Reader for extracting text from images and images containing scanned handwritten text.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published