Optical Character Recognition with NLP

An Optical Character Reader for extracting text from images and scanned handwritten text.

❖ Text from Images Using Tesseract
❖ Text from handwritten Images Using TensorFlow

Natural Language Processing (NLP) techniques used to improve OCR accuracy.

❖ Using BERT(Bidirectional Encoder Representations from Transformers)
❖ Using NLTK
❖ Using Python Spellchecker

OCR Using Tesseract

❑ Tesseract is used directly using an API to extract printed text from images.

❑ Tesseract includes a new neural network subsystem and uses LSTM.

❑ Doesn’t work well while extracting handwritten text.

OCR Using TensorFlow

➢ OCR for extracting text from images containing handwritten text.

➢ Consists of a Neural Network (NN) which is trained using images containing handwritten text from the IAM dataset.

➢ Image is split line-wise for text extraction, as the model is trained for extracting text from a line.

Model Overview

Model consists of :

--> Convolutional NN (CNN) layers

--> Recurrent NN (RNN) layers

--> Connectionist Temporal Classification (CTC).

Post-OCR Error Detection and Correction

I. Process scanned image using OCR

✓ Scanned text is cleaned by removing special and unwanted characters using NLTK library functions.

II. Process document and identify unreadable words

✓ Incorrect words are identified by Python enchant’s SpellChecker function.
✓ NLTK’s “Parts of Speech” tagging is used to exclude person names from incorrect words.
✓ Each incorrect word is replaced with a [MASK] token, and replacement word suggestions from SpellChecker are stored.

III. Load BERT model and predict replacement words

✓ BERT model looks for the [MASK] tokens and then predicts the original value of the masked words, based on the context provided by the other words in the sequence.

IV. Refine BERT predictions by using suggestions from Python SpellChecker

✓ The suggested word list from SpellChecker, which incorporates characters from the garbled OCR output, is combined with BERT’s context-based suggestions to yield better predictions and the best prediction replaces the [MASK] token.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Optical Character Recognition with NLP

OCR Using Tesseract

OCR Using TensorFlow

Model Overview

Post-OCR Error Detection and Correction

Sample Output

Files

README.md

Latest commit

History

README.md

File metadata and controls

Optical Character Recognition with NLP

OCR Using Tesseract

OCR Using TensorFlow

Model Overview

Post-OCR Error Detection and Correction

Sample Output