Optical Character Recognition with NLP

An Optical Character Reader for extracting text from images and scanned handwritten text.

❖ Text from Images Using Tesseract
❖ Text from handwritten Images Using TensorFlow

Natural Language Processing (NLP) techniques used to improve OCR accuracy.

❖ Using BERT(Bidirectional Encoder Representations from Transformers)
❖ Using NLTK
❖ Using Python Spellchecker

OCR Using Tesseract

❑ Tesseract is used directly using an API to extract printed text from images.

❑ Tesseract includes a new neural network subsystem and uses LSTM.

❑ Doesn’t work well while extracting handwritten text.

OCR Using TensorFlow

➢ OCR for extracting text from images containing handwritten text.

➢ Consists of a Neural Network (NN) which is trained using images containing handwritten text from the IAM dataset.

➢ Image is split line-wise for text extraction, as the model is trained for extracting text from a line.

Model Overview

Model consists of :

--> Convolutional NN (CNN) layers

--> Recurrent NN (RNN) layers

--> Connectionist Temporal Classification (CTC).

Post-OCR Error Detection and Correction

I. Process scanned image using OCR

✓ Scanned text is cleaned by removing special and unwanted characters using NLTK library functions.

II. Process document and identify unreadable words

✓ Incorrect words are identified by Python enchant’s SpellChecker function.
✓ NLTK’s “Parts of Speech” tagging is used to exclude person names from incorrect words.
✓ Each incorrect word is replaced with a [MASK] token, and replacement word suggestions from SpellChecker are stored.

III. Load BERT model and predict replacement words

✓ BERT model looks for the [MASK] tokens and then predicts the original value of the masked words, based on the context provided by the other words in the sequence.

IV. Refine BERT predictions by using suggestions from Python SpellChecker

✓ The suggested word list from SpellChecker, which incorporates characters from the garbled OCR output, is combined with BERT’s context-based suggestions to yield better predictions and the best prediction replaces the [MASK] token.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Handwritten OCR		Handwritten OCR
Tesseract		Tesseract
.gitignore		.gitignore
Project Report.pdf		Project Report.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Optical Character Recognition with NLP

OCR Using Tesseract

OCR Using TensorFlow

Model Overview

Post-OCR Error Detection and Correction

Sample Output

About

Releases

Packages

Languages

Lakshay-812/OCR-NLP

Folders and files

Latest commit

History

Repository files navigation

Optical Character Recognition with NLP

OCR Using Tesseract

OCR Using TensorFlow

Model Overview

Post-OCR Error Detection and Correction

Sample Output

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages