Nan.ai Optical Character Recognition (OCR)

Nan.ai is an open source machine learning service that extracts data from form images with handwritten inputs. This ML service is created to support Nan.ai, a grassroots agent platform for microinsurance. Nan.ai OCR ML service can be used for other use cases such as (insert examples here) and trained using the nan.ai OCR Open Data. This ML service is tailor fit for optimizing the workflow for Saphron.asia, however, as a public good it can be reused for other scenarios in need of OCR service.

nan.ai OCR ML service has three components:

Validation - pre-processing component that adjusts image quality and identification of regions of interest
Extraction - data extraction workflow that segregates region of interest into input fields then recognizing and recording form inputs from handwriting
Encoding - annotation workflow that aids labelling datasets to improve the OCR model. Encoding inputs are used to improve autocorrect suggestions.

You can participate by (1) reporting bugs or (2) suggesting improvements on implementation. To explore our ML service, you can use the existing notebooks available here or export model by (insert instructions here).

Alongside our open source initiative, we are also open sourcing related datasets, nan.ai OCR Open Data, to help you explore and train this model.

Description of the model

Data

We started with an initial model that is more generic and has been trained with the IAM dataset. This has been discussed further here. To tailor fit the model to our use case, we annotated our own data and used that to retrain the model. The input data is an image-text pair where the text is manually transcribed from the scanned image. Each image is in grayscale, 128 x 32 pixels, and contains a word. If the cropped image exceeds in size, it will be resized (without distortion) until it has a height of 128 px or a width of 32 px. All word images are then placed into an empty white canvas¹.

Moving forward, the model should be trainable with any kind of data. A detailed guide for annotating your data can be found here: 1.2 Create IAM-compatible dataset and train model.

Evaluation

The Connectionist Temporal Classification (CTC) loss function is used to evaluate the output of the model (both training and inferring). Upon training, the CTC receives the RNN output matrix and the ground truth text from which the loss value is computed¹. While inferring, the CTC only gets the character probability matrix and the final text is transcribed. The loss value is the negative log-likelihood of seeing the given text i.e. L=-log(P). If we feed the character-probability matrix and the recognized text to the loss function and afterward undo the log and the minus, we get the probability P of the recognized text: P=exp(-L)².

For the pre-trained model, evaluation is done using the IAM and Bentham HTR datasets. We also evaluated our model using our data.

Setup

Clone this repository.
Install Python 3.8 + and dependencies from requirements.txt
1. Tensorflow 2.4.0 or 2.6.0
2. Opencv-python 4.4.0.46
3. Opencv-contrib-python 4.5.1.48
4. Pytesseract 0.3.8
5. Tesseract 5.x
  1. Install tesseract on your environment:
    1. https://tesseract-ocr.github.io/tessdoc/Compiling.html
  2. Install custom Python library (Word Beam Search) for the Handwritten Text Recognition model.
    1. https://github.com/Saphron-Asia/nan.ai-ml-ocr/tree/main/CTC%20WordBeamSearch
Download pre-trained models as instructed:
1. Word Detector NN
2. HTR
(Optional) For extracting regions of interest (RoI), you will need to provide a template (e.g. a form template). Provide three directories for the script: the data, template, and output directories. For a detailed description of these parameters, please refer here.
For the Word Neural Net (WNN):
1. If you used the RoI, use the RoI’s output directory as input for the WNN.
2. Otherwise, put image/s containing words/text inside a directory. Create a separate directory for the output.
For the Handwritten Text Recognition (HTR), pass the output from (5). After execution, each subdirectory will contain a dataDump.json file and all the results will be summarized in a single log file at the parent directory.

Navigate this project

Resources

References

¹Build a Handwritten Text Recognition System using TensorFlow
²FAQ: Build a Handwritten Text Recognition System using TensorFlow
³Word Beam Search: A Connectionist Temporal Classification Decoding Algorithm
⁴Handwritten Word Detector

License

nanai-ml-ocr is licensed under the Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github		.github
.idea		.idea
CTC WordBeamSearch		CTC WordBeamSearch
data cleaning and normalization		data cleaning and normalization
data extraction		data extraction
.DS_Store		.DS_Store
CODEOFCODUCT.md		CODEOFCODUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
HOWTO.md		HOWTO.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nan.ai Optical Character Recognition (OCR)

Description of the model

Navigate this project

Resources

References

License

About

Releases

Packages

Contributors 4

Languages

License

Saphron-Asia/nan.ai-ml-ocr

Folders and files

Latest commit

History

Repository files navigation

Nan.ai Optical Character Recognition (OCR)

Description of the model

Navigate this project

Resources

References

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages