Skip to content

NAN.ai Optical Character Recognition (OCR) is an open source machine learning service that extracts data from form images with handwritten inputs

License

Notifications You must be signed in to change notification settings

Saphron-Asia/nan.ai-ml-ocr

Repository files navigation

Nan.ai Optical Character Recognition (OCR)

Nan.ai is an open source machine learning service that extracts data from form images with handwritten inputs. This ML service is created to support Nan.ai, a grassroots agent platform for microinsurance. Nan.ai OCR ML service can be used for other use cases such as (insert examples here) and trained using the nan.ai OCR Open Data. This ML service is tailor fit for optimizing the workflow for Saphron.asia, however, as a public good it can be reused for other scenarios in need of OCR service.

nan.ai OCR ML service has three components:

  1. Validation - pre-processing component that adjusts image quality and identification of regions of interest
  2. Extraction - data extraction workflow that segregates region of interest into input fields then recognizing and recording form inputs from handwriting
  3. Encoding - annotation workflow that aids labelling datasets to improve the OCR model. Encoding inputs are used to improve autocorrect suggestions.

You can participate by (1) reporting bugs or (2) suggesting improvements on implementation. To explore our ML service, you can use the existing notebooks available here or export model by (insert instructions here).

Alongside our open source initiative, we are also open sourcing related datasets, nan.ai OCR Open Data, to help you explore and train this model.

Description of the model

  • Data

We started with an initial model that is more generic and has been trained with the IAM dataset. This has been discussed further here. To tailor fit the model to our use case, we annotated our own data and used that to retrain the model. The input data is an image-text pair where the text is manually transcribed from the scanned image. Each image is in grayscale, 128 x 32 pixels, and contains a word. If the cropped image exceeds in size, it will be resized (without distortion) until it has a height of 128 px or a width of 32 px. All word images are then placed into an empty white canvas1 .

Moving forward, the model should be trainable with any kind of data. A detailed guide for annotating your data can be found here: 1.2 Create IAM-compatible dataset and train model.

  • Evaluation

The Connectionist Temporal Classification (CTC) loss function is used to evaluate the output of the model (both training and inferring). Upon training, the CTC receives the RNN output matrix and the ground truth text from which the loss value is computed1 . While inferring, the CTC only gets the character probability matrix and the final text is transcribed. The loss value is the negative log-likelihood of seeing the given text i.e. L=-log(P). If we feed the character-probability matrix and the recognized text to the loss function and afterward undo the log and the minus, we get the probability P of the recognized text: P=exp(-L)2 .

For the pre-trained model, evaluation is done using the IAM and Bentham HTR datasets. We also evaluated our model using our data.

  • Setup
  1. Clone this repository.
  2. Install Python 3.8 + and dependencies from requirements.txt
    1. Tensorflow 2.4.0 or 2.6.0
    2. Opencv-python 4.4.0.46
    3. Opencv-contrib-python 4.5.1.48
    4. Pytesseract 0.3.8
    5. Tesseract 5.x
      1. Install tesseract on your environment:
        1. https://tesseract-ocr.github.io/tessdoc/Compiling.html
      2. Install custom Python library (Word Beam Search) for the Handwritten Text Recognition model.
        1. https://github.com/Saphron-Asia/nan.ai-ml-ocr/tree/main/CTC%20WordBeamSearch
  3. Download pre-trained models as instructed:
    1. Word Detector NN
    2. HTR
  4. (Optional) For extracting regions of interest (RoI), you will need to provide a template (e.g. a form template). Provide three directories for the script: the data, template, and output directories. For a detailed description of these parameters, please refer here.
  5. For the Word Neural Net (WNN):
    1. If you used the RoI, use the RoI’s output directory as input for the WNN.
    2. Otherwise, put image/s containing words/text inside a directory. Create a separate directory for the output.
  6. For the Handwritten Text Recognition (HTR), pass the output from (5). After execution, each subdirectory will contain a dataDump.json file and all the results will be summarized in a single log file at the parent directory.

Navigate this project

Resources

References

1Build a Handwritten Text Recognition System using TensorFlow
2FAQ: Build a Handwritten Text Recognition System using TensorFlow
3Word Beam Search: A Connectionist Temporal Classification Decoding Algorithm
4Handwritten Word Detector

License

nanai-ml-ocr is licensed under the Apache License 2.0

About

NAN.ai Optical Character Recognition (OCR) is an open source machine learning service that extracts data from form images with handwritten inputs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages