Skip to content

Introduction

xiaomaoaichiyu edited this page Jun 5, 2020 · 6 revisions

Project

Form processing tool based on OCR recognition

This project is based on Microsoft Fott tool, a tool for form processing, with form recognition and form generation two main functions.

microsoft/OCR-Form-Tools

The main function of the original project is:

  • label Form
  • train recognize model
  • process new forms with the trained model

There are several reasons for this:

  • General form recognition tools generally recognize the form as a whole. In order to use the identified data, it is often necessary to manually process the noise data
  • As a recognized object, a form has a certain uniqueness in its data -- users tend to only care about some fields in it, rather than the entire form
  • Form recognition and OCR recognition is a certain gap, not completely consistent

The training process makes the tool a inconvenient tool because of the complex operation. Therefore, our development is mainly aimed at this change!

  1. Training requires data, which in reality is often private and difficult to obtain, and data indeed determines the outcome of training. One of our main functions is to generate form data, generating large Numbers of forms based on user-supplied templates and annotation information.

    This function is mainly aimed at the needs of researchers.

  2. As mentioned above, users need to mark each form of training with a good training model. We think this is an unfriendly design for users, so we make great modifications and supplements in this aspect, realizing two brand-new training methods and greatly simplifying users' work.

    • Blank template mode:
      • upload one template pdf file with no data.
      • label boxes with a selected attribution for pdf file (only once)
      • generate data
      • train
      • predict
    • Entity recognition mode:
      • upload five pdf files with data
      • train
      • predict The first mode can also generate form data. The second mode only train with no generate.

    At this stage of the project, users can be extended to students and professionals. If there is a large number of forms to be processed, you can use it with a simple training process, and then deal with a large number of forms

  3. Make the prediction of the model into batch prediction, and provide the data collation and visual processing of the prediction results

    This function is also for the user service of a large number of forms. Batch prediction can reduce the user's burden of use. The collation of results is convenient for export and use.

Features

Form Recognizer

1. Users upload blank templates

Users need to upload a PDF template and label the data location and attributes they need to train. The tool will automatically generate training data and train in the background, after which the user only needs to upload the document to be predicted for use!

2. Users upload five forms with data

Users need to upload at least 5 PDF documents filled with data without any annotation. The back end will automatically compare and parse the data, generate training data and training model. After that, the user only needs to upload the document to be predicted for use!

Form generation

The user simply uploads a PDF file template that needs to be generated and identifies the data location and properties they need to generate the form.

We make a certain degree of disturbance to the location of the data generation, which makes the generated data more real.

Currently, the back-end database is mainly for the United States, but by changing the database, you can support the generation of form information for each country.

Back-End

key issues

This project needs to meet the training requirements of Microsoft's form recognition tools through mass production of PDF. In the training stage, it is necessary to call Microsoft API to identify the contents filled in by users in the form and output JSON as relevant data before training can begin. However, the API for identifying forms is more limited, such as recognizing all the words in a form, and recognizing text that should belong to the same field into different fields (because of large gaps or newlines). So you need to train Microsoft's form recognition tools after handling the identified multiple JSON errors.

The main algorithm

Since the training data is usually in a group of five, the input data of this algorithm is JSON of the same form filled in with different data. According to prior knowledge, it can be known that the content of the form itself must be recognized as identical fields, so by comparing the same fields of multiple JSON, the redundant content of this part can be directly removed. Second, bounding fields are included in JSON data with text bounding box, so they can easily get their font size and position in PDF. According to this idea, we can first compare the relative position of two texts and then compare the font size of the two texts to splice related content (assuming that the content of different fields is far apart or the font is far apart, which is true). But the Bounding box identified by Microsoft API under different fonts has great deviation, and this algorithm may fail to combine some bounding fields.

Example

pdf1

drawing

json1

drawing
drawing

pdf2

drawing

json2

drawing
drawing

As you can see from json and the respective PDFS, what the user fills in May be divided into different sections, while what is already in the table must be recognized as the same. Therefore, according to this characteristic, using the above algorithm, the following results are obtained.

drawing

As you can see, the address content in the first PDF was successfully spliced together, and when viewing the JSON content, it was found that the original content in the table had been deleted. However, as it is not convenient to display, it is only explained here

Microsoft Cognitive Services

The back end USES Microsoft's entity recognition service to identify the entities in the field to construct {key: value} for training. We mainly use the Azure. CognitiveServices. TextAnalytics of Named Entity Recognition (NER).

referenced document:

Usage

Because Azure account can only be registered with Visa card, in order to facilitate the use of users, we manage in the back end, the front end does not need to register, ** out of the box! Easy and simple! **

Support three modes

  • Original project mode

official documentation

  • Blank template mode

statement of Alpha

  • Entity recognition mode

statement of ER

videos