Transformer-Based OCR

Implementation of transformer for optical character recognition of russian words.

Optical Character Recognition (OCR) is the conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image. OCR is a long-standing research problem for document digitalization. Many approaches are usually built based on CNN for image understanding and RNN for charlevel text generation. This implementation leverages the Transformer architecture for both image understanding and wordpiece-level text generation.

Usage

First of all, you need to download the dataset linked below or create your own dataset and place it in the root of the project. The dataset is a folder with training and test images, and two annotation files named train.csv and test.csv.

train.csv should look as follows:

test.csv should look as follows:

You should choose what type of tokenizer could you use. If you want to create own tokenizer, use train_tokenizer.py. If you want to use tokenizer from Hugging Face, change this line of code in train.py and test.py:

tokenizer = AutoTokenizer.from_pretrained("own-tokenizer")

To train your model set training params in train.py and run the script.
To evaluate your model set test params in test.py and run the script.

Useful links

Li M. et al. Trocr: Transformer-based optical character recognition with pre-trained models //arXiv preprint arXiv:2109.10282. – 2021.
Atienza R. Vision transformer for fast and efficient scene text recognition //Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part I 16. – Springer International Publishing, 2021. – С. 319-334.
Kim G. et al. Ocr-free document understanding transformer //Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII. – Cham : Springer Nature Switzerland, 2022. – С. 498-517.

Python packages

pytorch==1.13.1+cu117
torchvision==0.14.1+cu117
datasets==2.10.1
transformers==4.27.3

Dataset

Trainig and test datasets consists of 122297 RGB images of Russian text. There are examples of handwritten and printed text. The datasets are distributed as .PNG and .JPEG pictures. You can download images here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

our_model.py

our_model.py

test.csv

test.csv

test.py

test.py

train.csv

train.csv

train.py

train.py

train_tokenizer.py

train_tokenizer.py

Repository files navigation

Transformer-Based OCR

Usage

Useful links

Python packages

Dataset

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
our_model.py		our_model.py
test.csv		test.csv
test.py		test.py
train.csv		train.csv
train.py		train.py
train_tokenizer.py		train_tokenizer.py

Chebart/Transformer-Based-OCR

Folders and files

Latest commit

History

Repository files navigation

Transformer-Based OCR

Usage

Useful links

Python packages

Dataset

About

Topics

Resources

Stars

Watchers

Forks

Languages