Implementation of transformer for optical character recognition of russian words.
Optical Character Recognition (OCR) is the conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image. OCR is a long-standing research problem for document digitalization. Many approaches are usually built based on CNN for image understanding and RNN for charlevel text generation. This implementation leverages the Transformer architecture for both image understanding and wordpiece-level text generation.
- First of all, you need to download the dataset linked below or create your own dataset and place it in the root of the project. The dataset is a folder with training and test images, and two annotation files named train.csv and test.csv.
train.csv should look as follows:
test.csv should look as follows:
- You should choose what type of tokenizer could you use. If you want to create own tokenizer, use train_tokenizer.py. If you want to use tokenizer from Hugging Face, change this line of code in train.py and test.py:
tokenizer = AutoTokenizer.from_pretrained("own-tokenizer")
-
To train your model set training params in train.py and run the script.
-
To evaluate your model set test params in test.py and run the script.
-
Li M. et al. Trocr: Transformer-based optical character recognition with pre-trained models //arXiv preprint arXiv:2109.10282. – 2021.
-
Atienza R. Vision transformer for fast and efficient scene text recognition //Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part I 16. – Springer International Publishing, 2021. – С. 319-334.
-
Kim G. et al. Ocr-free document understanding transformer //Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII. – Cham : Springer Nature Switzerland, 2022. – С. 498-517.
pytorch==1.13.1+cu117
torchvision==0.14.1+cu117
datasets==2.10.1
transformers==4.27.3
Trainig and test datasets consists of 122297 RGB images of Russian text. There are examples of handwritten and printed text. The datasets are distributed as .PNG and .JPEG pictures. You can download images here.