Multi-language model

Recent Update

2021.4.9 supports the detection and recognition of 80 languages
2021.4.9 supports lightweight high-precision English model detection and recognition

PaddleOCR aims to create a rich, leading, and practical OCR tool library, which not only provides Chinese and English models in general scenarios, but also provides models specifically trained in English scenarios. And multilingual models covering 80 languages.

Among them, the English model supports the detection and recognition of uppercase and lowercase letters and common punctuation, and the recognition of space characters is optimized:

The multilingual models cover Latin, Arabic, Traditional Chinese, Korean, Japanese, etc.:

This document will briefly introduce how to use the multilingual model.

1 Installation
- 1.1 paddle installation
- 1.2 paddleocr package installation
2 Quick Use
- 2.1 Command line operation
- 2.2 python script running
3 Custom Training
4 Inference and Deployment
4 Supported languages and abbreviations

1 Installation

PaddleOCR provides two installation methods

1.1 paddle installation

# cpu
pip install paddlepaddle

# gpu
pip install paddlepaddle-gpu

1.2 paddleocr package installation

pip install

pip install "paddleocr>=2.0.6" # 2.0.6 version is recommended

Build and install locally

python3 setup.py bdist_wheel
pip3 install dist/paddleocr-x.x.x-py3-none-any.whl # x.x.x is the version number of paddleocr

2 Quick Use

2.1 Command Line Operation

View help information

paddleocr -h

Whole image prediction (detection + recognition)

PaddleOCR currently supports 80 languages, which can be switched by modifying the --lang parameter. The specific supported language can be viewed in the table.

paddleocr --image_dir doc/imgs_en/254.jpg --lang=en

The result is a list, each item contains a text box, text and recognition confidence

[('PHO CAPITAL', 0.95723116), [[66.0, 50.0], [327.0, 44.0], [327.0, 76.0], [67.0, 82.0]]]
[('107 State Street', 0.96311164), [[72.0, 90.0], [451.0, 84.0], [452.0, 116.0], [73.0, 121.0]]]
[('Montpelier Vermont', 0.97389287), [[69.0, 132.0], [501.0, 126.0], [501.0, 158.0], [70.0, 164.0]]]
[('8022256183', 0.99810505), [[71.0, 175.0], [363.0, 170.0], [364.0, 202.0], [72.0, 207.0]]]
[('REG 07-24-201706:59 PM', 0.93537045), [[73.0, 299.0], [653.0, 281.0], [654.0, 318.0], [74.0, 336.0]]]
[('045555', 0.99346405), [[509.0, 331.0], [651.0, 325.0], [652.0, 356.0], [511.0, 362.0]]]
[('CT1', 0.9988654), [[535.0, 367.0], [654.0, 367.0], [654.0, 406.0], [535.0, 406.0]]]
......

Recognition

paddleocr --image_dir doc/imgs_words_en/word_308.png --det false --lang=en

The result is a tuple, which returns the recognition result and recognition confidence

(0.99879867, 'LITTLE')

Detection

paddleocr --image_dir PaddleOCR/doc/imgs/11.jpg --rec false

The result is a list, each item contains only text boxes

[[26.0, 457.0], [137.0, 457.0], [137.0, 477.0], [26.0, 477.0]]
[[25.0, 425.0], [372.0, 425.0], [372.0, 448.0], [25.0, 448.0]]
[[128.0, 397.0], [273.0, 397.0], [273.0, 414.0], [128.0, 414.0]]
......

2.2 Python Script Running

PaddleOCR also supports running in python scripts for easy embedding in your own code:

Whole image prediction (detection + recognition)

from paddleocr import PaddleOCR, draw_ocr

# Also switch the language by modifying the lang parameter
ocr = PaddleOCR(lang="korean") # The model file will be downloaded automatically when executed for the first time
img_path ='doc/imgs/korean_1.jpg'
result = ocr.ocr(img_path)
# Recognition and detection can be performed separately through parameter control
# result = ocr.ocr(img_path, det=False)  Only perform recognition
# result = ocr.ocr(img_path, rec=False)  Only perform detection
# Print detection frame and recognition result
for line in result:
    print(line)

# Visualization
from PIL import Image
image = Image.open(img_path).convert('RGB')
boxes = [line[0] for line in result]
txts = [line[1][0] for line in result]
scores = [line[1][1] for line in result]
im_show = draw_ocr(image, boxes, txts, scores, font_path='doc/fonts/korean.ttf')
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')

Visualization of results:

ppocr also supports direction classification. For more usage methods, please refer to: whl package instructions.

3 Custom Training

PaddleOCR supports using your own data for custom training or finetune, where the configuration file can refer to French model Modify the training data path, dictionary and other parameters.

For specific data preparation and training process, please refer to: Text Detection, Text Recognition, more functions such as predictive deployment, For functions such as data annotation, you can read the complete Document Tutorial.

4 Inference and Deployment

In addition to installing the whl package for quick forecasting, ppocr also provides a variety of forecasting deployment methods. If necessary, you can read related documents:

Python Inference
C++ Inference
Serving
Mobile

The deployment tutorial uses the Chinese model by default. If you need to use other language models, please replace the model files and dictionaries yourself:

The detection model are as follows:

There are two detection models, namely the Chinese detection model and the English detection model. The Chinese detection model performs better on Chinese and English images. The English detection model is based on the Chinese model and is fine-tuned using English scene data and other language data finetune. Better in the English scene.

model name	description	config	download
en_mobile_v2.0_det	The multi-language detection model	en_det_mv3_db.yml	inference model /trained model
ch_ppocr_mobile_v2.0_det	The chinese detection model	ch_det_mv3_db.yml	inference model / trained model

The recognition model are as follows:

model name	dict file	description	config	model size	download
french_mobile_v2.0_rec	ppocr/utils/dict/french_dict.txt	Lightweight model for French recognition	rec_french_lite_train.yml	2.65M	inference model / trained model
german_mobile_v2.0_rec	ppocr/utils/dict/german_dict.txt	Lightweight model for German recognition	rec_german_lite_train.yml	2.65M	inference model / trained model
korean_mobile_v2.0_rec	ppocr/utils/dict/korean_dict.txt	Lightweight model for Korean recognition	rec_korean_lite_train.yml	3.9M	inference model / trained model
japan_mobile_v2.0_rec	ppocr/utils/dict/japan_dict.txt	Lightweight model for Japanese recognition	rec_japan_lite_train.yml	4.23M	inference model / trained model
chinese_cht_mobile_v2.0_rec	ppocr/utils/dict/chinese_cht_dict.txt	Lightweight model for chinese cht recognition	rec_chinese_cht_lite_train.yml	5.63M	inference model / trained model
te_mobile_v2.0_rec	ppocr/utils/dict/te_dict.txt	Lightweight model for Telugu recognition	rec_te_lite_train.yml	2.63M	inference model / trained model
ka_mobile_v2.0_rec	ppocr/utils/dict/ka_dict.txt	Lightweight model for Kannada recognition	rec_ka_lite_train.yml	2.63M	inference model / trained model
ta_mobile_v2.0_rec	ppocr/utils/dict/ta_dict.txt	Lightweight model for Tamil recognition	rec_ta_lite_train.yml	2.63M	inference model / trained model
latin_mobile_v2.0_rec	ppocr/utils/dict/latin_dict.txt	Lightweight model for latin recognition	rec_latin_lite_train.yml	2.6M	inference model / trained model
arabic_mobile_v2.0_rec	ppocr/utils/dict/arabic_dict.txt	Lightweight model for arabic recognition	rec_arabic_lite_train.yml	2.6M	inference model / trained model
cyrillic_mobile_v2.0_rec	ppocr/utils/dict/cyrillic_dict.txt	Lightweight model for cyrillic recognition	rec_cyrillic_lite_train.yml	2.6M	inference model / trained model
devanagari_mobile_v2.0_rec	ppocr/utils/dict/devanagari_dict.txt	Lightweight model for devanagari recognition	rec_devanagari_lite_train.yml	2.6M	inference model / trained model

5 Support languages and abbreviations

Language	Abbreviation	Language	Abbreviation
chinese and english	ch	Arabic	ar
english	en	Hindi	hi
french	fr	Uyghur	ug
german	german	Persian	fa
japan	japan	Urdu	ur
korean	korean	Serbian(latin)	rs_latin
chinese traditional	ch_tra	Occitan	oc
Italian	it	Marathi	mr
Spanish	es	Nepali	ne
Portuguese	pt	Serbian(cyrillic)	rs_cyrillic
Russia	ru	Bulgarian	bg
Ukranian	uk	Estonian	et
Belarusian	be	Irish	ga
Telugu	te	Croatian	hr
Saudi Arabia	sa	Hungarian	hu
Tamil	ta	Indonesian	id
Afrikaans	af	Icelandic	is
Azerbaijani	az	Kurdish	ku
Bosnian	bs	Lithuanian	lt
Czech	cs	Latvian	lv
Welsh	cy	Maori	mi
Danish	da	Malay	ms
Maltese	mt	Adyghe	ady
Dutch	nl	Kabardian	kbd
Norwegian	no	Avar	ava
Polish	pl	Dargwa	dar
Romanian	ro	Ingush	inh
Slovak	sk	Lak	lbe
Slovenian	sl	Lezghian	lez
Albanian	sq	Tabassaran	tab
Swedish	sv	Bihari	bh
Swahili	sw	Maithili	mai
Tagalog	tl	Angika	ang
Turkish	tr	Bhojpuri	bho
Uzbek	uz	Magahi	mah
Vietnamese	vi	Nagpur	sck
Mongolian	mn	Newari	new
Abaza	abq	Goan Konkani	gom

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi_languages_en.md

multi_languages_en.md

Multi-language model

1 Installation

1.1 paddle installation

1.2 paddleocr package installation

2 Quick Use

2.1 Command Line Operation

2.2 Python Script Running

3 Custom Training

4 Inference and Deployment

5 Support languages and abbreviations

Files

multi_languages_en.md

Latest commit

History

multi_languages_en.md

File metadata and controls

Multi-language model

1 Installation

1.1 paddle installation

1.2 paddleocr package installation

2 Quick Use

2.1 Command Line Operation

2.2 Python Script Running

3 Custom Training

4 Inference and Deployment

5 Support languages and abbreviations