中英文超轻量PP-OCRv3中的识别模型是用什么数据集训练的 #10244

luoyq6 · 2023-06-26T08:18:58Z

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：
版本号/Version：Paddle： PaddleOCR：问题相关组件/Related components：
运行指令/Command Code：
完整报错/Complete Error Message：

Gmgge · 2023-06-29T07:01:37Z

根据PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System论文描述，主要是开源数据集+百度自有数据集+互联网爬取+虚拟生成的数据集，以下为原文针对文本检测与文本识别数据集的介绍，由于方向分类在v3中未推出新的模型，没有描述。

For text detection, there are 127k training images and 200 validation images. The training images consist of 68K real scene images and 59K synthetic images. The real scene images are collected from Baidu image search and public datasets, including LSVT (Sun et al. 2019), RCTW-17 (Shi et al. 2017), MTWI 2018 (He and Yang 2018), CASIA-10K (He et al. 2018), SROIE (Huang et al. 2019), MLT 2019 (Nayef et al. 2019), BDI (Karatzas et al. 2011), MSRA TD500 (Yao et al. 2012) and CCPD 2019 (Xu et al. 2018). The synthetic images mainly focus on the scenarios for long texts, multi-direction texts and texts in table. The validation images are all from real scenes.

For text recognition, there are 18.5M training im ages and 18.7K validation images. Among the train ing images, 7M images are real scene images, which come from some public datasets and Baidu image search. The public datasets include LSVT, RCTW-17, MTWI 2018, CCPD 2019, openimages https://github.com/ openimages/dataset and InvoiceDatasets https://github.com/ FuxiJia/InvoiceDatasets. Besides, we scraped 750k finan cial report images from the web. We get 810k images from LSVT unlabeled data by using UIM strategy. We also ob tain about 3M croped images from Pubtabnet https://github. com/ibm-aur-nlp/PubTabNet. The remaining 11.5M syn thetic images mainly focus on scenarios for different back grounds, rotation, perspective transformation, noising, verti cal text, etc. The corpus of synthetic images comes from the real scene images. All the validation images also come from the real scenes.

相关数据集下载可参考文档

shiyutang · 2023-06-29T07:37:10Z

以上回答已经充分解答了问题，如果有新的问题欢迎随时提交issue，或者在此条issue下继续回复～
我们开启了飞桨套件的ISSUE攻关活动，欢迎感兴趣的开发者参加：PaddlePaddle/PaddleOCR#10223

paddle-bot bot assigned andyjiang1116 Jun 26, 2023

Gmgge mentioned this issue Jun 29, 2023

🏅️飞桨套件快乐开源常规赛 #10223

Closed

shiyutang added the good first issue Good for newcomers label Jun 29, 2023

shiyutang closed this as completed Jun 29, 2023

paddle-bot bot added the status/close label Jun 29, 2023

shiyutang mentioned this issue Jun 29, 2023

中英文超轻量PP-OCRv3中的识别模型是用什么数据集训练的？ #10236

Closed

shiyutang assigned Gmgge and unassigned andyjiang1116 Jun 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

中英文超轻量PP-OCRv3中的识别模型是用什么数据集训练的 #10244

中英文超轻量PP-OCRv3中的识别模型是用什么数据集训练的 #10244

luoyq6 commented Jun 26, 2023

Gmgge commented Jun 29, 2023

shiyutang commented Jun 29, 2023

中英文超轻量PP-OCRv3中的识别模型是用什么数据集训练的 #10244

中英文超轻量PP-OCRv3中的识别模型是用什么数据集训练的 #10244

Comments

luoyq6 commented Jun 26, 2023

Gmgge commented Jun 29, 2023

shiyutang commented Jun 29, 2023