Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

中英文超轻量PP-OCRv3中的识别模型是用什么数据集训练的 #10244

Closed
luoyq6 opened this issue Jun 26, 2023 · 2 comments
Closed
Assignees
Labels

Comments

@luoyq6
Copy link

luoyq6 commented Jun 26, 2023

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

  • 系统环境/System Environment:
  • 版本号/Version:Paddle: PaddleOCR: 问题相关组件/Related components:
  • 运行指令/Command Code:
  • 完整报错/Complete Error Message:
@Gmgge
Copy link
Contributor

Gmgge commented Jun 29, 2023

根据PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System论文描述,主要是开源数据集+百度自有数据集+互联网爬取+虚拟生成的数据集,以下为原文针对文本检测与文本识别数据集的介绍,由于方向分类在v3中未推出新的模型,没有描述。

For text detection, there are 127k training images and 200 validation images. The training images consist of 68K real scene images and 59K synthetic images. The real scene images are collected from Baidu image search and public datasets, including LSVT (Sun et al. 2019), RCTW-17 (Shi et al. 2017), MTWI 2018 (He and Yang 2018), CASIA-10K (He et al. 2018), SROIE (Huang et al. 2019), MLT 2019 (Nayef et al. 2019), BDI (Karatzas et al. 2011), MSRA TD500 (Yao et al. 2012) and CCPD 2019 (Xu et al. 2018). The synthetic images mainly focus on the scenarios for long texts, multi-direction texts and texts in table. The validation images are all from real scenes.

For text recognition, there are 18.5M training im ages and 18.7K validation images. Among the train ing images, 7M images are real scene images, which come from some public datasets and Baidu image search. The public datasets include LSVT, RCTW-17, MTWI 2018, CCPD 2019, openimages https://github.com/ openimages/dataset and InvoiceDatasets https://github.com/ FuxiJia/InvoiceDatasets. Besides, we scraped 750k finan cial report images from the web. We get 810k images from LSVT unlabeled data by using UIM strategy. We also ob tain about 3M croped images from Pubtabnet https://github. com/ibm-aur-nlp/PubTabNet. The remaining 11.5M syn thetic images mainly focus on scenarios for different back grounds, rotation, perspective transformation, noising, verti cal text, etc. The corpus of synthetic images comes from the real scene images. All the validation images also come from the real scenes.

相关数据集下载可参考文档

@shiyutang
Copy link
Collaborator

以上回答已经充分解答了问题,如果有新的问题欢迎随时提交issue,或者在此条issue下继续回复~
我们开启了飞桨套件的ISSUE攻关活动,欢迎感兴趣的开发者参加:PaddlePaddle/PaddleOCR#10223

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants