Dataset Repo

collection of datasets for deep learning

generated by scripts. do NOT modify this file

使用说明

添加新 dataset 时，手动做 3 件事：

dataset 压缩包放入 raw-data 目录。
在 metas 目录见，建一个 .yaml 放 meta 数据。
写 unzip 代码 & dataset 读取 API

# 新 dataset 由 dvc 管理
# 同时也会更新 README 中的 dataset list
make dvc-add

# review Readme.md 的改动 & git commit 后运行

# push to cloud
# 推到阿里云 OSS 备份 pdf 原文
# 功能包括 flake8 check, dvc push, git push
make push-all

# 生成 notes 文件. 输出目录:
# https://github.com/JackonYang/paper-reading/tree/master/01-zettelkasten/07-dataset-notes
make gen-notes-md

# 解压 dataset，以便使用
make unzip-all

# 更新 readme
# 手动改过 meta data，或 readme 模版以后执行
make gen-readme

# 删除本地的 pvc & 对应的 dataset 文件
# https://dvc.org/doc/command-reference/remove
dvc remove *.dvc --outs
# 待验证。删 remove cloud 里的用不上的
# remote 存储不爆炸，就不要搞这个命令
# https://dvc.org/doc/command-reference/gc
dvc gc --workspace -c

数据集说明

total count: 5

ICDAR 2003

评分: ⭐️⭐️

简评: 不推荐使用。总体不如 ICDAR 2013. gt 标注也比较另类，

下载地址：http://www.imglab.org/db/files/ICDAR2003-SceneTrialTrain-GT4.tar.gz

包含251个水平文字的完整场景图像和860张有单词被裁剪的图像。

2011年删除了包含非字母和数字字符以及少于三个字符的图片，并为每个图片定义了50个字的词典，此外还有一个50k的词汇表，它由Hunspell拼写检查词典中的所有词汇组成。

ICDAR 2013

评分: ⭐️⭐️⭐️⭐️

简评: 不错的 baseline 数据。真实场景，但偏简单。

下载地址：http://rrc.cvc.uab.es/?ch=2&com=downloads

聚焦场景文本的 229 个训练图像和 233 个测试图像。继承了 ICDAR 2003 数据集的大部分样本。

都是真实世界的图像，显示标志牌、书籍、海报或其他物品上的文字。
文字都是英文的且水平对齐。
标注是轴对齐的边界框，
共划分出1015个裁剪的单词图像。

广泛用于测试文本探测器的性能。

数据集预览：

ankush-me/SynthText

评分: ⭐️⭐️⭐️⭐️

简评: 比较难。质量不错，合成思路也值得学习。

下载地址：https://github.com/ankush-me/SynthText

paper: Synthetic Data for Text Localisation in Natural Images

word instances are placed in natural scene images, while taking into account the scene layout.

Each text instance is annotated with its text-string, word-level and character-level bounding-boxes.

size：8 million word instances. 800 thousand background images.

数据集预览：

cifar-10

评分: ⭐️⭐️⭐️⭐️

简评: 比较简单的早期 dataset

下载地址：https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz

no desc

数据集预览：

smaller subset of Imagenet

评分: ⭐️⭐️⭐️⭐️

简评: 10 个简单的 class, 160px version

下载地址：https://github.com/fastai/imagenette

A smaller subset of 10 easily classified classes from Imagenet, and a little more French

References

训练文本识别器，你可能需要这些数据集 https://cloud.tencent.com/developer/article/1453325

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.dvc		.dvc
metas		metas
pipeline		pipeline
raw-data		raw-data
.dvcignore		.dvcignore
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
manage.py		manage.py
md-notes.tmpl		md-notes.tmpl
readme.tmpl		readme.tmpl

License

JackonYang/dataset-repo

Folders and files

Latest commit

History

Repository files navigation

Dataset Repo

使用说明

数据集说明

ICDAR 2003

ICDAR 2013

ankush-me/SynthText

cifar-10

smaller subset of Imagenet

References

About

Resources

License

Stars

Watchers

Forks

Languages