Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paddle demo中文数据集 #981

Closed
reyoung opened this issue Dec 21, 2016 · 9 comments
Closed

Paddle demo中文数据集 #981

reyoung opened this issue Dec 21, 2016 · 9 comments

Comments

@reyoung
Copy link
Collaborator

reyoung commented Dec 21, 2016

Related #176

为了更好的做Paddle的demo、教程,需要有中文的数据集。数据集的获取方法可以是自己标注,也可以是找公开的数据集。

可能的中文数据集有:

  • 中文的情感分类数据
    • 判断某一句话的情感倾向。比如 正向/这个显示器的显示效果真好。
  • 中文的问答数据
    • 比如 问: 刘德华的妻子是谁? 答: 朱丽倩
  • 中文的对话数据
  • 中文的看图说话数据
    • 给出一张图片,给出描述。
    • 这个数据集似乎我们公开过一个,可以直接拿来用。
@llxxxll
Copy link
Member

llxxxll commented Dec 21, 2016

uci数据集:http://archive.ics.uci.edu/ml/index.html
kaggle平台数据集:https://www.kaggle.com/datasets

@beckett1124 贡献的两个数据集参考

@llxxxll
Copy link
Member

llxxxll commented Dec 22, 2016

Image classification
ImageNet: http://image-net.org/challenges/LSVRC/2016/
CIFAR-10 and CIFAR-100: https://www.cs.toronto.edu/~kriz/cifar.html
MNIST: http://yann.lecun.com/exdb/mnist/
SVHN: http://ufldl.stanford.edu/housenumbers/
CUB-200: http://www.vision.caltech.edu/visipedia/CUB-200.html

Object detection:
ImageNet: http://image-net.org/challenges/LSVRC/2016/
PASCAL VOC: http://host.robots.ox.ac.uk/pascal/VOC/
KITTI: http://www.cvlibs.net/datasets/kitti/
MS-COCO: http://mscoco.org/dataset/

Segmentation:
ImageNet: http://image-net.org/challenges/LSVRC/2016/
PASCAL VOC: http://host.robots.ox.ac.uk/pascal/VOC/
KITTI: http://www.cvlibs.net/datasets/kitti/
MS-COCO: http://mscoco.org/dataset/
Cityscapes: https://www.cityscapes-dataset.com/
PASCAL-Part: http://www.stat.ucla.edu/~xianjie.chen/pascal_part_dataset/pascal_part.html
PASCAL-Context: http://www.cs.stanford.edu/~roozbeh/pascal-context/
CamVid: http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/

Image caption
MS-COCO: http://mscoco.org/dataset/
Flickr 8K: http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/KCCA.html
Flickr 30k: http://shannon.cs.illinois.edu/DenotationGraph/
IAPR TC-12: http://imageclef.org/photodata

Question answering
DAQUAR: http://www.cs.toronto.edu/~mren/imageqa/results/
COCO-QA: http://www.cs.toronto.edu/~mren/imageqa/data/cocoqa/
Visual Genome: https://visualgenome.org/

Saliency:
MIT300: http://saliency.mit.edu/results_mit300.html
CAT2000: http://saliency.mit.edu/results_cat2000.html
MSRA10K: http://mmcheng.net/msra10k/
ECSSD: http://www.cse.cuhk.edu.hk/leojia/projects/hsaliency/dataset.html

Video summarization
SumMe: https://people.ee.ethz.ch/~gyglim/vsum/#benchmark
TVSum: https://github.com/yalesong/tvsum

@hohdiy 贡献的数据集参考

@pengli09
Copy link
Contributor

中文完形填空数据集:https://github.com/ymcui/Chinese-RC-Dataset

@Zrachel
Copy link
Contributor

Zrachel commented Dec 22, 2016

上面@reyoung提的:中文的看图说话数据,是没有中文数据的;但看图问话是有的,见http://idl.baidu.com/FM-IQA.html

此外还需要
中文语音识别corpus(THCHS-30 : A Free Chinese Speech Corpus貌似可用,待调研)
中文语料库(类似 1 Billion Word Language Model Benchmark)
中英翻译(类似WMT)
中文序列标注(类似CoNLL-2005&2012)

@luotao1
Copy link
Contributor

luotao1 commented Dec 24, 2016

@llxxxll@Zrachel 的回复中已经提到需要中英翻译的数据集了。wmt法英翻译数据集,主要以新闻语料为主,其中训练样本集有超过1200万条的平行语料。同时,根据 @lcy-seso 的经验,中英翻译如果少于100万条的平行语料,很难训练出一个比较好的模型。

@livc
Copy link
Member

livc commented Jan 11, 2017

THUOCL:清华大学开放中文词库 近日开源,供参考。

@pengli09
Copy link
Contributor

如果类似THUOCL这种语料能用的话,那http://thunlp.org/site2/index.php/en/resources 这里还有几个

@livc
Copy link
Member

livc commented Feb 15, 2017

发现一个古诗的数据集。

最全中华古诗数据库, 唐宋两朝近一万四千古诗人, 接近5.5万首唐诗加26万宋诗.
https://github.com/jackeyGao/chinese-poetry

@JiayiFeng
Copy link
Collaborator

Close this inactivate issue, please feel free to reopen.

wangxicoding pushed a commit to wangxicoding/Paddle that referenced this issue Dec 9, 2021
* add the taskflow doc for some task

* update the ddparaser doc about taskflow

* remove the unused code for the paddlenlp

* update the taskflow docs

* add the input check for the tasks

* add the document for the taskflow

Co-authored-by: Zeyu Chen <chenzeyu01@baidu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants