Skip to content

Pretraining data

zhezhaoa edited this page Aug 25, 2023 · 2 revisions

CLUECorpusSmall

CLUECorpusSmall consists of news, web, wiki, and comments corpus. The original data and detailed description can be found here.

Corpus Link
CLUECorpusSmall https://share.weiyun.com/sC6PMhxx
CLUECorpusSmall (BERT format) https://share.weiyun.com/9SPPGUOK

News Commentary v13 (ZH-EN)

News Commentary v13 consists of parallel data and can be downloaded from here.

Corpus Link
news-Commentary-v13-en-zh https://share.weiyun.com/PLMxw6ae
news-Commentary-v13-zh-en https://share.weiyun.com/5rMwRhDi
news-Commentary-v13-en-zh_sampled https://share.weiyun.com/1KTxq3Dc

CIFAR100_nolabel

CIFAR100_nolabel consists of 50 thousand images which can be used by unsupervised pre-training. CIFAR100_nolabel can be downloaded from here

Corpus Link
CIFAR100_nolabel https://share.weiyun.com/M2tA9P8p
Clone this wiki locally