Skip to content

a curated list of speech datasets (110+ datasets, 75+ easy to download)

License

Notifications You must be signed in to change notification settings

RevoSpeechTech/speech-datasets-collection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 

Repository files navigation

Speech Datasets Collection

contributions welcome HitCount

Contributions for more speech datasets are welcome! You can issue here with new speech datasets, and the list of datasets in the main branch will be updated Seasonly.

This is a curated list of open speech datasets for speech-related research (mainly for Automatic Speech Recognition).

Over 110 speech datasets are collected in this repository, and more than 70 datasets can be downloaded directly without further application or registration.

Notice:

  1. This repository does not show corresponding License of each dataset. Basically it's OK to use these datasets for research purpose only. Please make sure the License is suitable before using for commercial purpose.
  2. Some small-scale speech corpora are not shown here for concision.

1. Data Overview

Dataset Acquisition Sup/Unsup All Languages (Hours) Mandarin (Hours) English (Hours)
download directly supervised 199k + 2110 + 34k +
download directly unsupervised 530k + 1360 + 68k +
download directly total 729k + 3470 + 102k +
need application supervised 53k + 16740 + 50k +
need application unsupervised 60k + 12400 + 57k +
need application total 113k + 29140 + 107k +
total supervised 252k + 18850 + 84k +
total unsupervised 590k + 13760 + 125k +
total total 842k + 32610 + 209k +
  • Mandarin here includes Mandarin-English CS corpora.
  • Sup means supervised speech corpus with high-quality transcription.
  • Unsup means unsupervised or weakly-supervised speech corpus.

2. List of ASR corpora

a. datasets can be downloaded directly

id Name Language Type/Domain Paper Link Data Link Size (Hours)
1 Librispeech English Reading [paper] [dataset] 960
2 TED_LIUM v1 English Talks [paper] [dataset] 118
3 TED_LIUM v2 English Talks [paper] [dataset] 207
4 TED_LIUM v3 English Talks [paper] [dataset] 452
5 MLS Multilingual Reading [paper] [dataset] 50k +
6 thchs30 Mandarin Reading [paper] [dataset] 35
7 ST-CMDS Mandarin Commands - [dataset] 100
8 aishell Mandarin Recording [paper] [dataset] 178
9 aishell-3 Mandarin Recording [paper] [dataset] 85
10 aishell-4 Mandarin Meeting [paper] [dataset] 120
11 aishell-eval Mandarin Misc - [dataset] 80 +
12 Primewords Mandarin Recording - [dataset] 100
13 aidatatang_200zh Mandarin Record - [dataset] 200
14 MagicData Mandarin Recording - [dataset] 755
15 MagicData-RAMC Mandarin Conversational [paper] [dataset] 180
16 Heavy Accent Corpus Mandarin Conversational - [dataset] 58 +
17 AliMeeting Mandarin Meeting [paper] [dataset] 120
18 CN-Celeb Mandarin Misc [paper] [dataset] unsup(274)
19 CN-Celeb2 Mandarin Misc [paper] [dataset] unsup(1090)
20 The People's Speech English Misc [paper] [dataset] 30k +
21 Multilingual TEDx Multilingual Talks [paper] [dataset] 760 +
22 VoxPopuli Multilingual Misc [paper] [dataset] sup(1.8k)
unsup(400k)
23 Libri-Light English Reading [paper] [dataset] unsup(60k)
24 Common Voice (Multilingual) Multilingual Recording [paper] [dataset] sup(15k)
unsup(5k)
25 Common Voice (English) English Recording [paper] [dataset] sup(2200)
unsup(700)
26 JTubeSpeech Japanese Misc [paper] [dataset] 1300
27 ai4bharat NPTEL2020 English(Indian) Lectures - [dataset] weaksup(15.7k)
28 open_stt Russian Misc - [dataset] 20k +
29 ASCEND Mandarin-English CS Conversational [paper] [dataset] 10 +
30 Crowd-Sourced Speech Multilingual Recording [paper] [dataset] 1200 +
31 Spoken Wikipedia Multilingual Recording [paper] [dataset] 1000 +
32 MuST-C Multilingual Talks [paper] [dataset] 6000 +
33 M-AILABS Multilingual Reading - [dataset] 1000
34 CMU Wilderness Multilingual Misc [paper] [dataset] unsup(14k)
35 Gram_Vaani Hindi Recording [paper] [code] [dataset] sup(100)
unsup(1k)
36 VoxLingua107 Multilingual Misc [paper] [dataset] unsup(6600 +)
37 Kazakh Corpus Kazakh Recording [paper] [code] [dataset] 335
38 Voxforge English Recording - [dataset] 130
39 Tatoeba English Recording - [dataset] 200
40 IndicWav2Vec Multilingual Misc [paper] [dataset] unsup(17k +)
41 VoxCeleb English Misc [paper] [dataset] unsup(352)
42 VoxCeleb2 English Misc [paper] [dataset] unsup(2442)
43 RuLibrispeech Russian Read - [dataset] 98
44 MediaSpeech Multilingual Misc [paper] [dataset] 40
45 MUCS 2021 task1 Multilingual Misc - [dataset] 300
46 MUCS 2021 task2 Multilingual Misc - [dataset] 150
47 nicolingua-west-african Multilingual Misc [paper] [dataset] 140 +
48 Samromur 21.05 Samromur Misc [code] [dataset] [dataset][dataset] 145
49 Puebla-Nahuatl Puebla-Nahuatl Misc [paper] [dataset] 150 +
50 Golos Russian Misc [paper] [dataset] 1240
51 ParlaSpeech-HR Croatian Parliament [paper] [dataset] 1816
52 Lyon Corpus French Recording [paper] [dataset] 185
53 Providence Corpus English Recording [paper] [dataset] 364
54 CLARIN Spoken Corpora Czech Recording - [dataset] 1120 +
55 Czech Parliament Plenary Czech Recording - [dataset] 444
56 (Youtube) Regional American Corpus English (Accented) Misc [paper] [dataset] 29k +
57 NISP Dataset Multilingual Recording [paper] [dataset] 56 +
58 Regional African American English (Accented) Recording [paper] [dataset] 130 +
59 Indonesian Unsup Indonesian Misc - [dataset] unsup (3000+)
60 Librivox-Spanish Spanish Recording - [dataset] 120
61 AVSpeech English Audio-Visual [paper] [dataset] unsup(4700)
62 CMLR Mandarin Audio-Visual [paper] [dataset] 100 +
63 Speech Accent Archive English Accented [paper] [dataset] TBC
64 BibleTTS Multilingual TTS [paper] [dataset] 86
65 NST-Norwegian Norwegian Recording - [dataset] 540
66 NST-Danish Danish Recording - [dataset] 500 +
67 NST-Swedish Swedish Recording - [dataset] 300 +
68 NPSC Norwegian Parliament [paper] [dataset] 140
69 CI-AVSR Cantonese Audio-Visual [paper] [dataset] 8 +
70 Aalto Finnish Parliament Finnish Parliament [paper] [dataset] 3100 +
71 UserLibri English Reading [paper] [dataset] -
72 Ukrainian Speech Ukrainian Misc - [dataset] 1300+
73 UCLA-ASR-corpus Multilingual Misc - [dataset] unsup(15k)
sup(9k)
74 ReazonSpeech Japanese Misc [paper] [code] [dataset] 15k
75 Bundestag German Debate [paper] [dataset] sup(610)
unsup(1038)

b. datasets can be downloaded after application

id Name Language Type/Domain Paper Link Data Link Size (Hours)
1 Fisher English Conversational [paper] [dataset] 2000
2 WenetSpeech Mandarin Misc [paper] [dataset] sup(10k)
weaksup(2.4k)
unsup(10k)
3 aishell-2 Mandarin Recording [paper] [dataset] 1000
4 aidatatang_1505zh Mandarin Recording - [dataset] 1505
5 SLT 2021 CSRC Mandarin Misc [paper] [dataset] 400
6 GigaSpeech English Misc [paper] [dataset] sup(10k)
unsup(23k)
7 SPGISpeech English Misc [paper] [dataset] 5000
8 AESRC 2020 English (accented) Misc [paper] [dataset] 160
9 LaboroTVSpeech Japanese Misc [paper] [dataset] 2000 +
10 TAL_CSASR Mandarin-English CS Lectures - [dataset] 587
11 ASRU 2019 ASR Mandarin-English CS Reading - [dataset] 700 +
12 SEAME Mandarin-English CS Recording [paper] [dataset] 196
13 Fearless Steps English Misc - [dataset] unsup(19k)
14 FTSpeech Danish Meeting [paper] [dataset] 1800 +
15 KeSpeech Mandarin Recording [paper] [dataset] 1542
16 KsponSpeech Korean Conversational [paper] [dataset] 969
17 RVTE database Spanish TV [paper] [dataset] 800 +
18 DiDiSpeech Mandarin Recording [paper] [dataset] 800
19 Babel Multilingual Telephone [paper] [dataset] 1000 +
20 National Speech Corpus English (Singapore) Misc [paper] [dataset] 3000 +
21 MyST Children's Speech English Recording - [dataset] 393
22 L2-ARCTIC L2 English Recording [paper] [dataset] 20 +
23 JSpeech Multilingual Recording [paper] [dataset] 1332 +
24 LRS2-BBC English Audio-Visual [paper] [dataset] 220 +
25 LRS3-TED English Audio-Visual [paper] [dataset] 470 +
26 LRS3-Lang Multilingual Audio-Visual - [dataset] 1300 +
27 QASR Arabic Dialects [paper] [dataset] 2000 +
28 ADI (MGB-5) Arabic Dialects [paper] [dataset] unsup (3000 +)
29 MGB-2 Arabic TV [paper] [dataset] 1200 +
30 3MASSIV Multilingual Audio-Visual [paper] [dataset] sup(310)
unsup(600)
31 MDCC Cantonese Misc [paper] [dataset] 73 +
32 Lahjoita Puhetta Finnish Misc [paper] [dataset] sup(1600)
unsup(2000)
33 SDS-200 Swiss German Dialects [paper] [dataset] 200
34 Modality Corpus Multilingual Audio-Visual [paper] [dataset] 30 +
35 Hindi-Tamil-English Multilingual Misc - [dataset] 690
36 English-Vietnamese Corpus English, Vietnamese Misc [paper] [dataset] 500+
37 OLKAVS Korean Audio-Visual [paper] [code] [dataset] 1150

3. References

About

a curated list of speech datasets (110+ datasets, 75+ easy to download)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published