Contributions for more speech datasets are welcome! You can issue here with new speech datasets, and the list of datasets in the main branch will be updated Seasonly.
This is a curated list of open speech datasets for speech-related research (mainly for Automatic Speech Recognition).
Over 110 speech datasets are collected in this repository, and more than 70 datasets can be downloaded directly without further application or registration.
Notice:
- This repository does not show corresponding License of each dataset. Basically it's OK to use these datasets for research purpose only. Please make sure the License is suitable before using for commercial purpose.
- Some small-scale speech corpora are not shown here for concision.
Dataset Acquisition | Sup/Unsup | All Languages (Hours) | Mandarin (Hours) | English (Hours) |
---|---|---|---|---|
download directly | supervised | 199k + | 2110 + | 34k + |
download directly | unsupervised | 530k + | 1360 + | 68k + |
download directly | total | 729k + | 3470 + | 102k + |
need application | supervised | 53k + | 16740 + | 50k + |
need application | unsupervised | 60k + | 12400 + | 57k + |
need application | total | 113k + | 29140 + | 107k + |
total | supervised | 252k + | 18850 + | 84k + |
total | unsupervised | 590k + | 13760 + | 125k + |
total | total | 842k + | 32610 + | 209k + |
- Mandarin here includes Mandarin-English CS corpora.
- Sup means supervised speech corpus with high-quality transcription.
- Unsup means unsupervised or weakly-supervised speech corpus.
id | Name | Language | Type/Domain | Paper Link | Data Link | Size (Hours) |
---|---|---|---|---|---|---|
1 | Librispeech | English | Reading | [paper] | [dataset] | 960 |
2 | TED_LIUM v1 | English | Talks | [paper] | [dataset] | 118 |
3 | TED_LIUM v2 | English | Talks | [paper] | [dataset] | 207 |
4 | TED_LIUM v3 | English | Talks | [paper] | [dataset] | 452 |
5 | MLS | Multilingual | Reading | [paper] | [dataset] | 50k + |
6 | thchs30 | Mandarin | Reading | [paper] | [dataset] | 35 |
7 | ST-CMDS | Mandarin | Commands | - | [dataset] | 100 |
8 | aishell | Mandarin | Recording | [paper] | [dataset] | 178 |
9 | aishell-3 | Mandarin | Recording | [paper] | [dataset] | 85 |
10 | aishell-4 | Mandarin | Meeting | [paper] | [dataset] | 120 |
11 | aishell-eval | Mandarin | Misc | - | [dataset] | 80 + |
12 | Primewords | Mandarin | Recording | - | [dataset] | 100 |
13 | aidatatang_200zh | Mandarin | Record | - | [dataset] | 200 |
14 | MagicData | Mandarin | Recording | - | [dataset] | 755 |
15 | MagicData-RAMC | Mandarin | Conversational | [paper] | [dataset] | 180 |
16 | Heavy Accent Corpus | Mandarin | Conversational | - | [dataset] | 58 + |
17 | AliMeeting | Mandarin | Meeting | [paper] | [dataset] | 120 |
18 | CN-Celeb | Mandarin | Misc | [paper] | [dataset] | unsup(274) |
19 | CN-Celeb2 | Mandarin | Misc | [paper] | [dataset] | unsup(1090) |
20 | The People's Speech | English | Misc | [paper] | [dataset] | 30k + |
21 | Multilingual TEDx | Multilingual | Talks | [paper] | [dataset] | 760 + |
22 | VoxPopuli | Multilingual | Misc | [paper] | [dataset] | sup(1.8k) unsup(400k) |
23 | Libri-Light | English | Reading | [paper] | [dataset] | unsup(60k) |
24 | Common Voice (Multilingual) | Multilingual | Recording | [paper] | [dataset] | sup(15k) unsup(5k) |
25 | Common Voice (English) | English | Recording | [paper] | [dataset] | sup(2200) unsup(700) |
26 | JTubeSpeech | Japanese | Misc | [paper] | [dataset] | 1300 |
27 | ai4bharat NPTEL2020 | English(Indian) | Lectures | - | [dataset] | weaksup(15.7k) |
28 | open_stt | Russian | Misc | - | [dataset] | 20k + |
29 | ASCEND | Mandarin-English CS | Conversational | [paper] | [dataset] | 10 + |
30 | Crowd-Sourced Speech | Multilingual | Recording | [paper] | [dataset] | 1200 + |
31 | Spoken Wikipedia | Multilingual | Recording | [paper] | [dataset] | 1000 + |
32 | MuST-C | Multilingual | Talks | [paper] | [dataset] | 6000 + |
33 | M-AILABS | Multilingual | Reading | - | [dataset] | 1000 |
34 | CMU Wilderness | Multilingual | Misc | [paper] | [dataset] | unsup(14k) |
35 | Gram_Vaani | Hindi | Recording | [paper] [code] | [dataset] | sup(100) unsup(1k) |
36 | VoxLingua107 | Multilingual | Misc | [paper] | [dataset] | unsup(6600 +) |
37 | Kazakh Corpus | Kazakh | Recording | [paper] [code] | [dataset] | 335 |
38 | Voxforge | English | Recording | - | [dataset] | 130 |
39 | Tatoeba | English | Recording | - | [dataset] | 200 |
40 | IndicWav2Vec | Multilingual | Misc | [paper] | [dataset] | unsup(17k +) |
41 | VoxCeleb | English | Misc | [paper] | [dataset] | unsup(352) |
42 | VoxCeleb2 | English | Misc | [paper] | [dataset] | unsup(2442) |
43 | RuLibrispeech | Russian | Read | - | [dataset] | 98 |
44 | MediaSpeech | Multilingual | Misc | [paper] | [dataset] | 40 |
45 | MUCS 2021 task1 | Multilingual | Misc | - | [dataset] | 300 |
46 | MUCS 2021 task2 | Multilingual | Misc | - | [dataset] | 150 |
47 | nicolingua-west-african | Multilingual | Misc | [paper] | [dataset] | 140 + |
48 | Samromur 21.05 | Samromur | Misc | [code] | [dataset] [dataset][dataset] | 145 |
49 | Puebla-Nahuatl | Puebla-Nahuatl | Misc | [paper] | [dataset] | 150 + |
50 | Golos | Russian | Misc | [paper] | [dataset] | 1240 |
51 | ParlaSpeech-HR | Croatian | Parliament | [paper] | [dataset] | 1816 |
52 | Lyon Corpus | French | Recording | [paper] | [dataset] | 185 |
53 | Providence Corpus | English | Recording | [paper] | [dataset] | 364 |
54 | CLARIN Spoken Corpora | Czech | Recording | - | [dataset] | 1120 + |
55 | Czech Parliament Plenary | Czech | Recording | - | [dataset] | 444 |
56 | (Youtube) Regional American Corpus | English (Accented) | Misc | [paper] | [dataset] | 29k + |
57 | NISP Dataset | Multilingual | Recording | [paper] | [dataset] | 56 + |
58 | Regional African American | English (Accented) | Recording | [paper] | [dataset] | 130 + |
59 | Indonesian Unsup | Indonesian | Misc | - | [dataset] | unsup (3000+) |
60 | Librivox-Spanish | Spanish | Recording | - | [dataset] | 120 |
61 | AVSpeech | English | Audio-Visual | [paper] | [dataset] | unsup(4700) |
62 | CMLR | Mandarin | Audio-Visual | [paper] | [dataset] | 100 + |
63 | Speech Accent Archive | English | Accented | [paper] | [dataset] | TBC |
64 | BibleTTS | Multilingual | TTS | [paper] | [dataset] | 86 |
65 | NST-Norwegian | Norwegian | Recording | - | [dataset] | 540 |
66 | NST-Danish | Danish | Recording | - | [dataset] | 500 + |
67 | NST-Swedish | Swedish | Recording | - | [dataset] | 300 + |
68 | NPSC | Norwegian | Parliament | [paper] | [dataset] | 140 |
69 | CI-AVSR | Cantonese | Audio-Visual | [paper] | [dataset] | 8 + |
70 | Aalto Finnish Parliament | Finnish | Parliament | [paper] | [dataset] | 3100 + |
71 | UserLibri | English | Reading | [paper] | [dataset] | - |
72 | Ukrainian Speech | Ukrainian | Misc | - | [dataset] | 1300+ |
73 | UCLA-ASR-corpus | Multilingual | Misc | - | [dataset] | unsup(15k) sup(9k) |
74 | ReazonSpeech | Japanese | Misc | [paper] [code] | [dataset] | 15k |
75 | Bundestag | German | Debate | [paper] | [dataset] | sup(610) unsup(1038) |
id | Name | Language | Type/Domain | Paper Link | Data Link | Size (Hours) |
---|---|---|---|---|---|---|
1 | Fisher | English | Conversational | [paper] | [dataset] | 2000 |
2 | WenetSpeech | Mandarin | Misc | [paper] | [dataset] | sup(10k) weaksup(2.4k) unsup(10k) |
3 | aishell-2 | Mandarin | Recording | [paper] | [dataset] | 1000 |
4 | aidatatang_1505zh | Mandarin | Recording | - | [dataset] | 1505 |
5 | SLT 2021 CSRC | Mandarin | Misc | [paper] | [dataset] | 400 |
6 | GigaSpeech | English | Misc | [paper] | [dataset] | sup(10k) unsup(23k) |
7 | SPGISpeech | English | Misc | [paper] | [dataset] | 5000 |
8 | AESRC 2020 | English (accented) | Misc | [paper] | [dataset] | 160 |
9 | LaboroTVSpeech | Japanese | Misc | [paper] | [dataset] | 2000 + |
10 | TAL_CSASR | Mandarin-English CS | Lectures | - | [dataset] | 587 |
11 | ASRU 2019 ASR | Mandarin-English CS | Reading | - | [dataset] | 700 + |
12 | SEAME | Mandarin-English CS | Recording | [paper] | [dataset] | 196 |
13 | Fearless Steps | English | Misc | - | [dataset] | unsup(19k) |
14 | FTSpeech | Danish | Meeting | [paper] | [dataset] | 1800 + |
15 | KeSpeech | Mandarin | Recording | [paper] | [dataset] | 1542 |
16 | KsponSpeech | Korean | Conversational | [paper] | [dataset] | 969 |
17 | RVTE database | Spanish | TV | [paper] | [dataset] | 800 + |
18 | DiDiSpeech | Mandarin | Recording | [paper] | [dataset] | 800 |
19 | Babel | Multilingual | Telephone | [paper] | [dataset] | 1000 + |
20 | National Speech Corpus | English (Singapore) | Misc | [paper] | [dataset] | 3000 + |
21 | MyST Children's Speech | English | Recording | - | [dataset] | 393 |
22 | L2-ARCTIC | L2 English | Recording | [paper] | [dataset] | 20 + |
23 | JSpeech | Multilingual | Recording | [paper] | [dataset] | 1332 + |
24 | LRS2-BBC | English | Audio-Visual | [paper] | [dataset] | 220 + |
25 | LRS3-TED | English | Audio-Visual | [paper] | [dataset] | 470 + |
26 | LRS3-Lang | Multilingual | Audio-Visual | - | [dataset] | 1300 + |
27 | QASR | Arabic | Dialects | [paper] | [dataset] | 2000 + |
28 | ADI (MGB-5) | Arabic | Dialects | [paper] | [dataset] | unsup (3000 +) |
29 | MGB-2 | Arabic | TV | [paper] | [dataset] | 1200 + |
30 | 3MASSIV | Multilingual | Audio-Visual | [paper] | [dataset] | sup(310) unsup(600) |
31 | MDCC | Cantonese | Misc | [paper] | [dataset] | 73 + |
32 | Lahjoita Puhetta | Finnish | Misc | [paper] | [dataset] | sup(1600) unsup(2000) |
33 | SDS-200 | Swiss German | Dialects | [paper] | [dataset] | 200 |
34 | Modality Corpus | Multilingual | Audio-Visual | [paper] | [dataset] | 30 + |
35 | Hindi-Tamil-English | Multilingual | Misc | - | [dataset] | 690 |
36 | English-Vietnamese Corpus | English, Vietnamese | Misc | [paper] | [dataset] | 500+ |
37 | OLKAVS | Korean | Audio-Visual | [paper] [code] | [dataset] | 1150 |