India is the second largest English-speaking country in the world with a speaker base of roughly 130 million. Unfortunately, Indian speakers find a very poor representation in existing English ASR benchmarks such as LibriSpeech, Switchboard, Speech Accent Archive, etc. We address this gap by creating Svarah, a benchmark that contains 9.6 hours of transcribed English audio from 117 speakers across 65 districts across 19 states in India, resulting in a diverse range of accents. The collective set of native languages spoken by the speakers covers 19 of the 22 constitutionally recognized languages of India, belonging to 4 different language families. Svarah includes both read speech and spontaneous conversational data, covering a variety of domains such as history, culture, tourism, government, sports, etc. It also contains data corresponding to popular use cases such as ordering groceries, making digital payments, and using government services (e.g., checking pension claims, checking passport status, etc.). The resulting diversity in vocabulary as well as use cases allows a more robust evaluation of ASR systems for real-world applications.
We evaluate 6 open source ASR models and 2 commercial ASR systems on Svarah and show that there is clear scope for improvement on Indian accents. The results obtained are as shown in Table 1.
Datasets | Benchmark |
---|---|
Svarah | link |
- Sample structure of manifest file
Applicable to svarah_manifest.json
& saa_l1_manifest.json
{"audio_filepath": <path to audio file 1>, "duration": <seconds>, "text": <transcript 1>}
{"audio_filepath": <path to audio file 2>, "duration": <seconds>, "text": <transcript 2>}
- Running evaluation scripts
For azure and google cloud evaluations, you will be required to add your key associated with the services offered by each. For others, you can run the following :
python eval_<hf_model>.py --manifest <manifest path>
For processing audio filepaths, kindly change them as per your directory structure in the scripts.
- Meta statistics of speakers
The meta_speaker_stats.csv
file consists of 11 columns which describes some meta statistics of speakers involved in Svarah:
speaker_id
-- unique speaker identifierduration
-- duration of audio recorded (seconds)text
-- transcript of audiogender
-- "Male" / "Female"age-group
-- speaker's age group (18-30, 30-45, 45-60 & 60+ )primary_language
-- speaker's primary languagenative_place_state
-- speaker's native statenative_place_district
-- speaker's native districthighest_qualification
-- speaker's highest education qualificationjob_category
-- speakers's job category (Part Time, Full Time, Other)occupation_domain
-- speaker's domain of occupation (Education and Research, Healthcare [Medical & Pharma], Government, Technology and Services, Information and Media, Financial Services [Banking and Insurance], Transportation and Logistics, Entertainment, Social service, Manufacturing & Retail )
-
Svarah folder tree
Svarah ├── audio │ ├── <filename>.wav │ └── <filename>.txt │ . │ . │ . ├── svarah_manifest.json ├── saa_l1_manifest.json └── meta_speaker_stats.csv
Table 1 depicts WER's of different models on (i) Svarah that contains data from Indian speakers and (ii) SAA_L1, LibriSpeech Clean (Libri) which contain data from native English speakers.
# Params. | Svarah | SAA_L1 | LibriSpeech | |
---|---|---|---|---|
Whisperbase | 74M | 13.6 | 2.9 | 4.2 |
Whispermedium | 769M | 8.3 | 1.7 | 3.1 |
Whisperlarge | 1550M | 7.2 | 1.6 | 2.7 |
Wav2Vec2large | 317M | 24.9 | 3.1 | 1.8 |
HuBERTlarge | 316M | 25.6 | 3.2 | 2.0 |
WavLMlarge | 300M | 33.7 | 9.2 | 3.4 |
Data2Veclarge | 313M | 24.5 | 2.5 | 1.8 |
Conformerlarge | 120M | 14.6 | 1.1 | 2.1 |
AzureUS | - | 20.9 | 24.2 | - |
AzureIN | - | 21.3 | 30.1 | - |
GoogleUS | - | 30.0 | 16.8 | - |
GoogleIN | - | 20.7 | 63.7 | - |
Table 2: Number of hours and Number of tokens in each accent
Accent | # Hours | # Tokens |
---|---|---|
Assamese | 0.26 | 869 |
Bengali | 0.33 | 1024 |
Bodo | 0.63 | 1520 |
Dogri | 0.44 | 1262 |
Gujarati | 0.37 | 1051 |
Hindi | 0.40 | 1068 |
Kannada | 0.71 | 1892 |
Kashmiri | 0.40 | 1310 |
Konkani | 0.54 | 1325 |
Maithili | 0.76 | 1662 |
Malayalam | 0.68 | 1711 |
Marathi | 0.30 | 948 |
Nepali | 1.16 | 2236 |
Odia | 0.61 | 1548 |
Punjabi | 0.27 | 820 |
Sindhi | 0.18 | 536 |
Tamil | 0.44 | 1352 |
Telugu | 0.50 | 1311 |
Urdu | 0.64 | 1814 |
If you benefit from this dataset, kindly cite as follows:
@misc{javed2023svarah,
title={Svarah: Evaluating English ASR Systems on Indian Accents},
author={Tahir Javed and Sakshi Joshi and Vignesh Nagarajan and Sai Sundaresan and Janki Nawale and Abhigyan Raman and Kaushal Bhogale and Pratyush Kumar and Mitesh M. Khapra},
year={2023},
eprint={2305.15760},
archivePrefix={arXiv},
primaryClass={cs.CL}
}