Skip to content

An open-source Kazakh Emotional Text-to-Speech Dataset

Notifications You must be signed in to change notification settings

IS2AI/KazEmoTTS

Repository files navigation

KazEmoTTS
⌨️ 😐 😠 🙂 😞 😱 😮 🗣

GitHub stars GitHub issues ISSAI Official Website

This repository provides a dataset and a text-to-speech (TTS) model for the paper
KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis

Dataset Statistics 📊

Emotion # recordings Narrator F1 Narrator M1 Narrator M2
Total (h) Mean (s) Min (s) Max (s) Total (h) Mean (s) Min (s) Max (s) Total (h) Mean (s) Min (s) Max (s)
neutral 9,385 5.85 5.03 1.03 15.51 4.54 4.77 0.84 16.18 2.30 4.69 1.02 15.81
angry 9,059 5.44 4.78 1.11 14.09 4.27 4.75 0.93 17.03 2.31 4.81 1.02 15.67
happy 9,059 5.77 5.09 1.07 15.33 4.43 4.85 0.98 15.56 2.23 4.74 1.09 15.25
sad 8,980 5.60 5.04 1.11 15.21 4.62 5.13 0.72 18.00 2.65 5.52 1.16 18.16
scared 9,098 5.66 4.96 1.00 15.67 4.13 4.51 0.65 16.11 2.34 4.96 1.07 14.49
surprised 9,179 5.91 5.09 1.09 14.56 4.52 4.92 0.81 17.67 2.28 4.87 1.04 15.81
Narrator # recordings Duration (h)
F1 24,656 34.23
M1 19,802 26.51
M2 10,302 14.11
Total 54,760 74.85

Installation 🛠️

First, you need to build the monotonic_align code:

cd model/monotonic_align; python setup.py build_ext --inplace; cd ../..

Note: Python version is 3.9.13

Pre-Processing Data for Training 🧹

You need to download the KazEmoTTS dataset and customize it, as in filelists/all_spk, by executing data_preparation.py:

python data_preparation.py -d <path_to_KazEmoTTS_dataset>

Training Stage 🏋️‍♂️

To initiate the training process, you must specify the path to the model configurations, which can be found in configs/train_grad.json, and designate a directory for checkpoints, typically located at logs/train_logs, to specify the GPU you will be using.

CUDA_VISIBLE_DEVICES=YOUR_GPU_ID
python train_EMA.py -c <configs/train_grad.json> -m <checkpoint>

Inference 🧠

Pre-Training Stage 🏃

If you intend to utilize a pre-trained model, you will need to download the necessary checkpoints TTS, vocoder for both the TTS model based on GradTTS and HiFi-GAN.

To conduct inference, follow these steps:

  • Create a text file containing the sentences you wish to synthesize, such as filelists/inference_generated.txt.
  • Specify the txt file format as follows: text|emotion id|speaker id.
  • Adjust the path to the HiFi-Gan checkpoint in inference_EMA.py.
  • Set the classifier guidance level to 100 using the -g parameter.
python inference_EMA.py -c <config> -m <checkpoint> -t <number-of-timesteps> -g <guidance-level> -f <path-for-text> -r <path-to-save-audios>

Synthesized samples 🔈

You can listen to some synthesized samples here.

Citation 🎓

We kindly urge you, if you incorporate our dataset and/or model into your work, to cite our paper as a gesture of recognition for its valuable contribution. The act of referencing the relevant sources not only upholds academic honesty but also ensures proper acknowledgement of the authors' efforts. Your citation in your research significantly contributes to the continuous progress and evolution of the scholarly realm. Your endorsement and acknowledgement of our endeavours are genuinely appreciated.

@misc{abilbekov2024kazemotts,
      title={KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis}, 
      author={Adal Abilbekov and Saida Mussakhojayeva and Rustem Yeshpanov and Huseyin Atakan Varol},
      year={2024},
      eprint={2404.01033},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

About

An open-source Kazakh Emotional Text-to-Speech Dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published