This repository contains a Persian language adaptation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS). The core implementation is based on this repository, modified to work with Persian text and phoneme data.
Organize your data as follows:
dataset/persian_date/
train_data/
speaker1/book-1/
sample1.txt
sample1.wav
...
...
test_data/
...
- Audio Preprocessing
python synthesizer_preprocess_audio.py dataset --datasets_name persian_data --subfolders train_data --no_alignments
- Embedding Preprocessing
python synthesizer_preprocess_embeds.py dataset/SV2TTS/synthesizer
To begin training the synthesizer model:
python synthesizer_train.py my_run dataset/SV2TTS/synthesizer
To generate a wav file, place all trained models in the saved_models/final_models
directory. If you haven’t trained the speaker encoder or vocoder models, you can use pretrained models from saved_models/default
.
python inference.py --vocoder "WavRNN" --text "یک نمونه از خروجی" --ref_wav_path "/path/to/sample/reference.wav" --test_name "test1"
WavRNN is an old vocoder and if you want to use HiFiGAN you must first download a pretrained model in English.
- Install Parallel WaveGAN
pip install parallel_wavegan
- Download Pretrained HiFiGAN Model
from parallel_wavegan.utils import download_pretrained_model
download_pretrained_model("vctk_hifigan.v1", "saved_models/final_models/vocoder_HiFiGAN")
- Run Inference with HiFiGAN
python inference.py --vocoder "HiFiGAN" --text "یک نمونه از خروجی" --ref_wav_path "/path/to/sample/reference.wav" --test_name "test1"
Check out some audio samples from the trained model in this directory.