Skip to content
This repository has been archived by the owner on May 3, 2023. It is now read-only.

paguseva/tts-homework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TTS project

This is an implementation of Fast Speech paper.

Installation

The script downloads the necessary resources.

./setup.sh

Model checkpoint can be downloaded with the following command:

gdown --id 1wk7amOThMnfZtoMEKELT399rANOy1ZnN -O best_checkpoint.pth

Results

IMPORTANT: I USE PRECOMPUTED ALIGNMENTS

Here are samples of the synthesized speech:

  • A defibrillator is a device that gives a high energy electric shock to the heart of someone who is in cardiac arrest

Predicted audio

  • Massachusetts Institute of Technology may be best known for its math, science and engineering education

Predicted audio

  • Wasserstein distance or Kantorovich Rubinstein metric is a distance function defined between probability distributions on a given metric space

Predicted audio

Training

The logs can be found in W&B.

The final model trained for 40 epochs with this config. Training took 11 hours and 51 minutes.

Experiments

Aligner

The original aligner used durations that were extracted from wav2vec. Since this aligner computed durations in waveform there were problems with rounding during duration modeling.

I trained model with this aligner for 17 hours in total. The results were not great, but the sentences are mostly legible.

Example:

The sentence is The Chronicles of Newgate, Volume two. By Arthur Griffiths. Section four: Newgate down to eighteen eighteen.

Original audio Predicted audio

The spectrograms are oversmoothed, but you can see some faint patterns in low frequencies which is not bad (upper - original, lower - predicted).

Original spectrogram Predicted spectrogram

Another set of alignments wes taken from an open-source implementation. These alignments have the exact same shape as our encoded texts and they were extracted from Tacotron. There were no rounding problems because the alignments were computed in melspectrogram.

Using precomputed alignments greatly boosted the quality. The convergence improved, too. My model was able to reach the same quality as the previous one in under 1 hour.

The examples of audio are provided in the section Results.

Abbreviations

I was surprised to learn that torchaudio's pipelines did not do the expansion. The dataset contains a lot of abbreviations hence the lack of expansion can really affect the loss.

Another challenge were non-ASCII symbols in the dataset. Dataset's webpage mentions non-ASCII symbols, however there is no information about which symbols they are. Turns out there is a variety of diacritics and umlauts because the dataset contains quotes from languages other than English. I replace such symbols with their respective ASCII version.

There are also some other symbols that tokenizer does not support. I simply delete such characters.

Hyperparameters

I took model hyperparameters from FastSpeech paper and I did not change them at all. I mainly experimented with learning rate and various schedulers.

Model achieved good results with OneCycleLR and with WarmupScheduler from the original Transformer paper. With WarmupScheduler model trained faster, so I picked it as a final model.

The spectrograms look significantly better than the previous example (upper - original, lower - predicted). Original spectrogram Predicted spectrogram

About

code for training a text-to-speech model

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published