# **TTS using Transformer-based Architecture**

## **Abstract**
   This project explores the development and optimization of a sequence-to-sequence Transformer model aimed at speech synthesis, specifically focusing on the generation of mel spectrograms from phoneme sequences. The model was trained with several examples of speech to learn the variability and complexity of spoken language. The system architecture incorporates a Transformer model, which includes an encoder to interpret the phonemes and a decoder that reconstructs these into audible speech. The core of our model is the generation of mel spectrograms from input phoneme sequences using an encoder-decoder structure with positional encodings and custom pre-nets, enhancing the naturalness and intelligibility of the synthesized speech. Additionally, we employ a WaveNet vocoder to convert these spectrograms into audible speech, achieving near-human quality in the resulting audio. This synthesis process is not only faster but also more parallelizable compared to traditional methods, which streamlines both training and inference phases significantly. When compared to tacotron2, our model incoperates parallel processing of data and a more direct modeling of long-range dependencies in speech, leading to faster training times and potentially more expressive speech outputs. The results of this project show promise for developing better tools for speech generation, which could help in creating more effective and user-friendly applications that rely on synthetic speech, such as virtual assistants and spoken language translation systems.

## **Introduction**

In the modern world of text-to-speech (TTS) systems, traditional methodologies like concatenative and parametric speech synthesis have increasingly given way to neural network-based models, which simplify the complex linguistic and acoustic pipelines of older systems. This shift has been driven by end-to-end generative models such as Tacotron2, which directly generate mel spectrograms from text inputs using a sequence-to-sequence architecture​​.

Our study introduces a new approach by incorporating the Transformer architecture, renowned for its effiency in neural machine translation due to its parallel processing capabilities and self-attention mechanisms. By replacing recurrent components with multi-head attention, the Transformer model facilitates faster training and improves the handling of long-range dependencies, thus enhancing the naturalness of the speech output. This integration aims to address the limitations of Tacotron2, particularly in modeling long sequences and maintaining computational efficiency during training​​.

For this study, we employ the [LJSpeech dataset](https://arxiv.org/pdf/1809.08895.pdf), a widely-used corpus in speech synthesis research that consists of 13,100 short audio clips of a single speaker reading passages from various non-fiction books. This dataset provides a high-quality benchmark for evaluating the performance of TTS systems. By comparing our Transformer-based model to Tacotron2 using the LJSpeech dataset, we aim to highlight the improvements in processing speed and output quality, thereby contributing to the ongoing development of more advanced and natural-sounding speech synthesis technologies.

## **Methodology**

***Data Preparation:***

1. Expansion of Abbreviations: Common abbreviations in the transcriptions (like "Mr." to "Mister") are expanded to their full forms using a predefined mapping. This standardization helps in reducing variability during phoneme conversion.
2. Phoneme Conversion: The normalized transcriptions are converted into phonemes using SpeechBrain's GraphemeToPhoneme model. This step transforms textual data into a phonetic representation, which is more beneficial for training speech synthesis models as it provides a direct mapping to spoken sounds.
3. Dataset Splitting and JSON Conversion:
The dataset is divided into training, validation, and testing subsets. Each subset's data is processed to append silences at the beginning and end of audio files, accommodating natural speech patterns. The processed data—including paths to audio files, durations, transcriptions, and phoneme sequences—are saved into JSON files, facilitating efficient data handling and on-the-fly processing during model training.
4. Dynamic Item Dataset: This approach allows for on-the-fly computation of batches, optimizing memory usage and accommodating varying data sizes by dynamically applying transformations like phoneme encoding and mel spectrogram computation during training.
5. Bucketing: In addition to the custom collate function, our approach utilizes a bucketing technique for creating batches. Sequences are grouped into similar lengths before batching. This minimizes the amount of padding needed per batch, which in turn reduces the computation required during training. The bucketing technique significantly cuts down on the computational overhead by ensuring that sequences within a batch are of similar length. This alignment reduces the number of operations the model has to perform on padded areas, which do not contribute to the learning process.

6. Collate Function: It handles the padding of sequences within a batch, ensuring that all data points have uniform length. Sequences within a batch are sorted by length in descending order. This is crucial for optimizing the computation during training, especially when using attention mechanisms that can skip processing padded values. Each sequence, whether phonemes or mel spectrograms, is padded to the length of the longest sequence in the batch, ensuring that all sequences in the batch have the same length.
7. Feature Extraction:
Mel spectrograms are computed from the waveform data, providing a time-frequency representation of the audio. This feature is pivotal for the training of the speech synthesis model as it captures essential characteristics of the sound, which are used to generate the synthetic speech output.


***Model Architecture:***
 We introduced a custom Transformer model, leveraging its attention mechanisms to process sequences effectively without the recurrent layers typically found in other sequence-to-sequence models like Tacotron2.

1. Prenets: The encoder prenet processes phoneme sequences, converting them into a higher-dimensional, enriched representation. This is achieved through a series of convolutional layers followed by batch normalization and dropout. These layers help in capturing local dependencies within the phoneme sequences.

  The decoder prenet is crucial for handling the mel spectrogram inputs. It initially transforms these spectrograms into a more manageable representation for the decoder through a sequence of linear transformations interspersed with ReLU activations and dropout. This transformation is essential for smoothing and conditioning the input, making it more suitable for the decoder to generate the subsequent mel spectrogram frames.

2. Scaled positional Encoding: Scaled positional encoding in the model addresses a critical aspect of handling differing scales between the encoder and decoder inputs—namely, the phonemes and mel spectrograms. The encoder processes phonemes, which are on a different temporal scale compared to the mel spectrograms processed by the decoder. The scaled positional encoding introduces a trainable parameter that dynamically adjusts the scale of positional embeddings. This adjustment is crucial because it helps the model better align and integrate these different representations, ensuring that temporal dynamics are accurately captured and reflected in the synthesized speech. This method enhances the model's ability to deal with the inherent discrepancies in the data representation scales, leading to more effective and natural-sounding speech synthesis.


3. Encoder: It consists of multiple layers, each containing two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. Each of these layers is equipped with residual connections followed by layer normalization. The self-attention mechanism in the encoder allows the model to weigh the importance of different phonemes within the sequence, irrespective of their position, enhancing the contextual understanding of the input sequence.

<center><img src="https://drive.google.com/uc?export=view&id=1sGml2bi1J05SpDia35nScfMJO5ModeKZ"/></center>

4. Decoder: Autoregressively generates mel spectrogram frames from the encoded embeddings. The self-attention layers within the Transformer allow the model to weigh the importance of different words in the text, irrespective of their position, leading to more nuanced and contextually aware speech synthesis. Similarly, the decoder processes the mel spectrogram frames pre-processed by the decoder prenet and aims to predict the next frame in the sequence. The decoder also contains multiple layers, each consisting of three sub-layers: a masked multi-head self-attention mechanism, a multi-head attention mechanism over the encoder's output, and a position-wise fully connected feed-forward network. The first attention layer in the decoder is masked to ensure that predictions for a particular timestep can only depend on known outputs at previous timesteps, maintaining the autoregressive property.

5. Lookahead and Zero-padding masking:
The lookahead mask is implemented in the attention mechanism of the decoder block of the Transformer architecture. It works by setting the upper triangular part of the attention score matrix (including the diagonal) to negative infinity (or an extremely low value) before the softmax operation. This operation effectively zeros out these positions in the softmax output, ensuring that the attention mechanism can only focus on past and current positions in the sequence when calculating the context vector for each decoder output step.

  To prevent the model from treating padded zeros as valid data, a masking strategy is employed. The model uses a padding mask during the attention mechanisms within both the encoder and decoder blocks. This mask explicitly informs the self-attention layers to ignore the zeros added during padding. The masking is applied by setting the attention weights to zero wherever there are padding zeros in the input sequences. This approach ensures that the computation of the attention scores only considers the actual data points and not the padded zeros.

6. Mel-linear, Postnet, and Stop-linear: Mel-linear module is directly connected to the output of the decoder block in your model. It plays a crucial role in shaping the final output of the network into a format that is suitable for audio synthesis, ensuring that the dimensions of the output match the expected number of mel frequency bins.

   The Postnet takes the initial mel spectrogram predictions from the Mel-Linear layer, processes them, and outputs a refined version. This refined spectrogram, when added back to the initial prediction, helps in mitigating any potential prediction errors and enhances detail, leading to more accurate and natural sounding speech output.

   The Stop-Linear module processes the decoder output to predict stop tokens for each time step. These predictions are made using a sigmoid activation function to produce a probability between 0 and 1, where values close to 1 indicate a higher confidence in stopping the generation. This mechanism ensures that the model can flexibly adapt to sentences of different lengths and stops the generation process at the appropriate time

***Training:*** While training we have used teacher forcing to speed up the training time to generate the results parallely. The mel spectrograms already present are given as the input to the decoders by shifting one place. Also,
 it has been performed by applying decoder prenet and scaled positional encoding on them. The stop token prediction trains the model to learn when to predict the stop token and it uses that information at the time of inference.

***Inference:*** While testing, the sequence of zeroes are given as input to the first decoding step and the resulting output is given as input to the next decoding step and so on. This is sequential execution of the model to predict the resulting spectrograms. We have used stop tokens to predict the end of the sentence. So, when the stop token is generated by the model, it stops the prediction. All the predicted mel spectrograms are concatenated and stored into a single tensor to convert it into an audio waveform.




## **Experimental Setup**
For this study, the LJSpeech dataset was chosen due to its extensive collection of English speech data. LJSpeech is a publicly available audio dataset consisting of 13,100 short audio clips of a single female speaker reading passages from 7 non-fiction books. The audio totals about 24 hours and is sampled at 22,050 Hz. Each clip includes a transcript, which makes this dataset highly suitable for training speech synthesis models.

The model was implemented using PyTorch and Speechbrain, leveraging their robust frameworks for building and training deep learning models. Training was conducted on GPU-accelerated hardware to manage the computational demands efficiently i.e., MPS.

***Vocoder***: A pre-trained HiFi-GAN model, was integrated into the setup to convert the mel spectrograms back into waveform audio, ensuring high-quality audio output. The vocoder's role is crucial in synthesizing natural-sounding speech from the mel spectrogram output of the Transformer model.

***Loss Functions:*** In the training process of the model, two specific loss functions were utilized to optimize different aspects of the speech synthesis task. The Mean Squared Error (MSE) loss function was employed to calculate the loss between the predicted and target mel spectrograms. This choice is particularly effective as MSE directly measures the average squared difference between the estimated values and what is estimated, providing a clear quantitative measure of the model's accuracy in reproducing the audio's spectral features. Parallelly, the Binary Cross-Entropy (BCE) loss function was used for the stop tokens, which indicate the end of the speech sequence. The BCE loss is suitable for binary classification tasks and helps the model learn when to appropriately end the generated speech, enhancing the naturalness and timing of the synthesized audio.

***Weighted Loss:***
As we add the mel loss(MSE loss) and stop loss(BCE loss) to get the final loss, but we can also include weights to maintain the balance between the losses. This is achieved through a weighted loss strategy, where each loss component is assigned a specific weight. These weights help in prioritizing one aspect of the learning over another, depending on which area might need more emphasis. Using this strategy didn't help much in increasing the performance of the model.

***Weight Decay and Beta Values for Optimizer:*** Initially, when the weight decay value was 0.000001 and beta values were [0.9, 0.98], the model did not converge properly. So, I have changed the values to 0.00001  and [0.9, 0.999]. Since I was using 0.98 for the second moment, which is quite high and makes the optimizer rely heavily on more recent gradients, adjusting this value to allow the optimizer to consider a longer history of gradients. A common alternative is 0.999, which could provide a smoother and more stable update to the weights. Also, weight decay is used to regularize your model by adding a penalty on the size of the weights. while too low might not provide enough regularization, leading to overfitting.

***Batch Size:*** In an effort to improve training and potentially accelerate the learning process, increasing the batch size from 8 to 16. While a larger batch size might have theoretically provided smoother and more stable gradient updates, it introduced a critical issue: memory overflow. When the batch size was increased to 16, the model training encountered an "Out of Memory" error on the GPU.



## **Experimental Results**

<center><img src="https://drive.google.com/uc?export=view&id=1x6OvQcYB3WWksplaUIDYmxCu8GfXzMwH"/></center>
<br>
Despite the theoretical benefits of applying weighted loss, the results did not show a significant improvement in the model's performance. The graphs depicting the training and validation losses indicate minimal changes in convergence patterns and final loss values compared to other experimental results. This suggests that simply adjusting the loss weights did not effectively address the model's learning challenges.

***Interpretation:***

The lack of improvement might be due to several factors such as the initial choice of weights, the inherent complexity of the model, or the nature of the training data. It is possible that the model's architecture and existing hyperparameters are already near optimal for the given data, leaving little room for improvement through loss weighting alone. Alternatively, the weights might not have been tuned to the optimal values needed to make a discernible difference.
<br>
<br>
<center><img src="https://drive.google.com/uc?export=view&id=1f3DViNZKVLkGf3DeuqeBHBzy0vS3vXtG"/></center>
<br>
The second experiment focused on adjusting the weight decay parameters, the beta values of the optimizer, and the batch size. Adjustments to the weight decay and beta values, along with a smaller batch size of 8, resulted in more promising outcomes. The corresponding graph showed improved convergence rates and a slight decrease in validation loss over epochs. These changes suggest a more stable and effective learning process, potentially leading to enhanced model performance.

***Interpretation:***

The positive impact of these adjustments indicates that the model's training process is sensitive to the optimizer's configuration and the batch size. Smaller batches may have provided more frequent updates, helping the model to escape local minima or converge to better solutions. Similarly, tuning the weight decay and beta values likely helped in managing the trade-offs between speed of convergence and stability, preventing the accumulation of errors that larger batch sizes or less optimal optimizer settings might incur.
<br>

<center><img src="https://drive.google.com/uc?export=view&id=1HPjAzsz7_7BB-3hYHLRoIuVq2hsOfpHW"/></center>
<br>
The graph illustrates the progression of the learning rate over epochs using the Noam scheduler. The scheduler is characterized by a warm-up phase where the learning rate rises linearly, reaching a peak that facilitates rapid initial learning. This is particularly effective for Transformers, which benefit from the ability to make significant strides early in training when the model parameters are farthest from their optimal values. Following the warm-up, the learning rate declines in proportion to the inverse square root of the epoch number. This gradual decrease aligns with the need for adjustments as the model begins to converge, preventing the overshooting of the minimum loss value.

              Experiments with epochs 40 and warm up steps 4000
<table>
  <tr>
    <th>Experiment Name</th>
    <th>Mel Error</th>
    <th>Stop Error</th>
    <th>Train Loss</th>
    <th>Valid Loss</th>
    <th> Training Time</th>
  </tr>
  <tr>
    <td>Batch size 8 with nhead 4</td>
    <td>2.78e-02</td>
    <td>2.89e-02</td>
    <td>2.42e-03</td>
    <td>3.02e-02</td>
    <td>~1.5 day</td>
  </tr>
  <tr>
    <td>Batch 8 with nhead 8</td>
    <td>2.34e-03</td>
    <td>1.21e-04</td>
    <td>3.84e-03</td>
    <td>3.92e-03</td>
    <td>~2.5 days</td>
  </tr>
  <tr>
    <td>Batch 8 with eos and bos and nhead 8</td>
    <td>3.78e-01</td>
    <td>4.01e-01</td>
    <td>5.96e-01</td>
    <td>6.72e-01</td>
    <td>~2.5 days</td>
  </tr>
</table>
<br>

***Interpretation:***

Doubling the number of attention heads significantly improved the model's accuracy on both training and validation datasets. The lower errors and losses indicate a better generalization capability, likely due to the increased model complexity allowing for finer attention mechanisms within the network. The increase in training time is justified by the enhanced model learning, resulting in better performance.

The inclusion of EOS and BOS tokens, despite the complexity of having 8 attention heads, led to significantly higher errors and losses. This outcome could be due to several factors like introducing EOS and BOS might have introduced additional complexity or confusion in learning sequences, particularly if the model was not optimized to handle these tokens effectively.
and the way these tokens were implemented in the sequence might not align well with how the model processes sequence information leading to bad performance.


## **Conclusions**

In conclusion, the model demonstrates a capacity for rapid learning and effective long-range dependency modeling, outperforming traditional Tacotron2 frameworks in parallel processing and speed. During training, we noted that the Noam learning rate scheduler provided an adaptive approach, balancing rapid initial learning with more measured adjustments in later epochs. The improved performance with 8 attention heads is evident in the model's ability to reduce both mel error and stop token error, leading to a lower overall training and validation loss. With adjustments to weight decay and beta values, resulted in improved model convergence. This suggests a sensitive balance within the training process, where precise tuning of hyperparameters can lead to better generalization and performance.

One unexpected finding was that the addition of EOS and BOS tokens did not improve and, in fact, seemed to reduce the model's accuracy. This is an open area for further investigation into the optimal implementation of sequence delimiters within a Transformer-based TTS system.





## **Inference**


In [None]:
!pip install --upgrade --no-cache-dir gdown
!gdown 1Sg3eQyOwHNeBJLEaHgwTO0OAi2pqEpNe
!unzip project_TTS.zip

Downloading...
From (original): https://drive.google.com/uc?id=1Sg3eQyOwHNeBJLEaHgwTO0OAi2pqEpNe
From (redirected): https://drive.google.com/uc?id=1Sg3eQyOwHNeBJLEaHgwTO0OAi2pqEpNe&confirm=t&uuid=901c18dd-a21e-482f-b30c-bb5153705ef7
To: /content/project_TTS.zip
100% 209M/209M [00:01<00:00, 177MB/s]
Archive:  project_TTS.zip
   creating: project_TTS/
  inflating: project_TTS/.DS_Store   
  inflating: __MACOSX/project_TTS/._.DS_Store  
  inflating: project_TTS/custom_model.py  
  inflating: __MACOSX/project_TTS/._custom_model.py  
  inflating: project_TTS/TextToSpeech.py  
  inflating: __MACOSX/project_TTS/._TextToSpeech.py  
  inflating: project_TTS/model.ckpt  
  inflating: __MACOSX/project_TTS/._model.ckpt  
  inflating: project_TTS/hyperparams.yaml  
  inflating: __MACOSX/project_TTS/._hyperparams.yaml  
  inflating: project_TTS/input_encoder.txt  
  inflating: __MACOSX/project_TTS/._input_encoder.txt  


In [None]:
%cd project_TTS

/content/project_TTS


In [None]:
%%capture
pip install speechbrain

In [None]:
from speechbrain.inference.vocoders import HIFIGAN
from TextToSpeech import TextToSpeech
import torchaudio

texts = ["This is a example for synthesis."]

my_tts_model = TextToSpeech.from_hparams(source="/content/project_TTS")
hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir="tmpdir_vocoder")
mel_output = my_tts_model.encode_text(texts)

# Running Vocoder (spectrogram-to-waveform)
waveforms = hifi_gan.decode_batch(mel_output)

# Save the waverform
torchaudio.save('example_TTS.wav',waveforms.squeeze(1), 22050)

hyperparams.yaml:   0%|          | 0.00/11.3k [00:00<?, ?B/s]

model.ckpt:   0%|          | 0.00/129M [00:00<?, ?B/s]

ctc_lin.ckpt:   0%|          | 0.00/177k [00:00<?, ?B/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

hyperparams.yaml:   0%|          | 0.00/1.16k [00:00<?, ?B/s]



generator.ckpt:   0%|          | 0.00/55.8M [00:00<?, ?B/s]



# This is my [project.ipynb](https://drive.google.com/file/d/15cUf5zbA1Waat4NphKc9orRnKAwkl9jo/view?usp=sharing/) file where you can run the code for training