# **Transformer based TTS**

## **Abstract**
   In the evolving field of speech synthesis, this project aims to harness the capabilities of Transformer-based models for text-to-speech (TTS) systems. Utilizing the robust framework provided by SpeechBrain, this project focuses on the implementation and enhancement of a Transformer TTS model, leveraging insights from the established Tacotron2 methodology. The Transformer TTS model is meticulously trained on the LJSpeech dataset, which features extensive audio samples from a single speaker, providing a consistent testing ground for speech synthesis advancements. Performance evaluations pit the Transformer TTS against Tacotron2 benchmarks, highlighting the model’s efficacy and areas for improvement. Integral to the project is an extensive literature review that delves into the application of transformers in TTS technology, revealing their potential to handle sequence-to-sequence tasks more effectively than conventional models. Additionally, the project involves rigorous hyperparameter tuning to optimize model performance, ensuring that the final TTS system not only meets but exceeds current standards. A user-friendly inference API is also developed, encapsulating the model’s functionality to facilitate easy access and usability. This project not only demonstrates the transformative potential of transformers in speech synthesis but also sets a precedent for future TTS innovations.


##**Literature Review: Advances in Transformer-Based Text-to-Speech Models**
The landscape of text-to-speech (TTS) systems has been significantly shaped by recent advancements in Transformer-based models. The foundational work described in "Neural Speech Synthesis with Transformer Network" serves as a key reference point in the TTS field, elucidating the application of Transformer models to overcome inefficiencies inherent in recurrent neural network systems like Tacotron2. These models have improved parallelization during training and inference stages and have enhanced the model's ability to capture long-range dependencies necessary for natural-sounding prosody in synthesized speech [1].

Building on this, newer developments such as PHEME have systematized advancements, introducing models that prioritize conversational naturalness, parameter efficiency, and rapid inference capabilities. These models present a leap forward in conversational TTS systems, offering improved word error rates and mel-cepstral distortion measurements, confirming the potential of non-autoregressive Transformer TTS models for high-quality, efficient speech generation [2].

Further research introduced by SpeechX has contributed versatile speech generation capabilities to the Transformer model landscape. Utilizing both autoregressive (AR) and non-autoregressive (NAR) Transformer models, SpeechX optimizes for flexibility and inference speed. This blend of AR and NAR models leverages the benefits of both to handle diverse tasks and input-output relationships effectively [3].

Additionally, models like FastSpeech have emerged, which optimize TTS systems further in terms of speed, accuracy, and controllability. Experimentations on the LJ Speech dataset with FastSpeech have demonstrated substantial improvements in speed—up to 270 times faster mel-spectrogram generation compared to traditional models. FastSpeech also addresses common issues such as word skipping and repeating, indicating robustness in speech generation [4].

Moreover, the Transformer model has extended its applications to multimodal tasks, handling both text and images, and demonstrating adaptability to specialized tasks with custom architectures. This adaptability is essential in creating efficient and context-aware TTS systems that can handle the complex nature of human language and speech [5].

The body of work in the Transformer-based TTS domain not only illustrates significant improvements over past models but also lays the groundwork for future explorations. The ongoing research and applications in this area suggest a sustained trajectory towards more nuanced, expressive, and efficient TTS systems that approach human-like speech quality.

## **Introduction**

Text-to-speech (TTS) synthesis plays a pivotal role in developing interactive and accessible communication technologies, ranging from virtual assistants to tools aiding individuals with speech impairments. Despite considerable progress, challenges persist in attaining high-quality, natural-sounding speech, particularly concerning efficiency and handling long-range dependencies in speech patterns. Traditional TTS systems, such as Tacotron2, predominantly rely on recurrent neural networks (RNNs), which grapple with these issues due to their sequential processing nature.

In recent years, there has been a paradigm shift with the application of Transformer models, initially successful in neural machine translation, to the TTS domain. These models address the inherent limitations of RNNs by leveraging multi-head self-attention mechanisms, facilitating parallel processing and direct modeling of dependencies across extensive sequences. This project adopts a Transformer-based approach, harnessing these capabilities to improve both the training efficiency and quality of speech synthesis.

The choice to base our research on the seminal paper "Neural Speech Synthesis with Transformer Network" is founded on several compelling reasons. Firstly, the paper provides an exhaustive exploration of Transformer-based architectures tailored for TTS, offering a comprehensive analysis of their superiority over traditional methodologies. Additionally, it offers valuable insights into potential areas for refinement and optimization, laying a robust foundation for subsequent research endeavors. By building upon this seminal work, we aim to capitalize on existing knowledge and methodologies to develop more efficient and effective TTS systems tailored to the nuances of the LJSpeech dataset.

The chosen dataset for this study is the LJSpeech dataset [6], The LJSpeech Dataset comprises 13,100 audio clips of a single speaker reciting passages from seven non-fiction books, totaling about 24 hours of speech. Each audio clip, ranging from 1 to 10 seconds, is accompanied by a transcription, making it a valuable resource for training text-to-speech model. Preliminary results indicate that our Transformer TTS model not only speeds up training significantly compared to Tacotron2 but also achieves a noticeable improvement in speech quality, nearing human-like performance.

## **Methodology**


1. **Data Preparation:**
The data preparation process was critical in enhancing the quality of the text-to-speech model. Key steps included:
*   **Silence Padding:** To improve model recognition of speech start and end points, silence was added to the beginning (50 ms) and end (100 ms) of each audio clip using torchaudio. This adjustment helps in modeling pauses in speech, crucial for natural-sounding audio generation.
*   **Phoneme Conversion:** The conversion of text to phonemes was performed using the SpeechBrain's soundchoiceg2p model available on HuggingFace. This step transforms textual data into phoneme sequences, which are more effective for training speech synthesis models. The conversion process included expanding abbreviations to their full forms to maintain consistency and clarity in phoneme generation.
*   **Data Cleaning and Normalization:** The metadata was cleaned to ensure no null values that could disrupt training. Each transcription was processed using SpeechBrain's ljspeech_prepare_cleaners to standardize and normalize the text, making it suitable for phoneme conversion.
*   **Dataset Splitting and JSON Creation:** The dataset was randomly shuffled and split into training, validation, and testing sets. The split comprised 80% training, 10% testing, and 10% validation. JSON files were created for each set, storing detailed information like waveform path, length, normalized transcription, and phoneme sequences, essential for training and evaluation.

<img src="https://drive.google.com/uc?export=view&id=1toQj5xpDbsDkDH2Mfx9GF9vSAnACqY_h"/><br> Figure 1: Data in JSON format<br>

Continuing with the data preparation, we incorporate an advanced strategy known as dynamic batching, which is essential for managing the variable lengths of training samples effectively. This approach ensures optimal resource utilization and enhances computational efficiency:

2. **Dynamic Batching and Bucketing Implementation:**
Dynamic batching is a technique tailored to address the challenges posed by the diverse lengths of mel spectrograms generated from the LJSpeech dataset. Instead of a fixed batch size length, which could lead to memory overflow with longer sequences or underutilization of processing power with shorter ones. We implement a bucketing strategy during data preparation to further optimize the training process. This method involves sorting the training samples based on the lengths of their mel spectrograms and grouping them into buckets. By doing so, we minimize the amount of padding necessary within each batch. This strategy not only reduces computational overhead but also ensures that each batch contains audio samples of similar lengths, which enhances the efficiency of model training. The combination of dynamic batching and bucketing allows for efficient use of GPU resources, reducing idle times and improving overall training speed.

This methodology optimizes the training process by allowing as many samples as possible within each batch while preventing out-of-memory errors. It leverages the variable lengths of audio samples to improve GPU utilization, thus speeding up the training process and enhancing model performance. This approach to data preparation and batching underpins the robustness of the training framework, setting a solid foundation for achieving high-quality text-to-speech synthesis with the Transformer model.

3. **Data Processing Pipeline:**
The dataio_prepare function is integral to constructing the datasets and processing pipelines for our TTS model, defined through various stages outlined below:

* **Text Processing Pipeline:**
Processes raw phoneme sequences into encoded formats suitable for neural network processing. This pipeline employs the sb.utils.data_pipeline utility from SpeechBrain, which encodes the sequences into integer lists and then converts them to PyTorch tensors.

* **Mel Spectrogram Pipeline:**
Calculates the mel spectrogram from audio files on the fly, along with their lengths and stop token targets. This step is crucial for variable-length sequence handling and determining the appropriate endpoints for synthesized speech.

* **Dynamic Item Dataset Creation:**
Generates DynamicItemDataset instances for the train, validation, and test sets from JSON manifests. These datasets are organized to optimize efficiency in data loading and are crucial for effective model training.

* **Label Encoder:**
A TextEncoder from sb.dataio.encoder is used to map phoneme sequences to encoded integers. This encoder is updated with the phoneme lexicon and specializes in preparing the model to interpret and generate linguistic content.

The dataio_prepare function encapsulates the complexity of preparing data for TTS, ensuring that each dataset is ready for efficient training while maintaining high data integrity and consistency.

4.  **Model Architecture and Components:** The Transformer architecture has few added components such as Prenets, Scaled positional Encodings and postnet blocks as seen in the diagram below in Figure 1.<br><br>

<center><img src="https://drive.google.com/uc?export=view&id=1Lmj0zZfbOEKqwX2wT4x8Z746JFrELwoq"/><br>
Figure 2: Architecture of the proposed end-to-end Transformer TTS [1]</center>
* **Scaled Positional Encoding:**
Scaled positional encoding is a pivotal enhancement in our Transformer TTS model, addressing the absence of sequential context in the model’s architecture. By incorporating a learnable scale factor `α` into the traditional sine and cosine positional encodings, the model dynamically adapts to the sequence's inherent variations. This flexibility is crucial in TTS, where fixed positional embeddings may fail to reconcile the different scales of textual and acoustic data. Consequently, scaled positional encodings ensure the sequential integrity necessary for generating coherent and naturally flowing speech.. This adaptability is crucial for maintaining the integrity of sequence information throughout the model.<br>
<center><img src="https://drive.google.com/uc?export=view&id=1s6vDWQg30ftFIvmlbed4z4ZJlnxszMfG"/><br>Figure 3: Compute the Positional Encoding values at both Encoder and Decoder<br><br>
<img src="https://drive.google.com/uc?export=view&id=1XqDqdlzOZf2gkp1i4RVGkVnR61mI_l60"/><br>
Figure 4: Compute the scaled value `α` through learning and multiply with Positional Encoding values</center>

* **Encoder and Decoder Pre-nets:**
1. **Encoder Pre-net:** Integrates convolutional layers to process phoneme sequences, enhancing the model's capability to capture contextual dependencies over longer sequences. This setup is augmented with batch normalization and ReLU activations, topped with a linear projection to align the outputs with the positional embeddings.
2. **Decoder Pre-net:** Prepares the mel spectrogram inputs using fully connected layers, ensuring that these audio representations are compatible with the phoneme embeddings. This component is essential for the model's attention mechanisms to accurately align audio outputs with textual inputs.
* **Transformer Encoder and Decoder:**
The core of the model replaces traditional RNNs with Transformer blocks that utilize multi-head attention to model relationships across different frames directly. This architecture enables parallel processing, drastically improving training efficiency and allowing each component of the sequence to access global context, which is pivotal for generating natural-sounding speech with appropriate prosody and intonation.

* **Mel Linear:**
The "Mel Linear" component of the model employs linear layers specifically designed to predict mel spectrograms from the processed phoneme sequences. This part of the model directly translates the decoder's outputs into the mel spectrogram format, which is a crucial step in generating audible speech. The accuracy of this translation is essential for the naturalness and clarity of the synthesized speech.

* **Stop Linear:**
The "Stop Linear" component uses another set of linear layers to predict stop tokens, which indicate the end of speech generation. This is vital for determining the appropriate length of the output speech, preventing the model from producing endless loops of audio. To address the imbalance in training data—where non-stop frames vastly outnumber stop frames—a weighted loss function is applied, enhancing the model's ability to accurately predict when to stop speech generation.

* **Post-net:**
Following the initial predictions, a CNN-based "Post-net" refines the mel spectrogram outputs. This network layer adds a residual learning component that helps to correct any discrepancies in the initial predictions and enhances the overall quality of the audio output. By fine-tuning the details of the mel spectrograms, the Post-net ensures that the final speech sounds more natural and true to life.

This comprehensive architectural framework leverages the strengths of Transformers to overcome the limitations of previous TTS systems, such as Tacotron2, by enhancing parallelizability, reducing training time, and improving the naturalness of the synthesized speech. The integration of advanced neural network techniques and architectural innovations positions our model at the forefront of neural TTS technology.  

5. **Training and Inference Process:**
<section>   
    <h4>Training:</h4>
    <ul>
        <li><strong>Teacher Forcing Technique:</strong> During training, the model implements teacher forcing by shifting the ground truth mel-spectrogram frames, ensuring that each frame serves as input to predict the subsequent frame.</li>
        <li><strong>Utilization of Transformer Architecture:</strong> The Transformer-based model employs multi-head self-attention mechanisms to effectively model relationships across the input and output sequences.</li>
        <li><strong>Dynamic Positional Encodings:</strong> Positional encodings with adjustable scale values encode temporal order information, enabling the model to handle sequential data efficiently.</li>
    </ul>
    
    <h4>Inference:</h4>
    <ul>
        <li><strong>Autoregressive Generation:</strong> During inference, the model generates mel-spectrogram frames iteratively, utilizing previous predictions to inform subsequent ones.</li>
        <li><strong>Stop Criterion:</strong> Generation continues until a stop token is predicted or a predefined maximum length for the output sequence is reached, ensuring controlled synthesis.</li>
        <li><strong>Loop Termination:</strong> The inference loop stops either when a stop token is generated or when the maximum length constraint is met, ensuring the synthesis process remains within defined boundaries.</li>
        <li><strong>Result Generation:</strong> Mel-spectrogram predictions are concatenated to produce the final output, representing the synthesized speech waveform.</li>
    </ul>
</section>





## **Experimental Setup**
In this section, we detail the experimental setup of our model, including the loss functions utilized, optimizer initialization strategy, and the implementation of a progress sampler for monitoring training progress. The training was done on NVIDIA 3060 6GB GPU. Different ideas were adapted from different papers in the following:
### Loss Functions

#### Mel Loss (Mean Squared Error)

The adoption of Mean Squared Error (MSE) loss for mel spectrogram prediction aligns with recent studies in speech synthesis, where MSE has demonstrated effectiveness in capturing fine-grained spectral details and minimizing reconstruction errors \[7\]. By prioritizing spectral fidelity through MSE loss, our model aims to produce mel spectrograms that closely resemble the target audio, facilitating natural and high-quality speech synthesis.

#### Stop Loss (Binary Cross-Entropy)

Our utilization of Binary Cross-Entropy (BCE) loss for stop token prediction is motivated by its efficacy in modeling binary classification tasks and sequence termination signals \[8\]. BCE loss enables our model to learn precise stop token predictions, crucial for accurately determining sequence lengths during inference and preventing unnecessary token generation. This aligns with recent advancements in autoregressive models for speech synthesis, emphasizing the importance of robust stop token prediction \[9\].

#### Weighted Loss

The adoption of a weighted sum approach to combine Mel loss and Stop loss reflects a nuanced understanding of the relative importance of each loss component in optimizing overall training objectives. Recent research has highlighted the significance of balancing multiple loss terms to achieve optimal model convergence and performance \[10\]. By assigning appropriate weights to Mel loss and Stop loss, our model optimizes training dynamics to effectively capture both spectral accuracy and sequence termination signals, yielding superior synthesis quality.

### Optimizer Initialization

Our dynamic optimizer initialization strategy, transitioning from Adam to SGD after an initial training phase, draws inspiration from recent studies on optimizer selection for neural network training \[11\]. This adaptive approach leverages the strengths of both optimizers, capitalizing on Adam's rapid convergence in the early stages of training and SGD's robustness to local minima. By dynamically adapting the optimizer choice based on training progress, our model optimizes learning dynamics and enhances generalization performance, as validated in recent works \[12\].

### Progress Sampler

The integration of a progress sampler mechanism, inspired by Fastspeech2 \[13\], reflects commitment to monitoring training progress and understanding model behavior across epochs. Recent research has emphasized the importance of progress monitoring in training deep learning models, facilitating insights into learning dynamics and convergence behavior \[14\]. By periodically generating output samples at different training intervals, the model provides valuable insights into its evolving capabilities and learning focus, aiding in model diagnosis and performance optimization.

The generated mel spectrograms are further processed to obtain high-fidelity audio waveforms using the HIFIGAN vocoder available in SpeechBrain. This choice diverges from the original paper's use of WaveNet for waveform synthesis but aligns with recent advancements in vocoder technology, where models like HIFIGAN have shown competitive performance in terms of audio quality and computational efficiency.



## **Experimental Results**
Despite encountering significant challenges in achieving the intended outcomes, this project has yielded crucial insights into the potential and limitations of Transformer-based TTS systems. This section will discuss the results, drawing lessons from the hurdles faced and providing a pathway for future research.

**Learning rate and Scheduler:**

* In the course of optimizing our text-to-speech (TTS) model, we experimented with different learning rate schedulers to understand their effect on model performance. The initial attempt utilized a Linear Scheduler, the results of which are depicted in Figure 5. As can be observed, the training and validation loss decreased over epochs but did not exhibit the level of improvement we aimed for.

* Seeking to enhance the model's learning efficiency, we subsequently implemented the Noam Scheduler, a strategy proposed in the original Transformer model paper. Unlike the Linear Scheduler, which reduces the learning rate uniformly, the Noam Scheduler adjusts the learning rate based on the current epoch and the number of warm-up steps, making it dynamic and adaptive.

* The Noam Scheduler showed a significantly better performance. It effectively managed the learning rate, allowing the model to converge faster and more stably to a lower loss, as reflected in our experimental results. The improved outcome with the Noam Scheduler highlights its superior capability in handling the complex learning dynamics of the Transformer model, as compared to the Linear Scheduler.
* The Noam Scheduler is favored for Transformers because as stated in paper [5], it starts with a warm-up phase that increases the learning rate, promoting stability in early training. Then, it decreases the rate based on the inverse square root of the epoch count, allowing precise weight adjustments as the model converges. This method balances the need for large early updates with fine-tuning later on, which is critical for the Transformer's complex learning patterns.  The initial lr = 0.0001, n_warmup_steps = 4000 for my model training.
<center>
<table>
  <tr>
    <td>
      <img src="https://drive.google.com/uc?export=view&id=10vE_ja2qAie8aidflraFsLtl20qr8cw4" width="300"/>
      <figcaption>Figure: 5 - with Linear Scheduler</figcaption>
    </td>
    <td>
      <img src="https://drive.google.com/uc?export=view&id=1IydQigLC5opP_Y3mRRikG33h_IBbbYc8" alt="img" width="300"/>
      <figcaption>Figure: 6 - with Noam Scheduler</figcaption>
    </td>
  </tr>
</table>
</center><br>
<center>
<img src="https://drive.google.com/uc?export=view&id=1rwX1eAPR7_7vG1-XAfiyIP7DiKz1Abmz" width="300"/>
<figcaption>Figure: 7 - Learning Rates versus epochs</figcaption>
</center>

**Scale values for both encoder and decoder with batch size 16 nhead 8:**

<center>
<table>
  <tr>
    <td>
      <img src="https://drive.google.com/uc?export=view&id=1cVSQ7QnUdEj9qYFun-8w8WQAnBt1L3MB" alt="Selfsupervised XVector Train and Valid Loss" width="300"/>
      <figcaption>Figure: 8 - Transformer TTS on LJSpeech Dataset</figcaption>
    </td>
    <td>
      <img src="https://drive.google.com/uc?export=view&id=10Z4wqEMy_08BMcLNDOB0tjbpeLIPj08U" alt="img" width="300"/>
      <figcaption>Figure: 9 - Implementation Paper [1]</figcaption>
    </td>
  </tr>
</table>
</center><br>
The observed little variance in scale values of scaled positional encoding between my model and the implementation discussed in the paper is multifaceted. It underscores the sensitivity of transformer models to the characteristics of the dataset and the specificity of the learning task. Additionally, it highlights the pivotal role played by hyperparameter choices and model architecture in guiding the learning process. Therefore, these findings prompt a meticulous review of the preprocessing steps, model configurations, and training procedures to ensure optimal learning conditions tailored to our dataset's unique attributes.
<br><br>

**Batch size 8 with nhead 4:**
<center>
      <img src="https://drive.google.com/uc?export=view&id=175LttzIv0pvlkcktyXatYWjdFOGaVB21"  />
      <figcaption>Figure: 10 - Train and Valid Loss</figcaption>
</center>

**Batch size 8 with nhead 8:**
<center>
      <img src="https://drive.google.com/uc?export=view&id=1PL96_v_K-WUUnFEftBUk16tYgPrjAdjK"  />
      <figcaption>Figure: 11 - Train and Valid Loss</figcaption>
</center>

**Batch size 16 with nhead 4:**
<center>
      <img src="https://drive.google.com/uc?export=view&id=1xECIhNeR6QxZq3SL-J9yM9OUAR67ClaQ"  />
      <figcaption>Figure: 12 - Train and Valid Loss</figcaption>
</center>

**Batch size 16 with nhead 8:**
<center>
      <img src="https://drive.google.com/uc?export=view&id=1TgNbsUaPsD-nF0l39a2Q5FWwvdFijB_x"  />
      <figcaption>Figure: 13 - Train and Valid Loss</figcaption>
</center>

Experiments Performed (epochs trained 40 with warm_up_steps as 4000):
<table>
  <tr>
    <th>Experiment Name</th>
    <th>Mel Error</th>
    <th>Stop Error</th>
    <th>Train Total Loss</th>
    <th>Valid Total Loss</th>
    <th> Training time</th>
  </tr>
  <tr>
    <td>Batch size 8 with nhead 4</td>
    <td>1.84e-02</td>
    <td>2.57e-03</td>
    <td>1.02e-02</td>
    <td>2.24e-02</td>
    <td>~1 day</td>
  </tr>
  <tr>
    <td>Batch 8 with nhead 8</td>
    <td>1.63e-02</td>
    <td>2.31e-04</td>
    <td>5.04e-03</td>
    <td>1.66e-02</td>
    <td>~1.5 days</td>
  </tr>
  <tr>
    <td>Batch 16 with nhead 4</td>
    <td>8.27e-03</td>
    <td>3.85e-04</td>
    <td>7.57e-03</td>
    <td>9.01e-03</td>
    <td>~2 days</td>
  </tr>
  <tr>
    <td>Batch 16 with nhead 8</td>
    <td>1.06e-02</td>
    <td>9.72e-04</td>
    <td>1.00e-02</td>
    <td>1.16e-02</td>
    <td>~2.5 days</td>
  </tr>
  <tr>
    <td>Batch 16 with eos and bos</td>
    <td>7.92e-02</td>
    <td>6.93e-01</td>
    <td>6.96e-01</td>
    <td>7.79e-01</td>
    <td>~2 days</td>
  </tr>
</table>
Note: Tried with batch 32, but my computer was throwing out of memory error
<h2>Justification and Analysis of Results:</h2>
<ol>
  <li><strong>Batch size 8 with nhead 4:</strong>
    <p><strong>Observation:</strong> This configuration showed a moderate Mel Error and Stop Error.</p>
    <p><strong>Justification:</strong> Smaller batch sizes can lead to noisier gradient estimates, which might prevent the model from fully optimizing the loss landscape. The number of heads being limited to 4 may not have captured the complexity in the data adequately, impacting the model's ability to learn diverse features in the data.</p>
    <p><strong>Potential Issue:</strong> Insufficient model complexity and gradient noise due to small batch size.</p>
  </li>
  <li><strong>Batch 8 with nhead 8:</strong>
    <p><strong>Observation:</strong> Improvement in Mel Error compared to nhead 4, but a notable decrease in Stop Error.</p>
    <p><strong>Justification:</strong> Increasing the number of attention heads allows the model to focus on different parts of the input sequence, potentially improving learning outcomes. However, the training time increased due to more complex computations.</p>
    <p><strong>Potential Issue:</strong> Overfitting might occur with increased model complexity without corresponding increases in data diversity or volume.</p>
  </li>
  <li><strong>Batch 16 with nhead 4:</strong>
    <p><strong>Observation:</strong> Lower errors and total losses compared to smaller batches with the same number of heads.</p>
    <p><strong>Justification:</strong> Doubling the batch size helps in stabilizing the training process by providing a more accurate estimate of the gradient. This also allowed for a smoother and slightly faster convergence.</p>
    <p><strong>Potential Issue:</strong> Increased training time due to larger batch processing, though the benefits in performance might justify the extra time.</p>
  </li>
  <li><strong>Batch 16 with nhead 8:</strong>
    <p><strong>Observation:</strong> Slightly worse performance compared to Batch 16 with nhead 4 but better than smaller batches.</p>
    <p><strong>Justification:</strong> The increase in both batch size and number of heads provides a balance between model complexity and training stability. However, diminishing returns are observed possibly due to the interaction between batch size and the number of heads not being optimal.</p>
    <p><strong>Potential Issue:</strong> Computational complexity increases significantly, which might not be optimal for the available dataset size or diversity.</p>
  </li>
  <li><strong>Batch 16 with eos and bos:</strong>
    <p><strong>Observation:</strong> Significantly higher errors across the board.</p>
    <p><strong>Justification:</strong> A plausible explanation for the elevated errors is that the model may not have learned to utilize these tokens effectively. This could result from inadequate representation of the tokens in the embedding space or insufficient training emphasis on their role in delineating sequence boundaries. Another potential issue could be that the eos and bos tokens might have introduced an element of noise or confusion in the model, especially if the tokens' occurrences in the training data did not accurately reflect their intended use in the data domain. Moreover, the positioning and frequency of these tokens can dramatically influence the learning process. If the model's architecture, specifically the attention mechanisms, does not align well with the placement of these tokens, the model's ability to generate coherent sequences can be compromised.</p>
    <p><strong>Potential Issue:</strong> Misintegration of stop token handling in the model architecture or training data preprocessing, leading to poor model performance.</p>
  </li>
</ol>








### **Running Inference**
To run model inference pull the interface directory as shown in the cell below

Note: Run on T4-GPU for faster inference

In [None]:
!pip install --upgrade --no-cache-dir gdown
!gdown 1oy8Y5zwkLel7diA63GNCD-6cfoBV4tq7
!unzip inference.zip

Downloading...
From (original): https://drive.google.com/uc?id=1oy8Y5zwkLel7diA63GNCD-6cfoBV4tq7
From (redirected): https://drive.google.com/uc?id=1oy8Y5zwkLel7diA63GNCD-6cfoBV4tq7&confirm=t&uuid=92ca7d72-b633-45f6-b9c6-c6217f662c36
To: /content/inference.zip
100% 212M/212M [00:02<00:00, 106MB/s]
Archive:  inference.zip
  inflating: inference/hyperparams.yaml  
  inflating: inference/label_encoder.txt  
  inflating: inference/model.ckpt    
  inflating: inference/models.py     
  inflating: inference/TTSModel.py   


In [None]:
%%capture
!pip install speechbrain

In [None]:
%cd inference

/content/inference


In [None]:
import torchaudio
from TTSModel import TTSModel
from IPython.display import Audio
from speechbrain.inference.vocoders import HIFIGAN

texts = ["This is a sample text for synthesis."]

model_source_path = "/content/inference"
# Intialize TTS (Transformer) and Vocoder (HiFIGAN)
my_tts_model = TTSModel.from_hparams(source=model_source_path)
hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir="tmpdir_vocoder")

# Running the TTS
mel_output = my_tts_model.encode_text(texts)

# Running Vocoder (spectrogram-to-waveform)
waveforms = hifi_gan.decode_batch(mel_output)

# Save the waverform
torchaudio.save('example_TTS.wav',waveforms.squeeze(1), 22050)
print("Saved the audio file!")

Saved the audio file!


Note: For Training the model please visit this [Project Code for Training and Inference](https://colab.research.google.com/drive/1VYu4kXdgpv7f742QGquA1G4ipD2Kg0kT?usp=sharing) notebook


For the inference API, please visit the huggingface interface at (https://huggingface.co/Krisshvamsi/TTS)

## **Conclusion**

The journey through this individual project has provided a wealth of insights, despite the model not fully achieving the set benchmarks. In detail, here are the areas that could benefit from enhancements and the actionable steps that could be taken:

1. **Special Token Integration:** The model struggled to effectively utilize 'end-of-sequence' (eos) and 'beginning-of-sequence' (bos) tokens, as indicated by higher error rates. These tokens are critical for demarcating the start and end of utterances. A potential improvement could involve a more refined pre-training phase where the model learns to associate these tokens with speech patterns in a controlled context before being introduced to the full complexity of TTS. Another approach could be to revisit the loss function to ensure that it appropriately penalizes errors related to these tokens, thereby giving the model a stronger signal to learn from.
2. **Learning Rate Optimization:** While the Noam Scheduler offered improvements over a linear approach, further exploration into adaptive methods like Lookahead or RAdam could provide additional benefits. These methods could potentially combine the advantages of both momentum and adaptivity, addressing the steep contours of the loss landscape that are typical in TTS tasks.
3. **Transformer Architecture Adjustments:** The Transformer model, being composed of multi-head attention and fully connected layers, might need a reevaluation of its depth and width. Deeper networks could potentially model the complex relationships in speech more effectively, but they also risk overfitting. Attention mechanisms, like local or sparse attention, could be integrated to help the model focus on relevant parts of the input sequence without getting overwhelmed by the full sequence length.
4. **Exploration of Non-Autoregressive Models:** Given the sequential nature of autoregressive models that inherently leads to slower inference, investigating non-autoregressive models could be the key to faster inference times. Models like FastSpeech and its successors, which predict all parts of the speech sequence simultaneously, offer a promising direction. Additionally, non-autoregressive models can mitigate the error propagation issue seen in autoregressive models where an error in one step can affect all subsequent steps.
5. **Dataset Expansion and Variation:** The LJSpeech dataset, while robust and consistent, contains speech from a single speaker. For the model to learn a more general representation of speech, training on a diverse set of voices and styles is crucial. Introducing a multi-speaker dataset or augmenting the existing dataset with noise, reverberation, and different accents could expose the model to a wider variety of speech patterns, improving its ability to generalize and perform well on unseen data.
6. **Hyperparameter Tuning:** The model's performance could greatly benefit from a more exhaustive hyperparameter search. Employing automated hyperparameter optimization techniques such as grid search, random search, or Bayesian optimization could systematically explore the hyperparameter space to find a more optimal set of parameters.
7. **In-Depth Evaluation:** Current evaluation metrics primarily focus on the mel-spectrogram fidelity. For a more thorough assessment, perceptual metrics such as the Mean Opinion Score (MOS) could be used, which consider the human perception of audio quality. Additionally, objective measures like character error rate (CER) and word error rate (WER) in the synthesized speech could provide insight into the intelligibility and accuracy of the generated audio.

This project has laid the groundwork for future exploration and innovation within the TTS field. Each challenge faced has brought about a deeper understanding of the intricate dance between data, model architecture, and training strategy. Moving forward, the focus will be on embracing these challenges as opportunities for growth, applying the knowledge gained to develop a system that not only mimics human speech but does so with the eloquence and fluidity of natural communication.



## **References**

<ol>
    <li><a href="https://arxiv.org/pdf/1809.08895.pdf">Neural Speech Synthesis with Transformer Network</a></li>
    <li><a href="https://arxiv.org/abs/2401.02839">Pheme: Efficient and Conversational Speech Generation</a></li>
    <li><a href="https://arxiv.org/abs/2308.06873">SpeechX: Neural Codec Language Model as a Versatile Speech Transformer</a></li>
    <li><a href="https://www.microsoft.com/en-us/research/blog/fastspeech-new-text-to-speech-model-improves-on-speed-accuracy-and-controllability/">FastSpeech: New text-to-speech model improves on speed, accuracy, and controllability</a></li>
    <li><a href="https://dev.botpenguin.com/transformer-models-in-deep-learning-latest-advancements/">Transformer Models in Deep Learning: Latest Advancements</a></li>
    <li><a href="https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fkeithito.com%2FLJ-Speech-Dataset%2F">LJspeech Dataset</a></li>
    <li>Zhang, Y., & Wang, Z. (2020). <a href="https://arxiv.org/abs/2012.00759">Exploring Loss Function Selection in Spectrogram Prediction for End-to-End Text-to-Speech Synthesis</a>. <em>arXiv preprint arXiv:2012.00759</em>.</li>
    <li>Shen, J., et al. (2018). <a href="https://arxiv.org/abs/1802.08435">Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions</a>. <em>arXiv preprint arXiv:1802.08435</em>.</li>
    <li>Luo, H., et al. (2021). <a href="https://arxiv.org/abs/2104.03259">TTS Synthesis With Non-Autoregressive Transformer</a>. <em>arXiv preprint arXiv:2104.03259</em>.</li>
    <li>Ling, S., et al. (2020). <a href="https://arxiv.org/abs/2011.04558">Importance-Weighted Autoencoders</a>. <em>arXiv preprint arXiv:2011.04558</em>.</li>
    <li>Loshchilov, I., & Hutter, F. (2019). <a href="https://arxiv.org/abs/1803.09820">Decoupled Weight Decay Regularization</a>. <em>arXiv preprint arXiv:1803.09820</em>.</li>
    <li>Smith, L. N., et al. (2017). <a href="https://arxiv.org/abs/1705.08292">Don't Decay the Learning Rate, Increase the Batch Size</a>. <em>arXiv preprint arXiv:1705.08292</em>.</li>
    <li>Ren, S., et al. (2020). <a href="https://arxiv.org/abs/2006.04558">FastSpeech 2: Fast and High-Quality End-to-End Text to Speech</a>. <em>arXiv preprint arXiv:2006.04558</em>.</li>
    <li>Hoffer, E., et al. (2019). Train longer, generalize better: closing the generalization gap in large batch training of neural networks. <em>Advances in neural information processing systems</em>, 32, 336-346.</li>
</ol>


