# Text-To-Speech project

>**HomeWork №3**
>
>Elective course: Deep Learning in Audio processing
>
>Accomplished by Borevsky Andrey

-----

## I. Experiments

### A. First steps

**Idea**

So, here the core concept was to gradually evolve from the FastSpeech (FS) 1 implementation from the seminar to the required FS2. Consequently, after copying the given notebook & running the bash code, we appeared at the first part of the project. As it was proposed to us, we progressively modified **1)** the multi-head attention mechanism in terms of shapes' transformation, exploring the auxilary [source](https://github.com/karpathy/minGPT/blob/7218bcfa527c65f164de791099de715b81a95106/mingpt/model.py); **2)** the layer-norms' (LN) positions, shifting the entire algorithm from classical Post-LN to Pre-LN; **3)** the LN itself: from usual `nn.LayerNorm` to recently-emerged `nn.ScaleNorm`, implementation of which is identical to the [authors' one](https://arxiv.org/pdf/1910.05895.pdf); **4)** PostNet structure, borrowed from the FS1 implementation, downloaded via the bash code; **5)** FixNorm from the same paper, where we discovered the ScaleNorm.

Next, it was planned to progressively combine all the suggested apporahces on the low scale of given data only in terms of several starting thousands of operations. Comparison was to be made along such charachteristics, as №_iterations / sec & mel_loss as the main indicators of model's quality. Based on the upcoming results, we wished furter to implement most useful concepts as an optimization of FS2's transformer part. 

**Results**

<center><img src="images/Exp1_All.png" width="1500" height="400">

*Graph A.1 Result for duration & mel losses during the first experiment on FS1 with the low-scale data*

<img src="images/Exp1_Compare.png" width="500" height="400">

*Graph A.2 Comparison of the results in the long-run*</center>

> `fs1` for FastSpeech model
> `sc` for ScaleNorm
> `pre` for PreLN
> `fix` for FixNorm
> `w_norm` for weight normalization (from mentioned [article](https://arxiv.org/pdf/1910.05895.pdf))

**Inference**

After carefully investigating each available idea on how to increase the speed of future model, a particular list was formulated. First of all, pre-LN has shown marvellous results in terms of convergence. At the same time, during this stage we've chosen not the most efficient embodiement of tis idea. The pre-LN structure after the experiment was as following: 
```python 
self.FFTBlock: input -> nn.Linear(x) = q,k,v -> self.ScaleNorm(d_model ** 0.5) -> out=(self.ScaledDotProductAttention + q) -> self.ScaleNorm(d_in ** 0.5) -> self.FeedForward` + out
```

Later we will return to this point by showing it's final appearance. Another moment to underscore is a positive role of ScaleNorm, which, despite not significantly ameliorating the performance compared to standard LN use, was decided to involve to the future model. At the same time, zero (or even negative) effect had the introduction of FixNorm and/or adjusted weight normalization (coherent implementation of the [paper](https://arxiv.org/pdf/1910.05895.pdf)). Hence, they were forever disregarded. Nevertheless, it is important to understand that the entire experiment was based on ~15% of all available data, so convergence of all plots into identical values after several iterations might not be relevant for serious runs.

Several words can be devoted to the question of PostNet. While it seems to be quite a usual phenomenon for TTS task, its direct use is unclarified. At the same moment, since it increases the computation time, the decisiom about removal was categorical. Nevertheless, on the long-run there could be potential improvements compared to the no-PostNet case.

### B. First blood

**Idea**

We finally approached the FS2 scheme itself. While all the challenges will be discussed in the respective section, here we will mention the key milestones. As we found out that the CWT implemention for pitch extraction, required in article's second version, is not obligatory, the decision was made to make a symbiosis of two texts. Pitch is extracted from the corresponding wave with a help of `pyworld` library. Consequently, we interpolate and normalize the contour, recieving almost the same appearance as it was proposed in the second version. By the way, contour zero points were both extrapolated, based on the surrounding values, and filled by boundaries, which makes the entire experiment process more precise. Way more intresting is what happens with energy. In order to remind, authors suggest to extract it via L2 norm of spectrogram. However, as we can see, no spectrograms or methods for their creation was provided. The only things we have are the waves themselves and the resulting melspecs. Therefore, we had two paths: first is to create our own scheme of spectrogram creation from wave, searching for suitable hyperparameters (so, no alterations from `mel_target` will take place). But, due to a number of functions provided, we could appeal to the second path: to apply slightly modified `tools.inv_mel_spec`, exiting it after the spectrogram was created. So, the latter path was the one to be selected as the more appropriate and easy-going. Sure, some might say that the quality of this operation is lower than of the first algorithm. Such a notion would be bolstered by the comparison of true audio and the one created by this auxilary function (latter will be full of noise). Hence, we decided to verify the comparative quality of acquired energy by via the second path. We've accomplished it throught plotting the following graph, consisting of correct `mel_target` & our `energy` (the function was taken from [here](https://github.com/ming024/FastSpeech2/blob/d4e79eb52e8b01d24703b2dfc0385544092958f3/utils/tools.py#L213)):  
<center><img src="images/spectro.png" width="500" height="400"></center>
So, we can see that energy accurately corresponds to the baseline of true melspec's nature. 

As the extraction part was discussed, we can move on. After the announced procedures has taken place, objects are compressed and at the end of the entire processing algorithm they are normalized. Speaking of the compression, here should be a little intro. So, initially we were planning to work in the suggested by authors manner: 
```python
nn.DurationPredictor -> nn.LengthRegulator -> nn.PitchPredictor -> nn.EnergyPredictor
```
However, we soon found out that since we use some additional operations after LR and before decoder, specifc masks might probably be needed. Having no clear understanding of what they should look like, we decided to swap the first two and the last two stages. In case if the precise embodiement of authors' article will be questioned, we can assure that experiments with the asserted above formula have taken place on the low scale and were quite successful. Nevertheless, for the purpose of most pure algorithm construction, we have introduced this minor change. But it appeared to us at some moment that now targets of pitch and energy do not coincide with encoder output. This is a logical notion since it is the LR, which expands the input tensor to the required shape of targets. Thankfully, the entire expansion is held via the duration tensor, stating how long each sound lasts. Subsequently, we use a special `compress` function, which converts energy and pitch to the input shape by taking mean of values on each segment. 

One way or another, we have at this point extracted, compressed, saved and normalized energy and pitch target samples. Next we add the `VarianceAdaptor` class in our project universe. What it does is consequently appling distinct `Predictor` instances (ex `DurationPredictor` class) to acquire energy and pitch predictions. Then, it chooses either the true target if provided or takes the scaled outputs and runs them through `torch.quantization_per_tensor` for `quint8`, having 256 available values. To be honest, this stage was a real challenge at some point & this particular experiment was initiated with the wrong implementation, making almost half of the values equal to each other. Nevertheless, we will show later it's final appearances. Of course, after being quantized, tensors were altered into embeddings and their sum with input was projected to be returned.

Speaking of FS2 structure as a whole, it looks quite straightforward and is really close to the article's text. 

**Results**

<center><img src="images/Exp2_LowScale.png" width="1300" height="400">

*Graph B.1 Results for losses during the second experiment on initial FS2 models with the low-scale data*

<img src="images/Exp2_Full.png" width="1300" height="400">

*Graph B.2 Best model performance in the long-run on full dataset*</center>

**Inference**

Unfortunately, at this particular stage of the project we faced numerous tiny mistakes, which all together worsened the model's performance. The pre-LN was incorrectly implemented, the quantization formula was giving pretty bad distribution of values, the inference mode in LR was implemented with several errors, etc. Nevertheless, till this moment we were working on a comparatively small number of iterations. Hence, finding uncertainties in code and ameliorating our implementation were key goals of this phase, which were successfully accomplished. Last step was running a complete model on full training loader to get intermediate results for further comparison. While the results seems to be quite nice, there were several problematic regions, such as convergence of losses to high asymptote, rapid speech during inference, high error rate of pronounciation.

### C. Big Man Facilities (night_1)

**Idea**

After the long path described above, we finally started the true experiments with the entire available dataset. Quantisation after precise exploration was rewritten in the following manner:
```python
torch.quantize_per_tensor(energy_target + abs(data_config.energy_min), scale=self.energy_scale, zero_point=self.energy_mean, dtype=torch.quint8).int_repr().long()
```
where the `energy_scale` is sum of absolute maximum and minimum values, divided by 256 (№ of available values) & `energy_mean` equals mean value of entire dataset. Such a formula has shown itself in the most efficient manner in terms of task to efficiently distribute energy/pitch values across `quint` (according to the respective plots). At the same time, correct preLN changed it appearance, starting to look more in the initial, article's perspective:

<center><img src="images/post_preLN.png" width="500" height="400"></center>

The issue with LR appeared to be mainly bounded with making `log(duration_predictor_output)`. While we were correctly giving the logarithm of duration target to the loss function and had positive dynamics of duration loss, we had made a small mistake with LR inference implementation. However, after this error was removed (& special hyperparameters for controling pitch/energy values were added), we found ourselves ready for the training. Of course, there were numerous other alterations throughout the entire project fulfillment, however, the key points stayed the same if considering the training configurations or FS2 structure as a whole. 

**Results**

<center><img src="images/Exp3.png" width="1300" height="400">

*Graph C.1 Results for losses during the third experiment on ameliorated FS2 model*</center>

**Inference**

We finally achieved results, sufficient for publishing as project's benchmark. The speaker's speed was far more pleasant (owing to the fixed LR inference), energy & pitch were evolving further, showing better values for corresponding losses, the prime mel_loss was converging faster to lower values mostly because of altered preLN. Slight modifications within synthesis function allowed us to check the testing sentences. Another affirmative point to underscore is the achievment of worthy results without strong fall of `steps_per_sec` parameter. It is also vital to mention here that during training we were using `tools.inv_mel_spec` as a mechanism for extraction audio from the model's output (because of the implementation pecularities). Hence, quality displayed on the W&B project's directory and the one performed with a help of WaveGlow differ significantly. 

### D. Finita la Comedia (night_2)

**Idea**

Last jump of this work is quite contradictory. While we already had sufficient quality to pass the project, it seemed quite intresting to a bit play with settings. At this point we increased number of decoder layers, moved to log-scale pitch quantisation, used boundary fill value instead of extrapolation for pitch contour & added the postnet, similar to the FS [implementation](https://github.com/xcmyz/FastSpeech/).

What is of paramount importance for reproducing this model is a backpropagation moment. At some points (clearly seen on the graphs) we were turning off some losses (~76 - postnet_loss, ~91k - energy&pitch). This was made by two senses: to optimise the learning process & stop confronting gradients. The latter needs some more explanations. During the previous experiment we faced a phenomenon when losses for energy/pitch outweight mel loss, resulting in static nature of training (nothing really changes after specific number of iterations). Hence, we turn off backpropagation initially for postnet, as a mechanism requiring sufficient time to backprop, & for energy/pitch later. Such a method allowed to quickly achieve plateu for energy/pitch & then freeze them, so only mel_/duration_loss will be decreasing. While we loss in quality of this two new auxilary dimensions, we hope it will outpay for us during inference.

**Results**

<center><img src="images/Exp4.png" width="1300" height="400">

*Graph D.1 Results for losses during the fourth experiment on ameliorated FS2+Postnet*</center>

**Inference**

Well, it is quite controversial, which experiment has played better in terms of inferred audio quality. Therefore, as a real man, bread-and-winner in some sense, we decide not to make any decisions & provide both models & audio recordings, so you might choose the bettter one (for the sake of MOS, of course!). Actually, there is a slight difference in favor of one of them, but we'd prefer to leave as it is. Please, if it required to pass only one model, tell us - we will select the one. 

Speaking of the presented graphs, our technique had given the awaited result: the mel_loss fall surpassed the one of the previous experiment, while energy/pitch loss's plateu appeared to be higher.

In [12]:
from IPython import display
print('Third experiment')
display.Audio("results/big_man/s=1.0_2_waveglow.wav")

Third experiment


In [14]:
display.Audio("results/big_man/s=1.0_1_waveglow.wav")

In [10]:
print('Last experiment')
display.Audio("results/comedia/s=1.0_2_waveglow.wav")

Last experiment


In [15]:
display.Audio("results/comedia/s=1.0_1_waveglow.wav")

## II. Running scheme

**Model_1**: 

    1. pitch (pw extract, interpolate+extrapolate, compress, normalize, linear quantise)
    2. decoder_layers = 4
    3. postnet = False
    4. path: "model_new/checkpoints/experiment3.pth.tar"  - don't forget to chain config!
    
**Model_2**(recommended): 

    1. pitch (pw extract, interpolate+boundary, compress, normalize, log quantise)
    2. decoder_layers = 5
    3. postnet = True
    4. Combinative training (turning losses on/off at some points)
    5. path: "model_new/checkpoints/experiment4.pth.tar"
    
[**Logs**](https://wandb.ai/aborevsky/fastspeech_example)

**Install run**

In [None]:
%%bash
install.sh

**Train run**

In [None]:
!python3 train.py

**Inference run**

In [139]:
!python3 inference.py --m_path "model_new/checkpoints/experiment4.pth.tar"


---Model Restored ---

torch.Size([80, 567])


## III. Evaluation

### Features' workability assessment

1. Extrapolation Vs boundaries

First to start is a phase during the pitch extraction. While interpolation has shown itself as a useful tool, luckily recommended by the FS2 authors, the approach to establish it seemed quite ambiguous. On the one hand, we started with an `extrapolation` hyperparameter, wishing simply to fill empty positions. However, we were getting no effect during `PitchPredictor` training with loss jumping around constant values. Rolling back to the issue's rott, we found out that quantisation works very poorly with most values being set to the right boundary (255) because of long range between normalized minimum (-70) & maximum (90). Since this is quite not what we were expecting, we dived into the  pitch target creation algorithm. After plotting the graphs of interpolated arrays, it became clear to us that the issue lies exactly in `extrapolation approach`, which seldomly prolongates undefined boundaries to colossal values (>1000). Spending some time on scipy documentation, we changed the formula, now using left & right defined boundaries of each array to set values for those positions outside of the sequence.

2. VA position before/after LR

We of course wanted to stay as close to the article's text as possible (except the CWT case, truly last thing we ever wanted to have business with). Unfortunately, we faced quite a sudden potential issue: how to mask the pitch and energy tensors during prediction? Since a small amount of time was spent during the seminar for mask's use clarification, it seemed to be totally desperate to embody two predictors with no understanding of what tensors they should operate with. While we had a run during experiment 2 with VA used after LR (as in the paper), it's quality seemed to be unsatisfying. Hence, we decided to escape such a contradictory situation by adding pitch and energy before length regulator. In such a sense we modified the FS2 structure, added a compressed dataset of values and then outlined that such a technique was far better in terms of quality. Sure, if having more information for establishing VA run after LR, we could embody it and potentialy get even higher quality (hotya mnogogo nam i ne nado!).

3. Straightforward/adjusted quantise per tensor

Oh, this pitch/energy quantisation - true headache of this project. Viewing it as an easy obstacle on the path to functioning model, we faced a many-hours-long issue. Its core was lying in the fact that straightforward `torch.quantise_per_tensor(dtype=torch.quint8)` is severely hard to set to particular segment of possible values (for instance, between -1 and 6 with most of them lying between [0, 1]). We tried to solve respective equations to identify the suitable scale & zero point values. However, each time we connected pitch/energy min/max with quantisation boundaries ([-128, 127] if qint8 and [0, 255] if quint8), the distribution of intermediate values was terrifying with most positions being equal to either left or right boundary. Nevertheless, described above solution was found & showed itself in a good manner. A great bolster we got from paper's advice on normalization.

Another intresting point to mention is log-scale for pitch. While it is rational to use such a space in case of pitch (owing to a specific values' distribution), its direct implementation appeared to be a severe challenge. We wish a solution could be found with torch quantisation technique, howere all the attempts failed. Luckily, we were told of the magic `torch.bucketize` function, documentation of which was quickly explored. Its outstanding facilities allowed us to easily convert pitch contour to log-space by simply stating the specific `torch.logspace`, starting from log(1) and till log(min + max + 1) with exponent as a base. Additionally, we moved the entire pitch/energy array by the pitch_/energy_min each time to achieve good distribution of values.

4. FixNorm & altering weight normalization

Now we approach one of the most useless project's features. Numerously mentioned article on ScaleNorm was also mentioning this two techniques as very successful at most cases. However, several experiments with both learnable and static parameters (for FixNorm) have shown zero or sometimes even negative dynamics compared to previous attempts with these tools' absence.

5. ScaleNorm & PreLN

Well, at first they were slightly scolded, now we will praise them. Two articles provided to us, each discussing the methods to ameliorate transformer's performance & convergence, were highly usefull in some parts. Primarily, the ScaleNorm class and PreLN approach were most lucrative in terms of enchancing FS2.

6. Normalization & interpolation

Yes, the overall scheme of pitch contour extraction (except (i)CWT part) from the article's second version was more than essential for FS2 construction. This two techniques allowed to make `pitch_target`'s appearance suitable for predictor training in such manner that it enchances the original FS.

7. PostNet, log(pitch), decoder layers number

Well, last experiment was devoted to identifying the true role of these parameters. While each of them separately didn't give initially any sufficient improvement, combining them all together had brought us to the better losses' scores. Especially intresting to speak of log-space for pitch since coherent analysis of it's values' distribution has shown that they all lie so close in a specific region. Hence, logspace seems to be a supreme solution in such a case. 

---

### Challenges

1. Quantisation

Well, many many words were already said about quantisation and how desperate at some moments it was to continue implementating it. Especially, creation of log quantizer was a serious question since normalized values mostly lie between -1 and 1. Nevertheless, its' positive effect is indisputable. Moreover, great assistance was rendered from energy/pitch original mean/std preservance. Owing to their addition to DataConfig (together with min/max), we could easily explore all tensors mutations backwardly, searching for the potential issue. 

2. Energy/Pitch datasets creation 

Well, to wait for dataset creation around 40 min. and then each time for its loading around 10 min. still sounds quite challenging. Especially, when remembering all numerous mistakes during saving, compressiong & statistics calculation adding to the script, collator modifications. Nevertheless, it was a vital step on the pathof our project.

3. LR: inference mode

Last challenge to be mentioned is the inference mode in length regulator establishment. Besides numerous runs with wrong shapes during trainig/synthesis, a sufficient problem was to correlate the log(duration) prediction with satisfying values at inference. By the way, some runs' audio logs in W&B sound horrific because of the inaccurate inference mode (if checking last experiments, such a problem disappears).