# Neural Modelling Synthesis
***

### Adversarial Audio Synthesis
***
[Link to Audio Examples](https://chrisdonahue.com/wavegan_examples/)

**WaveGAN** - A first attempt to synthesize audio from raw time-domain waveforms  
- Unlike previous naive attempts to simply bootstrap algos by treating spectrograms as images. This offers better performance
- Qualitative(Human Judgement) + Quantitative(Inception Score) evaluation

Practical application in generating sounds - Artist can explore **latent space** of audio, and fine tune variables/parameters to generate desired sound, as opposed to finding his desired sound from dataset. However, audio has high temporal resolution, and thus latent space should encode these high dimensions correctly.  
Other work - 
1. Autoregressive models(Wavenet) -> *Very* slow generation of audio
2. GAN's used naively on image spectrograms -> Lossy estimates due to non-invertibility of spectrogram, thus learn an inversion model as well  
The authors want to investigate if unsupervised strategies can learn **semantic nodes** implicitly in the high dimensional space rather than being conditioned on them<Discuss more?!>

This paper - 
1. Waveform(WaveGAN) + Spectrogram(SpecGAN) strategies for GAN's
2. Human Evaluation for sounds + Quantitative evaluation on Speech dataset

<u>WaveGAN</u>  
Images vs Audio - 
    - Audio more likely to exhibit periodic structure
    - Correlations across large time instants in audio(-> Filter with larger RF!)

Architecture - 
- Modification of Deep Conv GAN
    1. Flattened(1D instead of 2D) convolutions
    2. Increase stride
    3. Rather than original GAN cost(which is unstable due to non-differentiability), use WGAN(modified stable cost function)
- Phase Shuffling(artifact prevention, phase invariance)

<u>SpecGAN</u>  
Phase information is often discarded which prevents inversion of spectrogram

Architecture - 
- Audio ---- [STFT] -> TF Representation ---- [Train DCGAN] -> Obtain samples ---- [Griffin Lim for phase reconstruction] -> Obtain audio from samples

Poor performance due to noisiness introduced in Griffin Lim inversion process

<u>Dataset</u>  
1. Speech(SC09)
2. Sounds(Drum, Bird, Piano, Large Vocab Speech)

**Took ~4 days to train!**

<u>Evaluation</u>  
1. Inception Score - $e^{E_{x}KL(P(y/x)||P(y))}$
    - P(y/x) should be low entropy(deterministic){Data generated given class label}
    - P(y) should be high entropy(uniform){Data generated across all labels}
2. NN comparison - Correct for errors in inception score
3. Qualitative Human Judgement - **Amazon Mechanical Turk** to collect/label audio. Digit perceived(SC09) + Sound quality(0-5)

Future Work - 
- Variable length audio
- label conditioning strategies

***
### Adversarial Generation of Time-Frequency Features with application in audio synthesis
***

STFT as Time Frequency(TF) Representation of audio i.e. GAN trained on STFT features. This outperforms models trained on waveform directly!

- Phase of STFT hard to understand and model => Use partial derivatives of phase(local instantaneous frequency).
- Phase estimation from magnitude spectrogram(Griffin Lim) unreliable
- Inspired from Phaseless Reconstruction + Current State of Art i.e. Gansynth -> TiFGAN

<u>Math</u>  
- STFT maps time-domain signals in to a lower-dimensional subspace of all possible magnitudes
- Look at STFT as an operator, and study the conditions for the operator to be perfectly invertible.
- For a consistent transformation, assuming that STFT is an analytic function, using the Cauchy Riemann equations, you get a coupled pair of equations which can be solved to obtain the phase simply from the log-Magnitude Spectrogram itself.
- Use of **Phase Gradient Heap Integration(PGHI)** to bypass phase instabilitites by providing betted estimates.
- The above condition is discretized, and using a new metric **consistency**, the authors judge goodness of TF representation. Consistency evaluated by looking ate projection error $\hat{e} = |S^{gen} - S^{proj}|^{2}$ where $S^{proj} = ISTFT(STFT(S))$

Architecture
- Modified DCGAN + preprocessing on signal to enable its input to a GAN

Evaluation
- Dataset - 
1) Speech commands dataset
2) MUSIC, 25 min of BACH piano recordings
- Evaluation using Inception Score and Frechet inception distance

Conclusion, Future work  -  
1. Consistency measure - Computationally cheap measure to assess quality of TF representation  
2. Extension - Use logarithmic and perceptual frequency scales

***
### GANSYNTH: ADVERSARIAL NEURAL AUDIO SYNTHESIS(State of the Art!!!)
***
[Audio Examples](https://storage.googleapis.com/magentadata/papers/gansynth/index.html)

Human perception of audio sensitive to both **global structure** and **fine scale waveform coherence**.
1. Autoregressive models(Wavenet, Autoencoder Wavenet) capture fine scale waveform, but lack global latent structure
2. GAN's have **global latent conditioning** and **efficient parallel sampling**, but lack local coherence.

This paper demonstrates that GAN's can indeed generate **high-fidelity** and **locally coherent** waveforms by modelling **log magnitudes** and **Instantaneuos Frequencies** in Spectral Domain

Autoregressive models - 
    - Focusing on 'finest scale' i.e. a sample
    - rely on **external conditioning** for global structure
    - Sampling very slow(**ancestral sampling**) as generate waveform one sample at a time
    - Due to **fine timescale**, autoencoder variants model only local latent structure.
GAN's - 
    - Stack of transposed convolutions on latent vector
    - lack perceptual fidelity of image counterparts

Motivation for using phase - 
- Problem of phase precession when frame size not equal to period(also for overlapping filterbanks)
![phase_precession](fig_01.PNG)
- Human perception highly sensitive to doscontinuities and irregularities in periodic waveforms
- Challenge for synthesis network - It must learn all the correct frequency+phase combinations to output a coherent waveform.

This paper - 
1. Generate log-magnitude spectrograms and phases directly with GAN
2. **Estimate Instantaneuos Frequency Spectra(Rather than Phase), more coherent audio**

<u>Dataset</u> - 
- Nsynth(available online)
- restricted to training on subsets of accoustic instruemtns, and limited pitch range(MIDI 24-84 ~ 1000Hz)

<u>Architecture</u> - 
- Progressive training of GAN's(Karras 2018a paper)
- Condition on additional source of information(pitch) to achieve independent control of pitch and timbre

<u>Evaluation</u> - 
1. Human Evaluation
2. Number of Statistically Different Bins(NDB) - measure diversity of generated samples
3. Inception Score
4. Pitch accuracy, pitch entropy
5. Frechet Inception Distance

<u>Discussion</u> - 
- Phase Coherence of generated waveforms(Rainbowgrams!)
![coherence_gen_wf](fig_02.PNG)
- Interpolation(**spherical interpolation** vs linear interpolation in previous work) observed smooth perceptual changes, no major artifacts
- Consistent Timbre across pitch, fix latent vector and condition on pitch. Timbral identity is constant for given point in latent space.
- Extremely fast generation

Future Work - 
- Combining adversarial losses with encoders(use VAE to model G and D)
- More straightforward **regression losses** to capture full data distribution

***
### TIMBRETRON: AWAVENET(CYCLEGAN(CQT(AUDIO))) PIPELINE FOR MUSICAL TIMBRE TRANSFER  
[Video](http://www.cs.toronto.edu/~huang/TimbreTron/index.html)
***

- Aim to solve problem of **Musical TImbre Transfer** i.e. manipulate timbre from one instrument to match another while preserving other musical content(pitch, rhythym, loudness)
- Inspired by **Image Domain** style transfer techniques

1. Use the Constant-Q Transform(CQT) to obtain the TF representation of the audio(log magniutde, no phase)
2. CycleGAN for timbre transfer
3. Conditional Wavenet Synthesizer

![process](fig_03.PNG)

Why CQT?
1. Well suited for timbre due to pitch equivariance
2. Simultaneuosly achieves better frequency resolution at low frequencies and high temporal resolution at higher frequencies
3. Empirical experiments yielded better performance compared to STFT 
4. Outperformed traditional representations like MFCC's in environmental sound classifications using CNN's(Husaifah 2017)


Issues I felt
1. No phase
2. Not perfectly invertible(though they claim it is)

<u>Dataset</u> - Unrelated collections of different musical instruments(from YouTube, links available in appendix)

<u>Architecture</u> - 
1. TF representation using log magnitude of CQT.
2. **CycleGAN** -
    - Unsupervised domain transfer - learn a mapping between two domains **without any paired data**
3. Reconstruction from TF representation generated - 
    - Avoid Griffin Lim as no optimality guarantees
    - Hence, use conditional wavenet  to generate waveform
4. Major Contributions - 
    1. **Beam Search** - Possibility of producing low-probability outputs sometimes. To avoid this, search through Wavenet's generations to search output which better matches target CQT
    2. **Reverse Generation** - Percussive attacks not modelled correctly at onsets(difficult to determin onset from CQT). Solve by generating samples backwards i.e. in reverse order(Really Cool Hack!!)
5. Also proposed architectural modifications for the following - 
    1. Removing CHeckerboard artifacts
    2. Full Spectrogram Discriminator
    3. Gradient Penalty
    4. Identity Loss

Discussion - 
- Evaluation primarily through human evaluation(Amazon Mechanical Turk + feedback form)
- Performed an **ablation** study to analyze importance of different architectural changes + importance of certain features
- CQT well suited to convolutional architectures mainly due to **pitch equivariance**([What this means in CNN's](https://www.quora.com/What-is-the-difference-between-equivariance-and-invariance-in-Convolution-neural-networks))


***
### WAVENET: A GENERATIVE MODEL FOR RAW AUDIO
[Audio Examples](https://deepmind.com/blog/wavenet-generative-model-raw-audio/)
***

- Deep neural network to generate **raw audio waveform**(all the above methods generated TF representations which were inverted to obtain waveforms)
- Fully **probabilistic + Autoregressive**
- Inspired Heavily from previous work by same author i.e. PixelCNN to generate images pixel by pixel.

- Models joint probability in the following way - 
$p(\bar(x)) = \Pi_{t=1}^{T}(x_{t}|x_{1},\dots,x_{t-1})$
i.e. each sample depends on all the samples before it.
- Use **Dilated Causal Convolutions** to model the above - 
![causal_convolutions](fig_04.PNG)
- **Softmax** vs **GMM** to model conditional distributions
- **Gated Activation Units** works as a better non-linearity for audio signals vs ReLU
- **Context Stacks** to increase Receptive Field size

**Conditional Wavenets** - 
- Model conditional distribution given an input h i.e. $p(\bar(x)|h) = \Pi_{t=1}^{T}(x_{t}|x_{1},\dots,x_{t-1},h)$. This is done to produce audio with required characteristics(timbre, pitch)
- Done in two ways -
1. Global conditioning - Influence output at all time steps
2. local conditioning

Discussion - 
- Lack of long range coherence dut to limited size of Receptive Field
- Picks up certain other characteristics in audio such as mimicking accoustics and recording quality(and breathing and mouth movement for speakers in case of speech generation)
- Large RF necessary for Music!
- Evaluation using **Mean Opinion Scores(MOS)** tests on human evaluated samples.

***
### Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
[Website](https://magenta.tensorflow.org/nsynth)
***

- Inspired by Wavenet, they deisgn a Wavenet Style autoencoder, that conditions the decoder on **temporal codes** learnt from the raw audio waveform. These capture long term structure without external conditioning.
- The model learns a **manifold of embeddings** that allows morphing, interpolations etc.

<u>Architecture</u>
![Nsynth](fig_05.PNG)

Thus, instead of external conditioning, the input embeddings work as the conditioning variable, which encodes information about the waveform $p(\bar(x)) = \Pi_{t=1}^{T}(x_{t}|x_{1},\dots,x_{t-1},f(\bar(x)))$

Major Limitation - 
- Unable to fully capture global context due to memory constraint(open problem)