# Neural Modelling Synthesis
***

### Adversarial Audio Synthesis
***
[Link to Audio Examples](https://chrisdonahue.com/wavegan_examples/)

**WaveGAN** - A first attempt to synthesize audio from raw time-domain waveforms  
- Unlike previous naive attempts to simply bootstrap algos by treating spectrograms as images. This offers better performance
- Qualitative(Human Judgement) + Quantitative(Inception Score) evaluation

Practical application in generating sounds - Artist can explore **latent space** of audio, and fine tune variables/parameters to generate desired sound, as opposed to finding his desired sound from dataset. However, audio has high temporal resolution, and thus latent space should encode these high dimensions correctly.  
Other work - 
1. Autoregressive models(Wavenet) -> *Very* slow generation of audio
2. GAN's used naively on image spectrograms -> Lossy estimates due to non-invertibility of spectrogram, thus learn an inversion model as well  
The authors want to investigate if unsupervised strategies can learn **semantic nodes** implicitly in the high dimensional space rather than being conditioned on them<Discuss more?!>

This paper - 
1. Waveform(WaveGAN) + Spectrogram(SpecGAN) strategies for GAN's
2. Human Evaluation for sounds + Quantitative evaluation on Speech dataset

<u>WaveGAN</u>  
Images vs Audio - 
    - Audio more likely to exhibit periodic structure
    - Correlations across large time instants in audio(-> Filter with larger RF!)

Architecture - 
- Modification of Deep Conv GAN
    1. Flattened(1D instead of 2D) convolutions
    2. Increase stride
    3. Rather than original GAN cost(which is unstable due to non-differentiability), use WGAN(modified stable cost function)
- Phase Shuffling(artifact prevention, phase invariance)

<u>SpecGAN</u>  
Phase information is often discarded which prevents inversion of spectrogram

Architecture - 
- Audio ---- [STFT] -> TF Representation ---- [Train DCGAN] -> Obtain samples ---- [Griffin Lim for phase reconstruction] -> Obtain audio from samples

Poor performance due to noisiness introduced in Griffin Lim inversion process

<u>Dataset</u>  
1. Speech(SC09)
2. Sounds(Drum, Bird, Piano, Large Vocab Speech)

**Took ~4 days to train!**

<u>Evaluation</u>  
1. Inception Score - $e^{E_{x}KL(P(y/x)||P(y))}$
    - P(y/x) should be low entropy(deterministic){Data generated given class label}
    - P(y) should be high entropy(uniform){Data generated across all labels}
2. NN comparison - Correct for errors in inception score
3. Qualitative Human Judgement - **Amazon Mechanical Turk** to collect/label audio. Digit perceived(SC09) + Sound quality(0-5)

Future Work - 
- Variable length audio
- label conditioning strategies

***
### Adversarial Generation of Time-Frequency Features with application in audio synthesis
***

STFT as Time Frequency(TF) Representation of audio i.e. GAN trained on STFT features. This outperforms models trained on waveform directly!

- Phase of STFT hard to understand and model => Use partial derivatives of phase(local instantaneous frequency).
- Phase estimation from magnitude spectrogram(Griffin Lim) unreliable
- Inspired from Phaseless Reconstruction + Current State of Art i.e. Gansynth -> TiFGAN

<u>Math</u>  
- STFT maps time-domain signals in to a lower-dimensional subspace of all possible magnitudes
- Look at STFT as an operator, and study the conditions for the operator to be perfectly invertible.
- For a consistent transformation, assuming that STFT is an analytic function, using the Cauchy Riemann equations, you get a coupled pair of equations which can be solved to obtain the phase simply from the log-Magnitude Spectrogram itself.
- Use of **Phase Gradient Heap Integration(PGHI)** to bypass phase instabilitites by providing betted estimates.
- The above condition is discretized, and using a new metric **consistency**, the authors judge goodness of TF representation. Consistency evaluated by looking ate projection error $\hat{e} = |S^{gen} - S^{proj}|^{2}$ where $S^{proj} = ISTFT(STFT(S))$

Architecture
- Modified DCGAN + preprocessing on signal to enable its input to a GAN

Evaluation
- Dataset - 
1) Speech commands dataset
2) MUSIC, 25 min of BACH piano recordings
- Evaluation using Inception Score and Frechet inception distance

Conclusion, Future work  -  
1. Consistency measure - Computationally cheap measure to assess quality of TF representation  
2. Extension - Use logarithmic and perceptual frequency scales