# WaveNet: A Generative Model for Raw Audio

Link: https://arxiv.org/abs/1609.03499

Authors: Aa ̈ron van den Oord, Karen Simonyan, Nal Kalchbrenner, Sander Dieleman, Oriol Vinyals, Andrew Senior, Heiga Zen, Alex Graves, Koray Kavukcuoglu

Institution: Google DeepMind

Publication: arXiv

Date: 2016



## Background Materials

- https://deepmind.com/blog/wavenet-generative-model-raw-audio/
- Pixel Recurrent Neural Networks https://arxiv.org/abs/1601.06759
- Conditional Image Generation with PixelCNN Decoders https://arxiv.org/abs/1606.05328
- http://musyoku.github.io/2016/09/18/wavenet-a-generative-model-for-raw-audio/


## What is this paper about?

WaveNet, high quality sample-level generative model with DNN.


## What is the motivation of this research?

Recent neural autoregressive generative models such as PixelCNN are able to model distributions over thousands of random variables. This suggests that similar approaches can suceed in generating wideband raw audio waveforms, which are signals with very high temporal resolution, at least 16000 samples per second.


## What makes this paper different from previous research?

- can generate raw audio directly
- generated speech signals has naluralness never before reported in TTS
- new architectures based on dilated causal convolution to deal with long-range temporal dependencies
- a single model can generate different voices when conditioned speaker identity


## How this paper achieve it?

The joint probability of a waveform $\boldsymbol{x} = \{x_1,...,x_T\}$ is

$p(\boldsymbol{x}) = \prod_{t=1}^T p(x_t \lvert x_1,..., x_{t-1})$


### DILATED CAUSAL CONVOLUTIONS

Causal convolution is used to constrain so that the prediction $p(x_{t+1}\lvert x_1,..., x_t)$ cannot depend on any of the future timesteps $x_{t+1}, x_{t+2}, x_T$. For images, the equivalent of a causal convolution is a masked convolution as seen in PixelRNN.

One of the problems of causal convolutions is that they require many layers, or large filters to increase the receptive field.

A diliated convolution is convolution where the filter is applied over an area larger than its length by skipping input values with certain step. It is equivalent to a convolution with a larger filter derived from the original filter by dilating it with zero. This is similar to pooling or strided convolutions but the output has the same size as the input.

Stacked diliated convolutions enable networks to have large receptive fields.

<img src="img/WaveNet_A_Generative_Model_for_Raw_Audio_Figure3.png">


### SOFTMAX DISTRIBUTIONS

van den Oord et al. (2016a) showed that softmax distribution tends to work better than continuous model such as GMM to model the conditional distribution $p(x_t \lvert x_1,..., x_{t-1})$.

Raw audio is tipically represented as a sequence of 16-bit, which means 65536 probabilities of all possible values. By using $\mu$-law companding transformation, the data is quantized to 256 possible values.

$f(x_t) = \mathrm{sign}(x_t)\frac{\ln(1 + \mu |x_t|)}{\ln(1 + \mu)}$

where $-1 \gt x_t \gt 1$ and $\mu = 255$.


### GATED ACTIVATION UNITS

As in PixelCNN, gated activation unit is used.

$\boldsymbol{z} = \tanh(W_{f,k} \cdot \sigma(W_{g,k} \ast x)) $

where $\ast$ denotes a convolitional operator, $\cdot$ is an element-wise multiplication operator, $\sigma(.)$ is a sigmoid function, $k$ is layer index, $f$ and $g$ denote filter and gate, respectively.


### RESIDUAL AND SKIP CONNECTIONS

Both residual and parameterized skip connections are used to enable trainig of much deeper models.

<img src="img/WaveNet_A_Generative_Model_for_Raw_Audio_Figure4.png">


### CONDITIONAL WAVENETS

Given an additional input $\boldsymbol{h}$, WaveNet acn model conditional distribution $p(\boldsymbol{x} \lvert \boldsymbol{h})$.

$p(\boldsymbol{x} \lvert \boldsymbol{h}) = \prod_{t=1}^T p(x_t \lvert x_1,...,x_{t-1},\boldsymbol{h})$

For example, by feeding a speeker identity speaker can be choosed in a multi-speeker settings.


### CONTEXT STACKS

In addition to incresing the number of dilation stages, using more layers, larger filters, greater dilation factors, complementary approach is to use separate, smaller context stack that processes a long part of the audio signal and locally conditions a larger WaveNet that processes only a smaller part of the audio signal.

I don't understand this part. Does this mean SampleRNN like stacked structure?


### Results

The generated audio can be found [here](https://deepmind.com/blog/wavenet-generative-model-raw-audio/).



## Dataset used in this study

- English multi-speaker corpus from CSTR voice cloning toolkit (VCTK)
- single-speaker speech databases from which Google’s North American English and Mandarin Chinese TTS systems are built
- TIMIT dataset for speech recognition

## Implementations

- https://github.com/ibab/tensorflow-wavenet
- https://github.com/tomlepaine/fast-wavenet
- https://github.com/musyoku/wavenet


## Further Readings

a lot