This is notebook gives a quick overview of this WaveNet implementation, i.e. creating the model and the data set, training the model and generating samples from it.

In [1]:
import torch
from wavenet_model import *
from audio_data import WavenetDataset
from wavenet_training import *
from model_logging import *

Import of 'jit' requested from: 'numba.decorators', please update to use 'numba.core.decorators' or pin to Numba version 0.48.0. This alias will not be present in Numba version 0.50.0.
  from numba.decorators import jit as optional_jit


## Model
This is an implementation of WaveNet as it was described in the original paper (https://arxiv.org/abs/1609.03499). Each layer looks like this:

```
            |----------------------------------------|      *residual*
            |                                        |
            |    |-- conv -- tanh --|                |
 -> dilate -|----|                  * ----|-- 1x1 -- + -->  *input*
                 |-- conv -- sigm --|     |
                                         1x1
                                          |
 ---------------------------------------> + ------------->  *skip*
```

Each layer dilates the input by a factor of two. After each block the dilation is reset and start from one. You can define the number of layers in each block (``layers``) and the number of blocks (``blocks``). The blocks are followed by two 1x1 convolutions and a softmax output function.
Because of the dilation operation, the independent output for multiple successive samples can be calculated efficiently. With ``output_length``, you can define the number these outputs. Empirically, it seems that a large number of skip channels is required.

In [2]:
model = WaveNetModel(layers=2,
                     blocks=2,
                     dilation_channels=32,
                     residual_channels=32,
                     skip_channels=1024,
                     end_channels=512, 
                     output_length=16,
                     bias=True)
# model = load_latest_model_from('snapshots')

print('model: ', model)
print('receptive field: ', model.receptive_field)
print('parameter count: ', model.parameter_count())

model:  WaveNetModel(
  (filter_convs): ModuleList(
    (0): Conv1d(32, 32, kernel_size=(2,), stride=(1,))
    (1): Conv1d(32, 32, kernel_size=(2,), stride=(1,))
    (2): Conv1d(32, 32, kernel_size=(2,), stride=(1,))
    (3): Conv1d(32, 32, kernel_size=(2,), stride=(1,))
  )
  (gate_convs): ModuleList(
    (0): Conv1d(32, 32, kernel_size=(2,), stride=(1,))
    (1): Conv1d(32, 32, kernel_size=(2,), stride=(1,))
    (2): Conv1d(32, 32, kernel_size=(2,), stride=(1,))
    (3): Conv1d(32, 32, kernel_size=(2,), stride=(1,))
  )
  (residual_convs): ModuleList(
    (0): Conv1d(32, 32, kernel_size=(1,), stride=(1,))
    (1): Conv1d(32, 32, kernel_size=(1,), stride=(1,))
    (2): Conv1d(32, 32, kernel_size=(1,), stride=(1,))
    (3): Conv1d(32, 32, kernel_size=(1,), stride=(1,))
  )
  (skip_convs): ModuleList(
    (0): Conv1d(32, 1024, kernel_size=(1,), stride=(1,))
    (1): Conv1d(32, 1024, kernel_size=(1,), stride=(1,))
    (2): Conv1d(32, 1024, kernel_size=(1,), stride=(1,))
    (3): Conv1d(3

## Data Set
To create the data set, you have to specify a path to a data set file. If this file already exists it will be used, if not it will be generated. If you want to generate the data set file (a ``.npz`` file), you have to specify the directory (``file_location``) in which all the audio files you want to use are located. The attribute ``target_length`` specifies the number of successive samples are used as a target and corresponds to the output length of the model. The ``item_length`` defines the number of samples in each item of the dataset and should always be ``model.receptive_field + model.output_length - 1``.

```
          |----receptive_field----|
                                |--output_length--|
example:  | | | | | | | | | | | | | | | | | | | | |
target:                           | | | | | | | | | |  
```
To create a test set, you should define a ``test_stride``. Then each ``test_stride``th item will be assigned to the test set.

In [3]:
data = WavenetDataset(dataset_file='train_samples/test_dataset.npz',
                      item_length=model.receptive_field + model.output_length - 1,
                      target_length=model.output_length,
                      #file_location='bowie_wav',
                      test_stride=500)
print('the dataset has ' + str(len(data)) + ' items')

one hot input
the dataset has 215994 items


## Training and Logging
This implementation supports logging with TensorBoard (you need to have TensorFlow installed). You can even generate audio samples from the current snapshot of the model during training. This will happen in a background thread on the cpu, so it will not interfere with the actual training but will be rather slow. If you don't have TensorFlow, you can use the standard logger that will print out to the console.
The trainer uses Adam as default optimizer.

In [4]:
def generate_and_log_samples(step):
    sample_length=320
    gen_model = load_latest_model_from('snapshots')
    print("start generating...")
    samples = generate_audio(gen_model,
                             length=sample_length,
                             temperatures=[0.5])
    
    logger.audio_summary('temperature_0.5', samples, step, sr=16000)

    samples = generate_audio(gen_model,
                             length=sample_length,
                             temperatures=[1.])
    logger.audio_summary('temperature_1.0', samples, step, sr=16000)
    print("audio clips generated")


logger = TensorboardLogger(log_interval=200,
                           validation_interval=400,
                           generate_interval=1000,
                           generate_function=generate_and_log_samples,
                           log_dir="logs/test_model")


In [5]:
trainer = WavenetTrainer(model=model,
                         dataset=data,
                         lr=0.001,
                         snapshot_path='snapshots',
                         snapshot_name='test_model',
                         snapshot_interval=2000,
                         logger=logger)

print('start training...')
trainer.train(batch_size=16,
              epochs=1)

start training...
epoch 0
one training step does take approximately 0.06699719905853271 seconds)
load model snapshots/chaconne_model_2017-12-28_16-44-12
start generating...




one generating step does take approximately 0.0718166732788086 seconds)
one generating step does take approximately 0.062008402347564696 seconds)
audio clips generated
load model snapshots/test_model_2020-05-15_07-26-53
start generating...
one generating step does take approximately 0.012911627292633057 seconds)
one generating step does take approximately 0.016660680770874025 seconds)
audio clips generated
load model snapshots/test_model_2020-05-15_07-26-53
start generating...
one generating step does take approximately 0.012651748657226562 seconds)
one generating step does take approximately 0.011625595092773437 seconds)
audio clips generated
load model snapshots/test_model_2020-05-15_07-29-34
start generating...
one generating step does take approximately 0.009312989711761475 seconds)
one generating step does take approximately 0.014244542121887208 seconds)
audio clips generated
load model snapshots/test_model_2020-05-15_07-29-34
start generating...
one generating step does take appr

## Generating
This model has the Fast Wavenet Generation Algorithm (https://arxiv.org/abs/1611.09482) implemented. This might run faster on the cpu. You can give some starting data (of at least the length of receptive field) or let the model generate from zero. In my experience, a temperature between 0.5 and 1.0 yields the best results, but this may depend on the data set.

In [6]:
start_data = data[25000][0] # use start data from the data set
start_data = torch.max(start_data, 0)[1] # convert one hot vectors to integers

def prog_callback(step, total_steps):
    print(str(100 * step // total_steps) + "% generated")

generated = model.generate_fast(num_samples=1600,
                                 first_samples=start_data,
                                 progress_callback=prog_callback,
                                 progress_interval=1000,
                                 temperature=1.0,
                                 regularize=0.)

0% generated
one generating step does take approximately 0.01148219347000122 seconds)
61% generated


In [7]:
import IPython.display as ipd

ipd.Audio(generated, rate=16000)