<a href="https://colab.research.google.com/github/NolanRink/CS4540/blob/main/HW13/WaveNet_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is notebook gives a quick overview of this WaveNet implementation, i.e. creating the model and the data set, training the model and generating samples from it.

In [13]:
!git clone https://github.com/Braedennnnn/pytorch-wavenet.git

fatal: destination path 'pytorch-wavenet' already exists and is not an empty directory.


In [21]:
import sys
sys.path.append("pytorch-wavenet")

import torch
from wavenet_model import *
from audio_data import WavenetDataset
from wavenet_training import *
from model_logging import *

## Model
This is an implementation of WaveNet as it was described in the original paper (https://arxiv.org/abs/1609.03499). Each layer looks like this:

```
            |----------------------------------------|      *residual*
            |                                        |
            |    |-- conv -- tanh --|                |
 -> dilate -|----|                  * ----|-- 1x1 -- + -->  *input*
                 |-- conv -- sigm --|     |
                                         1x1
                                          |
 ---------------------------------------> + ------------->  *skip*
```

Each layer dilates the input by a factor of two. After each block the dilation is reset and start from one. You can define the number of layers in each block (``layers``) and the number of blocks (``blocks``). The blocks are followed by two 1x1 convolutions and a softmax output function.
Because of the dilation operation, the independent output for multiple successive samples can be calculated efficiently. With ``output_length``, you can define the number these outputs. Empirically, it seems that a large number of skip channels is required.

In [22]:
# initialize cuda option
dtype = torch.FloatTensor # data type
ltype = torch.LongTensor # label type

use_cuda = torch.cuda.is_available()
if use_cuda:

    print('use gpu')
    dtype = torch.cuda.FloatTensor
    ltype = torch.cuda.LongTensor

use gpu


In [23]:
model = WaveNetModel(layers=4,
                     blocks=1,
                     dilation_channels=12,
                     residual_channels=12,
                     skip_channels=32,
                     end_channels=24,
                     output_length=12,
                     dtype=dtype,
                     bias=True)
# model = load_latest_model_from('snapshots', use_cuda=use_cuda)

print('model: ', model)
print('receptive field: ', model.receptive_field)
print('parameter count: ', model.parameter_count())

model:  WaveNetModel(
  (filter_convs): ModuleList(
    (0-3): 4 x Conv1d(12, 12, kernel_size=(2,), stride=(1,))
  )
  (gate_convs): ModuleList(
    (0-3): 4 x Conv1d(12, 12, kernel_size=(2,), stride=(1,))
  )
  (residual_convs): ModuleList(
    (0-3): 4 x Conv1d(12, 12, kernel_size=(1,), stride=(1,))
  )
  (skip_convs): ModuleList(
    (0-3): 4 x Conv1d(12, 32, kernel_size=(1,), stride=(1,))
  )
  (start_conv): Conv1d(256, 12, kernel_size=(1,), stride=(1,))
  (end_conv_1): Conv1d(32, 24, kernel_size=(1,), stride=(1,))
  (end_conv_2): Conv1d(24, 256, kernel_size=(1,), stride=(1,))
)
receptive field:  16
parameter count:  14964


## Data Set
To create the data set, you have to specify a path to a data set file. If this file already exists it will be used, if not it will be generated. If you want to generate the data set file (a ``.npz`` file), you have to specify the directory (``file_location``) in which all the audio files you want to use are located. The attribute ``target_length`` specifies the number of successive samples are used as a target and corresponds to the output length of the model. The ``item_length`` defines the number of samples in each item of the dataset and should always be ``model.receptive_field + model.output_length - 1``.

```
          |----receptive_field----|
                                |--output_length--|
example:  | | | | | | | | | | | | | | | | | | | | |
target:                           | | | | | | | | | |  
```
To create a test set, you should define a ``test_stride``. Then each ``test_stride``th item will be assigned to the test set.

In [24]:
data = WavenetDataset(dataset_file='pytorch-wavenet/train_samples/bach_chaconne/dataset.npz',
                      item_length=model.receptive_field + model.output_length - 1,
                      target_length=model.output_length,
                      file_location='pytorch-wavenet/train_samples/bach_chaconne',
                      test_stride=500)
print('the dataset has ' + str(len(data)) + ' items')

one hot input
the dataset has 797955 items


## Training and Logging
This implementation supports logging with TensorBoard (you need to have TensorFlow installed). You can even generate audio samples from the current snapshot of the model during training. This will happen in a background thread on the cpu, so it will not interfere with the actual training but will be rather slow. If you don't have TensorFlow, you can use the standard logger that will print out to the console.
The trainer uses Adam as default optimizer.

In [25]:
def generate_and_log_samples(step):
    sample_length=32000
    gen_model = load_latest_model_from('snapshots', use_cuda=False)
    print("start generating...")
    samples = generate_audio(gen_model,
                             length=sample_length,
                             temperatures=[0.5])
    tf_samples = tf.convert_to_tensor(samples, dtype=tf.float32)
    logger.audio_summary('temperature_0.5', tf_samples, step, sr=16000)

    samples = generate_audio(gen_model,
                             length=sample_length,
                             temperatures=[1.])
    tf_samples = tf.convert_to_tensor(samples, dtype=tf.float32)
    logger.audio_summary('temperature_1.0', tf_samples, step, sr=16000)
    print("audio clips generated")


# logger = TensorboardLogger(log_interval=200,
#                            validation_interval=400,
#                            generate_interval=1000,
#                            generate_function=generate_and_log_samples,
#                            log_dir="logs/chaconne_model")

logger = Logger(log_interval=200,
                validation_interval=400,
                generate_interval=1000)

In [26]:
!mkdir -p /content/pytorch-wavenet/train_samples/bach_chaconne

!wget -q \
    https://raw.githubusercontent.com/vincentherrmann/pytorch-wavenet/master/train_samples/bach_chaconne/dataset.npz \
    -O /content/pytorch-wavenet/train_samples/bach_chaconne/dataset.npz

In [27]:
model.cuda()
trainer = WavenetTrainer(model=model,
                         dataset=data,
                         lr=0.001,
                         snapshot_path='/content/pytorch-wavenet/snapshots',
                         snapshot_name='chaconne_model',
                         snapshot_interval=1000,
                         logger=logger,
                         dtype=dtype,
                         ltype=ltype)

import os
print("CWD is now", os.getcwd())
print('start training...')
trainer.train(batch_size=32,epochs=2)

CWD is now /content
start training...
epoch 0
one training step does take approximately 0.298240966796875 seconds)
loss at step 200: 5.251699204444885
loss at step 400: 4.81243955373764
validation loss: 4.808670635223389
validation accuracy: 1.4853033145716072%
loss at step 600: 4.588126406669617
loss at step 800: 4.460306897163391
validation loss: 4.534218521118164
validation accuracy: 4.539295392953929%
loss at step 1000: 4.377351536750793
loss at step 1200: 4.332708818912506
validation loss: 4.4350396299362185
validation accuracy: 4.997915363769023%
loss at step 1400: 4.308866822719574
loss at step 1600: 4.276740998029709
validation loss: 4.390761938095093
validation accuracy: 5.393996247654784%
loss at step 1800: 4.2575393748283386
loss at step 2000: 4.218745995759964
validation loss: 4.300880098342896
validation accuracy: 5.670210548259329%
loss at step 2200: 4.1682913053035735
loss at step 2400: 4.153820338249207
validation loss: 4.222279348373413
validation accuracy: 5.972482801

KeyboardInterrupt: 

## Generating
This model has the Fast Wavenet Generation Algorithm (https://arxiv.org/abs/1611.09482) implemented. This might run faster on the cpu. You can give some starting data (of at least the length of receptive field) or let the model generate from zero. In my experience, a temperature between 0.5 and 1.0 yields the best results, but this may depend on the data set.

In [28]:
start_data = data[250000][0] # use start data from the data set
start_data = torch.max(start_data, 0)[1] # convert one hot vectors to integers

def prog_callback(step, total_steps):
    print(str(100 * step // total_steps) + "% generated")

model.cpu()
generated = model.generate_fast(num_samples=160000,
                                 first_samples=start_data,
                                 progress_callback=prog_callback,
                                 progress_interval=1000,
                                 temperature=1.0,
                                 regularize=0.)

0% generated
one generating step does take approximately 0.002591671943664551 seconds)
0% generated
1% generated
1% generated
2% generated
3% generated
3% generated
4% generated
4% generated
5% generated
6% generated
6% generated
7% generated
8% generated
8% generated
9% generated
9% generated
10% generated
11% generated
11% generated
12% generated
13% generated
13% generated
14% generated
14% generated
15% generated
16% generated
16% generated
17% generated
18% generated
18% generated
19% generated
19% generated
20% generated
21% generated
21% generated
22% generated
23% generated
23% generated
24% generated
24% generated
25% generated
26% generated
26% generated
27% generated
28% generated
28% generated
29% generated
29% generated
30% generated
31% generated
31% generated
32% generated
33% generated
33% generated
34% generated
34% generated
35% generated
36% generated
36% generated
37% generated
38% generated
38% generated
39% generated
39% generated
40% generated
41% generated
41% g

In [29]:
import IPython.display as ipd

ipd.Audio(generated, rate=16000)

I hit Colab’s usage cap, so the tiny WaveNet could not fully complete the two quick passes so the result produced hissy, chopped loops. The model could improve with more runtime, training for more epochs on a larger dataset and deepening the stack with wider dilations so it hears several seconds at once. Weight normalization, 8-bit μ-law, and a light vocoder like LPCNet can further clean up the sound.