# Audio Machine Learning - Workshop Week 5 - Convolutional Neural Networks

## 0 - Import Libraries 

In [127]:
import numpy as np
import torch
import torchaudio
from torch.utils.data import Dataset, DataLoader

## 1 - Fun with Tensors!

Manipulating PyTorch tensors is very useful when creating machine learning models.\
This section has some exercises based on manipulating tensors.\
You can complete all the exercises using the functions/methods listed below.

#### PyTorch tensor operations:

 - torch.permute(): Returns the original tensor with its dimensions permuted
 - torch.transpose(): Swap dimensions
 - torch.unsqueeze(): Add a dimension of size 1
 - torch.squeeze(): Remove dimensions of size 1
 - torch.stack(): Combine tensors along new dimension
 - torch.cat(): Concatenate tensors along existing dimension

These can also be called as a method directly on a tensor, as in:\
x = torch.tensor(#arguments)\
x_permuted = x.permute(#arguments)

If you are unsure how to use a function, or it doesn't work how you expect, look at the documentation!

#### Exercise 1 - Reshaping audio batches
'audio_1' contains mono audio signals, each 1 second long at 44.1kHz sampling rate. Change the shape of the tensor to prepare it for input to a 1D-Convolutional layer.

If you're unsure what shape this should be, look at the input size of the tensor on the PyTorch documentation:

https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html

In [66]:
# In these examples we use torch.randn(dim1, dim2, etc) to generate tensors of a specific shape holding random values
audio_1 = torch.randn(4, 44100) 

## TODO - reshape the audio batch here!
reshaped_audio = None

In [67]:
# Test your answer
assert reshaped_audio.shape == (4, 1, 44100), f"Expected shape (4, 1, 44100), got {reshaped_audio.shape}"
assert torch.max(torch.abs(reshaped_audio.squeeze() - audio_1)) == 0, f"The values in audio_1 have changed!"

#### Exercise 2 - This batch of audio has it's dimensions all back to front...
'audio_2', has the following shape (time, batch, channel)

Change the dimensions of the tensor for input to a 1D convolutional layer!

In [88]:
audio_2 = torch.randn(44100, 10, 2)

## TODO - change the dimensions of the audio batch here!
reshaped_audio = None

In [90]:
# Test your answer
assert reshaped_audio.shape == (10, 2, 44100), f"Expected shape (10, 2, 44100), got {reshaped_audio.shape}"
assert max([torch.max(torch.abs(audio_2[:, n, 0] - reshaped_audio[n, 0, :])) for n in range(10)]) == 0, f"The values in the audio have changed!"
assert max([torch.max(torch.abs(audio_2[:, n, 1] - reshaped_audio[n, 1, :])) for n in range(10)]) == 0, f"The values in the audio have changed!"

#### Exercise 3 - Assembling a Tensor
Below we have a list, 'audio_list', that holds four 1-second long clips of audio.

Create a batch out of these tensors. You should be able to use one function to create a (4, 44100) tensor, and another function to add the channel dimension. Don't change the order the tensors appear in.

In [96]:
audio_list = [torch.randn(44100), torch.randn(44100),torch.randn(44100),torch.randn(44100)]

## TODO - create the audio batch!
reshaped_audio =

In [97]:
# Test your answer
assert reshaped_audio.shape == (4, 1, 44100), f"Expected shape (4, 1, 44100), got {reshaped_audio.shape}"
assert max([torch.max(torch.abs(audio_list[n][:] - reshaped_audio[n, 0, :])) for n in range(4)]) == 0, f"The values in the audio have changed!"

#### Exercise 4 - Assembling another Tensor!
Now, we have a clip of stereo audio, that has been chopped up into three segments!

We want to reassemble it, so we have one long clip of stereo audio. The segments should appear in numerical order.

Don't add the batch dimension this time.

In [None]:
seg_1, seg_2, seg_3 = torch.randn(2, 44100), torch.randn(2, 44100), torch.randn(2, 44100)

## TODO - Put the audio back together!
reshaped_audio = None

In [None]:
# Test your answer
assert reshaped_audio.shape == (2, 44100*3), f"Expected shape (2, 132300), got {reshaped_audio.shape}"
assert torch.max(torch.abs(seg_1 - reshaped_audio[:, 0:44100])) == 0, f"The values in the audio have changed!"
assert torch.max(torch.abs(seg_2 - reshaped_audio[:, 44100:2*44100])) == 0, f"The values in the audio have changed!"
assert torch.max(torch.abs(seg_3 - reshaped_audio[:, 44100*2:44100])) == 0, f"The values in the audio have changed!"

#### Exercise 5 - Retrieving a single training example and channel
Now we have a batch of training data. Can you get the 2nd channel of the 4th data item in the batch?

In [139]:
batch = torch.randn(10, 5, 44100)
## TODO - Get the data requested above
item = None

In [140]:
# Test your answer
assert item.shape == (44100,), f"Expected shape (44100), got {item.shape}"
assert torch.allclose(batch.flatten()[16*44100:17*44100], item)

#### Exercise 6 - Retrieving a single training example whilst retaining the batch dimension
Sometimes we want to get a single training example, but we want to keep the 'batch dimension' (which will be of size 1 for a single training example) \
Compare what happens to shape when:
- Indexing a tensor dimensions with single value

In [143]:
x = torch.randn(10,20)
x[0, :].shape

torch.Size([20])

- and 'slicing' with a range of 1

In [144]:
x[0:1, :].shape

torch.Size([1, 20])

These both return the same values, however the second one kept the first dimension.

Use this to get the same data as in the previous exercise, but retaining the batch dimension and the channel dimension

In [145]:
## TODO - Get the 2nd channel of the 4th data item in 'batch', but retain the channel and batch dimensions
item =

In [146]:
# Test your answer
assert item.shape == (1, 1, 44100), f"Expected shape (1, 1, 44100), got {item.shape}"
assert torch.allclose(batch.flatten()[16*44100:17*44100], item)

## 2 -  Using a 1D-Convolutional layer

Define a 1d convolutonal layer in PyTorch. It should be able to recieve stereo audio as input. It should output 16 channels of audio. It should have a kernel size of 5.

Look up how to make a 1D convolutional layer with those specifications in the PyTorch docs!

In [None]:
## TODO
layer = torch.nn.Conv1d(# Put the appropriate class constructor arguments here.)

Now we will create some audio to process with the layer, you can process the audio using the layer.forward() function.

In [148]:
input_audio = torch.randn(5, 2, 44100)

## TODO - process the 'input_audio' using the layer.forward() function, and save the output to 'output_audio'
output_audio = #

---
**NOTE**

You can also call the 'forward' method of a PyTorch module by as:

output = layer(input_audio)

This is actually how the forward method is usually called, but they both do the same thing!

---

Now examine the shape of the output audio. Which dimensions have changed, and which have stayed the same? 

In [None]:
## TODO - Look at the shape of 'output_audio'

## 3 - Adding a Nonlinearity, and zero-padding

Now create a convolutional layer, and a nonlinear activation function layer. Here is an example nonlinear block you can use:

https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html

In [128]:
## TODO
# Create a PyTorch 1d convolutional layer - which takes a single-channel of audio as input
# Create a nonlinear activation function layer

Process the following input batch through these layers, and determine how much zero-padding is required for the output to have the same sequence length as in the input

In [None]:
input_batch = torch.randn(10, 1, 44100)

## TODO
# Process 'input_batch' through the two layers and save the output to a new variable called 'output'
output = None

## TODO
# Determine how much shorter the output signal is in comparison to the input signal
# Save this to a variable called 'pad'


Now apply zero padding to the input before processing it with the convolutional layer.

The zero padding should be applied before the start of the signal only, to ensure the resulting system is causal.

You can do this by creating an appropriately size tensor of zeros, using torch.zeros(), then concatenating this to the input signal

In [None]:
## TODO
# Create some zeros using torch.zeros(). 
# Choose the shape according to the dimensions of 'input_batch' and the calculated 'pad' from earlier
# Join the zeros to input_batch, to create one tensor of shape (10, 1, 44100+pad)

# Process the input with the convolutional layer and nonlinearity
output = #

In [None]:
## This will test if the correct amount of zero-padding was applied
assert output.shape[0] == input_batch.shape[0]
assert output.shape[2] == input_batch.shape[2]

# 4 - Create 'ConvBlock' Class

We usually build 'deep' neural networks out of repeated smaller neural network blocks. Lets build a simple 1D convolutional network block!

Use the code from the previous section, and the following template:

In [None]:
class ConvBlock(torch.nn.Module):
    def __init__(self, #Put input arguments here determining channel size, kernel size etc..):
        super(ConvBlock, self).__init__()
        ## TODO
    
        #Create the layers of the convolutional block and save as class attributes using self.
        self.conv = #... conv layer goes here
        self.act = #... activation layer goes here. I recommend sigmoid.

        # Determine how much zero-padding is required for the output to have the same sequence length as in the input
        self.pad = #...

    def forward(self, x): #Write the forward function here, that applies the convolutional block layers to the input 'x'
        ## TODO
        # Create the zeros to pad x with
        # Pad x with zeros, (zeros first, then x!)
        # Apply convolutional layer
        # Apply Nonlinearity
        return # Return layer output here

Create an instance of your ConvBlock() class below

In [None]:
## TODO - create an instance of your ConvBlock
block = ConvBlock(# Add any required arguments for your class definition)

Create an input signal with the appropriate amount of input_channels for your ConvBlock

In [None]:
## TODO
input_signal = torch.randn(#Create appropriately shaped tensor here)

Now we will process the signal with your block - if this throws an error, don't worry! We will learn how to debug your convolutional block in the next section 'USING THE DEBUGGER'!

In [None]:
output_signal = block(input_signal)

If it didn't throw an error, well done! \
But you should still learn how to debug your convolutional block in the next section 'USING THE DEBUGGER'!

# 5 - USING THE DEBUGGER!

You can activate 'debug mode' by clicking on the little bug in the top right hand corner of this window!

This will enable you to create breakpoints. Instructions and demonstration of this are available at the following page:

https://jupyterlab.readthedocs.io/en/stable/user/debugger.html

-------

Add breakpoints at every line of code in your 'ConvBlock' class, then trying processing the input_signal with your convolutional block:

In [102]:
output_signal = block(input_signal)

If this worked correctly, you should have been taken 'inside' your class definition, where you can then step through each line of code.

In debug mode you can see the variables that your class can currently see, and even interact with them!

You can try evaluating a line of code by clicking on the:

< >

button located on the 'callstack' dropdown menu.

This is a good way to debug your tensor operations. For example, if the zero-padding threw an error, or produced the wrong result, you can try and find the correct way to concatenate the zeros with the signal in debug mode, and then add that to the class method. 

This is helpful because in the class you will now be able to see the exact shapes of all the tensors involved at the time the padding operation is being applied.

-----

Now you've debugged your code, run the below cell to check if the correct shape is produced.

In [103]:
## Test if the output is the same length as the input
assert input_signal.shape[0] == block(input_signal).shape[0], f"Expected batch dimension {input_signal.shape[0]}, got {layer(input_signal).shape[0]}"
assert input_signal.shape[2] == block(input_signal).shape[2], f"Expected time dimension {input_signal.shape[2]}, got {layer(input_signal).shape[2]}"

## 6 - Create ConvNet

Now you can create a 'deep' neural network, by stacking together many convolutional blocks. (Neural network architecures must have cool sounding names like 'AlexNet', 'VGGNet', 'ResNet' or 'ConvNet'.) 

This convolutional neural network will take a MONO AUDIO SIGNAL of any length as input, and will return a MONO AUDIO SIGNAL of the SAME LENGTH as output.

In this case the input will be clean 'DI' guitar, and the target output is that same guitar signal with an audio effect applied, but you don't have to worry about that part just yet.

You will create this DeepConvNet in .... another class!

In [73]:
class DeepConvNet(torch.nn.Module):
    def __init__(self):
        super(DeepConvNet, self).__init__()
        # TODO
        # Create the blocks of your convolutional neural network
        # self.block1 = ConvBlock(# Add appropriate arguments here)
        # self.block2 = ConvBlock(# Add approriate arguments here)
        # self.block3 = ConvBlock(# Add approriate arguments here)
        # self.block4 = ConvBlock(# Add approriate arguments here)

        # When choosing the channel sizes of each block, you should consider the number of channels
        # expected to be in the input and target signals.

        # The blocks in-between the first and last block can have any number of channels
        # Importantly, the number of input channels of any block MUST be the same as the number of output
        # channels of the previous block.

    def forward(self, x): #Write the forward function here, that applies each of your blocks to produce the output
        ## TODO
        return # Return the network output here



In [104]:
class DeepConvNet(torch.nn.Module):
    def __init__(self):
        super(DeepConvNet, self).__init__()
        # TODO
        # Create the blocks of your convolutional neural network
        self.block1 =  ConvBlock(1, 16, 10)
        self.block2 =  ConvBlock(16, 16, 10)
        self.block3 =  ConvBlock(16, 16, 10)
        self.block4 =  ConvBlock(16, 1, 10)

        # When choosing the channel sizes of each block, you should consider the number of channels
        # expected to be in the input and target signals.

        # The blocks in-between the first and last block can have any number of channels
        # Importantly, the number of input channels of any block MUST be the same as the number of output
        # channels of the previous block.

    def forward(self, x):
        x = self.block1(x)
        x = self.block2(x)
        x = self.block3(x)
        x = self.block4(x)
        #Write the forward function here, that applies each of your blocks to produce the output
        return x

Create and process some data with your model. Use the debugger to step through the model as it processes the input.

In [None]:
## TODO
model = ##Create Model

In [None]:
## TODO
input_data = ##Create some input

In [None]:
## TODO
output = # process the input data

In [None]:
## TODO
# Now check the output data, does it have the same shape as your input data?

## 7 - Dataset

This cell will load the some unprocessed 'DI' guitar signal, as well as that same guitar signal after it has been processed by an audio effect.

In this case, the 'input' to the model will be unprocessed 'DI' guitar, and the 'target' we want our model to produce will be that guitar with some audio effects processing applied.

In [76]:
# This cell loads some data, and splits it into train, validation and test set.
inp, fs = torchaudio.load('guitar-input.wav')
tgt, fs = torchaudio.load('guitar-target.wav')
inp_train = inp[:, 2*fs:242*fs] # 4 minutes of train data
tgt_train = tgt[:, 2*fs:242*fs]

inp_val = inp[:, 242*fs:272*fs] # 30 seconds of validation data
tgt_val = tgt[:, 242*fs:272*fs]

inp_test = inp[:, 272*fs:302*fs] # 30 seconds of test data
tgt_test = tgt[:, 272*fs:302*fs]


Use the below to play some of the audio we just loaded.

You will have to slice some of the audio, example:\
inp_train[0, 0:2*fs] \
will get the first 2 seconds of the input data.

You will also have to call .numpy() on the audio to make it into a numpy array.

In [21]:
# Play a segment of the input audio here
IPython.display.Audio(data=##audio to play here###, rate=fs)

In [22]:
# Play the same segment from the target audio here
IPython.display.Audio(data=##audio to play here###, rate=fs)

The above two clips should be the same guitar playing, but one should have audio effects processing applied!

----

Lets create a dataloader for our audio effects modelling neural network.

In this case, our training data consists of 4 minutes of clean guitar audio, and our target data consists of the same 4 minutes of guitar audio after effects processing.

We can choose to divide this into a larger number of datapoints. For example, we can decide that we want each training example to consist of 0.5-seconds of audio of clean guitar, paired with 0.5-second of distorted guitar.

This will leave a total of 480 datapoints in our dataset. As the sample rate in this case is 44100, this means the dataset will consist of input/target pairs of 480 tensors, each with 1 channel, and each 22050 samples long.

The first datapoint will be the first 22050 samples of the input signal, paired with the first 22050 samples of the target signal.
The second datapoint will be the samples from 22050 to 2*22050=44100 samples, of the input signal, paired with the corresponding samples in the target signal.
And so on. 

Below is a DataSet implementation that takes a single input audio tensor and a single target audio tensor as input.

In [129]:
class AudioDataSet(Dataset): 
    def __init__(self, input_audio, target_audio):
        # Our class constructor takes the input and target audio streams as arguments, and saves them
        self.input_audio = input_audio
        self.target_audio = target_audio

    def __getitem__(self, i):
        # The getitem method returns a segment of audio, that is 22050 samples long
        start_samp = i*22050
        end_samp = (i+1)*22050
        x = self.input_audio[:, start_samp:end_samp]
        y = self.target_audio[:, start_samp:end_samp]
        return x, y

    def __len__(self):
        # The len method returns the total number of datapoints in the dataset
        return self.input_audio.shape[1]//22050

In [131]:
# Create our training dataset.
train_dataset = AudioDataSet(inp_train, tgt_train)

In [132]:
# Get the 20th datapoint from our dataset
x,y = train_dataset[19]

In [133]:
# Play the input
IPython.display.Audio(data=x.numpy(), rate=fs)

In [134]:
# Play the target
IPython.display.Audio(data=y.numpy(), rate=fs)

In [None]:
## TODO  Create the validation and test datasets, using the inp_val/tgt_val and inp_test/tgt_test

# Listen to some examples from each dataset



## 8 - Putting it all together

Now we can train our convolutional neural network model to emulate the audio effect.

In [135]:
# This is the loss function
loss_fcn = torch.nn.MSELoss()

In [136]:
# This is the stochastic gradient descent optimiser
optimiser = torch.optim.SGD(model.parameters())

Below is a basic training loop.

In [None]:
loss_list = []
for n in range(100): # Run 100 training epochs
    for x,y in DataLoader(train_dataset, 10): # This wraps our train_dataset in a Dataloader, batch_size=10
        optimiser.zero_grad()         # Zero the gradients
        y_hat = model(x)              # Get the model predictions
        loss = loss_fcn(y_hat, y)     # Calculate the loss
        loss.backward()               # Backpropagate
        optimiser.step()              # Update the model parameters
        loss_list.append(loss.item()) # save the loss for monitoring training

If you look at the values in loss_list, they should be decreasing. You can try plotting them:

In [None]:
import matplotlib.pyplot as plt

# plot your loss

Now here are some things for you to try:

### Validation Loss:
- After each training epoch, use the model to process the validation dataset
- Find the average loss on the validation dataset, and save it to a list called 'validation_losses'

### Validation/Test dataset:
- Our Dataset class currently splits up the data into 0.5-second samples
- This doesn't make it very easy to assess how accurate our model sounds
- Modify the earlier defined 'AudioDataSet'
  - Add a new argument to the constructor called 'segment_length_samples'
  - Update the __getitem__ method so each segment is 'segment_length_samples' samples long
  - Now there will be a different number of total segments in the dataset, Update the __len__ method accordingly

### Evaluate model performance:
- Now use your modifed dataset class to create the test and validation sets, so each segment is 5-seconds long
- Listen to some examples from the validation/test set
- Run this test dataset through your trained model
- Listen to its outputs and compare them to the target data
- How is the model doing?

### ConvBlock modifications
- Can you add dilations to your ConvBlock?
- Try add a dilation argument, and then creating a ConvNet where dilation increases with each layer

### You might be able to improve the model performance by:
- Changing the learning rate in the optimiser
- Changing the number of model parameters
- Shorter training segments will make training faster

You can monitor training performance by looking at the validation loss.

In [None]:
### TODO