Data visualization and transformation is an important part of every model. Now that we have our dataset downloaded, let's learn more about audio data visualization and transforming this dataset.

TorchAudio has many transformation functions for audio manipulation and feature extractions. However, we will focus on the following concepts and tranforms:
* **Spectrogram** Create a spectrogram from a waveform.
* **MelSpectrogram** Create Mel Spectrogram from a waveform using the `STFT` function in PyTorch.
* **Waveform**
* **MFCC** Create the Mel-frequency cepstrum coefficients from a waveform.

In [None]:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

In [None]:
from __future__ import annotations
import torch
import torchaudio
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from pathlib import Path


In [None]:
# check the paltform, Apple Silicon or Linux
import os, platform

torch_device="cpu"

if 'kaggle' in os.environ.get('KAGGLE_URL_BASE','localhost'):
    torch_device = 'cuda'
else:
    torch_device = 'mps' if platform.system() == 'Darwin' else 'cpu'

torch_device

First, we'll go through the audio file tthat we downloaded in local directory by filtering out the ones that are the `yes` abd `no` commands under the `nohash` path. Then we'll load the files into the `torchaudio` data object. This will make it easy to extract attributes of the audio (for example, the waveform and sample rate).

In [None]:
def load_audio_file(path:str, label:str):
    dataset =[]
    walker = sorted(str(p) for p in Path(path).glob('*.wav'))

    for i, file_path in enumerate(walker):
        path,filename = os.path.split(file_path)
        speaker, _ = os.path.splitext(filename)
        speaker_id, utterance_number = speaker.split('_nohash_')
        utternance_number = int(utterance_number)

        # Load audio
        waveform, sample_rate = torchaudio.load(file_path)
        dataset.append([waveform, sample_rate, label, speaker_id, utterance_number])

    return dataset

Call the `load_audio_files` function to load the contents from each of the audio class files, as well as their metadata.

In [None]:
trainset_speechcommands_yes = load_audio_file('./data/SpeechCommands/speech_commands_v0.02/yes', 'yes')
trainset_speechcommands_no = load_audio_file('./data/SpeechCommands/speech_commands_v0.02/no', 'no')

print(f'Length of yes dataset: {len(trainset_speechcommands_yes)}')
print(f'Length of no dataset: {len(trainset_speechcommands_no)}')

Now load the dataset into a data loader for both `yes` and `no` training sample sets. `DataLoader` sets the number os batches you want to iterate to load the dataset thorugh your network, to train the model. We'll set the batch size to 1, because we want to load the entire batch in one teration.

In [None]:
traubkiader_yes = torch.utils.data.DataLoader(trainset_speechcommands_yes, batch_size=1, shuffle=True, num_workers=0)
traubkiader_no = torch.utils.data.DataLoader(trainset_speechcommands_no, batch_size=1, shuffle=True, num_workers=0)

To see how the data looks, we'll grab the waveform and sample rate form each class, and print out a sample of the dataset.
* THe **waveform** value is n a Tensor with a float datatype.
* The **sample_rate** value is 16000 in the format the audio signal was captured.
* The **label** value is the command classification of the word uttered in the audio, `yes` or `no`.
* The **ID** is a unique identifier of the audio file.

In [None]:
yes_waveform =  trainset_speechcommands_yes[0][0]
yes_sample_rate = trainset_speechcommands_yes[0][1]
print(f'Yes waveform: {yes_waveform}')
print(f'Yes sample rate: {yes_sample_rate}')
print(f'Yes Label: {trainset_speechcommands_yes[0][2]}')
print(f'Yes ID: {trainset_speechcommands_yes[0][3]}\n')

no_waveform =  trainset_speechcommands_no[0][0]
no_sample_rate = trainset_speechcommands_no[0][1]
print(f'No waveform: {no_waveform}')
print(f'No sample rate: {no_sample_rate}')
print(f'No Label: {trainset_speechcommands_no[0][2]}')
print(f'No ID: {trainset_speechcommands_no[0][3]}\n')

## Transform and visulize

Let's break down some of the audio transforms and the visualization to better understand what they are, and what they tell us about the data.

## Waveform

The waveform is generated by taking the sample rate and frequency, and representing the signal visually. This signal can be represented as a `waveform`, which is the `signal` representation over time, in a grapahical format. The audio can be recorded in different `channels`.

Here's how to use the `resample` transform to reduce the size of the waveform, and then graph the data to visualize the new waveform shape.

In [None]:
def show_waveform(waveform, sample_rate, label):
    print("Waveform: {}\nSample rate: {}\nLabel: {}".format(waveform, sample_rate, label))
    new_sample_rate = sample_rate/10

    # Resample applies to a single channel, we resample first channel here.
    channel = 0
    waveform_transformed = torchaudio.transforms.Resample(sample_rate, new_sample_rate)(waveform[channel,:].view(1,-1))

    print("Shape of transformed wavrform: {}\nSample rate: {}".format(waveform_transformed.size(), new_sample_rate))

    plt.figure()
    plt.plot(waveform_transformed[0,:].numpy())

The dispkayed results show how the sample rate is transformed from 16000 to 1600.

In [None]:
show_waveform(yes_waveform, yes_sample_rate, 'yes')

In [None]:
show_waveform(no_waveform, no_sample_rate, 'no')

## Spectrogram

A spectrogram maps the frequency to time of an audio file, and it sllows you to visualize audio data by frequency. It is an image format. This image is what we'll use for our computer vision classification on the audio files. You can view the spectrogram image in grayscale, or in Red Green Blue (RGB) color format.

Every spectrogram image helps show the different features the sound signal produces in a color pattern. The convolutional neural network(CNN) treats the color patterns in the image as features for training the model to classify the audio.

Let's use the PyTorch `torchaudio.transforms` function to transform the waveform to a spectrogram image format.

In [None]:
def show_spectrogram(waveform_classA, waveform_classB):
    yes_spectrogram = torchaudio.transforms.Spectrogram()(waveform_classA)
    print("\nShape of yes spectrogram: {}".format(yes_spectrogram.size()))

    no_spectrogram = torchaudio.transforms.Spectrogram()(waveform_classB)
    print("\nShape of no spectrogram: {}".format(no_spectrogram.size()))

    plt.figure()
    plt.subplot(1,2,1)
    plt.title('Features of {}'.format('no'))
    plt.imshow(no_spectrogram.log2()[0,:,:].numpy(), cmap='viridis')

    plt.subplot(1,2,2)
    plt.title('Features of {}'.format('yes'))
    plt.imshow(yes_spectrogram.log2()[0,:,:].numpy(), cmap='viridis')

We'll use the waveform for the `yes` command to display the spectrogram images dimensions and color pattern in a RGB chart. We'll also compare the feature difference between the `yes` and `no` audio commands.
* The **y-axis** is the frequency of the audio.
* The **x-axis** is the time of the audio.
* the intensity of the images shows the amplitude of the audio. In the following spectrogram images, the high concentrate of the yellow color illustrates the amplitude of the audio.

In [None]:
show_spectrogram(yes_waveform, no_waveform)

## Mel spectrogram

Mel Spectrogram is also a frequency to time, but the frequency is converted to the Mel scale. The Mel scale takes the frequency and changes it, based on the perception of the sound of the scale or melody. This transforms the frequency within to  the Mdel scale, and then creates the spectrogram image.

In [None]:
def show_melspectrogram(waveform, sample_rate):
    mel_spectrogram = torchaudio.transforms.MelSpectrogram(sample_rate)(waveform)
    print("Shape of spectrogram: {}".format(mel_spectrogram.size()))

    plt.figure()
    plt.imshow(mel_spectrogram.log2()[0,:,:].numpy(), cmap='viridis')

show_melspectrogram(yes_waveform, yes_sample_rate)

## Mel-frequenc cepstral coefficients (MFCC)

A simplified explanation of what the MFCC does is that it takes our frequency, applies transforms, and the result is the amplitudes of the spectrum created from the frequency. Let's take a look at what this looks like.

In [None]:
def show_mfcc(waveform, sample_rate):
    mfcc_spectrogram = torchaudio.transforms.MFCC(sample_rate)(waveform)
    print("Shape of spectrogram: {}".format(mfcc_spectrogram.size()))

    plt.figure()
    fig1 = plt.gcf() # Get current figure
    plt.imshow(mfcc_spectrogram.log2()[0,:,:].numpy(), cmap='viridis')


    plt.figure()
    plt.plot(mfcc_spectrogram.log2()[0,:,:].numpy())
    plt.draw()

show_mfcc(no_waveform, no_sample_rate)

## Create an image from s spectrogram

At this point, you have a better understainding of your audio data, and different transformations you can use on it. Now, let's create the images we will use for classification.

The following are two different function to create the spectrogram image or the MFCC images for classification. You will use the spectrogram images to train our model.

In [None]:
def create_spectrogram_images(trainloader, label_dir):
    directory = f'./data/spectrograms/{label_dir}/'
    if (os.path.isdir(directory)):
        print(f'Data exists for, {label_dir}')
    else:
        os.makedirs(directory, mode=0o777, exist_ok=True)
        for i, data in enumerate(trainloader):
            waveform =data[0]
            sample_rate = data[1][0]
            label = data[2]
            ID =data[3]

            # create transformed waveforms
            spectrogram_tensor = torchaudio.transforms.Spectrogram()(waveform)
            fig=plt.figure()
            plt.imsave(f'./data/spectrograms/{label_dir}/spec_img{i}.png', spectrogram_tensor[0].log2()[0,:,:].numpy(), cmap='viridis')

Here's the difine function to create the `MFCC` images.

In [None]:
def create_mfcc_images(trainloader, label_dir):
    os.makedirs(f'./data/mfcc_spectrograms/{label_dir}/', mode=0o777, exist_ok=True)

    for i, data in enumerate(trainloader):
        waveform =data[0]
        sample_rate = data[1][0]
        label = data[2]
        ID =data[3]

        # create transformed waveforms
        mfcc_spectrogram = torchaudio.transforms.MFCC(sample_rate)(waveform)
        plt.figure()
        fig1=plt.gcf()
        plt.imshow(mfcc_spectrogram[0].log2()[0,:,:].numpy(), cmap='viridis')
        plt.draw()
        fig1.savefig(f'./data/mfcc_spectrograms/{label_dir}/spec_img{i}.png', dpi=100)

Create the spectrogram images that you'll use for the audio classification

In [None]:
create_spectrogram_images(traubkiader_yes, 'yes')
create_spectrogram_images(traubkiader_no, 'no')

# Build the speech model

We will be using the `torchvision` package to build the model. The convolutional neural network(CNN) layer (`conv2d`) will be used to extract the unique features from the spectrogram image for each speech command.

In [None]:

import torch
import torchaudio
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import matplotlib.pyplot as plt
import numpy as np

from torch.utils.data import DataLoader,Dataset
from torchvision import datasets, transforms
from torchvision import datasets, models, transforms
from torchinfo import summary
import pandas as pd
import os

## Load spectrogram images into a data loader for training

Here, we provide the path to our image data and use PyTorch's `ImageFolder` dataset helper class to load the images into tensors. We'll also normalize the images by resizing to dimension of 201x81.

In [None]:
data_path = './data/spectrograms' #looking in subfolders train
yes_no_dataset = datasets.ImageFolder(
    root=data_path,
    transform=transforms.Compose([transforms.Resize((201,81)),transforms.ToTensor()])
)
print(yes_no_dataset)

`Imagefolder` automatically creates the image class labels and indices based on the folders for each audio class. We'll use the `class_to_idx` to view the class mapping for the image dataset.

In [None]:
class_map = yes_no_dataset.class_to_idx

print('\nClass category and index of the images: {}\n'.format(class_map))

## Split the data for training and testing

We will need to split the data to use 80% to train the model, and 20% to test.

In [None]:
#split data to test and train by using 80% to train
train_size = int(0.8 * len(yes_no_dataset))
test_size = len(yes_no_dataset) - train_size
yes_no_train_dataset, yes_no_test_dataset = torch.utils.data.random_split(yes_no_dataset, [train_size, test_size])

print('Ttraining size:', len(yes_no_train_dataset))
print('Test size:', len(yes_no_test_dataset))

Because the dataset was randomly split, let's count the training data to verify that the data has a fairly even distribution between the images in the `yes` and `no` categories.

In [None]:
from collections import Counter

# labels in training set
train_classes = [label for _, label in yes_no_train_dataset]
Counter(train_classes)

Load the data into the `DataLoader` and specify the batch size of how the data will be divided and loaded in the training iterations. We'll also set the number of workers so specify the number of subprocesses to load the data.

In [None]:
train_dataloader = torch.utils.data.DataLoader(yes_no_train_dataset, batch_size=15, num_workers=2, shuffle=True)
test_dataloader = torch.utils.data.DataLoader(yes_no_test_dataset, batch_size=15, num_workers=2, shuffle=True)

Let's take a look at what our training tensor looks like:

In [None]:
td = train_dataloader.dataset[0][0][0][0]
print(td)

### Create the convolutional neural network (CNN) model

We will define our layers and parameters:

* **conv2d**: Takes an input of 3 `channels`, which represents RGB colors because our input images are in color. The 32 represents the number of feature map images profuced from the convolutional layer. The images are produced after you apply a filter on each image in a channel, with a 5x5 kernel size and a stride of 1. `Max pooling` is set with a 2x2 kernel size to reduce the dimensions of the filtered images. We apply the `ReLU` activation to replace the negative pixel values in the feature map with zero.
* **conv2d**: Takes the 32 output images from the previous convolutional layer as input. Then, we increase the output number to 64 feature map images, after a filter is applied on the 32 input images, with a 5x5 kernel size and a stride of 1. `Max pooling` is set with a 2x2 kernel size to reduce the dimensions of the filtered images. We apply the `ReLU` activation to replace the negative pixel values to 0.
* **dropout**: Removes some of the features extraced from the `conv2d` layer with the ratio of 0.50 to prevent overfitting.
* **flatten**: Converts features from the `con2d` output image into the linear input layer.
* **Linear**: Takes a number of 51136 features as input, and sets the number of outpus from the network to be 50 logits. The next layer will take the 50 inputs and produces 2 logits in the outpur layer. The `ReLU` activation function will be applied to the neurons across the linear network to replace the negative values to 0. The 2 output values wil be used to predict the classification `yes` ot `no`.
* **log_Softmax**: An activation function applied to the 2 output values to predict the probability of the audio classification.


In [None]:
class CNNet(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=5)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.flatten = nn.Flatten()
        self.fc1= nn.Linear(51136, 50)
        self.fc2 = nn.Linear(50, 2)
    
    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)),2))
        #x = x.view(x.size(0), -1)
        x = self.flatten(x)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = F.relu(self.fc2(x))
        return F.log_softmax(x, dim=1)

model = CNNet().to(torch_device)

## Create train and test functions

Now we set the cost function, learning rate and optimizer. Then we define the train and test functions that we will use to train and test the model by using the CNN.

In [None]:
# cost function used to determine best parameters
cost = torch.nn.CrossEntropyLoss()

# used to create optimal parameters
learning_rate = 0.0001
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# create the training function

def train(dataloader, model, loss, optimizer):
    model.train()
    size = len(dataloader.dataset)
    for batch, (X,Y) in enumerate(dataloader):
        X, Y = X.to(torch_device), Y.to(torch_device)
        optimizer.zero_grad()
        pred = model(X)
        loss = cost(pred, Y)
        loss.backward()
        optimizer.step()

        if batch %100 ==0:
            loss, current = loss.item(), batch*len(X)
            print(f'loss: {loss:>7f} [{current:>5d}/{size:>5d}]')

# Create the validation/test function
def test(dataloader, model):
    size = len(dataloader.dataset)
    model.eval()
    test_loss, correct = 0,0
    with torch.no_grad():
        for batch, (X,Y) in enumerate(dataloader):
            X, Y = X.to(torch_device), Y.to(torch_device)
            pred = model(X)
            test_loss += cost(pred, Y).item()
            correct += (pred.argmax(1)==Y).type(torch.float).sum().item()
    test_loss /= size
    correct /= size
    print(f'Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n')

## Train the model

Now let's set the number of epochs, and call our `train` and `test` functions for each iteration. We'll iterate throught the training network by the number of epochs. As we train the model, we'll calcualte the loss as it decreases during the training. In addition, we'll display the accuracy as the optimization increases.

In [None]:
epochs = 15

for t in range(epochs):
    print(f'Epoch {t+1}\n-------------------------------')
    train(train_dataloader, model, cost, optimizer)
    test(test_dataloader, model)
print('Done!')

Let's look at the summary breakdown of the model architecture. It shows the number of filters used for the feature extraction and image reduction from pooling for each convolutional layer. Next, it shows 51136 input features and the 2 outputs used for classification in the linear layers.

In [None]:
summary(model, input_size=(15, 3, 201, 81))

## Test the model

We can get somewhere between a 93-95 percent accuracy by the 15 epoch. Here we grab a batch from our test data, and see how the model performs on the predicated result and the actual result.

In [None]:
model.eval()
test_loss, correct = 0,0
class_map = ['no', 'yes']


with torch.no_grad():
    for batch, (X,Y) in enumerate(test_dataloader):
        X,Y = X.to(torch_device), Y.to(torch_device)
        pred = model(X)
        print('Predicted:\nvalue={}, class_name={}\n'.format(pred[0].argmax(0), class_map[pred[0].argmax(0)]))
        print('Actual:\nvalue={}, class_name={}\n'.format(Y[0], class_map[Y[0]]))
        break