Data visualization and transformation is an important part of every model. Now that we have our dataset downloaded, let's learn more about audio data visualization and transforming this dataset.

TorchAudio has many transformation functions for audio manipulation and feature extractions. However, we will focus on the following concepts and tranforms:
* **Spectrogram** Create a spectrogram from a waveform.
* **MelSpectrogram** Create Mel Spectrogram from a waveform using the `STFT` function in PyTorch.
* **Waveform**
* **MFCC** Create the Mel-frequency cepstrum coefficients from a waveform.

In [None]:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

In [None]:
from __future__ import annotations
import os
import torch
import torchaudio
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from pathlib import Path


First, we'll go through the audio file tthat we downloaded in local directory by filtering out the ones that are the `yes` abd `no` commands under the `nohash` path. Then we'll load the files into the `torchaudio` data object. This will make it easy to extract attributes of the audio (for example, the waveform and sample rate).

In [None]:
def load_audio_file(path:str, label:str):
    dataset =[]
    walker = sorted(str(p) for p in Path(path).glob('*.wav'))

    for i, file_path in enumerate(walker):
        path,filename = os.path.split(file_path)
        speaker, _ = os.path.splitext(filename)
        speaker_id, utterance_number = speaker.split('_nohash_')
        utternance_number = int(utterance_number)

        # Load audio
        waveform, sample_rate = torchaudio.load(file_path)
        dataset.append([waveform, sample_rate, label, speaker_id, utterance_number])

    return dataset

Call the `load_audio_files` function to load the contents from each of the audio class files, as well as their metadata.

In [None]:
trainset_speechcommands_yes = load_audio_file('./data/SpeechCommands/speech_commands_v0.02/yes', 'yes')
trainset_speechcommands_no = load_audio_file('./data/SpeechCommands/speech_commands_v0.02/no', 'no')

print(f'Length of yes dataset: {len(trainset_speechcommands_yes)}')
print(f'Length of no dataset: {len(trainset_speechcommands_no)}')

Now load the dataset into a data loader for both `yes` and `no` training sample sets. `DataLoader` sets the number os batches you want to iterate to load the dataset thorugh your network, to train the model. We'll set the batch size to 1, because we want to load the entire batch in one teration.

In [None]:
traubkiader_yes = torch.utils.data.DataLoader(trainset_speechcommands_yes, batch_size=1, shuffle=True, num_workers=0)
traubkiader_no = torch.utils.data.DataLoader(trainset_speechcommands_no, batch_size=1, shuffle=True, num_workers=0)

To see how the data looks, we'll grab the waveform and sample rate form each class, and print out a sample of the dataset.
* THe **waveform** value is n a Tensor with a float datatype.
* The **sample_rate** value is 16000 in the format the audio signal was captured.
* The **label** value is the command classification of the word uttered in the audio, `yes` or `no`.
* The **ID** is a unique identifier of the audio file.

In [None]:
yes_waveform =  trainset_speechcommands_yes[0][0]
yes_sample_rate = trainset_speechcommands_yes[0][1]
print(f'Yes waveform: {yes_waveform}')
print(f'Yes sample rate: {yes_sample_rate}')
print(f'Yes Label: {trainset_speechcommands_yes[0][2]}')
print(f'Yes ID: {trainset_speechcommands_yes[0][3]}\n')

no_waveform =  trainset_speechcommands_no[0][0]
no_sample_rate = trainset_speechcommands_no[0][1]
print(f'No waveform: {no_waveform}')
print(f'No sample rate: {no_sample_rate}')
print(f'No Label: {trainset_speechcommands_no[0][2]}')
print(f'No ID: {trainset_speechcommands_no[0][3]}\n')

## Transform and visulize

Let's break down some of the audio transforms and the visualization to better understand what they are, and what they tell us about the data.

## Waveform

The waveform is generated by taking the sample rate and frequency, and representing the signal visually. This signal can be represented as a `waveform`, which is the `signal` representation over time, in a grapahical format. The audio can be recorded in different `channels`.

Here's how to use the `resample` transform to reduce the size of the waveform, and then graph the data to visualize the new waveform shape.

In [None]:
def show_waveform(waveform, sample_rate, label):
    print("Waveform: {}\nSample rate: {}\nLabel: {}".format(waveform, sample_rate, label))
    new_sample_rate = sample_rate/10

    # Resample applies to a single channel, we resample first channel here.
    channel = 0
    waveform_transformed = torchaudio.transforms.Resample(sample_rate, new_sample_rate)(waveform[channel,:].view(1,-1))

    print("Shape of transformed wavrform: {}\nSample rate: {}".format(waveform_transformed.size(), new_sample_rate))

    plt.figure()
    plt.plot(waveform_transformed[0,:].numpy())

The dispkayed results show how the sample rate is transformed from 16000 to 1600.

In [None]:
show_waveform(yes_waveform, yes_sample_rate, 'yes')

In [None]:
show_waveform(no_waveform, no_sample_rate, 'no')

## Spectrogram

A spectrogram maps the frequency to time of an audio file, and it sllows you to visualize audio data by frequency. It is an image format. This image is what we'll use for our computer vision classification on the audio files. You can view the spectrogram image in grayscale, or in Red Green Blue (RGB) color format.

Every spectrogram image helps show the different features the sound signal produces in a color pattern. The convolutional neural network(CNN) treats the color patterns in the image as features for training the model to classify the audio.

Let's use the PyTorch `torchaudio.transforms` function to transform the waveform to a spectrogram image format.

In [None]:
def show_spectrogram(waveform_classA, waveform_classB):
    yes_spectrogram = torchaudio.transforms.Spectrogram()(waveform_classA)
    print("\nShape of yes spectrogram: {}".format(yes_spectrogram.size()))

    no_spectrogram = torchaudio.transforms.Spectrogram()(waveform_classB)
    print("\nShape of no spectrogram: {}".format(no_spectrogram.size()))

    plt.figure()
    plt.subplot(1,2,1)
    plt.title('Features of {}'.format('no'))
    plt.imshow(no_spectrogram.log2()[0,:,:].numpy(), cmap='viridis')

    plt.subplot(1,2,2)
    plt.title('Features of {}'.format('yes'))
    plt.imshow(yes_spectrogram.log2()[0,:,:].numpy(), cmap='viridis')

We'll use the waveform for the `yes` command to display the spectrogram images dimensions and color pattern in a RGB chart. We'll also compare the feature difference between the `yes` and `no` audio commands.
* The **y-axis** is the frequency of the audio.
* The **x-axis** is the time of the audio.
* the intensity of the images shows the amplitude of the audio. In the following spectrogram images, the high concentrate of the yellow color illustrates the amplitude of the audio.

In [None]:
show_spectrogram(yes_waveform, no_waveform)

## Mel spectrogram

Mel Spectrogram is also a frequency to time, but the frequency is converted to the Mel scale. The Mel scale takes the frequency and changes it, based on the perception of the sound of the scale or melody. This transforms the frequency within to  the Mdel scale, and then creates the spectrogram image.

In [None]:
def show_melspectrogram(waveform, sample_rate):
    mel_spectrogram = torchaudio.transforms.MelSpectrogram(sample_rate)(waveform)
    print("Shape of spectrogram: {}".format(mel_spectrogram.size()))

    plt.figure()
    plt.imshow(mel_spectrogram.log2()[0,:,:].numpy(), cmap='viridis')

show_melspectrogram(yes_waveform, yes_sample_rate)

## Mel-frequenc cepstral coefficients (MFCC)

A simplified explanation of what the MFCC does is that it takes our frequency, applies transforms, and the result is the amplitudes of the spectrum created from the frequency. Let's take a look at what this looks like.

In [None]:
def show_mfcc(waveform, sample_rate):
    mfcc_spectrogram = torchaudio.transforms.MFCC(sample_rate)(waveform)
    print("Shape of spectrogram: {}".format(mfcc_spectrogram.size()))

    plt.figure()
    fig1 = plt.gcf() # Get current figure
    plt.imshow(mfcc_spectrogram.log2()[0,:,:].numpy(), cmap='viridis')


    plt.figure()
    plt.plot(mfcc_spectrogram.log2()[0,:,:].numpy())
    plt.draw()

show_mfcc(no_waveform, no_sample_rate)

## Create an image from s spectrogram

At this point, you have a better understainding of your audio data, and different transformations you can use on it. Now, let's create the images we will use for classification.

The following are two different function to create the spectrogram image or the MFCC images for classification. You will use the spectrogram images to train our model.

In [None]:
def create_spectrogram_images(trainloader, label_dir):
    directory = f'./data/spectrograms/{label_dir}/'
    if (os.path.isdir(directory)):
        print(f'Data exists for, {label_dir}')
    else:
        os.makedirs(directory, mode=0o777, exist_ok=True)
        for i, data in enumerate(trainloader):
            waveform =data[0]
            sample_rate = data[1][0]
            label = data[2]
            ID =data[3]

            # create transformed waveforms
            spectrogram_tensor = torchaudio.transforms.Spectrogram()(waveform)
            fig=plt.figure()
            plt.imsave(f'./data/spectrograms/{label_dir}/spec_img{i}.png', spectrogram_tensor[0].log2()[0,:,:].numpy(), cmap='viridis')

Here's the difine function to create the `MFCC` images.

In [None]:
def create_mfcc_images(trainloader, label_dir):
    os.makedirs(f'./data/mfcc_spectrograms/{label_dir}/', mode=0o777, exist_ok=True)

    for i, data in enumerate(trainloader):
        waveform =data[0]
        sample_rate = data[1][0]
        label = data[2]
        ID =data[3]

        # create transformed waveforms
        mfcc_spectrogram = torchaudio.transforms.MFCC(sample_rate)(waveform)
        plt.figure()
        fig1=plt.gcf()
        plt.imshow(mfcc_spectrogram[0].log2()[0,:,:].numpy(), cmap='viridis')
        plt.draw()
        fig1.savefig(f'./data/mfcc_spectrograms/{label_dir}/spec_img{i}.png', dpi=100)

Create the spectrogram images that you'll use for the audio classification

In [None]:
create_spectrogram_images(traubkiader_yes, 'yes')
create_spectrogram_images(traubkiader_no, 'no')