## PQ Prediction with HuBERT using a Deep Neural Net 

In the last notebook, we used HuBERT layers to predict the PQ Representation of the audio clips in the Perceptual Voice Qualities Database (PVQD). We used a naive aggregation technique to do this (simply averaging over the time steps), but what if instead we decided to try and take advantage of the temporal information to perform prediction? 

We may *want* to use deep neural networks, because they are super duper cool and powerful, but we have to ask if it is the best tool for the job. Given that the PQ Representation is a vector that represents speaker identity, which can be thought of as the time-invariant characteristics of speech (mostly), will the temporal information increase predictive performance? 

In [27]:
import os
import sys
os.environ["CUDA_VISIBLE_DEVICES"] = "6"
sys.path.append("../")

import torch
import torchaudio
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.utils.model_zoo as model_zoo
import torch.nn.functional as F


import IPython
import matplotlib.pyplot as plt
import seaborn as sns

from src.utils import *

import time
from collections import OrderedDict # Not necessary

In [5]:
# Load in the HuBERT Model
torch.random.manual_seed(0)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

bundle = torchaudio.pipelines.HUBERT_BASE
hubert_model = bundle.get_model().to(device)

print("Sample Rate:", bundle.sample_rate)
print(device)

Sample Rate: 16000
cuda


In [6]:
# Load in the Data 
## Load in the DataFrame
y_train = pd.read_csv('../data/pvqd/train_test_split/y_train.csv', index_col=0)
y_val = pd.read_csv('../data/pvqd/train_test_split/y_val.csv', index_col=0)

# Same inefficient code as before!

data_path = "../data/pvqd/audio_clips/"
audio_files = os.listdir(data_path)
speaker_ids = [extract_speaker(audio_file) for audio_file in audio_files]

# Assertion to make sure speaker_ids matches y_train['File']
i = 0
for spk_id in y_train["File"]:
    try:
        assert spk_id in speaker_ids
    except:
        print(spk_id)
        i+=1

# Dictionary to Link Speaker ID to Audio File for O(1) access
speaker_file_dict = {}
for i in range(0, len(speaker_ids)):
    speaker_file_dict[speaker_ids[i]] = os.path.join(data_path, audio_files[i])

### First things first, we gotta make the data compatible with PyTorch 

PyTorch is a very powerful package that allows us to process data in fast and remarkable ways. Now, the way to take advantage of PyTorch's [Dataset](https://pytorch.org/tutorials/ beginner/data_loading_tutorial.html) class. Using this class, you can train in parallel, and easily modify batch size (the number of input samples your network considers per training update)

Below, we're going to load in the hubert model and implement the class HubertDataset, which will take in a a dataframe of labels, the speaker-to-wavfile array, and a hubert model, and will return the sixth layer of the hubert features and the labels.

In [133]:
## TODO: Implement the Dataset function __getitem__ 

class HubertDataset(Dataset):
    def __init__(self, dataframe, spk_wav_arr, hubert_model, device):
        self.dataframe = dataframe
        self.spk_wav_arr = spk_wav_arr # array that links speaker id and 
        self.hubert_model = hubert_model # the hubert model
        self.hubert_model.eval() # Ensure it's in eval mode
        self.device = device # should we be doing this?
        self.pad_len = 3548 # Max seq length

    def __len__(self):
        # Return the total number of data samples
        return len(self.dataframe)

    # TODO: Implement the __getitem__ function
    # Given an index, return the features and the labels
    def __getitem__(self, index):
        # Retrieve a single row from the DataFrame
        single_row = self.dataframe.values[index, :]
        speaker_id = single_row[0]
        audio_file = self.spk_wav_arr[speaker_id]

        waveform, sample_rate = torchaudio.load(audio_file)
        waveform = waveform.to(self.device) # Question: Should we be processing the waveform here?

        if sample_rate != bundle.sample_rate:
            waveform = torchaudio.functional.resample(waveform, sample_rate, bundle.sample_rate)

        # Extract the HuBERT layers like before
        features = self.hubert_model.extract_features(waveform)
        features = features[0]

        features = features[5]

        # Have to swap the axes, since we want the hubert dimension to be the channels
        features = torch.swapaxes(features, 1, 2)

        to_pad = self.pad_len - features.shape[-1]
        features = F.pad(features, (0, to_pad))

        # Let's take a fixed length

        # Load in the Labels
        labels = torch.from_numpy(single_row[1:].astype(float)).to(self.device)

        # Process the row (assuming you have a function that does this)

        return features, labels


### Note: PyTorch's Dataset class is *incredibly* flexible

There's no need to make it load a pandas DataFrame at all. You could store everything on disk, and have the index be associated with a file location.

Let's use our created HubertDataset class to create a DataLoader

In [134]:
train_dataset = HubertDataset(y_train, speaker_file_dict, hubert_model, device)

batch_size = 8 # Set the batch size
data_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle = True)

## The Moment You've All Been Waiting for: Let's Build a Neural Network

For the purposes of this workshop, we're going to just build a convolutional neural network. There are other ways to model speech or sequential data more broadly, such as using a Transformer (or a Conformer if you want to combine the two). We're going to keep it simple. Feel free to modify the network though, Torchaudio makes it pretty easy! (https://pytorch.org/audio/main/generated/torchaudio.models.Conformer.html for example).

In [166]:
class ClassifierModel(nn.Module):
    def __init__(self):
      super().__init__()

      self.conv_block1 = nn.Sequential(OrderedDict([
        ('conv', nn.Conv1d(768,256,5, padding=2)),
        ('norm', nn.InstanceNorm1d(256)),
        ('relu', nn.ReLU()),
      ]))

      self.conv_block2 = nn.Sequential(OrderedDict([
        ('conv', nn.Conv1d(256,256,5, padding=2)),
        ('norm', nn.InstanceNorm1d(256)),
        ('relu', nn.ReLU()),
      ]))

      self.lstm = nn.LSTM(256, 512, num_layers=2, bidirectional= True)
      self.pool = nn.MaxPool1d(5, 2, 2)
      
      self.final_layer = nn.Linear(3548 * 512, 5)

    def forward(self, x):
      x = self.conv_block1(x)
      x = self.conv_block2(x)

      x = torch.swapaxes(x, 1, 2)

      x, _ = self.lstm(x)
      x = self.pool(x)

      x = torch.flatten(x, start_dim=1)

      x = self.final_layer(x)

      return x

In [167]:
model = ClassifierModel()
model.to(device)
model

ClassifierModel(
  (conv_block1): Sequential(
    (conv): Conv1d(768, 256, kernel_size=(5,), stride=(1,), padding=(2,))
    (norm): InstanceNorm1d(256, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
    (relu): ReLU()
  )
  (conv_block2): Sequential(
    (conv): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
    (norm): InstanceNorm1d(256, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
    (relu): ReLU()
  )
  (lstm): LSTM(256, 512, num_layers=2, bidirectional=True)
  (pool): MaxPool1d(kernel_size=5, stride=2, padding=2, dilation=1, ceil_mode=False)
  (final_layer): Linear(in_features=1816576, out_features=5, bias=True)
)

In [168]:
sample, label = train_dataset.__getitem__(1)
out = model(sample)

In [169]:
sample.shape

torch.Size([1, 768, 3548])

In [170]:
out.shape

torch.Size([1, 5])

In [132]:
max_seq_len

3548