## PQ Prediction with HuBERT using a Deep Neural Net 

In the last notebook, we used HuBERT layers to predict the PQ Representation of the audio clips in the Perceptual Voice Qualities Database (PVQD). We used a naive aggregation technique to do this (simply averaging over the time steps), but what if instead we decided to try and take advantage of the temporal information to perform prediction? 

We may *want* to use deep neural networks, because they are super duper cool and powerful, but we have to ask if it is the best tool for the job. Given that the PQ Representation is a vector that represents speaker identity, which can be thought of as the time-invariant characteristics of speech (mostly), will the temporal information increase predictive performance? 

In [2]:
import os
import sys
os.environ["CUDA_VISIBLE_DEVICES"] = "6"
sys.path.append("../")

import torch
import torchaudio
from torch.utils.data import Dataset, DataLoader

import IPython
import matplotlib.pyplot as plt
import seaborn as sns

from src.utils import *

import time


In [5]:
# Load in the HuBERT Model
torch.random.manual_seed(0)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

bundle = torchaudio.pipelines.HUBERT_BASE
hubert_model = bundle.get_model().to(device)

print("Sample Rate:", bundle.sample_rate)
print(device)

Sample Rate: 16000
cuda


In [6]:
# Load in the Data 
## Load in the DataFrame
y_train = pd.read_csv('../data/pvqd/train_test_split/y_train.csv', index_col=0)
y_val = pd.read_csv('../data/pvqd/train_test_split/y_val.csv', index_col=0)

# Same inefficient code as before!

data_path = "../data/pvqd/audio_clips/"
audio_files = os.listdir(data_path)
speaker_ids = [extract_speaker(audio_file) for audio_file in audio_files]

# Assertion to make sure speaker_ids matches y_train['File']
i = 0
for spk_id in y_train["File"]:
    try:
        assert spk_id in speaker_ids
    except:
        print(spk_id)
        i+=1

# Dictionary to Link Speaker ID to Audio File for O(1) access
speaker_file_dict = {}
for i in range(0, len(speaker_ids)):
    speaker_file_dict[speaker_ids[i]] = os.path.join(data_path, audio_files[i])

### First things first, we gotta make the data compatible with PyTorch 

PyTorch is a very powerful package that allows us to process data in fast and remarkable ways. Now, the way to take advantage of PyTorch's [Dataset](https://pytorch.org/tutorials/ beginner/data_loading_tutorial.html) class. Using this class, you can train in parallel, and easily modify batch size (the number of input samples your network considers per training update)

Below, we're going to load in the hubert model and implement the class HubertDataset, which will take in a a dataframe of labels, the speaker-to-wavfile array, and a hubert model, and will return the sixth layer of the hubert features and the labels.

In [None]:
## TODO: Implement the Dataset function __getitem__ 

class HubertDataset(Dataset):
    def __init__(self, dataframe, spk_wav_arr, hubert_model):
        self.dataframe = dataframe
        self.spk_wav_arr = spk_wav_arr # array that links speaker id and 
        self.hubert_model = hubert_model # the hubert model

    def __len__(self):
        # Return the total number of data samples
        return len(self.dataframe)

    # TODO: Implement the __getitem__ function
    # Given an index, return the features and the labels
    def __getitem__(self, index):
        # Retrieve a single row from the DataFrame
        single_row = self.dataframe.iloc[index, :]

        # Process the row (assuming you have a function that does this)
        left, right, label = self.read_row(single_row)

        return left, right, label



