In [1]:
%matplotlib inline

Amigos Audio Classification Tutorial
=========================
**Author**: Brandon Thai Tran <github.com/BrandonThaiTran>

In this notebook, we study the effect of movie audio on affective response in participants using the AMIGOS dataset. For more information on the AMIGOS dataset, please visit the below link.

Paper: <http://www.eecs.qmul.ac.uk/mmv/datasets/amigos/doc/Paper_TAC.pdf>

Webpage (includes how to download the dataset): http://www.eecs.qmul.ac.uk/mmv/datasets/amigos/index.html

The dataset records 40 individuals watch 16 different movie clips while recording modalities during the trial and recording their affective response both before and after the trial. The modalities include face video, body video, depth video, audio, GSR, EEG, and ECG.

Before we can run this notebook, we extracted the movie audio from the face videos.

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import Dataset
from torch.utils.data.sampler import SubsetRandomSampler
import torchaudio
import pandas as pd
import numpy as np

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


Inputs
------

Here are all of the parameters to change for the run. 

We will write our own custom dataset. Download the data
and set the ``data_dir`` input to the root directory of the dataset. 

The other inputs are as follows: ``num_classes`` is the number of
classes in the dataset, ``batch_size`` is the batch size used for
training and may be adjusted according to the capability of your
machine, ``num_epochs`` is the number of training epochs we want to run, ``feature_extract`` is a boolean that defines if we are finetuning
or feature extracting, ``reduce`` is the flag for reducting the training size, and ``preload`` is the flag to continue training from a preloaded model. If ``feature_extract = False``, the model is
finetuned and all model parameters are updated. If
``feature_extract = True``, only the last layer parameters are updated,
the others remain fixed. 

In [4]:
# Top level data directory
root_dir = '/home/jupyter/datasets/amigos'

# directory with audio
data_dir = root_dir + '/audio'

# directory with csv
csv_dir = root_dir + '/annotations/SelfAsessment'

# csv file
csv_file = csv_dir + "/SelfAsessment.csv"

# Models to choose from [resnet, alexnet, vgg, squeezenet, densenet, inception]
model_name = 'ResNeXt-101-32x8d'

# Number of classes in the dataset
num_classes = 4

# Batch size for training (change depending on how much memory you have)
bs = 16

# Percentage of data to be used for validation
validation_split = .1

# Shuffle dataset
shuffle_dataset = True

# Number of epochs to train for 
num_epochs = 10

# Flag for feature extracting. When False, we finetune the whole model, 
#   when True we only update the reshaped layer params
feature_extract = True

# Flag for reducing size of the training matrix. When False, we do not reduce the size, 
#   when True wwe reduce the size.
reduce = False

# Flag to load from a model
preload = True

# The frequency to be resampled to
resample_freq = 8000

# number of workers
# set to equal the number of cores you
num_workers = 8

Reading the Dataset
---------------------

We will use the AMIGOS dataset to train our network. 
First, we will look at the csv file that provides information about the
individual sound files. ``pandas`` allows us to open the csv file and
use ``.iloc()`` to access the data within it.

In [5]:
# Loading csv
csv = pd.read_csv(csv_file)


We will train two networks concurrently, one network for arousal and another for valence. Both will classify arousal and valence as either high or low. We will also look at the emotion classes that correspond with the arousal and valence labels: low valence and low arousal (LVLA), low valence and high arousal (LVHA), high valence and low arousal (HVLA), and high valence and high arousal (HVHA).

In [6]:
import IPython.display as ipd

In [7]:
# ipd.Audio('/home/jupyter/datasets/amigos/audio/Exp1_P01_audio/P1_10_audio.wav')

In [8]:
# ipd.Audio('/home/jupyter/datasets/amigos/audio/Exp1_P01_audio/P1_138_audio.wav')

Formatting the Data
-------------------

Now that we know the format of the csv file entries, we can construct
our dataset. We will create a wrapper class for our dataset using
``torch.utils.data.Dataset`` that will handle loading the files and
performing some formatting steps. The names of the audio files will be read from the CSV. We will use a 65%/15%/20% train/validation/test spilt. The wrapper
class will store the file names, labels, and folder numbers of the audio
files in the inputted folder list when initialized. The actual loading
and formatting steps will happen in the access function ``__getitem__``.

In ``__getitem__``, we use ``torchaudio.load()`` to convert the wav
files to tensors. ``torchaudio.load()`` returns a tuple containing the
newly created tensor along with the sampling frequency of the audio file
(44.1kHz for AMIGOS).  The dataset uses two channels for audio so
we will use ``torchaudio.transforms.DownmixMono()`` to convert the audio
data to one channel. Next, we need to format the audio data. The network
we will make takes an input size of 32,000, while most of the audio
files have well over 100,000 samples. The UrbanSound8K audio is sampled
at 44.1kHz, so 32,000 samples only covers around 700 milliseconds. By
downsampling the audio to aproximately 8kHz, we can represent 4 seconds
with the 32,000 samples. This downsampling is achieved by taking every
fifth sample of the original audio tensor. Not every audio tensor is
long enough to handle the downsampling so these tensors will need to be
padded with zeros. The minimum length that won’t require padding is
160,000 samples.

In [9]:
# this function pads signals to the target length
def pad_signal(signal, target_len):
    # inputs: 
    #    signal: signal to be padded
    #    target_len: length to be padded to 
    # output: 
    #    padded_signal: size signal.shape[0] x target_len
    
    len_signal = signal.shape[1]
    num_zeros_needed = target_len - signal.shape[1]
    padded_signal = torch.zeros(1, target_len)
    
    if num_zeros_needed > 0:
        start_idx = np.random.randint(num_zeros_needed)
        padded_signal[:,start_idx:start_idx+len_signal] = signal
        return padded_signal
    else:
        return signal

In [10]:
class AmigosAudioDataset(Dataset):
    """Amigos audio dataset."""

    def __init__(self, csv_file, data_dir, resample_freq, transform=None, reduce_size=False):
        """
        Args:
            csv_file (string): Path to the csv file with annotations.
            data_dir (string): Directory with all the audio.
            transform (callable, optional): Optional transform to be applied
                on a sample.
            reduce_size (bool, optional): Optional flag to reduce the dataset for testing
            resample_freq: the frequency for the sound to be resampled at
        """
        # read csv and make a data frame
        df = pd.read_csv(csv_file)
        if reduce_size:
            self.data_frame = df[:33]
        else:
            self.data_frame = df
        self.resample_freq = resample_freq
        # store the data directory and transforms as members 
        self.data_dir = data_dir 
        self.transform = transform
        #initialize lists to hold the path and labels
        self.path = []
        self.arousal_labels = []
        self.valence_labels = []
        # find the maximum signal length
        self.max_signal_len = 0
        # number of faulty files
        self.num_faulty_files = 0
        # loop through csv entries to build paths
        for i in range(0,40):
            # get the participant id
            participant_id = i+1
            if participant_id < 10:
                participant_id = "0{}".format(participant_id)
            else:
                participant_id = str(participant_id)
            # gather file names
            for j in range(0,16):
                row_num = (i*16) + (j+1)
                audio_id = self.data_frame.iloc[row_num,2]
                audio_path = "{}/Exp1_P{}_audio/P{}_{}_audio.wav".format(self.data_dir,participant_id,i+1,audio_id[1:-1])
                # try to open it
                # if it opens find the waveform size
                # after that, add its info to the members
                try:
                    waveform, sample_rate = torchaudio.load(audio_path)
                except:
                    print("Audio file {} is faulty".format(audio_path))
                    self.num_faulty_files += 1
                    continue
                # record the sample_rate
                self.sample_rate = sample_rate
                # append the path
                self.path.append(audio_path)
                # check if the waveform size is larger than the max_signal_len
                waveform_size = waveform.shape[1]
                if waveform_size > self.max_signal_len:
                    self.max_signal_len = waveform_size
                # set the labels 
                arousal = float(self.data_frame.iloc[row_num,4])
                valence = float(self.data_frame.iloc[row_num,5])
                arousal_label = 0
                valence_label = 0
                if arousal >= 5:
                    arousal_label = 1
                if valence >= 5:
                    valence_label = 1
                self.arousal_labels.append(arousal_label)
                self.valence_labels.append(valence_label)
        # fix max_signal_len to account for the resampling rate
        self.max_signal_len = int(np.ceil(self.max_signal_len/self.sample_rate*self.resample_freq))
        print("Maximum signal length after downsampling:",self.max_signal_len)

    def __len__(self):
        return len(self.data_frame) - self.num_faulty_files - 1

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist() 
        # load audio file
        sound, _ = torchaudio.load(self.path[idx], out = None, normalization = True)
        # Convert 2 channel audio to 1 channel
        sound_mono = torch.mean(sound, dim=0, keepdim=True)
        # downsample the audio to the new sample ate
        sound_downsamp = torchaudio.transforms.Resample(self.sample_rate, self.resample_freq)(sound[0,:].view(1,-1))
        # pad the signal
        sound_formatted = pad_signal(sound_downsamp, self.max_signal_len)
#         return sound_formatted, self.arousal_labels[index], self.valence_labels[index]
#         print(self.arousal_labels[idx])
        return sound_formatted, self.arousal_labels[idx]


Creating the Dataset and Dataloader Objects
-------------------
We now define out dataset and dataloader. We do not have a predefined train/test split, so we will use ``SubsetRandomSampler`` to help do this.

In [11]:
dataset =   AmigosAudioDataset(csv_file, data_dir, resample_freq, transform=None, reduce_size=False)
dataset_size = len(dataset)
print("Total number of samples: " + str(len(dataset)))

# Creating data indices for training and validation splits
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
random_seed = 42 # Random seed so we create the same train/val sets
if shuffle_dataset:
    np.random.seed(random_seed)
    np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]

# Creating data samplers 
train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(val_indices)

# Creating dataloaders
kwargs = {'num_workers': num_workers, 'pin_memory': True} if device == 'cuda' else {} #needed for using datasets on gpu
train_loader = torch.utils.data.DataLoader(dataset, batch_size=bs, 
                                           sampler=train_sampler, **kwargs)
validation_loader = torch.utils.data.DataLoader(dataset, batch_size=bs,
                                                sampler=valid_sampler, **kwargs)
    
# kwargs = {'num_workers': num_workers, 'pin_memory': True} if device == 'cuda' else {} #needed for using datasets on gpu

# data_loader = torch.utils.data.DataLoader(dataset, batch_size = bs, shuffle = True, **kwargs)


Audio file /home/jupyter/datasets/amigos/audio/Exp1_P39_audio/P39_10_audio.wav is faulty
Maximum signal length after downsampling: 1266640
Total number of samples: 639


Build the model (M5)
-------------------
Since we are using raw audio data, we will use, M5, the network architecture described in https://arxiv.org/pdf/1610.00087.pdf

In [12]:
class M5(nn.Module):
    def __init__(self):
        super(M5, self).__init__()
        self.conv1 = nn.Conv1d(1, 128, 80, 4)
        self.bn1 = nn.BatchNorm1d(128)
        self.pool1 = nn.MaxPool1d(4)
        self.conv2 = nn.Conv1d(128, 128, 3)
        self.bn2 = nn.BatchNorm1d(128)
        self.pool2 = nn.MaxPool1d(4)
        self.conv3 = nn.Conv1d(128, 256, 3)
        self.bn3 = nn.BatchNorm1d(256)
        self.pool3 = nn.MaxPool1d(4)
        self.conv4 = nn.Conv1d(256, 512, 3)
        self.bn4 = nn.BatchNorm1d(512)
        self.pool4 = nn.MaxPool1d(4)
        self.avgPool = nn.AvgPool1d(1219) #input should be 512x1219 so this outputs a 512x1
        self.fc1 = nn.Linear(512, 1)
        
    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(self.bn1(x))
        x = self.pool1(x)
        x = self.conv2(x)
        x = F.relu(self.bn2(x))
        x = self.pool2(x)
        x = self.conv3(x)
        x = F.relu(self.bn3(x))
        x = self.pool3(x)
        x = self.conv4(x)
        x = F.relu(self.bn4(x))
        x = self.pool4(x)
        x = self.avgPool(x)
        x = x.permute(0, 2, 1) #change the 512x1 to 1x512
        x = self.fc1(x)
        x = F.log_softmax(x, dim = 2)
        return x
    
#     def forward(self, x):
#         print("pre forward x shape:", x.shape)
#         x = self.conv1(x)
#         print("post nn.Conv1d(1, 128, 80, 4) x shape:", x.shape)
#         x = F.relu(self.bn1(x))
#         print("post nn.BatchNorm1d(128) and relu:", x.shape)
#         x = self.pool1(x)
#         print("post maxpool1(4):", x.shape)
#         x = self.conv2(x)
#         print("post nn.Conv1d(128, 128, 3):", x.shape)
#         x = F.relu(self.bn2(x))
#         print("post nn.BatchNorm1d(128) and relu:", x.shape)
#         x = self.pool2(x)
#         print("post maxpool2(4):", x.shape)
#         x = self.conv3(x)
#         print("post nn.Conv1d(128, 256, 3):", x.shape)
#         x = F.relu(self.bn3(x))
#         print("post nn.BatchNorm1d(256) and relu:", x.shape)
#         x = self.pool3(x)
#         print("post maxpool3(4):", x.shape)
#         x = self.conv4(x)
#         print("post nn.Conv1d(256, 512, 3):", x.shape)
#         x = F.relu(self.bn4(x))
#         print("post nn.BatchNorm1d(512) and relu:", x.shape)
#         x = self.pool4(x)
#         print("post maxpool4(4):", x.shape)
#         x = self.avgPool(x)
#         print("post avgPool(30) x shape:", x.shape)
#         x = x.permute(0, 2, 1) #change the 512x1 to 1x512
#         print("post permute(0, 2, 1) x shape:", x.shape)
#         x = self.fc1(x)
#         print("post fc (512,1) x shape:",x.shape)
#         x = F.log_softmax(x, dim = 2)
#         print("post F.log_softmax(x, dim = 2):",x.shape)
#         return x

arousal_model = M5()
arousal_model.to(device)
valence_model = M5()
valence_model.to(device)
print(arousal_model)
print(valence_model)

M5(
  (conv1): Conv1d(1, 128, kernel_size=(80,), stride=(4,))
  (bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (pool1): MaxPool1d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv1d(128, 128, kernel_size=(3,), stride=(1,))
  (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (pool2): MaxPool1d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
  (conv3): Conv1d(128, 256, kernel_size=(3,), stride=(1,))
  (bn3): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (pool3): MaxPool1d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
  (conv4): Conv1d(256, 512, kernel_size=(3,), stride=(1,))
  (bn4): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (pool4): MaxPool1d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
  (avgPool): AvgPool1d(kernel_size=(1219,), stride=(1219,

Define the optimizer and scheduler
-------------------

We will use the same optimization technique used in the paper, an Adam
optimizer with weight decay set to 0.0001. At first, we will train with
a learning rate of 0.01, but we will use a ``scheduler`` to decrease it
to 0.001 during training.

In [13]:
# for arousal
arousal_optimizer = optim.Adam(arousal_model.parameters(), lr = 0.01, weight_decay = 0.0001)
arousal_scheduler = optim.lr_scheduler.StepLR(arousal_optimizer, step_size = 20, gamma = 0.1)
# for valence
valence_optimizer = optim.Adam(valence_model.parameters(), lr = 0.01, weight_decay = 0.0001)
valence_scheduler = optim.lr_scheduler.StepLR(valence_optimizer, step_size = 20, gamma = 0.1)

Training and Testing the Network
--------------------------------
Now let’s define a training function that will feed our training data
into the model and perform the backward pass and optimization steps.


In [14]:
def train(model, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        arousal_optimizer.zero_grad()
        data = data.to(device)
#         print("data:", data)
        print("data shape:", data.shape)
        target = target.to(device)
        data = data.requires_grad_() #set requires_grad to True for training
        output = model(data)
        output = output.permute(1, 0, 2) #original output dimensions are batchSizex1x1 
#         print("Output:", output[0], "\nTarget:",target[0])
        print("Output shape permute:", output.shape, "\nTarget shape:",target.shape)
        loss = F.nll_loss(output[0], target) #the loss functions expects a batchSizex1 input
        loss.backward()
        arousal_optimizer.step()
        if batch_idx % log_interval == 0: #print training stats
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss))

Now that we have a training function, we need to make one for testing
the networks accuracy. We will set the model to ``eval()`` mode and then
run inference on the test dataset. Calling ``eval()`` sets the training
variable in all modules in the network to false. Certain layers like
batch normalization and dropout layers behave differently during
training so this step is crucial for getting correct results.


In [15]:
def validate(model, epoch):
    model.eval()
    correct = 0
    for data, target in validation_loader:
        data = data.to(device)
        target = target.to(device)
        output = model(data)
        output = output.permute(1, 0, 2)
        pred = output.max(2)[1] # get the index of the max log-probability
        correct += pred.eq(target).cpu().sum().item()
    print('\nTest set: Accuracy: {}/{} ({:.0f}%)\n'.format(
        correct, len(validation_loader.dataset),
        100. * correct / len(validation_loader.dataset)))

Finally, we can train and test the network. We will train the network
for ten epochs then reduce the learn rate and train for ten more epochs.
The network will be tested after each epoch to see how the accuracy
varies during the training.


In [16]:
log_interval = 20
for epoch in range(1, 2):
    if epoch == 31:
        print("First round of training complete. Setting learn rate to 0.001.")
    train(arousal_model, epoch)
    arousal_scheduler.step()
    validate(arousal_model, epoch)

data shape: torch.Size([16, 1, 1266640])
Output shape permute: torch.Size([1, 16, 1]) 
Target shape: torch.Size([16])


RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCTensorMath.cu:29