### RAVDESS Emotional Speech

In this notebook we are going to create a simple emotion classifier using the [ravdess-emotional-speech-audio](https://www.kaggle.com/uwrfkaggler/ravdess-emotional-speech-audio) datase from kaggle.


### Data
I've downloaded the dataset from [kaggle](https://www.kaggle.com/uwrfkaggler/ravdess-emotional-speech-audio) and uploaded it on my google drive so that we can load it in this notebook easly.

### Folder structures

```
📁 ravdess-emotion
  📁 Actor_01
    🎵 03-01-01-01-01-01-01.wav
    ....
  ...
  📁 Actor_24
```

### File naming
All the files in this dataset has the following naming conversion and each part represents something according to the [ravdess-emotional dataset](https://www.kaggle.com/uwrfkaggler/ravdess-emotional-speech-audio)

```
03-01-01-01-01-01-01.wav
```
Filename example: `03-01-06-01-02-01-12.wav`

1. Audio-only (03)
2. Speech (01)
3. Fearful (06)
4. Normal intensity (01)
5. Statement "dogs" (02)
6. 1st Repetition (01)
7. 12th Actor (12) (Female, as the actor ID number is even).

Filename identifiers

* Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
* Vocal channel (01 = speech, 02 = song).
* Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
* Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
* Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
* Repetition (01 = 1st repetition, 02 = 2nd repetition).
* Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

### What are we going to built?

We are going to built a deep learning model that will be able to classify:

1. emotions in an audio
2. gender of the speaker
3. emotion intensity

The following will be the labels that we will have in the dataset:

```py
emotions = ["neutral", "calm", "happy", "sad", "angry", "fearful", "disgust", "surprised"]
emotion_intensity = ["normal", "strong"]
gender = ["male", "female"]

```

### Mounting the drive

In [1]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Installation of `torchaudio`, `librosa` and `boto3`

In [2]:
!pip install -q torchaudio librosa boto3

### Imports

The following code cell will contain the basic imports that we will need in this notebook.

In [3]:
import os, time, math, random, json

import torch
import torchaudio

from torch import nn
from torch.nn import functional as Functional
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split

import torchaudio.functional as F
import torchaudio.transforms as T

import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Audio, display
from prettytable import PrettyTable

torch.__version__

'1.10.0+cu111'

### Setting Seed

Setting the `SEED`

In [4]:
SEED = 42

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
np.random.seed(SEED)
random.seed(SEED)

### Device
Get gpu accellaration if possible:

In [5]:
device = torch.device("cuda" if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Base path 

The following is the base path to where our folders containing audio files are located:

In [6]:
base_path = "/content/drive/My Drive/Computer Audio/ravdess-emotions"

In [7]:
folders = os.listdir(base_path)

### Helper function `play_audio`
This helper function takes in the `waveform` and the `sample_rate` as argumenys and display an audio player in the notebook.

In [8]:
def play_audio(waveform, sample_rate):
  waveform = waveform.numpy()
  num_channels, num_frames = waveform.shape
  if num_channels == 1:
    display(Audio(waveform[0], rate=sample_rate))
  elif num_channels == 2:
    display(Audio((waveform[0], waveform[1]), rate=sample_rate))
  else:
    raise ValueError("Waveform with more than 2 channels are not supported.")
    

### Classes


In [9]:
emotions = ["neutral", "calm", "happy", "sad", "angry", "fearful", "disgust", "surprised"]
emotion_intensities = ["normal", "strong"]
genders = ["male", "female"]

### Helper functions
1. `load_audio`:

This function takes in a path and loop through all the `.wav` files and load them using the `touchaudio.load()` function which returns the `waveform` in a torch tensor and a `sample_rate` oof that audio.

2. `get_labels`

Get labels takes in the file name of fomart `03-01-01-01-01-01-01.wav` and returns three labels `(emotion, emotion_intensity, gender)`


In [10]:
def load_audio(path:str):
  waveform, sample_rate = torchaudio.load(path)
  return waveform, sample_rate

def get_labels(file_name: str):
  file_name = file_name.replace(".", "-")
  _, _, emotion, emotion_int, _, _, gender, _ = file_name.split("-")
  emotion = int(emotion) - 1
  emotion_int = int(emotion_int) - 1
  gender = 0 if int(gender)%2 == 1 else 1
  return emotion, emotion_int, gender

get_labels("03-01-06-01-02-01-12.wav")

(5, 0, 1)

### Extracting the features and lables 

In the following code cell we are then going to extract all the audio features and generate labels in the dataset folders.

In [11]:
# features
features = list()

# labels
y_emotions = list()
y_emotions_intensities = list()
y_genders = list()

for folder in folders:
  for filename in os.listdir(os.path.join(base_path, folder)):
    fname = os.path.join(base_path, folder, filename)

    base_name = os.path.basename(fname)
    wave_form, sample_rate = load_audio(fname)
    num_channels, num_frames = wave_form.shape
    if num_channels == 1:
      e, ei, g = get_labels(base_name)
      features.append([wave_form, sample_rate])
      y_emotions.append(e)
      y_emotions_intensities.append(ei)
      y_genders.append(g)
    else:
      continue
    
print("Done")

Done


### Playing a single audio from the data

In the following code cell we are going to play a single audio and display the labels for a given audio.

In [12]:

print("emotion: ", emotions[y_emotions[0]])
print("gender: ", genders[y_genders[0]])
print("emotion intensity: ", emotion_intensities[y_emotions_intensities[0]])

waveform, sample_rate = features[0]
play_audio(waveform, sample_rate)


emotion:  calm
gender:  male
emotion intensity:  strong


### Checking the total examples

In [13]:
len(y_emotions_intensities), len(y_emotions), len(y_genders), len(features)

(1435, 1435, 1435, 1435)

### Splitting the data into train and test

We are going to use the `train_test_split` from `sklearn.model_selection` to split the data into train and test with the ratio `9:1` respectively

In [14]:
X_train, X_test, y_train_emotions, y_test_emotions, y_train_genders, y_test_genders, y_train_emotions_intensities, y_test_emotions_intensities= train_test_split(
    features, y_emotions, y_genders, y_emotions_intensities, random_state=SEED, test_size = .1
) 

### Checking if the features and labels for each set have the same length

In [15]:
assert len(X_train) == len(y_train_emotions) == len(y_train_genders) ==len(y_train_emotions_intensities), "Features and labels does not have the same size."
assert len(X_test) == len(y_test_emotions) == len(y_test_genders) ==len(y_test_emotions_intensities), "Features and labels does not have the same size."

### Creating the dataset

Next we are going to create a dataset class called `RAVDESS` in the code cell that follows:

In [16]:
class RAVDESS(Dataset):
  def __init__(self,
               features,
               y_emotions,
               y_emotions_intensities,
               y_genders,
               transform = None
               ):
    self.transform = transform
    self.features = [i[0].numpy() for i in features]
    self.y_emotions = y_emotions 
    self.y_emotions_intensities = y_emotions_intensities
    self.y_genders = y_genders
    self.len = len(features)

  def __getitem__(self, index):
    sample = (
        self.features[index],
        self.y_emotions[index],
        self.y_emotions_intensities[index],
        self.y_genders[index]
    )
    if self.transform is not None:
      sample = self.transform(sample)
    return sample
    
  def __len__(self):
    return self.len

  def __repr__(self):
    return "RAVDESS Dataset"

### Transform
We are going to create the `ToTensor` custom tranform that will convert the features and labels to tensors.

In [17]:
class ToTensor:
  def __call__(self, samples):
    X, y_e, y_e_i, y_g = samples

    X = torch.from_numpy(X.astype('float32'))
    y_e = torch.from_numpy(np.array(y_e)).long()
    y_e_i = torch.from_numpy(np.array(y_e_i, dtype="float32"))
    y_g = torch.from_numpy(np.array(y_g, dtype="float32"))
    return X, y_e, y_e_i, y_g 

### Train and Test datasets

Now we can create the train and test datasets.

In [18]:
train = RAVDESS(
    X_train,
    y_train_emotions,
    y_train_emotions_intensities,
    y_train_genders,
    transform=ToTensor()
)

test = RAVDESS(
    X_test,
    y_test_emotions,
    y_test_emotions_intensities,
    y_test_genders,
    transform=ToTensor()
)

### Counting Examples for each set.

In [19]:
def tabulate(column_names, data):
  table = PrettyTable(column_names)
  for row in data:
    table.add_row(row)
  print(table)


column_names = ["SUBSET", "EXAMPLE(s)"]
data = [
        ["training", len(train)],
        ['validation/test', len(test)],
]
tabulate(column_names, data)

+-----------------+------------+
|      SUBSET     | EXAMPLE(s) |
+-----------------+------------+
|     training    |    1291    |
| validation/test |    144     |
+-----------------+------------+


In [20]:
train[0]

(tensor([[-1.2207e-04, -9.1553e-05,  0.0000e+00,  ..., -3.0518e-05,
           0.0000e+00,  9.1553e-05]]), tensor(4), tensor(1.), tensor(1.))

### Creating Loaders

We are then going to create loaders for the train and test set. Each loader will have a `collate_fn` which will preprocess the data in the `DataLoader` class from `torch.utils.data`. We are going to create 2 functions which are:

1. pad_sequence

* The audios that wecare working with are of different length, so we need to use the `pad_sequences` from `torch.nn.utils.rnn` so that we make sure that short audios are padded with `0's`. In the `torch.nn.utils.rnn.pad_sequences` function we are going to set `batch_first` to True, this is because we are going to use `Conv1D` layers and Conv layers expect the `batch_size` to be first.

2. collate_fn

In this function we are going to transform our audio `waveforms` and then apply the `pad_sequence` function. By transforming the waveforms i mean we have to downsample the waveform by giving it a new sample rate.


### Downsampling the audio

Most of our audios have a sample rate of `48000`  which is very huge and it is not effient to train the network on such a sample rate. We can downsample the sample rate using the `touchaudio.transforms.Resample` method and change the sample rate to `.25` the original sample rate.

In [21]:
new_sample_rate = int(sample_rate*.25)
sample_rate, new_sample_rate

(48000, 12000)

In [22]:
transform = torchaudio.transforms.Resample(orig_freq=sample_rate,
                                           new_freq=new_sample_rate)

In [23]:
def pad_sequence(batch):
  batch = [item.t() for item in batch]
  batch = torch.nn.utils.rnn.pad_sequence(batch, batch_first=True, padding_value=0., )
  return batch.permute(0, 2, 1)

def collate_fn(batch):
  tensors, emotion_targets = [], []
  emotion_intensity_targets, gender_targets = [], []

  for waveform, e, ei, g  in batch:
    # apply the transformations, by downsampling the wavefor from sample_rate 16000 to 8000
    tensors += [transform(waveform)]
    emotion_targets += [e]
    emotion_intensity_targets += [ei]
    gender_targets += [g]
  tensors = pad_sequence(tensors)
  emotion_targets = torch.stack(emotion_targets)
  emotion_intensity_targets = torch.stack(emotion_intensity_targets)
  gender_targets = torch.stack(gender_targets)
  return tensors, emotion_targets, emotion_intensity_targets, gender_targets


### Loaders 

We are going to use a smaller batch_size of `16` since our dataset is small.

In [24]:
BATCH_SIZE = 16
train_loader = torch.utils.data.DataLoader(
    train,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=collate_fn
)

test_loader = torch.utils.data.DataLoader(
    test,
    batch_size=BATCH_SIZE,
    collate_fn=collate_fn,
)

### Checking a single example in a batch

### Creating a Model

We are going to use the modified version of the [`M5`](https://arxiv.org/pdf/1610.00087.pdf) model to perform our audio classification. 

> _Note that our model will be outputing 3 different labels._

In [25]:
class AudioClassifier(nn.Module):
  def __init__(self,
               n_input=1,
               stride=16, 
               n_channel=32,
               emotion_n_output=8,
               emotion_intensity_output=1,
               gender_output=1
               ):
    super(AudioClassifier, self).__init__()
    self.features = nn.Sequential(
        nn.Conv1d(n_input, n_channel, kernel_size=80, stride=stride),
        nn.BatchNorm1d(n_channel),
        nn.ReLU(),
        nn.MaxPool1d(4),
        nn.Conv1d(n_channel, n_channel, kernel_size=3),
        nn.BatchNorm1d(n_channel),
        nn.ReLU(),
        nn.MaxPool1d(4),
        nn.Conv1d(n_channel, 2 * n_channel, kernel_size=3),
        nn.BatchNorm1d(2 * n_channel),
        nn.ReLU(),
        nn.MaxPool1d(4),
        nn.Conv1d(2 * n_channel, 2 * n_channel, kernel_size=3),
        nn.BatchNorm1d(2 * n_channel),
        nn.ReLU(),
        nn.MaxPool1d(4)
    )
    self.emotion_classifier = nn.Sequential(
        nn.Linear(2 * n_channel, 64),
        nn.Linear(64, emotion_n_output)
    )
    self.emotion_intensity_classification = nn.Sequential(
        nn.Linear(2 * n_channel, 64),
        nn.Linear(64, emotion_intensity_output)
    )
    self.gender = nn.Sequential(
        nn.Linear(2 * n_channel, 64),
        nn.Linear(64, gender_output)
    )

  def forward(self, x):
    x = self.features(x)
    x = Functional.avg_pool1d(x, x.shape[-1])
    x = x.permute(0, 2, 1)
    return (self.emotion_classifier(x),
            self.emotion_intensity_classification(x),
            self.gender(x)
            )



### Model instance
 We can now create the model instance for our audio classifier model.

In [26]:
model = AudioClassifier().to(device)
model

AudioClassifier(
  (features): Sequential(
    (0): Conv1d(1, 32, kernel_size=(80,), stride=(16,))
    (1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): MaxPool1d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
    (4): Conv1d(32, 32, kernel_size=(3,), stride=(1,))
    (5): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU()
    (7): MaxPool1d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
    (8): Conv1d(32, 64, kernel_size=(3,), stride=(1,))
    (9): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (10): ReLU()
    (11): MaxPool1d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
    (12): Conv1d(64, 64, kernel_size=(3,), stride=(1,))
    (13): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (14): ReLU()
    (15): MaxPool1d(kernel_size=4, stride=4, padding=0, dil

### Counting model parameters

In [27]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")

Total number of paramaters: 37,770
Total tainable parameters: 37,770


### Optimizer

For the optimizer we are going to use the `Adam` optimizer with default parameters.

### Criterions

We are going to have `3` criterions which will be:

1. `emotion_classifier_criterion`
* This will be a `CrossEntropyLoss()` function since we have 8 class labels

2. `emotion_intensity_classifier_criterion`
* This will be a `BCELoss()` since this will be a binary classification.

3. `gender_classifier_criterion`
* This will be a `BCELoss()` since this will be a binary classification.


In [28]:
optimizer = torch.optim.Adam(model.parameters())

emotion_classifier_criterion = nn.CrossEntropyLoss().to(device)
gender_classifier_criterion = nn.BCEWithLogitsLoss().to(device)
emotion_intensity_classifier_criterion = nn.BCEWithLogitsLoss().to(device)

### Accuracy functions

We are going to have two accuracy functions.

1. binary_accuracy

This will calculate the binary accuracy of predicted labels aganist the actual labels for gender and emotion intensity.

2. categorical_accuracy

This will calculate the categorical accuracy of predicted labels aganist the actual labels for gender and emotion intensity.


In [29]:
def binary_accuracy(y_preds, y_true):
  #round predictions to the closest integer
  rounded_preds = torch.round(torch.sigmoid(y_preds))
  correct = (rounded_preds == y_true).float() #convert into float for division 
  acc = correct.sum() / len(correct)
  return acc

def categorical_accuracy(preds, y):
    top_pred = preds.argmax(1, keepdim = True)
    correct = top_pred.eq(y.view_as(top_pred)).sum()
    acc = correct.float() / y.shape[0]
    return acc
    

### Train and evaluation Functions

Our train and evaluation functions will be defined as follows:

1. train

* This is a function that takes in the `model`, `optimizer`, `iterator`, `criterion1`, `criterion2`, `criterion3`, and return the train loss and train accuracy for each label.

* first we put the model in the train mode by calling `model.train()`

* We then iterate over an iterator and put features and labels to the device

* We restore the gradients by calling the `optimizer.zero_grad()` function

* We make the predictions and calculate the loss and accuracy for each iterator

* After the loop we return the loss of each epoch and the accuracy for each label

2. evaluate
This is a function that takes in the model, iterator and criterion and return the train loss and train accuracy for each label.

We call the `model.eval()` so that the model will be in the evaluation mode.

* We don't need to compute the gradi ents during evaluation so we wrap our iteration with with `torch.no_grad()` function
* We the iterate over an iterator and put features and labels to the device
* We make the predictions and calculate the loss and accuracy for each iterator
* After the loop we return the loss of each epoch and the accuracy for each class label (emotion, emotion_intensity, gender).


In [30]:
def train(model, iterator, optimizer, criterion1, criterion2, criterion3):
  emotion_epoch_acc = gender_epoch_acc = emotion_intensity_epoch_acc = 0
  epoch_loss = 0

  model.train()
  for X, e_y, ei_y, g_y in iterator:
    X = X.to(device)
    e_y = e_y.long().to(device)
    ei_y = ei_y.type(torch.FloatTensor).to(device)
    g_y = g_y.type(torch.FloatTensor).to(device)
    optimizer.zero_grad()
    emotion_preds, emotion_intensity_preds, gender_preds = model(X)

    emotion_loss = criterion1(emotion_preds.squeeze(), e_y)
    emotion_acc = categorical_accuracy(emotion_preds.squeeze(), e_y)

    emotion_intensity_loss = criterion2(emotion_intensity_preds.squeeze(), ei_y)
    emotion_intensity_acc = binary_accuracy(emotion_intensity_preds.squeeze().squeeze(), ei_y)

    gender_loss = criterion3(gender_preds.squeeze(), g_y)
    gender_acc = binary_accuracy(gender_preds.squeeze(), g_y)
    loss = gender_loss +  emotion_loss + emotion_intensity_loss
    # back propagate
    loss.backward()

    optimizer.step()
    epoch_loss += loss.item()
    emotion_epoch_acc += emotion_acc.item()
    gender_epoch_acc += gender_acc.item()
    emotion_intensity_epoch_acc += emotion_intensity_acc.item()

  acc = (emotion_intensity_epoch_acc/len(iterator) + gender_epoch_acc/len(iterator) + emotion_epoch_acc/len(iterator) )/3
  return epoch_loss / len(iterator), acc


def evaluate(model, iterator, criterion1, criterion2, criterion3):
  emotion_epoch_acc = gender_epoch_acc = emotion_intensity_epoch_acc = 0
  epoch_loss = 0
  model.eval()
  with torch.no_grad():
    for X, e_y, ei_y, g_y in iterator:
      X = X.to(device)
      e_y = e_y.long().to(device)
      ei_y = ei_y.type(torch.FloatTensor).to(device)
      g_y = g_y.type(torch.FloatTensor).to(device)
      emotion_preds, emotion_intensity_preds, gender_preds = model(X)

      emotion_loss = criterion1(emotion_preds.squeeze(), e_y)
      emotion_acc = categorical_accuracy(emotion_preds.squeeze(), e_y)

      emotion_intensity_loss = criterion2(emotion_intensity_preds.squeeze(), ei_y)
      emotion_intensity_acc = binary_accuracy(emotion_intensity_preds.squeeze().squeeze(), ei_y)

      gender_loss = criterion3(gender_preds.squeeze(), g_y)
      gender_acc = binary_accuracy(gender_preds.squeeze(), g_y)
      loss = gender_loss +  emotion_loss + emotion_intensity_loss

      epoch_loss += loss.item()
      emotion_epoch_acc += emotion_acc.item()
      gender_epoch_acc += gender_acc.item()
      emotion_intensity_epoch_acc += emotion_intensity_acc.item()

    acc = (emotion_intensity_epoch_acc/len(iterator) + gender_epoch_acc/len(iterator) + emotion_epoch_acc/len(iterator) )/3
  return epoch_loss / len(iterator), acc

### Training

We will create two helper functions that will help us to visualize our training.

1. hms_string


In [31]:
def hms_string(sec_elapsed):
  h = int(sec_elapsed / (60 * 60))
  m = int((sec_elapsed % (60 * 60)) / 60)
  s = sec_elapsed % 60
  return "{}:{:>02}:{:>05.2f}".format(h, m, s)

2. visualize_training

In [32]:
def visualize_training(start, end, train_loss, train_accuracy, val_loss, val_accuracy, title):
  data = [
       ["Training", f'{train_loss:.3f}', f'{train_accuracy:.3f}', f"{hms_string(end - start)}" ],
       ["Validation", f'{val_loss:.3f}', f'{val_accuracy:.3f}', "" ],       
  ]
  table = PrettyTable(["CATEGORY", "LOSS", "ACCURACY", "ETA"])
  table.align["CATEGORY"] = 'l'
  table.align["LOSS"] = 'r'
  table.align["ACCURACY"] = 'r'
  table.align["ETA"] = 'r'
  table.title = title
  for row in data:
    table.add_row(row)
  print(table)

In [None]:
N_EPOCHS = 15
MODEL_NAME = "audio-classifier.pt"
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
  start = time.time()
  train_loss, train_acc = train(model, 
                                train_loader, optimizer, 
                                emotion_classifier_criterion,
                                emotion_intensity_classifier_criterion,
                                gender_classifier_criterion
                                )
  
  val_loss, val_acc = evaluate(model, 
                                    test_loader, emotion_classifier_criterion,
                                emotion_intensity_classifier_criterion,
                                gender_classifier_criterion)
  title = f"EPOCH: {epoch+1:02}/{N_EPOCHS:02} {'saving best model...' if val_loss < best_valid_loss else 'not saving...'}"
  if val_loss < best_valid_loss:
      best_valid_loss = val_loss
      torch.save(model.state_dict(), MODEL_NAME)
  end = time.time()
  visualize_training(start, end, train_loss, train_acc, val_loss, val_acc, title)

### Downloading the model
Now we can download the static model to our computer from google colab in the following code cell:

In [34]:
from google.colab import files
files.download(MODEL_NAME)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Model Inference

Now we can make preditions using our model. So to make predictions we need to preprocess the data in a way that we did during data preparation for training. We are going to do the following

1. pad the sequences to the longest sequence
2. transform the sequences by reducing the wave_length

In [35]:
def pad_sequence(batch):
  batch = torch.nn.utils.rnn.pad_sequence([batch], batch_first=True, padding_value=0.)
  return batch

def preprocess(waveform):
  waveform = pad_sequence(waveform)
  return transform(waveform)


### Predictions

Our predict_label function will take in the waveform and a model as arguments and returns the json prediction that looks as follows:

```json
{
  "emotion": {
      "class": "angry",
      "label": 4,
      "probability": 0.67
    },
 "emotion_intensity": {
    "class": "strong", 
    "label": 1, 
    "probability": 1.0
   },
 "gender": {
    "class": "female", 
    "label": 1, 
    "probability": 1.0
   }
 }
```

In [53]:
def predict_label(model, waveform):
  processed = preprocess(waveform).to(device)
  model.eval()
  with torch.no_grad():
    emotion_preds, emotion_intensity_preds, gender_preds = model(processed)

    emotion_pred = torch.softmax(emotion_preds.squeeze(), dim=0)
    emotion_intensity_pred = torch.sigmoid(emotion_intensity_preds.squeeze())
    gender_pred = torch.sigmoid(gender_preds.squeeze())
    
    emotion_intensity_prob = float(emotion_intensity_pred.item()) if emotion_intensity_pred.item() > .5 else float(1 - emotion_intensity_pred.item())
    gender_pred_prob = float(gender_pred.item()) if gender_pred.item() > .5 else float(1 - gender_pred.item())
    emotion_prob = torch.max(emotion_pred).item()
    
    gender_label = 1 if gender_pred.item() >= 0.5 else 0
    emotion_intensity_label = 1 if emotion_intensity_pred.item() >= 0.5 else 0
    emotion_label = torch.argmax(emotion_pred, dim=0).item()
    

    pred =  {
       "emotion": {
         'label': emotion_label,
         'class': emotions[emotion_label],
         'probability':round(emotion_prob, 2),
      },
       "emotion_intensity": {
         'label': emotion_intensity_label,
         'class': emotion_intensities[emotion_intensity_label],
         'probability':round(emotion_intensity_prob, 2),
      },
      "gender": {
         'label': gender_label,
         'class': genders[gender_label],
         'probability':round(gender_pred_prob, 2),
      },
        
    }
    return pred
predict_label(model, X_train[0][0])


{'emotion': {'class': 'angry', 'label': 4, 'probability': 0.67},
 'emotion_intensity': {'class': 'strong', 'label': 1, 'probability': 1.0},
 'gender': {'class': 'female', 'label': 1, 'probability': 1.0}}

### Making predictions

In [57]:
emotions

['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised']

In [60]:
print("emotion: ", y_train_emotions[0])
print("emotion intensity: ", y_train_emotions_intensities[0])
print("gender: ", y_train_genders[0])

play_audio(X_train[0][0], sample_rate)

emotion:  4
emotion intensity:  1
gender:  1


In [61]:
print(predict_label(model, X_train[1][0]))

print("emotion: ", y_train_emotions[1])
print("emotion intensity: ", y_train_emotions_intensities[1])
print("gender: ", y_train_genders[1])

play_audio(X_train[1][0], sample_rate)

{'emotion': {'label': 6, 'class': 'disgust', 'probability': 0.77}, 'emotion_intensity': {'label': 1, 'class': 'strong', 'probability': 0.98}, 'gender': {'label': 0, 'class': 'male', 'probability': 1.0}}
emotion:  3
emotion intensity:  0
gender:  0


In [63]:
print(predict_label(model, X_train[-1][0]))

print("emotion: ", y_train_emotions[-1])
print("emotion intensity: ", y_train_emotions_intensities[-1])
print("gender: ", y_train_genders[-1])

play_audio(X_train[-1][0], sample_rate)

{'emotion': {'label': 5, 'class': 'fearful', 'probability': 1.0}, 'emotion_intensity': {'label': 1, 'class': 'strong', 'probability': 1.0}, 'gender': {'label': 1, 'class': 'female', 'probability': 1.0}}
emotion:  5
emotion intensity:  1
gender:  1


In [64]:
print(predict_label(model, X_test[-1][0]))

print("emotion: ", y_test_emotions[-1])
print("emotion intensity: ", y_test_emotions_intensities[-1])
print("gender: ", y_test_genders[-1])

play_audio(X_test[-1][0], sample_rate)

{'emotion': {'label': 6, 'class': 'disgust', 'probability': 0.57}, 'emotion_intensity': {'label': 1, 'class': 'strong', 'probability': 1.0}, 'gender': {'label': 1, 'class': 'female', 'probability': 0.99}}
emotion:  4
emotion intensity:  0
gender:  1
