## Task 4 (Bonus Task). Emotion Speech Recognition using Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM)

### Objective

**This exercise task asks you to conduct deep learning based emotion speech recognition using the given model architecture on the given data**.
Generally speaking, you are asked to predict the emotion of the speaker based on given speech waveforms with arbitrary lengths. 
In this task, you need to use the Convolutional Neural Network (CNN) which we have used in the bonus task of Exercise 1, but with 1-dimension filers. 
Besides, you also need to use a sequential data analysis deep learning module, the [Long Short-Term Memory (LSTM)](https://colah.github.io/posts/2015-08-Understanding-LSTMs/). 
The 1-d CNN here is used to extract locally short time feature from speech frames, 
and the LSTM is used to synthesize the globally long-term feature based on these short time features. 
The network architecture will be end-to-end, which receives the sequence of waveform frames as the inputs and directly give the prediction of the emotion. As a result, the philosophy of 'short time analysis' is still applied here and you need to make the speech waveform as a sequence of equal-length waveform frames and feed them to the network.

In this part, the dataset to be used is a subset of [*Toronto emotional speech set (TESS)*](https://tspace.library.utoronto.ca/handle/1807/24487). The TESS dataset contains the utterance which are spoken under 7 categories of emotions by two actresses (young actress and old actress). In this exercise, we select two categories of data (happy and sad) which is generated by young speakers. In total there are 400 samples (200 samples for each class). The **training and evaluation protocol will be that 70% of the data will be selected as the training data, and the rest will be the test data**. 


Similar to the bonus task of the exercise 1, we have provided the code which defines the network architecture using pytorch, and you will need to invoke it in your training and evaluation code. You are also encouraged to implement your own network, and all software framework are accepted. The network architecture is described in the figure below.






### Suggested procedures

We provide following procedures to support you to complete this exercise. But you are free to achieve the exercise goal by your own way of implementation.

1. Load and normalize the speech data by subtract the mean and divide the stand variance which are calculated from the whole dataset. To load the speech data, you can use the [librosa.load()](https://librosa.github.io/librosa/generated/librosa.core.load.html#librosa-core-load) for example, and you can also use other tools if you like.

2. Segment frames from the speech waveforms in order to perform short-time analysis. For example you can make your window size as 20-40 ms, with or without overlapping.

3. Split the dataset to training set and testing set, the testing set will be 30% of the whole dataset.

3. Initialize the network and perform the batch training

4. Evaluated the trained mode.



### Code snippet of the network architecture

Below we provide the network architecture definition written in pytorch. Please take read them and use them in your further experiments.

#### Code snippet of the LSTM cell



In [1]:

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

from torch.autograd import Variable

class LstmCell(nn.Module):
    
    def __init__(self, inDim, outDim):
        
        super().__init__()
        
        self.outDim = outDim
        self.inDim = inDim
        
        
        # Define the gate operation of LSTM, each gate is implemented by a fully connected neural network
        # The input dimension of each gate is the sum of the dimension of input and out put
        # because each gate performs the operations on input and previous output at the same time
        
        # For example at time step t, the input is x_t and h_{t-1}
        
        # For the example of input gate: I_t = Wx * x_t + Wh * h_{t-1} + b
        # So we can use a one-time linear operation to complete it: make it as [x_t h_{t-1}] * [Wx;  Wh] + b
        
        self._i = nn.Linear(inDim + outDim, outDim)
        self._f = nn.Linear(inDim + outDim, outDim)
        self._o = nn.Linear(inDim + outDim, outDim)
        self._g = nn.Linear(inDim + outDim, outDim)
    
    
    def forward(self, x, currentStates):
        
        
        # Receive the cell state and output of prevous time step
        currentH, currentC = currentStates
        
        # Concate the previous output and current input for the prepration of the gate operation
        combined = torch.cat([x, currentH], dim=1)
        
        # Perform the gate operations
        cc_i = self._i(combined)
        cc_f = self._f(combined)
        cc_o = self._o(combined)
        cc_g = self._g(combined)

        i = torch.sigmoid(cc_i)
        f = torch.sigmoid(cc_f)
        o = torch.sigmoid(cc_o)
        g = torch.tanh(cc_g)
        
        
        # Update the cell state
        nextC = f * currentC + i * g
        
        # Calculate the output
        nextH = o * torch.tanh(nextC)
        
        
        return nextH, nextC
    
    
    def init_hidden(self, batchSize):
        
        # Initialize the variables, cell states and output states at t=0
        return (Variable(torch.zeros(batchSize, self.outDim)).double(),
                Variable(torch.zeros(batchSize, self.outDim)).double())

        



    


#### Code snippet of the LSTM layer

In [2]:

class LstmLayer(nn.Module):
    
    def __init__(self, inDim, outDim):
        super().__init__()
        
        
        
        self.lstmCell = LstmCell(inDim=inDim, outDim=outDim)
    
    def forward(self, x, sequenceLengths, returnLast=False, hc=None):
        N, T, D = x.shape
        
        # A LSTM layer handles he operations from t=0 till T.
        # So the cell state has to be initialized as zerso
        if hc is not None:
            raise NotImplementedError()
        else:
            hc = self._init_hidden(batchSize=N)
            
        h, c = hc
        
        outH = []
        outC = []

        # Iterate for each timestep.
        for t in range(T):
            h, c = self.lstmCell(x=x[:, t, :], currentStates=[h,c])
            
            outH.append(h)
            outC.append(c)
            
        outH = torch.stack(outH, dim=1)
        outC = torch.stack(outC, dim=1)
        
        if returnLast:
            newOutH = []
            newOutC = []
            
            # Since the the length of input are different, and some of them are padded by zeros,
            # so the last output has to be picked according to the length of the input sequence
            for sampleIte in range(N):
                newOutH.append(outH[sampleIte, sequenceLengths[sampleIte] - 1, :])
                               
                #sampleIte, outH[sequenceLengths[sampleIte] - 1, :])
                newOutC.append(outC[sampleIte, sequenceLengths[sampleIte] - 1, :])
                
            newOutH = torch.stack(newOutH, dim=0)
            newOutC = torch.stack(newOutC, dim=0)
            return newOutH, newOutC    
            
        else:
            return outH, outC
    
    def _init_hidden(self, batchSize):
        return self.lstmCell.init_hidden(batchSize=batchSize)
    

#### Code snippet of the whole nework architecture
In oder to use the provide network architecture, please read the comment of forward() function.

In [3]:

class AudioNet(nn.Module):

    def __init__(self, numClass=2, inDim=0, convLayerNum=3, convOutNum=64, convKernelSize=5, lstmLayerNum=2):
        super().__init__()
        
        self.convLayerNum = convLayerNum
        self.convOutNum = convOutNum
        self.lstmLayerNum=lstmLayerNum
        self.convKernelSize = convKernelSize
        self.inDim = inDim
        
        
        self.currentDim = self.inDim
        self.avgPool = nn.AvgPool1d(kernel_size=2,
                                    stride=2)
        
        self.currentDim = self.currentDim / 2
        self.conv1d_1 = nn.Conv1d(in_channels=1,
                                  out_channels=64,
                                  kernel_size=self.convKernelSize,
                                  stride=2)
        
        # (self.inDim/2 + 2 x 0 - 1 x (5 - 1) - 1) / 2 + 1
        # (488 - 4)/2 + 1  
        self.currentDim = int((self.currentDim + 2 * 0 - 1 * (self.convKernelSize - 1) - 1)/2 + 1) 
        
        self.maxPool = nn.MaxPool1d(kernel_size=2, stride=2)
        
        self.currentDim = int(self.currentDim / 2)
        
        self.conv1d_2 = nn.Conv1d(in_channels=64,
                                  out_channels=64,
                                  kernel_size=self.convKernelSize,
                                  stride=2)
        for i in range(self.convLayerNum - 1):
            self.currentDim = int((self.currentDim + 2 * 0 - 1 * (self.convKernelSize - 1) - 1)/2 + 1) 
            self.currentDim = int(self.currentDim/2)
        
        self.currentDim = convOutNum * self.currentDim
        self.lstm1 = LstmLayer(self.currentDim, 64)
        self.lstm2 = LstmLayer(64, 64)
        self.fc = nn.Linear(convOutNum, numClass)
                    
    def forward(self, x, sequenceLengths):
        '''
        input size:
            x： N x T x D
            sequence: N x 1
            
            
        Example of input:
            If you have three sequence of speech frames with different lengths:
                Frame11 Frame12 Frame 13    0     
                Frame21 Frame22 Frame 23 Frame 24
                Frame31 frame32    0        0
            
            Then you need to pad the sequence with empty frames to make them have eaual length in time, and put them in a tensor,
            as a result the tensor will have the size of 3 x 4 x D
            
            Besides, you also need to provide a vector which contains their lengths, in this example it will be:
                [3
                 4
                 2]
        
        '''
        
        N, T, D = x.shape
        x = torch.reshape(x, (N * T, 1,  D))
        
        # Applying an initial downsampling
        x1 = self.avgPool(x)
        # Applying an intial convolution
        x2 = self.conv1d_1(x1)    
        x2 = F.relu(x2)
        x3 = self.maxPool(x2)
        

        # Apply several layers of CNN
        for convLayerIte in range(self.convLayerNum - 1):
            
            x3 = self.conv1d_2(x3)
            x3 = F.relu(x3)
            x3 = self.maxPool(x3)
        
        x4 = torch.reshape(x3, (N, T, -1))
        
        # Apply two LSTM layers
        x4, _ = self.lstm1(x=x4, sequenceLengths=sequenceLengths, returnLast=False, hc=None)
        x5, _ = self.lstm2(x=x4, sequenceLengths=sequenceLengths, returnLast=True, hc=None)
                
        
        x6 = F.relu(x5)
        return self.fc(x6)
    


### Your implementation
Please write your code below to complete the exercise


#### Load and normalize the speech data

1. Get the sampling rate of the raw data. All waveforms are recorded with the same sampling rate, so you can get the sampling rate by only read on speech samples. The function [librosa.load()](https://librosa.github.io/librosa/generated/librosa.core.load.html#librosa-core-load) will return the sampling rate, please read the manual.

2. Normalize the speech data by subtracting the *mean* and dividing the *stand variance* to make the data as 'zero mean and unit variance'. Thus you need to calculate the *mean* and *stand variance* from the whole dataset. Since speech waveforms are 1-d time series, so the *mean* and *stand variance* will be scalars. 

3. Segment each speech waveform as sequences of equal-length frames. You can make frames with or without overlapping according to your needs. Please not that you may need to pad zeros to the end of the speech waveforms in order to make the last frame has the same length with others.

4. Split the dataset to training set and testing set. 70% of the data will be the training data, an 30% will be the testing data. The portion will be applied equally to either class.


In [4]:


# filename = 'BonusTaskData/YAF_happy/YAF_back_happy.wav'
# # filename = librosa.util.example_audio_file()

# # To preserve the native sampling rate of the file, use sr=None.
# y, sr = librosa.load(filename, sr=None)


In [5]:
import librosa
import math

#############################      Define device       ################################# 

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [6]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
import numpy as np
import torch
import os
from torch import optim
from torchvision import datasets, transforms, models

#############################      Load data       #################################

data_dir = 'BonusTaskData'
dataset = []
original = []

Videos = os.listdir(data_dir)
maxlen = 50000

for folder in Videos:
    folderlist = os.listdir(data_dir+"/"+folder)
    for video in folderlist:
        if not os.path.isdir(video):
            video_path = data_dir+"/"+folder+"/"+video
            y, sr = librosa.load(video_path, sr=None)
            y = y.tolist()
            if len(y) > maxlen:
                maxlen = len(y)
            original.append(y)
            dataset += y
            
sample_size = len(folderlist)            
mean = np.mean(dataset)
std = np.std(dataset, ddof = 1)
classes_number = len(Videos)

#############################     Normalize #################################


Normalize = []

for value in original:
        Normalize.append([ (x-mean)/std  for x in value])



In [8]:
SegmentList = []
SequenceLengths = []


#############################        Segment        #################################
# We set the window size is around 20 ms, Calculating the size the frame windows
frame_size = (int(sr * 20 /1000/100)+1)*100

Dimention = maxlen//frame_size+1

for series in Normalize:
    lent = len(series)//frame_size+1
    SequenceLengths.append(lent)
    frame = []
    for k in range(lent-1):
        frame.append(series[k*frame_size:(k+1)*frame_size])
    frame.append(series[k*frame_size+len(series)%frame_size:])
    for i in range(Dimention - lent):
        frame.append([0.0]*frame_size)
    SegmentList.append(frame)

In [28]:
#############################     Split the dataset to training set and testing set      #################################
import random

classes1 = [0]*sample_size
classes2 = [1]*sample_size
classes = classes1 + classes2

split_size = 0.7
all_index = np.arange(sample_size*classes_number)
train_index = np.vstack((random.sample(range(0,sample_size), int(sample_size * split_size)),random.sample(range(sample_size,sample_size*classes_number), int(sample_size * split_size)))).flatten()
test_index = list(set(all_index)-set(train_index))


train_data = []
test_data = []
train_label = []
test_label = []

for index in train_index:
    train_data.append(SegmentList[index])
    train_label.append(classes[index])

for index in test_index:
    test_data.append(SegmentList[index])
    test_label.append(classes[index])


In [29]:
# transform data into Tensor(double)

inputs_train_dataset = torch.from_numpy(np.array(train_data))
labels_train_dataset = torch.LongTensor(np.array(train_label))

inputs_test_dataset = torch.from_numpy(np.array(test_data))
labels_test_dataset = torch.LongTensor(np.array(test_label))
length_dataset = torch.from_numpy(np.array(SequenceLengths))

#### Perform the network training.

Please write your code below for the network training. You are asked to perform the batch based training here. Please accumulate the loss and classification accuracy for each epoch and output them and the end of the epoch.



In [34]:
#Initialize the network and optimizer
model = AudioNet(numClass=classes_number,inDim=frame_size).double()

#####Change model to cuda
if torch.cuda.is_available():
    model = model.cuda()

criterion = nn.CrossEntropyLoss()

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

trainingEpoch = 20



In [35]:
batch_size = 10
batch_loss = 0
train_losses, test_losses = [], []

for epoch in range(trainingEpoch):
    train_len = 0
    running_loss = 0
    running_corrects = 0
    max_index = int(sample_size * split_size* classes_number)
    for i in range(math.ceil(max_index/batch_size)):
        start_index = i*batch_size
        end_index = min((i+1)*batch_size,max_index)
        inputs = inputs_train_dataset[start_index:end_index]
        labels = labels_train_dataset[start_index:end_index]
        inputs, labels = inputs.to(device),labels.to(device)
        
        optimizer.zero_grad()
        outputs= model.forward(inputs,length_dataset[start_index:end_index])
        _, preds = torch.max(outputs, 1)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        running_corrects += torch.sum(preds == labels)
#    
#    ps = torch.exp(outputs)
#    top_p, top_class = ps.topk(1, dim = 1)
#    print(top_p)
#    equals = 1 if (top_class == labels.view(*top_class.shape)) else 0
#    accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
#    
        # print training lost and accuracy
        model.eval()
        # accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
        train_len += labels.size(0)
    print(f"Epoch {epoch+1}/{trainingEpoch}"
      f"Train loss: {running_loss/math.ceil(max_index/batch_size):.3f}"
      f"Train accuracy: {100*running_corrects.double()/train_len:.3f}")

Epoch 1/20Train loss: 0.699Train accuracy: 50.000
Epoch 2/20Train loss: 0.697Train accuracy: 50.000
Epoch 3/20Train loss: 0.697Train accuracy: 50.000
Epoch 4/20Train loss: 0.697Train accuracy: 50.000
Epoch 5/20Train loss: 0.697Train accuracy: 50.000
Epoch 6/20Train loss: 0.697Train accuracy: 50.000
Epoch 7/20Train loss: 0.697Train accuracy: 50.000
Epoch 8/20Train loss: 0.696Train accuracy: 50.000
Epoch 9/20Train loss: 0.696Train accuracy: 50.000
Epoch 10/20Train loss: 0.695Train accuracy: 50.000
Epoch 11/20Train loss: 0.695Train accuracy: 50.000
Epoch 12/20Train loss: 0.693Train accuracy: 50.000
Epoch 13/20Train loss: 0.691Train accuracy: 50.000
Epoch 14/20Train loss: 0.687Train accuracy: 50.357
Epoch 15/20Train loss: 0.789Train accuracy: 22.500
Epoch 16/20Train loss: 0.673Train accuracy: 60.714
Epoch 17/20Train loss: 0.637Train accuracy: 59.286
Epoch 18/20Train loss: 0.554Train accuracy: 84.643
Epoch 19/20Train loss: 0.481Train accuracy: 88.214
Epoch 20/20Train loss: 0.453Train accura

#### Conduct the network evaluation

Please write your code below to evaluated the trained model using your splited testing data. Please output the loss and accuracy of the testing data.

In [39]:
# Test the network and print the testing result

test_loss = 0
test_corrects = 0
test_len =0
max_index2 = int(sample_size * (1-split_size)* classes_number)

for i in range(math.ceil(max_index2/batch_size)):
    start_index = i*batch_size
    end_index = min((i+1)*batch_size,max_index2)
    
    inputs = inputs_test_dataset[start_index:end_index]
    labels = labels_test_dataset[start_index:end_index]
    
    inputs, labels = inputs.to(device),labels.to(device)

    outputs= model.forward(inputs,length_dataset[start_index:end_index])
    
    _, preds = torch.max(outputs, 1)
    batch_loss = criterion(outputs, labels)
    test_loss += batch_loss.item()
    test_corrects += torch.sum(preds == labels)
    test_len += labels.size(0)
    test_losses.append(test_loss/math.ceil(max_index2/batch_size))
    
    
print(f"Test loss: {test_loss/math.ceil(max_index2/batch_size):.3f}"
    f"Test accuracy: {100*test_corrects.double()/test_len:.3f}")

Test loss: 0.411Test accuracy: 89.167
