## Task 4 (Bonus Task). Emotion Speech Recognition using Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM)

### Objective

**This exercise task asks you to conduct deep learning based emotion speech recognition using the given model architecture on the given data**.
Generally speaking, you are asked to predict the emotion of the speaker based on given speech waveforms with arbitrary lengths. 
In this task, you need to use the Convolutional Neural Network (CNN) which we have used in the bonus task of Exercise 1, but with 1-dimension filers. 
Besides, you also need to use a sequential data analysis deep learning module, the [Long Short-Term Memory (LSTM)](https://colah.github.io/posts/2015-08-Understanding-LSTMs/). 
The 1-d CNN here is used to extract locally short time feature from speech frames, 
and the LSTM is used to synthesize the globally long-term feature based on these short time features. 
The network architecture will be end-to-end, which receives the sequence of waveform frames as the inputs and directly give the prediction of the emotion. As a result, the philosophy of 'short time analysis' is still applied here and you need to make the speech waveform as a sequence of equal-length waveform frames and feed them to the network.

In this part, the dataset to be used is a subset of [*Toronto emotional speech set (TESS)*](https://tspace.library.utoronto.ca/handle/1807/24487). The TESS dataset contains the utterance which are spoken under 7 categories of emotions by two actresses (young actress and old actress). In this exercise, we select two categories of data (happy and sad) which is generated by young speakers. In total there are 400 samples (200 samples for each class). The **training and evaluation protocol will be that 70% of the data will be selected as the training data, and the rest will be the test data**. 


Similar to the bonus task of the exercise 1, we have provided the code which defines the network architecture using pytorch, and you will need to invoke it in your training and evaluation code. You are also encouraged to implement your own network, and all software framework are accepted. The network architecture is described in the figure below (the figure unrolls the network according to the number of speech frames T).
!['NetworkArchitecture.png'](Network_Architecture.png)






### Suggested procedures

We provide following procedures to support you to complete this exercise. But you are free to achieve the exercise goal by your own way of implementation.

1. Load and normalize the speech data by subtract the mean and divide the stand variance which are calculated from the whole dataset. To load the speech data, you can use the [librosa.load()](https://librosa.github.io/librosa/generated/librosa.core.load.html#librosa-core-load) for example, and you can also use other tools if you like.

2. Segment frames from the speech waveforms in order to perform short-time analysis. For example you can make your window size as 20-40 ms, with or without overlapping.

3. Split the dataset to training set and testing set, the testing set will be 30% of the whole dataset.

3. Initialize the network and perform the batch training

4. Evaluated the trained mode.



### Code snippet of the network architecture

Below we provide the network architecture definition written in pytorch. Please take read them and use them in your further experiments.

#### Code snippet of the LSTM cell



In [1]:

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

from torch.autograd import Variable

class LstmCell(nn.Module):
    
    def __init__(self, inDim, outDim):
        
        super().__init__()
        
        self.outDim = outDim
        self.inDim = inDim
        
        
        # Define the gate operation of LSTM, each gate is implemented by a fully connected neural network
        # The input dimension of each gate is the sum of the dimension of input and out put
        # because each gate performs the operations on input and previous output at the same time
        
        # For example at time step t, the input is x_t and h_{t-1}
        
        # For the example of input gate: I_t = Wx * x_t + Wh * h_{t-1} + b
        # So we can use a one-time linear operation to complete it: make it as [x_t h_{t-1}] * [Wx;  Wh] + b
        
        self._i = nn.Linear(inDim + outDim, outDim)
        self._f = nn.Linear(inDim + outDim, outDim)
        self._o = nn.Linear(inDim + outDim, outDim)
        self._g = nn.Linear(inDim + outDim, outDim)
    
    
    def forward(self, x, currentStates):
        
        
        # Receive the cell state and output of prevous time step
        currentH, currentC = currentStates
        
        # Concate the previous output and current input for the prepration of the gate operation
        combined = torch.cat([x, currentH], dim=1)
        
        # Perform the gate operations
        cc_i = self._i(combined)
        cc_f = self._f(combined)
        cc_o = self._o(combined)
        cc_g = self._g(combined)

        i = torch.sigmoid(cc_i)
        f = torch.sigmoid(cc_f)
        o = torch.sigmoid(cc_o)
        g = torch.tanh(cc_g)
        
        
        # Update the cell state
        nextC = f * currentC + i * g
        
        # Calculate the output
        nextH = o * torch.tanh(nextC)
        
        
        return nextH, nextC
    
    
    def init_hidden(self, batchSize):
        
        # Initialize the variables, cell states and output states at t=0
        return (Variable(torch.zeros(batchSize, self.outDim)).float(),
                Variable(torch.zeros(batchSize, self.outDim)).float())

        



    


#### Code snippet of the LSTM layer

In [2]:

class LstmLayer(nn.Module):
    
    def __init__(self, inDim, outDim):
        super().__init__()
        
        
        
        self.lstmCell = LstmCell(inDim=inDim, outDim=outDim)
    
    def forward(self, x, sequenceLengths, returnLast=False, hc=None):
        N, T, D = x.shape
        
        # A LSTM layer handles he operations from t=0 till T.
        # So the cell state has to be initialized as zerso
        if hc is not None:
            raise NotImplementedError()
        else:
            hc = self._init_hidden(batchSize=N)
            
        h, c = hc
        
        outH = []
        outC = []

        # Iterate for each timestep.
        for t in range(T):
            h, c = self.lstmCell(x=x[:, t, :], currentStates=[h,c])
            
            outH.append(h)
            outC.append(c)
            
        outH = torch.stack(outH, dim=1)
        outC = torch.stack(outC, dim=1)
        
        if returnLast:
            newOutH = []
            newOutC = []
            
            # Since the the length of input are different, and some of them are padded by zeros,
            # so the last output has to be picked according to the length of the input sequence
            for sampleIte in range(N):
                newOutH.append(outH[sampleIte, sequenceLengths[sampleIte] - 1, :])
                               
                #sampleIte, outH[sequenceLengths[sampleIte] - 1, :])
                newOutC.append(outC[sampleIte, sequenceLengths[sampleIte] - 1, :])
                
            newOutH = torch.stack(newOutH, dim=0)
            newOutC = torch.stack(newOutC, dim=0)
            return newOutH, newOutC    
            
        else:
            return outH, outC
    
    def _init_hidden(self, batchSize):
        return self.lstmCell.init_hidden(batchSize=batchSize)
    

#### Code snippet of the whole nework architecture
In oder to use the provide network architecture, please read the comment of forward() function.

In [3]:

class AudioNet(nn.Module):

    def __init__(self, numClass=2, inDim=0, convLayerNum=3, convOutNum=64, convKernelSize=5, lstmLayerNum=2):
        super().__init__()
        
        self.convLayerNum = convLayerNum
        self.convOutNum = convOutNum
        self.lstmLayerNum=lstmLayerNum
        self.convKernelSize = convKernelSize
        self.inDim = inDim
        
        
        self.currentDim = self.inDim
        self.avgPool = nn.AvgPool1d(kernel_size=2,
                                    stride=2)
        
        self.currentDim = self.currentDim / 2
        self.conv1d_1 = nn.Conv1d(in_channels=1,
                                  out_channels=64,
                                  kernel_size=self.convKernelSize,
                                  stride=2)
        
        # (self.inDim/2 + 2 x 0 - 1 x (5 - 1) - 1) / 2 + 1
        # (488 - 4)/2 + 1  
        self.currentDim = int((self.currentDim + 2 * 0 - 1 * (self.convKernelSize - 1) - 1)/2 + 1) 
        
        self.maxPool = nn.MaxPool1d(kernel_size=2, stride=2)
        
        self.currentDim = int(self.currentDim / 2)
        
        self.conv1d_2 = nn.Conv1d(in_channels=64,
                                  out_channels=64,
                                  kernel_size=self.convKernelSize,
                                  stride=2)
        for i in range(self.convLayerNum - 1):
            self.currentDim = int((self.currentDim + 2 * 0 - 1 * (self.convKernelSize - 1) - 1)/2 + 1) 
            self.currentDim = int(self.currentDim/2)
        
        self.currentDim = convOutNum * self.currentDim
        self.lstm1 = LstmLayer(self.currentDim, 64)
        self.lstm2 = LstmLayer(64, 64)
        self.fc = nn.Linear(convOutNum, numClass)
                    
    def forward(self, x, sequenceLengths):
        '''
        input size:
            x： N x T x D
            sequence: N x 1
            
            
        Example of input:
            If you have three sequence of speech frames with different lengths:
                Frame11 Frame12 Frame 13    0     
                Frame21 Frame22 Frame 23 Frame 24
                Frame31 frame32    0        0
            
            Then you need to pad the sequence with empty frames to make them have eaual length in time, and put them in a tensor,
            as a result the tensor will have the size of 3 x 4 x D
            
            Besides, you also need to provide a vector which contains their lengths, in this example it will be:
                [3
                 4
                 2]
        
        '''
        
        N, T, D = x.shape
        x = torch.reshape(x, (N * T, 1,  D))
        
        # Applying an initial downsampling
        x1 = self.avgPool(x)
        # Applying an intial convolution
        x2 = self.conv1d_1(x1)    
        x2 = F.relu(x2)
        x3 = self.maxPool(x2)
        

        # Apply several layers of CNN
        for convLayerIte in range(self.convLayerNum - 1):
            
            x3 = self.conv1d_2(x3)
            x3 = F.relu(x3)
            x3 = self.maxPool(x3)
        
        x4 = torch.reshape(x3, (N, T, -1))
        
        # Apply two LSTM layers
        x4, _ = self.lstm1(x=x4, sequenceLengths=sequenceLengths, returnLast=False, hc=None)
        x5, _ = self.lstm2(x=x4, sequenceLengths=sequenceLengths, returnLast=True, hc=None)
                
        
        x6 = F.relu(x5)
        return self.fc(x6)
    


### Your implementation
Please write your code below to complete the exercise


#### Load and normalize the speech data

1. Get the sampling rate of the raw data. All waveforms are recorded with the same sampling rate, so you can get the sampling rate by only read on speech samples. The function [librosa.load()](https://librosa.github.io/librosa/generated/librosa.core.load.html#librosa-core-load) will return the sampling rate, please read the manual.

2. Normalize the speech data by subtracting the *mean* and dividing the *stand variance* to make the data as 'zero mean and unit variance'. Thus you need to calculate the *mean* and *stand variance* from the whole dataset. Since speech waveforms are 1-d time series, so the *mean* and *stand variance* will be scalars. 

3. Segment each speech waveform as sequences of equal-length frames. You can make frames with or without overlapping according to your needs. Please not that you may need to pad zeros to the end of the speech waveforms in order to make the last frame has the same length with others.

4. Split the dataset to training set and testing set. 70% of the data will be the training data, an 30% will be the testing data. The portion will be applied equally to either class.


In [4]:
import scipy.io as sio
from scipy import signal
import matplotlib.pyplot as plt
import librosa

def load_data(pathAudio):
    files = librosa.util.find_files(pathAudio, ext=['wav']) 
    files = np.asarray(files)
    data = []
    for y in files: 
        y, sr = librosa.load(y, sr = None)
        data.append(y)
    return data, sr
    

Happy_data, SR_h = load_data('BonusTaskData/YAF_happy/')
Sad_data, SR_s = load_data('BonusTaskData/YAF_sad/')

def Normalise(data1, data2):
    data = np.append(data1, data2)
    s = 0
    c = 0
    for sublist in data:
        for item in sublist:
            s = s + item
            c += 1
    mean = s/c
    s = 0
    c = 0
    for sublist in data:
        for item in sublist:
            x = (item - mean)**2
            s = s + x
            c += 1
            
    std_v = s/(c-1)
    
    return mean, std_v
        
        
mean, std_v = Normalise(Happy_data, Sad_data)

In [5]:
Happy_data_norm = (Happy_data - mean)/std_v
Sad_data_norm = (Sad_data - mean)/std_v
data = np.append(Happy_data_norm, Sad_data_norm)

In [7]:
data = np.append(Happy_data_norm, Sad_data_norm)
seg_duration = 0.03
frame_len = int(SR_h*seg_duration)
# print(frame_len)

length = []
for i in range(len(data)):
    length.append(len(data[i]))
max_length = np.max(length)
frame = (np.ceil(max_length/frame_len))*frame_len

for i in range(len(data)):
    pad_len = abs(len(data[i]) - int(frame))
    data[i] = np.pad(data[i], (0, pad_len), mode = 'constant')


In [8]:
from sklearn.model_selection import train_test_split
classes = np.zeros(200)
classes = np.pad(classes, (0, 200), mode = 'constant', constant_values = 1)
Train_data, Test_data, Train_classes, Test_classes = train_test_split(data, classes, test_size = 0.3, random_state = 0, shuffle = True)

# Perform the network training.

Please write your code below for the network training. You are asked to perform the batch based training here. Please accumulate the loss and classification accuracy for each epoch and output them and the end of the epoch.



In [47]:
#Initialize the network and optimizer
model = AudioNet(numClass = 2,inDim = frame_len).float()

criterion = nn.CrossEntropyLoss()

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

trainingEpoch = 20

#### Conduct the network evaluation

Please write your code below to evaluated the trained model using your splited testing data. Please output the loss and accuracy of the testing data.

In [54]:
batch_size = frame_len
batch_loss = 0
train_losses, test_losses = [], []

def segmentation(data):
    segmented = []
    for i in range(int(frame/frame_len)):
        start_index = i * batch_size
        end_index = (i+1) * batch_size
        segmented.append(data[start_index:end_index])
    return segmented

# segmented = segmentation(Train_data[0])
# segmented = np.array(segmented)

for epoch in range(trainingEpoch):
    train_len = 0
    running_loss = 0
    running_corrects = 0
    for i, sample in enumerate(Train_data):
        
        inputs = segmentation(sample)
        inputs = torch.FloatTensor(inputs)
        inputs = inputs.unsqueeze(0)
        labels = np.full(1,Train_classes[i])
        labels = torch.LongTensor(labels)
        
        optimizer.zero_grad()
        
        length = np.array([int(len(Train_data[i])/frame_len)])
        
        outputs = model.forward(inputs, length)
        _, preds = torch.max(outputs, 1)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        running_corrects += torch.sum(preds == labels)
        
        # print training lost and accuracy
        model.eval()
        # accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
        train_len += labels.size(0)
    print(f"Epoch {epoch+1}/{trainingEpoch}"
      f"Train loss: {running_loss/np.ceil(frame/batch_size):.3f}"
      f"Train accuracy: {100*running_corrects.double()/train_len:.3f}")

Epoch 1/20Train loss: 2.170Train accuracy: 48.929
Epoch 2/20Train loss: 2.159Train accuracy: 46.429
Epoch 3/20Train loss: 2.014Train accuracy: 69.286
Epoch 4/20Train loss: 1.483Train accuracy: 87.500
Epoch 5/20Train loss: 1.212Train accuracy: 87.500
Epoch 6/20Train loss: 1.071Train accuracy: 90.357
Epoch 7/20Train loss: 0.985Train accuracy: 91.071
Epoch 8/20Train loss: 0.920Train accuracy: 90.357
Epoch 9/20Train loss: 0.843Train accuracy: 91.786
Epoch 10/20Train loss: 0.780Train accuracy: 91.071
Epoch 11/20Train loss: 0.727Train accuracy: 91.786
Epoch 12/20Train loss: 0.670Train accuracy: 93.214
Epoch 13/20Train loss: 0.560Train accuracy: 94.643
Epoch 14/20Train loss: 0.647Train accuracy: 92.857
Epoch 15/20Train loss: 0.421Train accuracy: 96.429
Epoch 16/20Train loss: 0.243Train accuracy: 98.571
Epoch 17/20Train loss: 0.147Train accuracy: 99.643
Epoch 18/20Train loss: 0.116Train accuracy: 100.000
Epoch 19/20Train loss: 0.101Train accuracy: 100.000
Epoch 20/20Train loss: 0.088Train accu

In [59]:
test_loss = 0
test_corrects = 0
test_len =0


for i, sample in enumerate(Test_data):
    inputs = segmentation(sample)
    inputs = torch.FloatTensor(inputs)
    inputs = inputs.unsqueeze(0)
    labels = np.full(1,Test_classes[i])
    labels = torch.LongTensor(labels)
    
    length = np.array([int(len(Test_data[i])/frame_len)])

    outputs = model.forward(inputs,length)
    
    _, preds = torch.max(outputs, 1)
    batch_loss = criterion(outputs, labels)
    test_loss += batch_loss.item()
    test_corrects += torch.sum(preds == labels)
    test_len += labels.size(0)
    test_losses.append(test_loss/len(Test_data[i])/frame_len)
    
    
print(f"Test loss: {test_loss/np.ceil(frame/batch_size):.3f}"
  f"Test accuracy: {100*test_corrects.double()/test_len:.3f}")

Test loss: 0.041Test accuracy: 100.000
