<h1 style="text-align: center;"> Emotion Recognition in Voice Recordings </h1>
<h5 style="text-align: center;"> Joseph Golubchik (209195353), Johann Thuillier (336104120), Shlomi Wenberger (203179403) </h5>

<h3 style="text-align: center;"> Project Description </h3>
The aim of our project is to use machine learning to classify a persons emotional state from a recording of him speaking.

<h3 style="text-align: center;"> Dataset </h3>
The dataset we used is “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)”  
https://zenodo.org/record/1188976  

The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). We used only the speach files and not the song files, and used only the audio files and not the videos.

Speech file contains 1440 files: 60 trials per actor x 24 actors = 1440. The labels for each file will be taken from the filenames: The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics: Filename identifiers Modality (01 = full-AV, 02 = video-only, 03 = audio-only). Vocal channel (01 = speech, 02 = song). Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised). Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion. Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door"). Repetition (01 = 1st repetition, 02 = 2nd repetition). Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).


![Image of MLP](https://raw.githubusercontent.com/ledell/sldm4-h2o/master/mlp_network.png)
<h2 style="text-align: center;"> Multi Layered Perceptron </h2>

In [1]:
import tensorflow as tf
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import os
import random
import timeit

In [2]:
# Function to extract only the features from data_xy
def getXvalues(data_xy):
    x_values = []
    for data in data_xy:
        x_values.append(data[0])
    return x_values

# Function to extract only the labels from data_xy
def getYvalues(data_xy):
    y_values = []
    for data in data_xy:
        y_values.append(data[1])
    return y_values

# Sigmoid function
def logistic_fun(z):
    return 1/(1.0 + np.exp(-z))

In [3]:
# Loading the filenames from the folder with the audio files.
filenames = []

for i in range(1,25):
    if (i < 10):
        folderNum = "0"+str(i)
    else:
        folderNum = str(i)
    for file in os.listdir('audio/Actor_'+folderNum):
        filenames.append('Actor_'+folderNum+'/'+file)
        
# Shuffling the filenames array.
random.shuffle(filenames)

# Spliting the dataset into train and test files,
# 70% train and 30% test.
num_train = int(len(filenames)*0.7)
num_test = len(filenames) - num_train

print("Number of files =",len(filenames),",Number of actors =",int(len(filenames)/60))
print("Number of train examples =",num_train,",Number of test examples =",num_test)

Number of files = 1440 ,Number of actors = 24
Number of train examples = 1007 ,Number of test examples = 433


In [4]:
data_x_train = []
data_x_test = []
data_y_train = []
data_y_test = []

start_time = timeit.default_timer()

# For each of the training examples,
# extract from each file its Mel-frequency cepstral coefficients (MFCCs)
# and append the mfccs to the array that stores the features of each train file - data_x_train.
# look at the filename and create a label for the example,
# Where the 8'th character determines the label.
# Ex: filename[7] == 3 => label: [0,0,1,0,0,0,0,0]
for filename in filenames[:num_train]:
    data, sampling_rate = librosa.load("audio/" + filename, sr=22050*2, res_type='kaiser_fast', duration=2.5, offset=0.5)
    sampling_rate = np.array(sampling_rate)
    mfccs = np.mean(librosa.feature.mfcc(y=data, sr=sampling_rate, n_mfcc=13), axis=0)
    data_x_train.append(mfccs)
    label = np.zeros(8)
    label[int(filename[16])-1] = 1
    data_y_train.append(label)

# Do the same for the testing examples.
for filename in filenames[num_train:]:
    data, sampling_rate = librosa.load("audio/" + filename, sr=22050*2, res_type='kaiser_fast', duration=2.5, offset=0.5)
    sampling_rate = np.array(sampling_rate)
    mfccs = np.mean(librosa.feature.mfcc(y=data, sr=sampling_rate, n_mfcc=13), axis=0)
    data_x_test.append(mfccs)
    label = np.zeros(8)
    label[int(filename[16])-1] = 1
    data_y_test.append(label)
    
stop_time = timeit.default_timer()
print('Loading time:', stop_time - start_time, "Seconds")  

Loading time: 177.9082642735782 Seconds


In [5]:
print(np.shape(data_x_train), np.shape(data_x_test))

(1007,) (433, 216)


In [8]:
# We create a new array that will contain tuples where the first element is the features of the example,
# and the second element is the label of the example.
# This is neccesary so we can shuffle the order of the examples around after each training epoch.
data_xy_train = []
for i in range(len(data_x_train)):
    # For all but two of our files, our mfccs extraction returns 216 features, so we don't use these two.
    if len(data_x_train[i]) == 216:
        data_xy_train.append( (data_x_train[i], data_y_train[i]) )
    
data_xy_test = []
for i in range(len(data_x_test)):
    # For all but two of our files, our mfccs extraction returns 216 features, so we don't use these two.
    if len(data_x_test[i]) == 216:
        data_xy_test.append( (data_x_test[i], data_y_test[i]) )

In [9]:
print(np.shape(getXvalues(data_xy_train)), np.shape(getXvalues(data_xy_test)))

(1005, 216) (433, 216)


In [17]:
features = len(data_xy_train[0][0])
hidden_layer_nodes = 10

x = tf.placeholder(tf.float32, [None, features])
y_ = tf.placeholder(tf.float32, [None, 8])
W1 = tf.Variable(tf.truncated_normal([features,hidden_layer_nodes], stddev=0.1))
b1 = tf.Variable(tf.constant(0.1, shape=[hidden_layer_nodes]))
z1 = tf.add(tf.matmul(x,W1),b1)
a1 = tf.nn.relu(z1)
W2 = tf.Variable(tf.truncated_normal([hidden_layer_nodes,8], stddev=0.1))
b2 = tf.Variable(0.)
z2 = tf.matmul(a1,W2) + b2
y = tf.nn.softmax(z2)

cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
# cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1])+0.1*tf.nn.l2_loss(W))
train_step = tf.train.AdamOptimizer(0.0001).minimize(cross_entropy)

init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)

start_time = timeit.default_timer()

for i in range(10000):
    sess.run(train_step, feed_dict={x:getXvalues(data_xy_train), y_:getYvalues(data_xy_train)})
    if i % 200 == 0:
        correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
        print('Epoch '+str(i)+':', "Accuracy:", sess.run(accuracy, feed_dict={x:getXvalues(data_xy_test), y_:getYvalues(data_xy_test)}))
#     random.shuffle(data_xy_train)
    
stop_time = timeit.default_timer()
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

print("Accuracy:", sess.run(accuracy, feed_dict={x:getXvalues(data_xy_test), y_:getYvalues(data_xy_test)}))
print('runtime: ', stop_time - start_time)  

Epoch 0: Accuracy: 0.16859123
Epoch 200: Accuracy: 0.11316397
Epoch 400: Accuracy: 0.106235564
Epoch 600: Accuracy: 0.1039261
Epoch 800: Accuracy: 0.10161663
Epoch 1000: Accuracy: 0.10161663
Epoch 1200: Accuracy: 0.1039261
Epoch 1400: Accuracy: 0.1039261
Epoch 1600: Accuracy: 0.10161663
Epoch 1800: Accuracy: 0.1039261
Epoch 2000: Accuracy: 0.10161663
Epoch 2200: Accuracy: 0.10161663
Epoch 2400: Accuracy: 0.1039261
Epoch 2600: Accuracy: 0.1039261
Epoch 2800: Accuracy: 0.09006929
Epoch 3000: Accuracy: 0.09237875
Epoch 3200: Accuracy: 0.11778291
Epoch 3400: Accuracy: 0.12702079
Epoch 3600: Accuracy: 0.12702079
Epoch 3800: Accuracy: 0.13163972
Epoch 4000: Accuracy: 0.13394919
Epoch 4200: Accuracy: 0.13625866
Epoch 4400: Accuracy: 0.13394919
Epoch 4600: Accuracy: 0.13394919
Epoch 4800: Accuracy: 0.13625866
Epoch 5000: Accuracy: 0.13856813
Epoch 5200: Accuracy: 0.1408776
Epoch 5400: Accuracy: 0.14549653
Epoch 5600: Accuracy: 0.15704387
Epoch 5800: Accuracy: 0.15704387
Epoch 6000: Accuracy: 0