# Emotion Recognition in Voice Recordings
##### Joseph Golubchik (209195353), Johann Thuillier (336104120), Shlomi Wenberger (203179403)

The aim of our project is to use logistic regression to classify a persons emotional state from a recording of him speaking.  

## Dataset
The dataset we used is “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)”  
https://zenodo.org/record/1188976  

The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). We used only the speach files and not the song files, and used only the audio files and not the videos.

Speech file contains 1440 files: 60 trials per actor x 24 actors = 1440. The labels for each file will be taken from the filenames: The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics: Filename identifiers Modality (01 = full-AV, 02 = video-only, 03 = audio-only). Vocal channel (01 = speech, 02 = song). Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised). Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion. Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door"). Repetition (01 = 1st repetition, 02 = 2nd repetition). Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).


## Convolutional Neural Network

In [1]:
import tensorflow as tf
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import os
import random
import timeit

In [2]:
# Function to extract only the features from data_xy
def getXvalues(data_xy):
    x_values = []
    for data in data_xy:
        x_values.append(data[0])
    return x_values

# Function to extract only the labels from data_xy
def getYvalues(data_xy):
    y_values = []
    for data in data_xy:
        y_values.append(data[1])
    return y_values

# Sigmoid function
def logistic_fun(z):
    return 1/(1.0 + np.exp(-z))

In [3]:
# Loading the filenames from the folder with the audio files.
filenames = []

for i in range(1,25):
    if (i < 10):
        folderNum = "0"+str(i)
    else:
        folderNum = str(i)
    for file in os.listdir('audio/Actor_'+folderNum):
        filenames.append('Actor_'+folderNum+'/'+file)
        
# Shuffling the filenames array.
random.shuffle(filenames)

# Spliting the dataset into train and test files,
# 70% train and 30% test.
num_train = int(len(filenames)*0.7)
num_test = len(filenames) - num_train

print("Number of files =",len(filenames),",Number of actors =",int(len(filenames)/60))
print("Number of train examples =",num_train,",Number of test examples =",num_test)

Number of files = 1440 ,Number of actors = 24
Number of train examples = 1007 ,Number of test examples = 433


In [4]:
data_x_train = []
data_x_test = []
data_y_train = []
data_y_test = []

# max_pad_len = 11

start_time = timeit.default_timer()

# For each of the training examples,
# extract from each file its Mel-frequency cepstral coefficients (MFCCs)
# and append the mfccs to the array that stores the features of each train file - data_x_train.
# look at the filename and create a label for the example,
# Where the 8'th character determines the label.
# Ex: filename[7] == 3 => label: [0,0,1,0,0,0,0,0]
# Actor_13/03-01-05-01-01-01-13.wav
for filename in filenames[:num_train]:
    data, sampling_rate = librosa.load("audio/" + filename, sr=22050*2, res_type='kaiser_fast', duration=2.5, offset=0.5)
    sampling_rate = np.array(sampling_rate)
    mfccs = np.mean(librosa.feature.mfcc(y=data, sr=sampling_rate, n_mfcc=13), axis=0)
    data_x_train.append(mfccs)
    label = np.zeros(8)
    label[int(filename[16])-1] = 1
    data_y_train.append(label)
    
    np.save('saved/' + filename[9:-3] + str(np.argmax(label)) + '.npy', mfccs)
    

# Do the same for the testing examples.
for filename in filenames[num_train:]:
    data, sampling_rate = librosa.load("audio/" + filename, sr=22050*2, res_type='kaiser_fast', duration=2.5, offset=0.5)
    sampling_rate = np.array(sampling_rate)
    mfccs = np.mean(librosa.feature.mfcc(y=data, sr=sampling_rate, n_mfcc=13), axis=0)
    data_x_test.append(mfccs)
    label = np.zeros(8)
    label[int(filename[16])-1] = 1
    data_y_test.append(label)
    
    np.save('saved/' + filename[9:-3] + str(np.argmax(label)) + '.npy', mfccs)
    
stop_time = timeit.default_timer()
print('Loading time:', stop_time - start_time, "Seconds")  

Loading time: 198.8266464191262 Seconds


In [5]:
# We create a new array that will contain tuples where the first element is the features of the example,
# and the second element is the label of the example.
# This is neccesary so we can shuffle the order of the examples around after each training epoch.
data_xy_train = []
for i in range(len(data_x_train)):
#     # For all but two of our files, our mfccs extraction returns 216 features, so we don't use these two.
#     if len(data_x_train[i]) == 216:
    temp_arr = np.copy(data_x_train[i])
    temp_arr.resize(256)
    data_xy_train.append( (temp_arr, data_y_train[i]) )
    
data_xy_test = []
for i in range(len(data_x_test)):
#     # For all but two of our files, our mfccs extraction returns 216 features, so we don't use these two.
#     if len(data_x_test[i]) == 216:
    temp_arr = np.copy(data_x_test[i])
    temp_arr.resize(256)
    data_xy_test.append( (temp_arr, data_y_test[i]) )

In [11]:
features = len(data_xy_train[0][0])
labels = len(data_xy_train[0][1])

x = tf.placeholder(tf.float32, [None, features], name="x")
y_ = tf.placeholder(tf.float32, [None, labels], name="y_")

W_conv1 = tf.Variable(tf.truncated_normal([10, 10, 1, 32], stddev=0.1))
b_conv1 = tf.Variable(tf.constant(0.1, shape=[32]))
x_image = tf.reshape(x, [-1,16,16,1]) #if we had RGB, we would have 3 channels
h_conv1 = tf.nn.relu(tf.nn.conv2d(x_image, W_conv1, strides=[1, 1, 1, 1], padding='SAME') + b_conv1)
h_pool1 = tf.nn.max_pool(h_conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
W_conv2 = tf.Variable(tf.truncated_normal([5, 5, 32, 64], stddev=0.1))
b_conv2 = tf.Variable(tf.constant(0.1, shape=[64]))
h_conv2 = tf.nn.relu(tf.nn.conv2d(h_pool1, W_conv2, strides=[1, 1, 1, 1], padding='SAME') + b_conv2)
h_pool2 = tf.nn.max_pool(h_conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
h_pool2_flat = tf.reshape(h_pool2, [-1, 4*4*64])
W_fc1 = tf.Variable(tf.truncated_normal([4 * 4 * 64, 1024], stddev=0.1))
b_fc1 = tf.Variable(tf.constant(0.1, shape=[1024]))
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
keep_prob = tf.placeholder(tf.float32, name="keep_prob")
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
W_fc2 = tf.Variable(tf.truncated_normal([1024, labels], stddev=0.1))
b_fc2 = tf.Variable(tf.constant(0.1, shape=[labels]))
y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)

cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y_conv), reduction_indices=[1]))
train_step = tf.train.AdamOptimizer(0.001).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())

start_time = timeit.default_timer()

for i in range(10):
    if i % 2 == 0:
        train_accuracy = accuracy.eval(feed_dict={x:getXvalues(data_xy_test), y_:getYvalues(data_xy_test), keep_prob: 1.0})
        print("step %d, training accuracy %g"%(i, train_accuracy))
                               
    train_step.run(feed_dict={x:getXvalues(data_xy_test), y_:getYvalues(data_xy_test), keep_prob: 0.5})
    random.shuffle(data_xy_train)

    
stop_time = timeit.default_timer()
                               
print("test accuracy %g"%accuracy.eval(feed_dict={x:getXvalues(data_xy_test), y_:getYvalues(data_xy_test), keep_prob: 1.0}))
print('runtime: ', stop_time - start_time)  

sess.close()

step 0, training accuracy 0.127021
step 2, training accuracy 0.073903
step 4, training accuracy 0.073903
step 6, training accuracy 0.073903
step 8, training accuracy 0.073903
test accuracy 0.073903
runtime:  15.716906967600494


In [17]:
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

In [23]:
#parameters of convolutional layer
conv1_fmaps = 32
conv1_ksize = 3
conv1_stride = 1
conv1_pad = "SAME"

conv2_fmaps = 64
conv2_ksize = 3
conv2_stride = 2
conv2_pad = "SAME"

#parameters of pooling layer
pool2_fmaps = conv2_fmaps
#parameters of fully connected network and outputs
n_fc1 = 64
n_outputs = 10

features = len(data_xy_train[0][0])
labels = len(data_xy_train[0][1])

reset_graph()

x = tf.placeholder(tf.float32, [None, features], name="x")
X_reshaped = tf.reshape(x, [-1,16,16,1]) #if we had RGB, we would have 3 channels
y = tf.placeholder(tf.float32, [None, labels], name="y")



conv1 = tf.layers.conv2d(X_reshaped, filters=conv1_fmaps, kernel_size = conv1_ksize,
                         strides = conv1_stride, padding=conv1_pad,
                         activation = tf.nn.relu, name="conv1")
conv2 = tf.layers.conv2d(conv1, filters=conv2_fmaps, kernel_size=conv2_ksize,
                         strides=conv2_stride, padding=conv2_pad,
                         activation=tf.nn.relu, name="conv2")

pool2 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding="VALID")
pool2_flat = tf.reshape(pool2, shape=[-1,pool2_fmaps*7*7])

fc1 = tf.layers.dense(pool2_flat, n_fc1, activation = tf.nn.relu,
                          name = "fc1")

logits = tf.layers.dense(fc1, n_outputs, name = "output")
Y_proba = tf.nn.softmax(logits, name="Y_proba")

xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer()
training_op = optimizer.minimize(loss)

correct = tf.nn.in_top_k(logits,y,1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()
saver = tf.train.Saver()


#I decided to set the epochs to 10, but also 2 or 3 it's enough for good result,
#this because the train and the test sets are very similar
n_epochs = 10
batch_size = 100

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(num_examples // batch_size):
            #this cycle is for dividing step by step the heavy work of each neuron
            X_batch = np_fashion_train[iteration*batch_size:iteration*batch_size+batch_size,1:]
            y_batch = np_fashion_train[iteration*batch_size:iteration*batch_size+batch_size,0]
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: test_images, y: test_labels})
        print("Epoch:",epoch+1, "Train accuracy:", acc_train, "test accuracy:", acc_test)
       
        save_path = saver.save(sess, "./my_fashion_model")

ValueError: Rank mismatch: Rank of labels (received 2) should equal rank of logits minus 1 (received 2).