# Emotion Recognition in Voice Recordings
##### Joseph Golubchik (209195353), Johann Thuillier (336104120), Shlomi Wenberger (203179403)

The aim of our project is to use logistic regression to classify a persons emotional state from a recording of him speaking.  

## Dataset
The dataset we used is “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)”  
https://zenodo.org/record/1188976  

The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). We used only the speach files and not the song files, and used only the audio files and not the videos.

Speech file contains 1440 files: 60 trials per actor x 24 actors = 1440. The labels for each file will be taken from the filenames: The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics: Filename identifiers Modality (01 = full-AV, 02 = video-only, 03 = audio-only). Vocal channel (01 = speech, 02 = song). Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised). Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion. Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door"). Repetition (01 = 1st repetition, 02 = 2nd repetition). Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).


In [1]:
import tensorflow as tf
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import os
import random
import timeit

In [2]:
# Function to extract only the features from data_xy
def getXvalues(data_xy):
    x_values = []
    for data in data_xy:
        x_values.append(data[0])
    return x_values

# Function to extract only the labels from data_xy
def getYvalues(data_xy):
    y_values = []
    for data in data_xy:
        y_values.append(data[1])
    return y_values

# Sigmoid function
def logistic_fun(z):
    return 1/(1.0 + np.exp(-z))

In [73]:
# Loading the filenames from the folder with the audio files.
filenames = []

for i in range(1,25):
    if (i < 10):
        folderNum = "0"+str(i)
    else:
        folderNum = str(i)
    for file in os.listdir('audio/Actor_'+folderNum):
        filenames.append('Actor_'+folderNum+'/'+file)
        


# Shuffling the filenames array.
random.shuffle(filenames)

# Spliting the dataset into train and test files,
# 70% train and 30% test.
num_train = int(len(filenames)*0.7)
num_test = len(filenames) - num_train

print("Number of files =",len(filenames),",Number of actors =",int(len(filenames)/60))
print("Number of train examples =",num_train,",Number of test examples =",num_test)

Number of files = 1440 ,Number of actors = 24
Number of train examples = 1007 ,Number of test examples = 433


In [112]:
data_x_train = []
data_x_test = []
data_y_train = []
data_y_test = []

start_time = timeit.default_timer()

# For each of the training examples,
# extract from each file its Mel-frequency cepstral coefficients (MFCCs)
# and append the mfccs to the array that stores the features of each train file - data_x_train.
# look at the filename and create a label for the example,
# Where the 8'th character determines the label.
# Ex: filename[7] == 3 => label: [0,0,1,0,0,0,0,0]
for filename in filenames[:num_train]:
    if filename != 'Actor_17/03-01-06-01-02-01-17.wav111':
        data, sampling_rate = librosa.load("audio/" + filename, sr=22050*2, res_type='kaiser_fast', duration=2.5, offset=0.5)
        sampling_rate = np.array(sampling_rate)
        mfccs = np.mean(librosa.feature.mfcc(y=data, sr=sampling_rate, n_mfcc=13), axis=0)
        if len(mfccs) != 216:
            print(filename)
        data_x_train.append(mfccs)
        label = np.zeros(8)
        label[int(filename[16])-1] = 1
        data_y_train.append(label)

# Do the same for the testing examples.
for filename in filenames[num_train:]:
    if filename != 'Actor_17/03-01-06-01-02-01-17.wav111':
        data, sampling_rate = librosa.load("audio/" + filename, sr=22050*2, res_type='kaiser_fast', duration=2.5, offset=0.5)
        sampling_rate = np.array(sampling_rate)
        mfccs = np.mean(librosa.feature.mfcc(y=data, sr=sampling_rate, n_mfcc=13), axis=0)
        data_x_test.append(mfccs)
        label = np.zeros(8)
        label[int(filename[16])-1] = 1
        data_y_test.append(label)
    
stop_time = timeit.default_timer()
print('Loading time:', stop_time - start_time, "Seconds")  

Loading time: 168.61646682778337 Seconds


In [129]:
print(np.shape(data_x_train), np.shape(data_x_test))

(1007, 216) (433,)


In [114]:
# Tensorflow requires the y array that it gets to be of the shape (none, 1)
# This converts our data_y arrays from the shape (none, ) to the required shape (none, 1)
# ex: [0, 1, 1, 1, 0] => [[0], [1], [1], [1], [0]]
data_y_train_correct = []
data_y_test_correct = []

for val in data_y_train:
    val_arr = []
    val_arr.append(val)
    data_y_train_correct.append(val_arr)
    
for val in data_y_test:
    val_arr = []
    val_arr.append(val)
    data_y_test_correct.append(val_arr)

In [115]:
# We create a new array that will contain tuples where the first element is the features of the example,
# and the second element is the label of the example.
# This is neccesary so we can shuffle the order of the examples around after each training epoch.
# data_xy_train = []
# for i in range(len(data_x_train)):
#     data_xy_train.append( (data_x_train[i], data_y_train_correct[i]) )
    
# data_xy_test = []
# for i in range(len(data_x_test)):
#     data_xy_test.append( (data_x_test[i], data_y_test_correct[i]) )

In [130]:
data_xy_train = []
for i in range(len(data_x_train)):
    if len(data_x_train[i]) == 216:
        data_xy_train.append( (data_x_train[i], data_y_train[i]) )
    
data_xy_test = []
for i in range(len(data_x_test)):
    if len(data_x_test[i]) == 216:
        data_xy_test.append( (data_x_test[i], data_y_test[i]) )

In [123]:
print(np.shape(getXvalues(data_xy_train)), np.shape(getXvalues(data_xy_test)))

(1007, 216) (431, 216)


In [125]:
count = 0
for data in getXvalues(data_xy_train):
    if len(data) != 216:
        print(len(data), count, filenames[count])
    count += 1

In [141]:
features = len(data_xy_train[0][0])
hidden_layer_nodes = 10

x = tf.placeholder(tf.float32, [None, features])
y_ = tf.placeholder(tf.float32, [None, 8])
W1 = tf.Variable(tf.truncated_normal([features,hidden_layer_nodes], stddev=0.1))
b1 = tf.Variable(tf.constant(0.1, shape=[hidden_layer_nodes]))
z1 = tf.add(tf.matmul(x,W1),b1)
a1 = tf.nn.relu(z1)
W2 = tf.Variable(tf.truncated_normal([hidden_layer_nodes,8], stddev=0.1))
b2 = tf.Variable(0.)
z2 = tf.matmul(a1,W2) + b2
y = tf.nn.softmax(z2)

cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
# cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1])+0.1*tf.nn.l2_loss(W))
train_step = tf.train.AdamOptimizer(0.0001).minimize(cross_entropy)

init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)


# accuracy_arr = np.array()
start_time = timeit.default_timer()

for i in range(10000):
    sess.run(train_step, feed_dict={x:getXvalues(data_xy_train), y_:getYvalues(data_xy_train)})
    if i % 500 == 0:
        correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
        print('Epoch '+str(i)+':', "Accuracy:", sess.run(accuracy, feed_dict={x:getXvalues(data_xy_test), y_:getYvalues(data_xy_test)}))
#     random.shuffle(data_xy_train)
    
stop_time = timeit.default_timer()
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

print("Accuracy:", sess.run(accuracy, feed_dict={x:getXvalues(data_xy_test), y_:getYvalues(data_xy_test)}))
print('runtime: ', stop_time - start_time)  

# Inconsistent
# Accuracy: 0.62931037, iter=5000, rate=0.0006, 10 nodes
# Accuracy: 0.6637931, iter=10000, rate=0.0006, 10 nodes

Epoch0: Accuracy: 0.14849187
Epoch500: Accuracy: 0.1438515
Epoch1000: Accuracy: 0.15777262
Epoch1500: Accuracy: 0.20417634
Epoch2000: Accuracy: 0.19953597
Epoch2500: Accuracy: 0.20417634
Epoch3000: Accuracy: 0.21345708
Epoch3500: Accuracy: 0.20185615
Epoch4000: Accuracy: 0.21345708
Epoch4500: Accuracy: 0.22273782
Epoch5000: Accuracy: 0.225058
Epoch5500: Accuracy: 0.23665893
Epoch6000: Accuracy: 0.24825986
Epoch6500: Accuracy: 0.2412993
Epoch7000: Accuracy: 0.24593967
Epoch7500: Accuracy: 0.24593967
Epoch8000: Accuracy: 0.25290024
Epoch8500: Accuracy: 0.25058004
Epoch9000: Accuracy: 0.26914153
Epoch9500: Accuracy: 0.2552204
Accuracy: 0.25290024
runtime:  85.97175560119285


In [132]:
sess.run(correct_prediction, feed_dict={x:getXvalues(data_xy_test), y_:data_y_test})
ans = np.array(sess.run(correct_prediction, feed_dict={x:getXvalues(data_xy_test), y_:data_y_test}))


InvalidArgumentError: Incompatible shapes: [431] vs. [433]
	 [[{{node Equal_2}} = Equal[T=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ArgMax_4, ArgMax_5)]]

Caused by op 'Equal_2', defined at:
  File "c:\python\python36\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\python\python36\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "c:\python\python36\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "c:\python\python36\lib\site-packages\traitlets\config\application.py", line 658, in launch_instance
    app.start()
  File "c:\python\python36\lib\site-packages\ipykernel\kernelapp.py", line 505, in start
    self.io_loop.start()
  File "c:\python\python36\lib\site-packages\tornado\platform\asyncio.py", line 132, in start
    self.asyncio_loop.run_forever()
  File "c:\python\python36\lib\asyncio\base_events.py", line 422, in run_forever
    self._run_once()
  File "c:\python\python36\lib\asyncio\base_events.py", line 1432, in _run_once
    handle._run()
  File "c:\python\python36\lib\asyncio\events.py", line 145, in _run
    self._callback(*self._args)
  File "c:\python\python36\lib\site-packages\tornado\ioloop.py", line 758, in _run_callback
    ret = callback()
  File "c:\python\python36\lib\site-packages\tornado\stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "c:\python\python36\lib\site-packages\tornado\gen.py", line 1233, in inner
    self.run()
  File "c:\python\python36\lib\site-packages\tornado\gen.py", line 1147, in run
    yielded = self.gen.send(value)
  File "c:\python\python36\lib\site-packages\ipykernel\kernelbase.py", line 357, in process_one
    yield gen.maybe_future(dispatch(*args))
  File "c:\python\python36\lib\site-packages\tornado\gen.py", line 326, in wrapper
    yielded = next(result)
  File "c:\python\python36\lib\site-packages\ipykernel\kernelbase.py", line 267, in dispatch_shell
    yield gen.maybe_future(handler(stream, idents, msg))
  File "c:\python\python36\lib\site-packages\tornado\gen.py", line 326, in wrapper
    yielded = next(result)
  File "c:\python\python36\lib\site-packages\ipykernel\kernelbase.py", line 534, in execute_request
    user_expressions, allow_stdin,
  File "c:\python\python36\lib\site-packages\tornado\gen.py", line 326, in wrapper
    yielded = next(result)
  File "c:\python\python36\lib\site-packages\ipykernel\ipkernel.py", line 294, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "c:\python\python36\lib\site-packages\ipykernel\zmqshell.py", line 536, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "c:\python\python36\lib\site-packages\IPython\core\interactiveshell.py", line 2819, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "c:\python\python36\lib\site-packages\IPython\core\interactiveshell.py", line 2845, in _run_cell
    return runner(coro)
  File "c:\python\python36\lib\site-packages\IPython\core\async_helpers.py", line 67, in _pseudo_sync_runner
    coro.send(None)
  File "c:\python\python36\lib\site-packages\IPython\core\interactiveshell.py", line 3020, in run_cell_async
    interactivity=interactivity, compiler=compiler, result=result)
  File "c:\python\python36\lib\site-packages\IPython\core\interactiveshell.py", line 3185, in run_ast_nodes
    if (yield from self.run_code(code, result)):
  File "c:\python\python36\lib\site-packages\IPython\core\interactiveshell.py", line 3267, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-131-ca1a932de4d6>", line 32, in <module>
    correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
  File "c:\python\python36\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 2949, in equal
    "Equal", x=x, y=y, name=name)
  File "c:\python\python36\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "c:\python\python36\lib\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "c:\python\python36\lib\site-packages\tensorflow\python\framework\ops.py", line 3272, in create_op
    op_def=op_def)
  File "c:\python\python36\lib\site-packages\tensorflow\python\framework\ops.py", line 1768, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Incompatible shapes: [431] vs. [433]
	 [[{{node Equal_2}} = Equal[T=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ArgMax_4, ArgMax_5)]]
