# Emotion Recognition in Voice Recordings
##### Joseph Golubchik (209195353), Johann Thuillier (336104120), Shlomi Wenberger (203179403)

The aim of our project is to use logistic regression to classify a persons emotional state from a recording of him speaking.  

## Dataset
The dataset we used is “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)”  
https://zenodo.org/record/1188976  

The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). We used only the speach files and not the song files, and used only the audio files and not the videos.

Speech file contains 1440 files: 60 trials per actor x 24 actors = 1440. The labels for each file will be taken from the filenames: The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics: Filename identifiers Modality (01 = full-AV, 02 = video-only, 03 = audio-only). Vocal channel (01 = speech, 02 = song). Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised). Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion. Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door"). Repetition (01 = 1st repetition, 02 = 2nd repetition). Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).


## Convolutional Neural Network

In [1]:
import tensorflow as tf
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import os
import random
import timeit

In [2]:
# Function to extract only the features from data_xy
def getXvalues(data_xy):
    x_values = []
    for data in data_xy:
        x_values.append(data[0])
    return x_values

# Function to extract only the labels from data_xy
def getYvalues(data_xy):
    y_values = []
    for data in data_xy:
        y_values.append(data[1])
    return y_values

# Sigmoid function
def logistic_fun(z):
    return 1/(1.0 + np.exp(-z))

In [136]:
# Loading the filenames from the folder with the audio files.
filenames = []

for i in range(1,25):
    if (i < 10):
        folderNum = "0"+str(i)
    else:
        folderNum = str(i)
    for file in os.listdir('audio/Actor_'+folderNum):
        filenames.append('Actor_'+folderNum+'/'+file)
        
# Shuffling the filenames array.
random.shuffle(filenames)

# Spliting the dataset into train and test files,
# 70% train and 30% test.
num_train = int(len(filenames)*0.8)
num_test = len(filenames) - num_train

print("Number of files =",len(filenames),",Number of actors =",int(len(filenames)/60))
print("Number of train examples =",num_train,",Number of test examples =",num_test)

Number of files = 1440 ,Number of actors = 24
Number of train examples = 1152 ,Number of test examples = 288


In [137]:
data_x_train = []
data_x_test = []
data_y_train = []
data_y_test = []

# max_pad_len = 11

start_time = timeit.default_timer()

# For each of the training examples,
# extract from each file its Mel-frequency cepstral coefficients (MFCCs)
# and append the mfccs to the array that stores the features of each train file - data_x_train.
# look at the filename and create a label for the example,
# Where the 8'th character determines the label.
# Ex: filename[7] == 3 => label: [0,0,1,0,0,0,0,0]
# Actor_13/03-01-05-01-01-01-13.wav
for filename in filenames[:num_train]:
    data, sampling_rate = librosa.load("audio/" + filename, sr=22050*2, res_type='kaiser_fast', duration=2.5, offset=0.5)
    sampling_rate = np.array(sampling_rate)
    mfccs = np.mean(librosa.feature.mfcc(y=data, sr=sampling_rate, n_mfcc=13), axis=0)
    data_x_train.append(mfccs)
    label = np.zeros(2)
    label[int(filename[18:20])%2] = 1
    data_y_train.append(label)
    
    np.save('saved/' + filename[9:-3] + str(np.argmax(label)) + '.npy', mfccs)
    

# Do the same for the testing examples.
for filename in filenames[num_train:]:
    data, sampling_rate = librosa.load("audio/" + filename, sr=22050*2, res_type='kaiser_fast', duration=2.5, offset=0.5)
    sampling_rate = np.array(sampling_rate)
    mfccs = np.mean(librosa.feature.mfcc(y=data, sr=sampling_rate, n_mfcc=13), axis=0)
    data_x_test.append(mfccs)
    label = np.zeros(2)
    label[int(filename[18:20])%2] = 1
    data_y_test.append(label)
    
    np.save('saved/' + filename[9:-3] + str(np.argmax(label)) + '.npy', mfccs)
    
stop_time = timeit.default_timer()
print('Loading time:', stop_time - start_time, "Seconds")  

Loading time: 74.7750888758892 Seconds


In [140]:
# We create a new array that will contain tuples where the first element is the features of the example,
# and the second element is the label of the example.
# This is neccesary so we can shuffle the order of the examples around after each training epoch.
data_xy_train = []
for i in range(len(data_x_train)):
#     # For all but two of our files, our mfccs extraction returns 216 features, so we don't use these two.
#     if len(data_x_train[i]) == 216:
    temp_arr = np.copy(data_x_train[i])
    temp_arr.resize(256)
    data_xy_train.append( (temp_arr, data_y_train[i]) )
    
data_xy_test = []
for i in range(len(data_x_test)):
#     # For all but two of our files, our mfccs extraction returns 216 features, so we don't use these two.
#     if len(data_x_test[i]) == 216:
    temp_arr = np.copy(data_x_test[i])
    temp_arr.resize(256)
    data_xy_test.append( (temp_arr, data_y_test[i]) )

In [143]:
features = len(data_xy_train[0][0])
labels = len(data_xy_train[0][1])

x = tf.placeholder(tf.float32, [None, features], name="x")
y_ = tf.placeholder(tf.float32, [None, labels], name="y_")
x_image = tf.reshape(x, [-1,16,16,1]) #if we had RGB, we would have 3 channels

f1=1
f2=16
f3=128

k1=5
k2=3
k3=1

fc1_nodes=512

W_conv1 = tf.Variable(tf.truncated_normal([k1, k1, 1, f1], stddev=0.1))
b_conv1 = tf.Variable(tf.constant(0.1, shape=[f1]))
h_conv1 = tf.nn.relu(tf.nn.conv2d(x_image, W_conv1, strides=[1, 1, 1, 1], padding='SAME') + b_conv1)
h_pool1 = tf.nn.max_pool(h_conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

W_conv2 = tf.Variable(tf.truncated_normal([k2, k2, f1, f2], stddev=0.1))
b_conv2 = tf.Variable(tf.constant(0.1, shape=[f2]))
h_conv2 = tf.nn.relu(tf.nn.conv2d(h_pool1, W_conv2, strides=[1, 1, 1, 1], padding='SAME') + b_conv2)
h_pool2 = tf.nn.max_pool(h_conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

W_conv3 = tf.Variable(tf.truncated_normal([k3, k3, f2, f3], stddev=0.1))
b_conv3 = tf.Variable(tf.constant(0.1, shape=[f3]))
h_conv3 = tf.nn.relu(tf.nn.conv2d(h_pool2, W_conv3, strides=[1, 1, 1, 1], padding='SAME') + b_conv3)
h_pool3 = tf.nn.max_pool(h_conv3, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
h_pool3_flat = tf.reshape(h_pool3, [-1, 2*2*f3])

W_fc1 = tf.Variable(tf.truncated_normal([2 * 2 * f3, fc1_nodes], stddev=0.1))
b_fc1 = tf.Variable(tf.constant(0.1, shape=[fc1_nodes]))
h_fc1 = tf.nn.relu(tf.matmul(h_pool3_flat, W_fc1) + b_fc1)
keep_prob = tf.placeholder(tf.float32, name="keep_prob")
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
W_fc2 = tf.Variable(tf.truncated_normal([fc1_nodes, labels], stddev=0.1))
b_fc2 = tf.Variable(tf.constant(0.1, shape=[labels]))
y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)

cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y_conv), reduction_indices=[1]))
train_step = tf.train.AdamOptimizer(0.00008).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())

start_time = timeit.default_timer()

for i in range(20000):
    if i % 100 == 0:
        train_accuracy = accuracy.eval(feed_dict={x:getXvalues(data_xy_train), y_:getYvalues(data_xy_train), keep_prob: 1.0})
        test_accuracy = accuracy.eval(feed_dict={x:getXvalues(data_xy_test), y_:getYvalues(data_xy_test), keep_prob: 1.0})
        print("step %d, training accuracy %g, testing accuracy %g"%(i, train_accuracy, test_accuracy))
                               
    train_step.run(feed_dict={x:getXvalues(data_xy_train), y_:getYvalues(data_xy_train), keep_prob: 0.5})
    random.shuffle(data_xy_train)

    
stop_time = timeit.default_timer()
                               
print("test accuracy %g"%accuracy.eval(feed_dict={x:getXvalues(data_xy_test), y_:getYvalues(data_xy_test), keep_prob: 1.0}))
print('runtime: ', stop_time - start_time)  



step 0, training accuracy 0.462674, testing accuracy 0.472222
step 100, training accuracy 0.699653, testing accuracy 0.722222
step 200, training accuracy 0.722222, testing accuracy 0.739583
step 300, training accuracy 0.730035, testing accuracy 0.739583
step 400, training accuracy 0.741319, testing accuracy 0.746528
step 500, training accuracy 0.752604, testing accuracy 0.736111
step 600, training accuracy 0.75434, testing accuracy 0.729167
step 700, training accuracy 0.767361, testing accuracy 0.746528
step 800, training accuracy 0.77691, testing accuracy 0.75
step 900, training accuracy 0.787326, testing accuracy 0.746528
step 1000, training accuracy 0.796875, testing accuracy 0.75
step 1100, training accuracy 0.809896, testing accuracy 0.75
step 1200, training accuracy 0.81684, testing accuracy 0.739583
step 1300, training accuracy 0.826389, testing accuracy 0.729167
step 1400, training accuracy 0.835069, testing accuracy 0.725694
step 1500, training accuracy 0.847222, testing accur

step 13300, training accuracy 0.471354, testing accuracy 0.447917
step 13400, training accuracy 0.471354, testing accuracy 0.447917
step 13500, training accuracy 0.471354, testing accuracy 0.447917
step 13600, training accuracy 0.471354, testing accuracy 0.447917
step 13700, training accuracy 0.471354, testing accuracy 0.447917
step 13800, training accuracy 0.471354, testing accuracy 0.447917
step 13900, training accuracy 0.471354, testing accuracy 0.447917
step 14000, training accuracy 0.471354, testing accuracy 0.447917
step 14100, training accuracy 0.471354, testing accuracy 0.447917
step 14200, training accuracy 0.471354, testing accuracy 0.447917
step 14300, training accuracy 0.471354, testing accuracy 0.447917
step 14400, training accuracy 0.471354, testing accuracy 0.447917
step 14500, training accuracy 0.471354, testing accuracy 0.447917
step 14600, training accuracy 0.471354, testing accuracy 0.447917
step 14700, training accuracy 0.471354, testing accuracy 0.447917
step 14800

In [None]:
sess.close()

In [102]:
import IPython.display as ipd
import pyaudio
import wave
 
FORMAT = pyaudio.paInt16
CHANNELS = 2
RATE = 44100
CHUNK = 1024
RECORD_SECONDS = 5
WAVE_OUTPUT_FILENAME = "file.wav"
 
audio = pyaudio.PyAudio()
 
# start Recording
stream = audio.open(format=FORMAT, channels=CHANNELS,
                rate=RATE, input=True,
                frames_per_buffer=CHUNK)
print ("recording...")
frames = []
 
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
    data = stream.read(CHUNK)
    frames.append(data)
print ("finished recording")
 
 
# stop Recording
stream.stop_stream()
stream.close()
audio.terminate()
 
waveFile = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
waveFile.setnchannels(CHANNELS)
waveFile.setsampwidth(audio.get_sample_size(FORMAT))
waveFile.setframerate(RATE)
waveFile.writeframes(b''.join(frames))
waveFile.close()

ipd.Audio(WAVE_OUTPUT_FILENAME)

recording...
finished recording


In [120]:
data, sampling_rate = librosa.load(WAVE_OUTPUT_FILENAME, sr=22050*2, res_type='kaiser_fast', duration=2.5, offset=0.5)
sampling_rate = np.array(sampling_rate)
mfccs = np.mean(librosa.feature.mfcc(y=data, sr=sampling_rate, n_mfcc=13), axis=0)

In [121]:
y_value = [[0,1]]
temp_arr = np.copy(mfccs)
temp_arr.resize(1,256)
test_pred = tf.argmax(y_conv,1).eval(feed_dict={x:temp_arr, y_:y_value, keep_prob: 1.0})
print(y_conv.eval(feed_dict={x:temp_arr, y_:y_value, keep_prob: 1.0}))

print("Prediction:",test_pred[0],",Real:",np.argmax(y_value))

[[0.9966683  0.00333173]]
Prediction: 0 ,Real: 1


In [122]:
test_pred = tf.argmax(y_conv,1).eval(feed_dict={x:getXvalues(data_xy_test), y_:getYvalues(data_xy_test), keep_prob: 1.0})
for i in range(len(data_xy_test)):
    print(i,"Prediction:",test_pred[i],",Real:",getYvalues(data_xy_test)[i],",Correct?",(test_pred[i] == np.argmax(getYvalues(data_xy_test)[i])))

0 Prediction: 0 ,Real: [1. 0.] ,Correct? True
1 Prediction: 1 ,Real: [0. 1.] ,Correct? True
2 Prediction: 0 ,Real: [1. 0.] ,Correct? True
3 Prediction: 0 ,Real: [0. 1.] ,Correct? False
4 Prediction: 0 ,Real: [1. 0.] ,Correct? True
5 Prediction: 0 ,Real: [1. 0.] ,Correct? True
6 Prediction: 1 ,Real: [0. 1.] ,Correct? True
7 Prediction: 0 ,Real: [0. 1.] ,Correct? False
8 Prediction: 0 ,Real: [1. 0.] ,Correct? True
9 Prediction: 1 ,Real: [0. 1.] ,Correct? True
10 Prediction: 1 ,Real: [0. 1.] ,Correct? True
11 Prediction: 0 ,Real: [1. 0.] ,Correct? True
12 Prediction: 1 ,Real: [1. 0.] ,Correct? False
13 Prediction: 0 ,Real: [1. 0.] ,Correct? True
14 Prediction: 0 ,Real: [0. 1.] ,Correct? False
15 Prediction: 0 ,Real: [1. 0.] ,Correct? True
16 Prediction: 1 ,Real: [0. 1.] ,Correct? True
17 Prediction: 1 ,Real: [0. 1.] ,Correct? True
18 Prediction: 1 ,Real: [0. 1.] ,Correct? True
19 Prediction: 0 ,Real: [1. 0.] ,Correct? True
20 Prediction: 1 ,Real: [1. 0.] ,Correct? False
21 Prediction: 1 ,

208 Prediction: 1 ,Real: [0. 1.] ,Correct? True
209 Prediction: 0 ,Real: [1. 0.] ,Correct? True
210 Prediction: 1 ,Real: [1. 0.] ,Correct? False
211 Prediction: 1 ,Real: [0. 1.] ,Correct? True
212 Prediction: 0 ,Real: [1. 0.] ,Correct? True
213 Prediction: 1 ,Real: [0. 1.] ,Correct? True
214 Prediction: 1 ,Real: [0. 1.] ,Correct? True
215 Prediction: 1 ,Real: [0. 1.] ,Correct? True
216 Prediction: 1 ,Real: [0. 1.] ,Correct? True
217 Prediction: 1 ,Real: [0. 1.] ,Correct? True
218 Prediction: 1 ,Real: [0. 1.] ,Correct? True
219 Prediction: 0 ,Real: [1. 0.] ,Correct? True
220 Prediction: 0 ,Real: [1. 0.] ,Correct? True
221 Prediction: 1 ,Real: [0. 1.] ,Correct? True
222 Prediction: 1 ,Real: [1. 0.] ,Correct? False
223 Prediction: 1 ,Real: [0. 1.] ,Correct? True
224 Prediction: 1 ,Real: [0. 1.] ,Correct? True
225 Prediction: 0 ,Real: [1. 0.] ,Correct? True
226 Prediction: 1 ,Real: [0. 1.] ,Correct? True
227 Prediction: 0 ,Real: [0. 1.] ,Correct? False
228 Prediction: 0 ,Real: [0. 1.] ,Cor

In [None]:
count = np.zeros(2)
for y in getYvalues(data_xy_test):
    count[np.argmax(y)] += 1
count