# Detecting Emotions From Audio With CNN

In this notebook I propose a very simple CNN architecture to classify audio clips by the emotion it contains.

The training pipeline works as follows:
- Audio clips are first converted into spectrograms, which are essentially 2D arrays, and downsampled to save training time
- The downsampled spectrograms are then fed to a two-layer vanilla convolutional neural network. The output is probability of labels computed by a softmax operation. 

The convolutional neural network is chosen because of its proficiency in learning both higher and lower level image features. When an audio sample is represented as a spectrogram, it is essentially an image and we can easily visualize features such as prosodies and intonations. These features are very useful in classifying emotions and the CNN architecture is very good at learning them. The CNN architecture is also robust against variations in audio quality, such as the pitch of the speaker.

The dataset here is the speech section of the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). The audio clips covers short speeches intoning 2 sentences with 8 different emotions by a female speaker and a male speaker. The dataset is found at https://zenodo.org/record/1188976#.W2R6RtVKick.

In [None]:
import os 
import glob
from pathlib import Path
import re 

audio_root_dir = Path(r'./Audio_Speech_Actors_01-24')
audio_file_pattern = Path(r'**/*.wav')

def get_emotion_label(filename):
    """
    Each of the 7356 RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier 
    (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics: 

    Filename identifiers 

    Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
    Vocal channel (01 = speech, 02 = song).
    Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
    Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
    Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
    Repetition (01 = 1st repetition, 02 = 2nd repetition).
    Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).
    
    Here we will only use 'Emotion' as the label for our training
    
    INPUT
        filename
        
    OUTPUT
        emotion label, STARTING FROM 0 AS OPPOSED TO 1
    """
    EMOTION_LABEL_POS = 2 
    return int(re.findall(r"\d+", os.path.basename(filename))[EMOTION_LABEL_POS]) - 1 

Define a few util functions to compute the spectrogram from a WAV file and display the result.

In [None]:
import matplotlib.pyplot as plt
from scipy import signal
from scipy.io import wavfile
import librosa
import librosa.display 
import numpy as np

# Define a function which wil apply a butterworth bandpass filter
from scipy.signal import butter, lfilter


def butter_bandpass_filter(samples, lowcut, highcut, sample_rate, order=5):
    """
    Butterworth's filter
    """
    def butter_bandpass(lowcut, highcut, sample_rate, order=5):
        nyq = 0.5 * sample_rate
        low = lowcut / nyq
        high = highcut / nyq
        b, a = butter(order, [low, high], btype='band')
        return b, a
    
    b, a = butter_bandpass(lowcut, highcut, sample_rate, order=order)
    y = lfilter(b, a, samples)
    return y

def clean_audio(samples, sample_rate, lowcut=30, highcut=3000):
    """
    return a preprocessed waveform with normalized volumn, bandpass filtered to only
    contain freq range of human speech, and trimmed of starting and trailing silence
    
    INPUT
        samples       1D array containing volumns at different time
        sample_rate
        lowcut        lower bound for the bandpass filter, default to 30Hz
        highcut       higher bound for the bandpass filter, default to 3000Hz
    
    OUTPUT
        filtered      1D array containing preprocessed audio information
    """
    # remove silence at the start and end of 
    trimmed, index = librosa.effects.trim(samples)
    # only keep frequencies common in human speech
    filtered = butter_bandpass_filter(samples, lowcut, highcut, sample_rate, order=5)
    return filtered

def get_melspectrogram(audio_path):
    """
    return a denoised spectrogram of audio clip given path
    
    INPUT
        audio_path    string
    OUTPUT
        spectrogram   2D array, where axis 0 is time and axis 1 is fourier decomposition
                      of waveform at different times
    """
    samples, sample_rate = librosa.load(audio_file_path)
    samples = clean_audio(samples, sample_rate)
    
    melspectrogram = librosa.feature.melspectrogram(samples, sample_rate) 
    
    # max L-infinity normalized the energy 
    return librosa.util.normalize(melspectrogram)
     
def display_spectrogram(melspectrogram):
    plt.figure(figsize=(10, 4))
    librosa.display.specshow(melspectrogram,
                             y_axis='mel', 
                             fmax=8000,
                             x_axis='time')
    
    plt.colorbar(format='%+2.0f dB')
    plt.title('melspectrogram')
    plt.show()
    
def align_and_downsample(spectrogram, max_freq_bins=128, max_frames=150, freq_strides=1, frame_strides=1):
    return spectrogram[:max_freq_bins:freq_strides, :max_frames:frame_strides]

def duplicate_and_stack(layer, dups=3):
    return np.stack((layer for _ in range(dups)), axis=2)

Show an example of a spectrogram. This is the input to the CNN.

In [None]:
audio_file_path = 'Audio_Speech_Actors_01-24/Actor_07/03-01-01-01-01-02-07.wav'

spectrogram = get_melspectrogram(audio_file_path)
display_spectrogram(spectrogram)

spectrogram

Obtain all audio files, convert them into spectrograms, and extract the emotion labels from the file names. Currently we are only trying to classify anger, therefore all other emotion labels are combined into one. 

In [None]:
spectrograms = []
labels = []

# takes about 6-8 min on my machine
counter = 0
for audio_file in glob.iglob(str(audio_root_dir / audio_file_pattern), recursive=True):
    labels.append(get_emotion_label(audio_file))
    
    spectrogram = get_melspectrogram(audio_file)
    spectrograms.append(duplicate_and_stack(align_and_downsample(spectrogram)))
    
    if counter % 100 == 0:
        print('Processing the {}th file: {}'.format(counter, audio_file))
    counter += 1

In [None]:
import pandas as pd
labels_dict = dict(zip(range(8), 
                       ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised']))
df = pd.DataFrame(labels, columns=['label'])
df.replace({"label": labels_dict}, inplace=True)
df['label'].value_counts().plot(kind='bar')

In [None]:
spectrograms = np.array(spectrograms)
labels = np.array(labels)

In [None]:
spectrograms.shape

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(spectrograms, labels, test_size=0.4, random_state=0)
print('X_train.shape = {}'.format(X_train.shape))
print('y_train.shape = {}'.format(y_train.shape))

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)

Define CNN architecture here. We have two conv/max pooling layers followed by a dense layer with dropout. 

In [None]:
def cnn_model_fn(features, labels, mode):
    """Model function for CNN."""
    _, height, width, _ = features.shape
    height, width = int(height), int(width)
    
    kernel_size=[5, 5]
    
    strides = 5
    pool_size = [5, 5]
    
    # Convolutional Layer #1
    conv1 = tf.layers.conv2d(
      inputs=features,
      filters=32,
      kernel_size=kernel_size,
      padding="same",
      activation=tf.nn.relu)

    # Pooling Layer #1
    pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=pool_size, strides=strides)

    # Convolutional Layer #2 and Pooling Layer #2
    conv2_filters = 64
    conv2 = tf.layers.conv2d(
      inputs=pool1,
      filters=conv2_filters,
      kernel_size=kernel_size,
      padding="same",
      activation=tf.nn.relu)
    pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=pool_size, strides=strides)

    # Dense Layer
    pool2_flat = tf.reshape(pool2, 
                            [-1, (height // (strides ** 2)) * (width // (strides ** 2)) * conv2_filters])
    dense = tf.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)
    dropout = tf.layers.dropout(
      inputs=dense, rate=0.4, training=mode == tf.estimator.ModeKeys.TRAIN)

    # Logits Layer
    logits = tf.layers.dense(inputs=dropout, units=10)

    predictions = {
      # Generate predictions (for PREDICT and EVAL mode)
      "classes": tf.argmax(input=logits, axis=1),
      # Add `softmax_tensor` to the graph. It is used for PREDICT and by the
      # `logging_hook`.
      "probabilities": tf.nn.softmax(logits, name="softmax_tensor")
    }

    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)

    # Calculate Loss (for both TRAIN and EVAL modes)
    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

    # Configure the Training Op (for TRAIN mode)
    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
        train_op = optimizer.minimize(
            loss=loss,
            global_step=tf.train.get_global_step())
        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)

    # Add evaluation metrics (for EVAL mode)
    eval_metric_ops = {
      "accuracy": tf.metrics.accuracy(
          labels=labels, predictions=predictions["classes"])}
    return tf.estimator.EstimatorSpec(
      mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

In [None]:
# Create the Estimator
emotion_classifier = tf.estimator.Estimator(
    model_fn=cnn_model_fn, model_dir="/tmp/audio_cnn_model")

# Set up logging for predictions
tensors_to_log = {"probabilities": "softmax_tensor"}
logging_hook = tf.train.LoggingTensorHook(
    tensors=tensors_to_log, every_n_iter=100)

Train the model.

In [None]:
train_input_fn = tf.estimator.inputs.numpy_input_fn(
    x=X_train,
    y=y_train,
    batch_size=100,
    num_epochs=None,
    shuffle=True)
emotion_classifier.train(
    input_fn=train_input_fn,
    steps=5000,
    hooks=[logging_hook])

Evaluate the model and print results.

In [None]:
eval_input_fn = tf.estimator.inputs.numpy_input_fn(
    x=X_test,
    y=y_test,
    num_epochs=1,
    shuffle=False)
eval_results = emotion_classifier.evaluate(input_fn=eval_input_fn)
print(eval_results)

In [None]:
from keras import applications

# If you are only interested in convolution filters. Note that by not
# specifying the shape of top layers, the input tensor shape is (None, None, 3),
# so you can use them for any size of images.
vgg_model = applications.VGG16(weights='imagenet', include_top=False)

# If you want to specify input tensor
from keras.layers import Input
input_tensor = Input(shape=(128, 150, 3))
vgg_model = applications.VGG16(weights='imagenet',
                               include_top=False,
                               input_tensor=input_tensor)

# To see the models' architecture and layer names, run the following
vgg_model.summary()

In [None]:
import keras
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D

num_classes = 8

# Creating dictionary that maps layer names to the layers
layer_dict = dict([(layer.name, layer) for layer in vgg_model.layers])

# Getting output tensor of the last VGG layer that we want to include
x = layer_dict['block1_pool'].output

# Stacking a new two layer neural network on top of it 
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(num_classes, activation='softmax')(x)

# Creating new model. Please note that this is NOT a Sequential() model.
from keras.models import Model
custom_model = Model(input=vgg_model.input, output=x)

# Make sure that the pre-trained bottom layers are not trainable
for layer in custom_model.layers[:3]:
    layer.trainable = False

# Do not forget to compile it
custom_model.compile(loss='categorical_crossentropy',
                     optimizer=keras.optimizers.Adadelta(),
                     metrics=['accuracy'])

In [None]:
custom_model.summary()

In [None]:
from keras import callbacks
from pathlib import Path

log_dir = str(Path('./Graph'))
tbCallBack = callbacks.TensorBoard(log_dir=log_dir, histogram_freq=0, write_graph=True, write_images=True)

In [None]:
y_train = keras.utils.to_categorical(y_train, num_classes)
print(y_train.shape)

y_test = keras.utils.to_categorical(y_test, num_classes)
print(y_test.shape)

custom_model.fit(x=X_train, 
                 y=y_train, 
                 epochs=20, 
                 verbose=1, 
                 callbacks=[tbCallBack],
                 validation_data=(X_test, y_test))


In [None]:
import subprocess
tensorboard_cmd = 'tensorboard --logdir {}'.format(log_dir)
subprocess.run(tensorboard_cmd.split())

In [None]:
score = custom_model.evaluate(X_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])