In [46]:
import numpy as np 
import pandas as pd
import os 
import sys 
import torch 
import torchaudio 
import matplotlib.pyplot as plt 
%matplotlib_inline

import IPython.display as ipd

UsageError: Line magic function `%matplotlib_inline` not found.


# Dataset Exploritory Analysis 

The RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) is a widely used dataset for emotion classification using recorded speech because of its high quality and consistent audio quality. The dataset can be found at https://smartlaboratory.org/ravdess/ and more info can be found in this offical citation from the creators: 

Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.

RAVDESS contains both audio and video but for this project i will only be using and discussions the audio-only portion of the dataset. The database contains audio from 24 actors (12 male, 12 female) each speaking 2 similar sentences in a neutral North American accent. Each statement is spoken in 8 different emotions/ expressions (calm, happy, sad, angry, fearful, suprise, and disgust). Each one is performed in 2 different levels of emotional intensity (normal, strong) and a neutral expression is added. All audio recordings have a sample rate of 48kHz with a bit depth of 16bit. There is a total of 1440 audio files (24 actors X 60 trials per actor).


RAVDESS does not come with any sort of metadata table with information on the recordings but instead the filename themselves have all the information. Each filename has a 7 part numerical identifier (ex. 03-01-04-01-01-02-12.wav). The identifiers represent the following: 

    1.) Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
    2.) Vocal channel (01 = speech, 02 = song).
    3.) Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
    4.) Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the ‘neutral’ emotion.
    5.) Statement (01 = “Kids are talking by the door”, 02 = “Dogs are sitting by the door”).
    6.) Repetition (01 = 1st repetition, 02 = 2nd repetition).
    7.) Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

So for example the file 03-01-04-01-01-02-12.wav contains the following metadata: 

    1.) Audio-only (03)
    2.) Speech (01)
    3.) Sad (04)
    4.) Normal Intensity (01)
    5.) "Kids are talking by the door" (01)
    6.) Second Repetition (02)
    7.) Actor-12 Male (12) 

# Creating Metadata table

To make the audio data easier to deal with I will create a Pandas data frame that will contain the file path of each audio file and linked to it's emotion as it will be are target variable. The audio files are each separated into their own folders by which actor performed them. 

In [60]:
audio_path = '/Users/stephen/Emotion_Dectection/data/RAVDESS/Audio_Speech_Actors_01-24/'

dir_list = os.listdir(audio_path)
dir_list.sort()

emotion = []
path = []
intensity = []
statement = []
actor = []
gender = []
for i in dir_list:
    fname = os.listdir(audio_path + i)
    for f in fname:
        part = f.split('.')[0].split('-')
        emotion.append(int(part[2]))
        temp = int(part[6])
        if temp%2 == 0:
            temp = "female"
        else:
            temp = "male"
        gender.append(temp)
        path.append(audio_path + i + '/' + f)

        intent = int(part[3])
        if intent == 1:
            intent = "normal"
        else: 
            intent = "intense"
        intensity.append(intent)
        state = int(part[4])
        if state == 1:
            state = "Kids are talking by the door"
        else: 
            state = "Dogs are sitting by the door"
        statement.append(state)
        act = int(part[6])
        actor.append(act)


# dataframe for emotion of files
emotion_df = pd.DataFrame(emotion, columns=['Emotions'])

# dataframe for emotional intensity 
intensity_df = pd.DataFrame(intensity, columns=['Intensity'])

# dateframe for statements
statement_df = pd.DataFrame(statement, columns=['Statement'])

# dataframe for actor number 
actor_df = pd.DataFrame(actor, columns=['actor'])

# dataframe for path of files.
path_df = pd.DataFrame(path, columns=['Path'])
Ravdess_df = pd.concat([emotion_df, path_df, intensity_df, statement_df, actor_df], axis=1)

# changing integers to actual emotions.
Ravdess_df.Emotions.replace({1:'neutral', 2:'calm', 3:'happy', 4:'sad', 5:'angry', 6:'fear', 7:'disgust', 8:'surprise'}, inplace=True)
Ravdess_df.head()

Unnamed: 0,Emotions,Path,Intensity,Statement,actor
0,surprise,/Users/stephen/Emotion_Dectection/data/RAVDESS...,intense,Dogs are sitting by the door,1
1,surprise,/Users/stephen/Emotion_Dectection/data/RAVDESS...,normal,Kids are talking by the door,1
2,angry,/Users/stephen/Emotion_Dectection/data/RAVDESS...,normal,Dogs are sitting by the door,1
3,fear,/Users/stephen/Emotion_Dectection/data/RAVDESS...,normal,Dogs are sitting by the door,1
4,fear,/Users/stephen/Emotion_Dectection/data/RAVDESS...,intense,Kids are talking by the door,1


In [67]:
# Lets take a look at our target variables 
Ravdess_df['Emotions'].value_counts()

surprise    192
angry       192
fear        192
disgust     192
sad         192
happy       192
calm        192
neutral      96
Name: Emotions, dtype: int64

The Dataset is all balanced except for the "neutral" emotion. This doesn't seem like it will be a problem so we'll leave it as is for now. 

# Visualizing the audio data