# The purpose of this jupyter notebook is to create labels for our dataset.
I have made the labels for 10 categories with their truth values in binary form.

In [1]:
import pandas as pd
import glob

Loading the file paths of the RAVDESS dataset. The directory used below contains all the .wav files from the dataset without any segregation on the basis of speech actors.

In [2]:
files = glob.glob(r"D:\Kaggle\datasets\Audio_Speech_Actors_01-24\RAVDESS/*.wav")
print(len(files))   #get the total number of files in the dataset

1440


Initializing label arrays as list of zeros the length of number of files in the dataset.

In [3]:
male_angry = [0]*len(files)
male_sad = [0]*len(files)
male_fearful = [0]*len(files)
male_calm = [0]*len(files)
male_happy = [0]*len(files)

female_angry = [0]*len(files)
female_sad = [0]*len(files)
female_fearful = [0]*len(files)
female_calm = [0]*len(files)
female_happy = [0]*len(files)

fname = []

Assigning the truth value 1 to the respective array element for each .wav file according to the information provided by the RAVDESS dataset creators. I have separated the information in the names of the .wav files to get 'tags' for each file, and have used these tags to determine the labels for them.
* The filename consists of a 7-part numerical identifier (e.g., 03-01-06-01-02-01-12.wav)
* Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).  [3rd tag]
* Actor (01 to 24. Odd numbered actors are male, even numbered actors are female). [7th tag]

In [4]:
for i,file in enumerate(files):
    tags = file.replace("D:\\Kaggle\\datasets\\Audio_Speech_Actors_01-24\\RAVDESS\\","").replace(".wav","").split("-")
    fname.append(file.replace("D:\\Kaggle\\datasets\\Audio_Speech_Actors_01-24\\RAVDESS\\",""))
    if(int(tags[2]) == 2 and int(tags[6])%2 == 0):
        female_calm[i] = 1
    elif(int(tags[2]) == 3 and int(tags[6])%2 == 0):
        female_happy[i] = 1
    elif(int(tags[2]) == 4 and int(tags[6])%2 == 0):
        female_sad[i] = 1
    elif(int(tags[2]) == 5 and int(tags[6])%2 == 0):
        female_angry[i] = 1
    elif(int(tags[2]) == 6 and int(tags[6])%2 == 0):
        female_fearful[i] = 1
        
        
    elif(int(tags[2]) == 2 and int(tags[6])%2 != 0):
        male_calm[i] = 1
    elif(int(tags[2]) == 3 and int(tags[6])%2 != 0):
        male_happy[i] = 1
    elif(int(tags[2]) == 4 and int(tags[6])%2 != 0):
        male_sad[i] = 1
    elif(int(tags[2]) == 5 and int(tags[6])%2 != 0):
        male_angry[i] = 1
    elif(int(tags[2]) == 6 and int(tags[6])%2 != 0):
        male_fearful[i] = 1

Creating a dataframe with file names and their labels. However since I have not used all the possible categories in my labeling process, there are files having no boolean true value for any of the categories used. This is not good for the model, as it results in model learning incorrect features due to ambiguity of having all labels as zeros for a file.

In [5]:
df = pd.DataFrame({"fname":fname,"male_angry":male_angry,"male_happy":male_happy,"male_calm":male_calm,"male_fearful":male_fearful,"male_sad":male_sad,"female_angry":female_angry,"female_happy":female_happy,"female_calm":female_calm,"female_fearful":female_fearful,"female_sad":female_sad})

In [6]:
df.head()

Unnamed: 0,fname,male_angry,male_happy,male_calm,male_fearful,male_sad,female_angry,female_happy,female_calm,female_fearful,female_sad
0,03-01-01-01-01-01-01.wav,0,0,0,0,0,0,0,0,0,0
1,03-01-01-01-01-01-02.wav,0,0,0,0,0,0,0,0,0,0
2,03-01-01-01-01-01-03.wav,0,0,0,0,0,0,0,0,0,0
3,03-01-01-01-01-01-04.wav,0,0,0,0,0,0,0,0,0,0
4,03-01-01-01-01-01-05.wav,0,0,0,0,0,0,0,0,0,0


To resolve the ambiguity of having all labels as zeros for a file, I removed all such files.

In [7]:
cd = pd.DataFrame() 
for i in range(len(df)):
    if(len(set(df.iloc[i,1:])) == 2):
        cd = pd.concat((cd,df.loc[i:i]))
print(len(cd))
cd = cd.reset_index(drop = True)

960


Final dataframe.

In [8]:
cd

Unnamed: 0,fname,male_angry,male_happy,male_calm,male_fearful,male_sad,female_angry,female_happy,female_calm,female_fearful,female_sad
0,03-01-02-01-01-01-01.wav,0,0,1,0,0,0,0,0,0,0
1,03-01-02-01-01-01-02.wav,0,0,0,0,0,0,0,1,0,0
2,03-01-02-01-01-01-03.wav,0,0,1,0,0,0,0,0,0,0
3,03-01-02-01-01-01-04.wav,0,0,0,0,0,0,0,1,0,0
4,03-01-02-01-01-01-05.wav,0,0,1,0,0,0,0,0,0,0
5,03-01-02-01-01-01-06.wav,0,0,0,0,0,0,0,1,0,0
6,03-01-02-01-01-01-07.wav,0,0,1,0,0,0,0,0,0,0
7,03-01-02-01-01-01-08.wav,0,0,0,0,0,0,0,1,0,0
8,03-01-02-01-01-01-09.wav,0,0,1,0,0,0,0,0,0,0
9,03-01-02-01-01-01-10.wav,0,0,0,0,0,0,0,1,0,0


Exporting the dataframe as csv file.

In [9]:
#cd.to_csv(r"D:\Kaggle\datasets\Audio_Speech_Actors_01-24\RAVDESS\dataset.csv",index = False,header = True)