<h1>Importing Libraries</h1>

At first, let's import all the necessary Python libraries.

In [5]:
import pandas as pd
import numpy as np
import librosa as rosa
import os
import tensorflow.keras as keras
import statistics

<h1>Counting Audio Frames</h1>

Next, we will count the number of audio frames for each audio file in our dataset and calculate the median number of frames. This is done so that we have the same size for each feature vector. Don't worry, it will make sense later! We will extract the RMS energy from each frame to keep track of the number of frames from all the audio files.

RAVDESS files were recorded at a sampling frequency of 48 kHz. A higher sampling rate gives a better audio resolution, but it also means that we will need to store more data. That's why we will resample the audio files at 16 kHz, which is good enough for most cases. Also, we will use a frame length of 512 samples (i.e. 32 ms) and a hop length of 256 samples (i.e. 16 ms).

In [6]:
# Save directory path of RAVDESS in 'folder_path'.
folder_path = 'C:/Users/rezwa/Downloads/RAVDESS'

# Create a list of directories inside the RAVDESS directory.
folder_list = os.listdir(folder_path)

# Initialize an empty list for storing the number of audio frames in each file.
num_frames = []

# Loop for calculating averge number of frames for the dataset.
for foldername in folder_list:
    file_path = folder_path + '/' + foldername
    file_list = os.listdir(file_path)
    for filename in file_list:
        # Read WAV file. 'rosa.core.load' returns sampling frequency in 'fs' and audio signal in 'sig'.
        sig, fs = rosa.core.load(file_path + '/' + filename, sr=16000)
        
        # 'rosa.feature.rms' extracts rms energies from audio frames (one per frame) and stores them into 'rms_feat'.
        rms_feat = rosa.feature.rms(y=sig, frame_length=512, hop_length=256)
        num_frames.append(rms_feat.shape[1])
    
    # Go one level up in the directory tree.
    os.chdir('..')
    
# Calculate the Median of the number of frames for all audio files. This will then be used to cap the maximum number of frames per audio file, which in turn will be used as the number of RNN units.
median_num_frames = statistics.median(num_frames)

# Convert float to integer.
median_num_frames = int(median_num_frames)
print(median_num_frames)

<h1>Extracting Features</h1>

Now that we know how many audio frames we will process for each audio file, we can extract the features and save them to a Pandas dataframe. We will extract 26 MFCCs per frame, 7 spectral contrasts per frame, 2 polynomial coefficients per frame, and 1 RMS energy per frame.

If an audio file has lower number of audio frames than our median number of frames (the cap), we will pad the audio with zeros to match its length to the median. On the other hand, if an audio file has more audio frames than the median number of frames, we will remove the excess frames from that audio file to match the lengths.

In [7]:
# Declare a dummy Numpy array (row vector).
result_array = np.empty([1, (36*median_num_frames)+1])

# Declare a variable to be later used in reshaping the feature array.
i = 0

# Loop for feature extraction.
for foldername in folder_list:
    file_path = folder_path + '/' + foldername
    file_list = os.listdir(file_path)
    for filename in file_list:
        # Read WAV file. 'rosa.core.load' returns sampling frequency in 'fs' and audio signal in 'sig'.
        sig, fs = rosa.core.load(file_path + '/' + filename, sr=16000)
        
        # 'rosa.feature.mfcc' extracts n_mfccs from signal and stores it into 'mfcc_feat'
        mfcc_feat = rosa.feature.mfcc(y=sig, sr=fs, n_mfcc=26, n_fft=512, hop_length=256, htk=True)
        spec_feat = rosa.feature.spectral_contrast(y=sig, sr=fs, n_fft=512, hop_length=256)
        poly_feat = rosa.feature.poly_features(y=sig, sr=fs, n_fft=512, hop_length=256)
        rms_feat = rosa.feature.rms(y=sig, frame_length=512, hop_length=256)

        # Append the three 1D arrays into a single 1D array called 'feat'.
        feat0 = np.append(mfcc_feat, spec_feat, axis=0)
        feat1 = np.append(feat0, poly_feat, axis=0)
        feat2 = np.append(feat1, rms_feat, axis=0)

        # Transpose the array to flip the rows and columns. This is done so that the features become column parameters, making each row an audio frame.
        transp_feat = feat2.T

        # Note: The 'cap frame number' is basically the limit we set for the number of frames for each audio file, so that all audio files have equal lengths when processing.
        
        if transp_feat.shape[0] < median_num_frames:
            # If number of frames is smaller than the cap frame number, we pad the array in order to reach our desired dimensions.
            # Pad the array so that it matches the cap frame number. The second value in the argument contains two tuples which indicate which way to pad how much.  
            transp_feat = np.pad(transp_feat, ((0, median_num_frames-transp_feat.shape[0]), (0,0)), constant_values=0)

        elif transp_feat.shape[0] > median_num_frames:
            # If number of frames is larger than the cap frame number, we delete rows (frames) which exceed the cap frame number in order to reach our desired dimensions.
            # Define a tuple which contains the range of the row indices to delete.
            row_del_index = (range(median_num_frames, transp_feat.shape[0], 1))
            transp_feat = np.delete(transp_feat, row_del_index, axis=0)

        else:
            # If number of frames match the cap frame length, perfect!
            transp_feat = transp_feat

        # Transpose again to flip the rows and columns. This is done so that the features become row parameters, making each column an audio frame.
        transp2_feat = transp_feat.T

        # Flatten the entire 2D Numpy array into 1D Numpy array. So, the first 36 values of the 1D array represent the features for first frame, the second 36 represent the features for second frame, and so on till the final (cap) frame.
        # 'C' means row-major ordered flattening.
        feat_flatten = transp2_feat.flatten('C')

        # Save emotion label from file name.
        label = os.path.splitext(os.path.basename(file_path + '/' + filename))[0].split('-')[2]

        # Create a new Numpy array 'sample' to store features along with label.
        sample = np.insert(feat_flatten, obj=36*median_num_frames, values=label)

        result_array = np.append(result_array, sample)

        i += 1
    
    # Go one level up in the directory tree.
    os.chdir('..')
    

# Convert 1D Numpy array to 2D array. Argument must be a Tuple. i+1 because we have i audio files plus a dummy row.
result_array = np.reshape(result_array, (i+1,-1))

# Delete first dummy row from 2D array.
result_array = np.delete(result_array, 0, 0)

# Save the feature array into a Pandas dataframe.
df = pd.DataFrame(data=result_array)
print(df.shape)

<h1>Saving Features to CSV</h1>

We can now save the Pandas dataframe containing all the features as a CSV file. For this project, we will only use data for seven emotions - happy, sad, anger, fear, surprise, and neutral. Also, we will replace the integer labels that we got from the file names with string labels for better readability.

In [8]:
# Label only the last (target) column.
df = df.rename({36*median_num_frames: "Emotion"}, axis='columns')
# Delete calm emotion data.
df.drop(df[df['Emotion'] == 2.0].index, inplace = True)
# Rename integer labels with string labels.
df['Emotion'].replace({1.0: "Neutral", 3.0: "Happy", 4.0: "Sad", 5.0: "Angry", 6.0: "Fearful", 7.0: "Disgust", 8.0: "Surprised"}, inplace=True)
# Reset row (audio files) indexing.
df = df.reset_index(drop=True)

# Save as CSV file.
df.to_csv("C:/Users/rezwa/Documents/RAVDESS_Librosa_RNN.csv")
print("Done!")