# RAVDESS Data Preparation Model 2

## Choice of Feature Representation

For this model, I have decided not to use mel-spectrograms as a feature representation anymore. Instead, we will explore the use of MFCCs (Mel Frequency Cepstral Coefficients) as they might be more applicable for machine learning and classification tasks.

MFCCs are a variant of mel-spectrograms that capture the spectral characteristics of audio signals. They provide a compact representation of the audio data by extracting relevant features such as pitch, timbre, and texture.

By leveraging MFCCs, I aim to improve the performance of my machine learning models since my goal is to reach 80% (and beyond !).

## Processing MFCCs

In contrast to our previous approach of directly feeding the 2D image of Mel-spectrogram, we will employ a different method for processing the MFCCs in this model.

Specifically, we will compute the mean across the time axis of the MFCCs. This approach serves two purposes: dimension reduction and noise reduction. By taking the mean, we transform the 2D MFCC representation into a 1D feature vector, simplifying the input for the model. Additionally, averaging the MFCCs helps mitigate the impact of short-term variations and noise.

While this aggregation technique may result in the loss of some audio information, it can be seen as an opportunity for the model to generalize better. To account for this, we may consider increasing the number of epochs during training. Fortunately with this new approach, the dataset we are working with is lightweight compared to previous models, so this adjustment should not place a significant burden on computational resources.


In [None]:
import matplotlib.pyplot as plt  #MAKE SURE TO IMPORT MATPLOTLIB BEFORE LIBROSA, otherwise matplolib will return errors somehow..
import os
import librosa
import librosa.display
import re
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd

audio_dir = "../../RAVDESS_dataset/"

""" 
Modality            (01 = full-AV, 02 = video-only, 03 = audio-only).
Vocal channel       (01 = speech, 02 = song).
Emotion             (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
Emotional intensity (01 = normal, 02 = strong).     
Statement           (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
Repetition          (01 = 1st repetition, 02 = 2nd repetition).
Actor               (01 to 24. Odd numbered actors are male, even numbered actors are female).
"""

paths=[]
emotion=[]
soundwaves=[]
# iterate over the files in the directory, grouped by three
for root,dirs,files in os.walk(audio_dir):
    for file in files:

        # we only want the files with strong emotions
        if file.split("-")[3]!="02":
            continue

        paths.append(file)
        if file.split('-')[2]=='08':
            emotion.append('07')
        else:
            emotion.append(file.split('-')[2])
        y, sr = librosa.load(f"{root}/{file}", sr=22050,mono=True)
        soundwaves.append(y)
print("data extracted")
librosa.display.waveshow(soundwaves[2])
paths = np.array(paths)
emotion = np.array(emotion)
soundwaves = np.array(soundwaves)
print("finished !")

In [None]:
#generates MEL-spectogramms for each elements of padded_soundwaves
n_mfcc = 40
sr = 22050
mfcc_array = []

# Generate MFCCs for each padded soundwave
for soundwave in soundwaves:
    # Compute MFCCs
    mfccs = np.mean(librosa.feature.mfcc(y=soundwave, sr=sr, n_mfcc=n_mfcc).T,axis=0) 
    mfcc_array.append(mfccs)


In [None]:
#save labels as unicode characters (useful if we're later using non-latin characters)

# Save the array to a .npy file
np.save('processing_dataset/labels.npy', emotion)
print("labels saved!")


In [None]:
#Save data in - final_dataset/
import numpy as np

mfccs = np.array(mfcc_array)
np.save("final_dataset/mfccs.npy", mfccs)
print("mfccs saved!")  
