<a href="https://colab.research.google.com/github/jackylmw/Week-2_Data_Preprocessing/blob/Assignment2_JackyLam/JackyLam_Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import drive for subsequent applications

In [26]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Importing Required libraries

In [15]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import librosa.display
import soundfile

# Merge EmoDB and RAVDESS

The EMODB dataset includes recordings from ten professional voice artists, equally divided between males and females.  Each artist in this dataset expresses seven different emotions.  Similarly, the RAVDESS dataset contains recordings from 24 professional actors, showcasing eight distinct emotional expressions.  Although both datasets include similar types of recordings, EmoDB and RAVDESS each use a completely different system for naming their files.  The challenge I encountered was to seamlessly integrate the EMODB recordings into the RAVDESS collection, for which I adjusted the naming scheme of the EMODB files to align with RAVDESS's system.


The actor's ID in EmoDB dataset will be represented using the representation in the RAVDESS dataset

In [None]:
actor_id = {
    "03": "25",
    "08": "26",
    "10": "27",
    "09": "28",
    "11": "29",
    "13": "30",
    "12": "31",
    "14": "32",
    "15": "33",
    "16": "34",
}

Since the emotion categories in the EmoDB dataset are not exactly consistent with those in the RAVDESS dataset, I think bored in EmoDB is calm in RAVDESS. Since the EmoDB does not contain the emotion of surprised, I left it alone, which meant keeping the emotion number of surprised in the original RAVDESS dataset.

In [None]:
emotion = {"W": "05",
           "E": "07",
           "A": "06",
           "F": "03",
           "T": "04",
           "N": "01",
           "L": "02"}

In the EmoDB dataset, different versions are represented by letters, while in RAVDESS, different versions are represented by numbers. To make the EmoDB consistent with RAVDESS naming, I will convert letters to numbers.

In [None]:
versions = {"a": "01", "b": "02", "c": "03", "d": "04", "e": "05", "f": "06"}

Renaming every file in the EmoDB dataset is a tedious step. Completing this task manually would be a daunting and tedious endeavor. Therefore, I developed a Python script to speed up the process. This script will rename each file in the EmoDB dataset. If no RAVDESS folder exists in the same location as the EmoDB folder, a new RAVDESS directory will be generated. If there is already a RAVDESS folder, there is no need to regenerate and overwrite the original folder. Executing this script directly renames the files in the EmoDB folder and transfers them to the RAVDESS folder.

In [None]:
import os

os.chdir(os.path.dirname("/content/drive/MyDrive/IAT481-Assignment2/wav"))

if not os.path.exists("./RAVDESS"):
    os.mkdir("./RAVDESS")

files = os.listdir("./wav")

for f in files:
    new_f = (
        "03-01-"
        + emotion[f[5]]
        + "-01-"
        + f[2:5]
        + "-"
        + versions[f[6]]
        + "-"
        + actor_id[f[0:2]]
        + ".wav"
    )

    if not os.path.exists("./RAVDESS/Actor_" + actor_id[f[0:2]]):
        os.mkdir("./RAVDESS/Actor_" + actor_id[f[0:2]])


    os.rename("./wav/" + f, "./RAVDESS/Actor_" + actor_id[f[0:2]] + "/" + new_f)

# Data Analysis

In this section, we need to visualize our combined dataset.



**Load the Dataset and Compute Features**

We have to understand the labelling of the RAVDESS dataset to find the ground truth emotion for each sample. Each file is labelled with 7 numbers delimited by a "-". Most of the numbers describe metadata about the audio samples such as their format (video and/or audio), whether the audio is a song or statement, which of two statements is being read and by which actor.


We're going to define a dictionary based on the third number (emotion) and assign an emotion to each number as specified by the RAVDESS dataset:

In [3]:
#Emotions in the RAVDESS dataset
emotions ={
  '01':'neutral',
  '02':'calm',
  '03':'happy',
  '04':'sad',
  '05':'angry',
  '06':'fearful',
  '07':'disgust',
  '08':'surprised'
}

In [11]:
import librosa

def feature_chromagram(waveform, sample_rate):
    # STFT computed here explicitly; mel spectrogram and MFCC functions do this under the hood
    stft_spectrogram=np.abs(librosa.stft(waveform))
    #print(stft_spectrogram.shape)
    # Produce the chromagram for all STFT frames and get the mean of each column of the resulting matrix to create a feature array
    chromagram=np.mean(librosa.feature.chroma_stft(S=stft_spectrogram, sr=sample_rate).T,axis=0)
    #print(chromagram.shape)
    return chromagram

def feature_melspectrogram(waveform, sample_rate):
    # Produce the mel spectrogram for all STFT frames and get the mean of each column of the resulting matrix to create a feature array
    # Using 8khz as upper frequency bound should be enough for most speech classification tasks
    melspectrogram=np.mean(librosa.feature.melspectrogram(y=waveform, sr=sample_rate, n_mels=128, fmax=8000).T,axis=0)
    return melspectrogram

def feature_mfcc(waveform, sample_rate):
    # Compute the MFCCs for all STFT frames and get the mean of each column of the resulting matrix to create a feature array
    # 40 filterbanks = 40 coefficients
    mfc_coefficients=np.mean(librosa.feature.mfcc(y=waveform, sr=sample_rate, n_mfcc=40).T, axis=0)
    return mfc_coefficients

In [12]:
def get_features(file):
    # load an individual soundfile
     with soundfile.SoundFile(file) as audio:
        waveform = audio.read(dtype="float32")
        sample_rate = audio.samplerate
        # compute features of soundfile
        chromagram = feature_chromagram(waveform, sample_rate)
        melspectrogram = feature_melspectrogram(waveform, sample_rate)
        mfc_coefficients = feature_mfcc(waveform, sample_rate)
        feature_matrix=np.array([])

        # use np.hstack to stack our feature arrays horizontally to create a feature matrix
        feature_matrix = np.hstack((chromagram, melspectrogram, mfc_coefficients))

        return feature_matrix

In [32]:
import os, glob

def load_data():
    X,y=[],[]
    count = 0
    for file in glob.glob("/content/drive/MyDrive/IAT481-Assignment2/audio_speech_actors_01-24/Actor_*/*.wav"):
        file_name=os.path.basename(file)
        #print(file_name)
        #print(file_name.split("-")[2])
        emotion=emotions[(file_name.split("-")[2])]
        #print(emotion)
        features = get_features(file)
        X.append(features)
        y.append(emotion)
        count += 1
        # '\r' + end='' results in printing over same line
        print('\r' + f' Processed {count}/{1435} audio samples',end=' ')
      # Return arrays to plug into sklearn's cross-validation algorithms
    return np.array(X), np.array(y)

In [33]:
features, emotions = load_data()

 Processed 517/1435 audio samples 

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 12 and the array at index 1 has size 128