#Speech Emotion Recognition with MLP Classifier



#Dataset
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) 

---
Audio-only files

Audio-only files of all actors (01-24) are available as two separate zip files (~200 MB each):

Speech file Ravdess contains 1080 files: 42 trials per actor x 24 actors = 1008. 

---

---
Toronto emotional speech set (TESS)

---


There are a set of 200 target words were spoken in the carrier phrase "Say the word _' by two actresses (aged 26 and 64 years) and recordings were made of the set portraying each of seven emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral). There are 2800 data points (audio files) in total.

The dataset is organised such that each of the two female actor and their emotions are contain within its own folder. And within that, all 200 target words audio file can be found. The format of the audio file is a WAV format


---



# Make the necessary imports

In [11]:
import librosa 
import joblib
import soundfile
import os, glob, pickle
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier 
from sklearn.metrics import accuracy_score
import resampy

Define a function extract_feature to extract the mfcc, chroma, and mel features from a sound file. This function takes 4 parameters- the file name and three Boolean parameters for the three features:

* mfcc: Mel Frequency Cepstral Coefficient, represents the short-term power spectrum of a sound
* chroma: Pertains to the 12 different pitch classes
* mel: Mel Spectrogram Frequency

In [2]:
def extract_feature(file_name, mfcc=True, chroma=True, mel=True):
    # Load the audio file
    X, sample_rate = librosa.load(os.path.join(file_name), res_type="kaiser_fast")
    
    # Compute Short-Time Fourier Transform (STFT) only if chroma is needed
    stft = np.abs(librosa.stft(y=X)) if chroma else None
    
    # Initialize an empty array to store the features
    result = np.array([])

    # Extract MFCC features
    if mfcc:
        mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
        result = np.hstack((result, mfccs))

    # Extract Chroma features
    if chroma:
        chroma_feat = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T, axis=0)
        result = np.hstack((result, chroma_feat))

    # Extract Mel Spectrogram features
    if mel:
        mel_feat = np.mean(librosa.feature.melspectrogram(y=X, sr=sample_rate).T, axis=0)
        result = np.hstack((result, mel_feat))

    return result

Now, let’s define a dictionary to hold numbers and the emotions available in the RAVDESS & TESS dataset, and a list to hold all 8 emotions- neutral,calm,happy,sad,angry,fearful,disgust,surprised.

In [3]:
# Emotions in the RAVDESS & TESS dataset
emotions={
  '01':'neutral',
  '02':'calm',
  '03':'happy',
  '04':'sad',
  '05':'angry',
  '06':'fearful',
  '07':'disgust',
  '08':'surprised'
}
# Emotions to observe
observed_emotions=['neutral','calm','happy','sad','angry','fearful', 'disgust','surprised']

# Load the data and extract features for each sound file

In [4]:
def load_data(test_size=0.2):
    x, y = [], []
    
    for file in glob.glob("C:/Users/ACER/Downloads/CNDMCK/Audio/Ravdess/Actor_*/*.wav"):
        file_name = os.path.basename(file)
        
        emotion = emotions[file_name.split("-")[2]]
        
        if emotion not in observed_emotions:
            continue
        
        feature = extract_feature(file, mfcc=True, chroma=True, mel=True)
        
        x.append(feature)
        y.append(emotion)
    
    return train_test_split(np.array(x), y, test_size=test_size, train_size=0.75, random_state=9)

# Split the Dataset
Time to split the dataset into training and testing sets! Let’s keep the test set 25% of everything and use the load_data function for this.

In [5]:
# Split the dataset
import time
x_train,x_test,y_train,y_test=load_data(test_size=0.25)

Observe the shape of the training and testing datasets:

In [6]:
#Get the shape of the training and testing datasets
print((x_train.shape[0], x_test.shape[0]))

(2859, 953)


# Number of features extracted.

In [7]:
# Get the number of features extracted
print(f'Features extracted: {x_train.shape[1]}')

Features extracted: 180


# MLP Classifier

In [8]:
# Initialize the Multi Layer Perceptron Classifier
model=MLPClassifier(alpha=0.01, batch_size=256, epsilon=1e-08, hidden_layer_sizes=(300,), learning_rate='adaptive', max_iter=500)

Fit/train the model.

In [9]:
# Train the model
model.fit(x_train,y_train)

In [13]:
# Lưu mô hình MLP sklearn
joblib.dump(model, "mlp_model.joblib")

['mlp_model.joblib']

# Predict the accuracy of our model

Let’s predict the values for the test set. This gives us y_pred (the predicted emotions for the features in the test set).

In [14]:
# Predict for the test set
y_pred=model.predict(x_test)

To calculate the accuracy of our model, we’ll call up the accuracy_score() function we imported from sklearn. Finally, we’ll round the accuracy to 2 decimal places and print it out.

In [19]:
# Calculate the accuracy of our model
accuracy=accuracy_score(y_true=y_test, y_pred=y_pred)
# Print the accuracy
print("Accuracy: {:.2f}%".format(accuracy*100))

Accuracy: 96.01%



classification Report

In [16]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))


              precision    recall  f1-score   support

       angry       0.99      0.93      0.96       160
        calm       0.88      0.96      0.92        48
     disgust       1.00      1.00      1.00        86
     fearful       0.92      0.93      0.92       136
       happy       0.96      0.95      0.95       168
     neutral       0.99      1.00      1.00       120
         sad       0.93      0.97      0.95       143
   surprised       0.99      0.99      0.99        92

    accuracy                           0.96       953
   macro avg       0.96      0.96      0.96       953
weighted avg       0.96      0.96      0.96       953



# Confusion Matrix

In [17]:
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(y_test,y_pred)
print (matrix)

[[149   0   0   6   2   1   2   0]
 [  0  46   0   0   1   0   1   0]
 [  0   0  86   0   0   0   0   0]
 [  1   1   0 126   1   0   7   0]
 [  1   5   0   1 159   0   1   1]
 [  0   0   0   0   0 120   0   0]
 [  0   0   0   4   1   0 138   0]
 [  0   0   0   0   1   0   0  91]]
