# Introduction

Speech emotion recognition, the best example of it can be seen at call centers. If you ever noticed, call centers employees never talk in the same manner, their way of pitching/talking to the customers changes with customers. Now, this does happen in common people too, but how is this relevant to call centers? Here is your answer, the employees recognize customers' emotions from speech, so they can improve their service and convert more people. In this way, they are using speech emotion recognition. So, let's discuss this project in detail.

Speech emotion recognition is a simple Python mini-project, which you are going to practice with DataFlair.

# What is Speech Emotion Recognition?

Speech Emotion Recognition, abbreviated as SER, is the act of attempting to recognize human emotion and affective states from speech. This is capitalizing on the fact that voice often reflects underlying emotion through tone and pitch. This is also the phenomenon that animals like dogs and horses employ to be able to understand human emotion.

SER is tough because emotions are subjective and annotating audio challenging.

# What is Librosa?

Librosa is a Python Library for analyzing audio and music. It has a flatter pacakge layout, standardizes interfaces and names, backward compatibility, modular functions and readabale code. 

# What is JupyterLab?

JupyterLab is an open-source, web-based UI for Project Jupyter and it has basic functionalities of the Jupyter Notebook, like notebooksm terminals, text editors, file browsers, rich outputs and more. However, it also provides improved support for third party extensions.

# Speech Emotion Recognition - Objective
To build a model to recognize emotion from speech using the librosa and sklearn libraries and the RAVDESS dataset.

# Speech Emotion Recognition - About the Project

In this project, we will use the libraries librosa, soundfile, and sklearn to build a model using an MLPClassifier. This will be able to recognize emotion from sound files. We will load the data, extract features from it, then split the dataset into training and testing sets. Then, we'll initialize an MLPCLassifier and traing the model. Finally, we'll calculate the accuracy of our model.

# The Dataset
For this Python project, we'll use the RAVDESS dataset; this is the Ryerson Audio-Visual Database of Emotional Speech and Song dataset, and is free to download. This dataset has 7356 files rated by 247 individuals 10 times on emotonal validity, intensity and genuineness. The entire dataset is 24.8 GB from 24 actors.

# Prerequisites
pip install librosa soundfile numpy sklearn pyaudio

In [3]:
# Necessary Imports
import librosa
import soundfile
import os, glob, pickle
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

Now, define a function extract_feature to extract the mfcc, chroma, and mel features from a sound file. This function takes 4 parameters - the file name and three Boolean parameters for the three features:

a. mfcc: Mel Frequency Cepstral Coefficient, represents the short-term power spectrum of a sound
b. chroma: Pertains to the 12 different pitch classes
c. mel: Mel Spectrogram Frequency

Open the sound file with soundfile. SoundFile using with-as so it's automatically closed once we're done. Read from it and call it X. Also, get the sample rate. If chroma is True, get the Short-TIme Fourier Transform of X.

Let reuslt be an empty numpy array. Now, for each feature of the three, if it exists, make a call to the corresponding funtion from librosa.feature(eg-librosa.feature.mfcc for mfcc), and get the mean value. Call the function hstack() from numpy with result and the feature value, and store this in result. hstack() stacks array in sequential horizontally (in a columnar fashion). Then, return the result.

In [5]:
#Extract features (mfcc, chroma, mel) from a sound file
def extract_feature(file_name, mfcc, chroma, mel):
    with soundfile.SoundFile(file_name) as sound_file:
        X = sound_file.read(dtype="float32")
        sample_rate=sound_file.samplerate
        if chroma:
            stft=np.abs(librosa.stft(X))
        result=np.array([])
        if mfcc:
            mfccs=np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
            result=np.hstack((result, mfccs))
        if chroma:
            chroma=np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
            result=np.hstack((result, chroma))
        if mel:
            mel=np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
            result=np.hstack((result, mel))
    return result

Now, let's define a dictionary to hold numbers and the emotions available in the RAVDESS dataset, and a list to hold those we want-calm, happy, fearful,disgust.

In [6]:
#Emotions in the RAVDESS dataset
emotions={
  '01':'neutral',
  '02':'calm',
  '03':'happy',
  '04':'sad',
  '05':'angry',
  '06':'fearful',
  '07':'disgust',
  '08':'surprised'
}
#DataFlair - Emotions to observe
observed_emotions=['calm', 'happy', 'fearful', 'disgust']

Now, let's load the data with a function load_data() - this takes in the relative size of the test set as parameter. x and y are empty lists; we'll use the glob() function from the glob module to get all the pathnames for the sound file in our dataset. The patternwe use for this is: "D:\\DataFlair\\ravdessdata\\Actor_*\\*.wav". So, for each such path, get the basename of the file, the emotion by splitting the name around ‘-’ and extracting the third value:

Using our emotion dictionary, this number is turned into an emotion, and our function checks wheteher this emotion is in our list of observed_emotions; if not, it continues to the next file. It makes a call to extract_feature and stores what is returned in 'feature'. Then, it appends the feature to x and the emotion to y. So, the list x holds the features and y holds the emotions. We call the function train_test_split with these, the test size and  a random state value and return that.

In [7]:
#Load the data and extract features for each sound file
def load_data(test_size=0.2):
    x,y=[],[]
    for file in glob.glob("C:\\Users\\Ace\\Desktop\\MachineLearning\\machinelearningprojects\\Speech Recognition with librosa\\ravdessdata\\Actor_*\\*.wav"):
        file_name=os.path.basename(file)
        emotion=emotions[file_name.split("-")[2]]
        if emotion not in observed_emotions:
            continue
        feature=extract_feature(file, mfcc=True, chroma=True, mel=True)
        x.append(feature)
        y.append(emotion)
    return train_test_split(np.array(x), y, test_size=test_size, random_state=9)

Time to split the dataset into training and testing sets! lets keep the test set 25% of everything and use the load_data functin for this.


In [8]:
#split the dataset
x_train, x_test, y_train, y_test = load_data(test_size=0.25)

Observe the shape of the training and testing datasets:


In [9]:
#Get the shape of the training and testing datasets
print((x_train.shape[0],x_test.shape[0]))

(576, 192)


In [10]:
#Get the number of features extracted
print(f'Features extracted: {x_train.shape[1]}')

Features extracted: 180


Now, let's initialize an MLPClassifier. This is Multi-layer Perceptron Classifier; it optimizes the log-loss funtion using LBFGS or stochastic gradient descent. Unlike SVM or Naive Bayes, the MLPClassifier has an internal neutral network for the purpose of classification. This is a feedforward ANN model.

In [11]:
#DataFlair - Initialize the Multi Layer Perceptron Classifier
model = MLPClassifier(alpha=0.01, batch_size=256, epsilon=1e-08,hidden_layer_sizes=(300,),learning_rate="adaptive",max_iter=500)

In [12]:
# Train the model
model.fit(x_train,y_train)

MLPClassifier(activation='relu', alpha=0.01, batch_size=256, beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(300,), learning_rate='adaptive',
              learning_rate_init=0.001, max_fun=15000, max_iter=500,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [13]:
# Predict for the test set
y_pred = model.predict(x_test)

In [None]:
# Calculate the accuracy of our model
ac