First thing first, let’s install the libraries that we will need. We can use PIP install, which is a python library management tool. We can install multiple libraries in one line as follows:

In [1]:
import librosa
import soundfile
import os, glob, pickle
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

RAVDESS stands for Ryerson Audio-Visual Database of Emotional Speech and Song. It is a large dataset will an audio and video database. The original size of this data is around 24Gb. But we will use a smaller portion of it and not the whole dataset. This will help us to stay focused, train our model faster and to keep things simple. The small portion of the dataset can be found https://www.kaggle.com/uwrfkaggler/ravdess-emotional-speech-audio on Kaggle.

This portion of the RAVDESS contains 1440 files: 60 trials per actor x 24 actors = 1440. The data contains 24 professional actors: 12 female and 12 male. Speech emotions include calm, happy, sad, angry, fearful, surprise, and disgust expressions. You can learn more on the Kaggle website.
The file names are renamed following a particular pattern. This pattern consists of 7 parts. And these parts are divided as following: Modality, Vocal channel, Emotion, Emotional intensity, Statement, Repetition, and Actor. Each information also has its sub-division. All this information is labeled; you can find more about these on the Kaggle website.

In [2]:
def extract_feature(file_name, mfcc, chroma, mel):
    with soundfile.SoundFile(file_name) as sound_file:
        X = sound_file.read(dtype="float32")
        sample_rate=sound_file.samplerate
        if chroma:
            stft=np.abs(librosa.stft(X))
        result=np.array([])
        if mfcc:
            mfccs=np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
            result=np.hstack((result, mfccs))
        if chroma:
            chroma=np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
            result=np.hstack((result, chroma))
        if mel:
            mel=np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
            result=np.hstack((result, mel))
    return result

This function will extract audio recordings and return them as stack arrays in sequence horizontally using a numpy hstack method.
There are many features of audio files. And some of them are MFCC, Chroma and Mel.

mfcc: Mel Frequency Cepstral Coefficient, represents the short-term power spectrum of a sound
chroma: Pertains to the 12 different pitch classes
mel: Mel Spectrogram Frequency

In [2]:
emotions={
  '01':'neutral',
  '02':'calm',
  '03':'happy',
  '04':'sad',
  '05':'angry',
  '06':'fearful',
  '07':'disgust',
  '08':'surprised'
}

#DataFlair - Emotions to observe
observed_emotions=['calm', 'happy', 'fearful', 'disgust']

We are going to create this dictionary to use when training the machine learning model. And after the labels, we are creating a list of emotions that we want to focus in this project. It’s hard to do a prediction using all emotions, because the speech may sound in more than one emotion simultaneously, and that will affect our prediction scores. 

In [3]:
def load_data(test_size=0.2):
    x,y=[],[]
    for file in glob.glob("D:\\ArdentML\\project\\data_set\\Actor_*\\*.wav"):
        file_name=os.path.basename(file)
        emotion=emotions[file_name.split("-")[2]]
        if emotion not in observed_emotions:
            continue
        feature=extract_feature(file, mfcc=True, chroma=True, mel=True)
        x.append(feature)
        y.append(emotion)
    return train_test_split(np.array(x), y, test_size=test_size, random_state=9)

In this step, we are going to define a function to load our dataset. First, we are loading the data and then extracting the features using the function defined in the previous step. While features are extracting, we are assigning the features with the labels emotions. You can think of features as our input (x) and the labeled emotion as an output (y). This is a well-known machine learning model, also known as Supervised Learning.

 we are going to split the labeled dataset using the train_test_split() function. It is a well-known splitting function by Scikit-learn module. It divides the dataset into four chunks. We can define how much of the dataset we want to use for training and how much for testing. You can adjust these values to see how it affects the prediction. There is no one size fits all rule; it usually depends on the dataset. But in most cases, the 0.25 test size is applied. This means 3/4 of dataset is used for training and 1/4 for testing.

________________________________________

We are almost done. This is the final step, where will start calling the functions we defined earlier and recognizing emotions from speech audio recordings.

In [4]:
x_train,x_test,y_train,y_test=load_data(test_size=0.25)

NameError: name 'glob' is not defined

Let’s start by running the load_data() function. This function will return four lists. That’s we are going to use four different variables for each list — the order matters. You should be familiar with this splitting method, especially if you are working with machine learning projects

In [13]:
print((x_train.shape[0], x_test.shape[0]))

(573, 191)


In [14]:
print(f'Features extracted: {x_train.shape[1]}')

Features extracted: 180


In [15]:
model=MLPClassifier(alpha=0.01, batch_size=256, epsilon=1e-08, hidden_layer_sizes=(300,), learning_rate='adaptive', 
                    max_iter=500)

MLP Classifier is multi-layer perceptron classifier. It uses a neural network model to optimize the log-loss function using Limited memory BFGS or stochastic gradient descent.

In [16]:

model.fit(x_train,y_train)

MLPClassifier(alpha=0.01, batch_size=256, hidden_layer_sizes=(300,),
              learning_rate='adaptive', max_iter=500)

In [17]:
y_pred=model.predict(x_test)

In [18]:
print(y_pred)

['happy' 'happy' 'happy' 'fearful' 'fearful' 'fearful' 'calm' 'fearful'
 'calm' 'disgust' 'fearful' 'calm' 'calm' 'happy' 'disgust' 'happy'
 'happy' 'fearful' 'disgust' 'disgust' 'calm' 'fearful' 'calm' 'calm'
 'happy' 'happy' 'happy' 'disgust' 'happy' 'calm' 'happy' 'calm' 'happy'
 'fearful' 'happy' 'fearful' 'fearful' 'fearful' 'calm' 'happy' 'happy'
 'calm' 'fearful' 'calm' 'calm' 'calm' 'happy' 'calm' 'calm' 'fearful'
 'disgust' 'fearful' 'fearful' 'happy' 'happy' 'fearful' 'calm' 'happy'
 'calm' 'calm' 'calm' 'disgust' 'disgust' 'happy' 'disgust' 'happy'
 'happy' 'happy' 'happy' 'happy' 'fearful' 'fearful' 'disgust' 'fearful'
 'fearful' 'disgust' 'fearful' 'fearful' 'calm' 'happy' 'calm' 'fearful'
 'calm' 'calm' 'disgust' 'fearful' 'calm' 'fearful' 'fearful' 'fearful'
 'disgust' 'calm' 'calm' 'disgust' 'disgust' 'fearful' 'fearful' 'fearful'
 'fearful' 'happy' 'happy' 'disgust' 'disgust' 'calm' 'disgust' 'calm'
 'disgust' 'happy' 'fearful' 'happy' 'fearful' 'happy' 'fearful' 'fear

In [19]:
accuracy=accuracy_score(y_true=y_test, y_pred=y_pred)

print("Accuracy: {:.2f}%".format(accuracy*100))

Accuracy: 73.30%


Our accuracy score is 73.3, and that is pretty impressive. I usually get a similar score after fitting the model multiple times. I do think that this is a satisfying score for an emotion recognition model, which was trained by audio recordings. Thanks to machine learning and artificial intelligence model developers.

______________________________________________________

Dumping our model in a pickele file accessing it to predict file and join it to the frontend made using streamlit framework using python

In [30]:
import pickle


In [32]:
with open('model_pickle','wb') as f:
    pickle.dump(model,f)

In [33]:
with open('model_pickle','rb') as f:
    mp= pickle.load(f)

THANK you  We have created a speech emotion recognizer using python.