<a href="https://www.kaggle.com/code/homohl/speech-emotion-recognition-in-tinyml?scriptVersionId=123748190" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<h1 style="text-align: center;">Speech Emotion Recognition in TinyML</h1>

## Introduction
In this notebook, we explore the development of a speech emotion recognition (SER) model using deep neural networks and its conversion into a TinyML model. We chose the Long Short-Term Memory (LSTM) architecture and Mel-frequency cepstral coefficients (MFCCs) as they are adept at learning from sequences and have been proven to be well-suited for SER tasks [[1]]. Furthermore, we leverage TinyML to enable machine learning models to operate on resource-constrained devices like microcontrollers, offering advantages for various applications, including mobile devices, IoT devices, and wearables.

In this application, we consider the circumplex model in affective computing, which is a two-dimensional model representing emotions in terms of valence (pleasantness) and arousal (intensity) [[2]]. Based on this model and the implementation scenarios, we drop the categories "Fear" and "Disgusted" and merge "Angry" and "Sad" emotions into the same category, "Unpleasant," due to their similar attributes of valence and arousal. Consequently, we focus on classifying four emotions: Happy, Surprised, Neutral, and Unpleasant.

The dataset employed in this notebook is a combination of three datasets: RAVDESS [[3]], TESS [[4]], and SAVEE [[5]]. We will load and preprocess the data before exploring and visualizing wave plots and spectrograms for different emotions. We utilize data augmentation techniques such as noise addition, stretching, and pitching to expand the dataset and enhance the model's performance.

Upon completing the standard training and validation pipeline, we convert the model into a TensorFlow Lite (TFLite) format, quantize it to reduce the size, and compare its inference time with that of the original model.

Reference:<br>
[[1]] Kumbhar, Harshawardhan S., and Sheetal U. Bhandari. "Speech emotion recognition using MFCC features and LSTM network." 2019 5th International Conference On Computing, Communication, Control And Automation (ICCUBEA). IEEE, 2019.<br>
[[2]] Russell, James A. "A circumplex model of affect." Journal of personality and social psychology 39.6 (1980): 1161.<br>
[[3]] Livingstone, Steven R., and Frank A. Russo. "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English." PloS one 13.5 (2018): e0196391.<br>
[[4]] Dupuis, Kate, and M. Kathleen Pichora-Fuller. "Toronto emotional speech set (tess)-younger talker_happy." (2010).<br>
[[5]] Jackson, Philip, and SJUoSG Haq. "Surrey audio-visual expressed emotion (savee) database." University of Surrey: Guildford, UK (2014).

[1]: https://doi.org/10.1109/ICCUBEA47591.2019.9129067
[2]: https://doi.org/10.1037/h0077714
[3]: https://doi.org/10.1371/journal.pone.0196391
[4]: https://doi.org/10.5683/SP2/E8H2MF
[5]: https://openresearch.surrey.ac.uk/esploro/outputs/journalArticle/Surrey-audio-visual-expressed-emotion-savee-database/99635364402346

## Importing Libraries
Include the necessary libraries for data manipulation, visualization, and model building:

In [None]:
import pandas as pd
import numpy as np

import os
import sys

import librosa
import librosa.display
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

from IPython.display import Audio

import tensorflow as tf
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Flatten, Dropout
from tensorflow.keras.callbacks import ModelCheckpoint

import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning) 

## Dataset
We will use three different datasets for speech emotion recognition: RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song), TESS (Toronto Emotional Speech Set), and SAVEE (Surrey Audio-Visual Expressed Emotion). These datasets offer several benefits that contribute to the development of more robust and accurate SER models:

1. Variety of emotional expressions:<br> These datasets encompass a wide range of emotions, such as happy, sad, angry, fearful, surprised, disgusted, and neutral. This variety helps train models to recognize and distinguish subtle differences between various emotional expressions, enhancing their performance.

2. Multiple speakers:<br> Including multiple speakers with different accents, genders, and speaking styles provide a more diverse and representative speech data sample. This diversity helps models generalize to real-world scenarios, making them more effective in handling speech data from various sources.

3. High-quality recordings:<br> The audio files in these datasets are recorded with high-quality equipment, resulting in clear and consistent audio samples, allowing models to focus on the emotional content of the speech without being hindered by noise or other artifacts.

In [None]:
# Paths for data.
Ravdess = "/kaggle/input/ravdess-emotional-speech-audio/audio_speech_actors_01-24/"
Tess = "/kaggle/input/toronto-emotional-speech-set-tess/tess toronto emotional speech set data/TESS Toronto emotional speech set data/"
Savee = "/kaggle/input/surrey-audiovisual-expressed-emotion-savee/ALL/"

In [None]:
ravdess_directory_list = os.listdir(Ravdess)

file_emotion = []
file_path = []
for dir in ravdess_directory_list:
    # as their are 20 different actors in our previous directory we need to extract files for each actor.
    actor = os.listdir(Ravdess + dir)
    for file in actor:
        part = file.split('.')[0]
        part = part.split('-')
        # third part in each file represents the emotion associated to that file.
        file_emotion.append(int(part[2]))
        file_path.append(Ravdess + dir + '/' + file)
        
# dataframe for emotion of files
emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])

# dataframe for path of files.
path_df = pd.DataFrame(file_path, columns=['Path'])
Ravdess_df = pd.concat([emotion_df, path_df], axis=1)

# Change integers to actual emotions.
Ravdess_df.Emotions.replace({1:'neutral', 2:'neutral', 3:'happy', 4:'sad', 5:'angry', 6:'fear', 7:'disgust', 8:'surprise'}, inplace=True)

Ravdess_df.head()

In [None]:
tess_directory_list = os.listdir(Tess)

file_emotion = []
file_path = []

for dir in tess_directory_list:
    directories = os.listdir(Tess + dir)
    for file in directories:
        part = file.split('.')[0]
        part = part.split('_')[2]
        if part=='ps':
            file_emotion.append('surprise')
        else:
            file_emotion.append(part)
        file_path.append(Tess + dir + '/' + file)
        
# dataframe for emotion of files
emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])

# dataframe for path of files.
path_df = pd.DataFrame(file_path, columns=['Path'])
Tess_df = pd.concat([emotion_df, path_df], axis=1)

Tess_df.head()

In [None]:
savee_directory_list = os.listdir(Savee)

file_emotion = []
file_path = []

for file in savee_directory_list:
    file_path.append(Savee + file)
    part = file.split('_')[1]
    ele = part[:-6]
    if ele=='a':
        file_emotion.append('angry')
    elif ele=='d':
        file_emotion.append('disgust')
    elif ele=='f':
        file_emotion.append('fear')
    elif ele=='h':
        file_emotion.append('happy')
    elif ele=='n':
        file_emotion.append('neutral')
    elif ele=='sa':
        file_emotion.append('sad')
    else:
        file_emotion.append('surprise')
        
# dataframe for emotion of files
emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])

# dataframe for path of files.
path_df = pd.DataFrame(file_path, columns=['Path'])
Savee_df = pd.concat([emotion_df, path_df], axis=1)

Savee_df.head()

In [None]:
aggregated_data = pd.concat([Ravdess_df, Tess_df, Savee_df], axis = 0)

# Shuffle the dataframe using the sample method
aggregated_data = aggregated_data.sample(frac=1).reset_index(drop=True) 

# Drop rows where Emotions is 'fear' or 'disgust'
aggregated_data = aggregated_data[~aggregated_data['Emotions'].isin(['fear', 'disgust'])]

# Drop rows where Emotions is "sad" and "angry" and replace them with "unpleasant"
aggregated_data = aggregated_data.drop(aggregated_data[aggregated_data['Emotions'] == 'sad'].sample(frac=0.4).index)
aggregated_data = aggregated_data.drop(aggregated_data[aggregated_data['Emotions'] == 'angry'].sample(frac=0.4).index)
aggregated_data['Emotions'] = aggregated_data['Emotions'].replace(['sad', 'angry'], 'unpleasant')

aggregated_data.to_csv("data_path.csv",index=False)
aggregated_data.head()

In [None]:
plt.title('Count of Emotions', size=16)
sns.countplot(aggregated_data.Emotions)
plt.ylabel('Count', size=12)
plt.xlabel('Emotions', size=12)
sns.despine(top=True, right=True, left=False, bottom=False)
plt.show()

## Data Exploration
In this section, you've plotted waveforms and spectrograms for emotions data, as well as applied different audio transformations like noise addition, time stretching, and pitch shifting to provide an overview of the audio data properties.

In [None]:
def create_waveplot(data, sr, e):
    plt.figure(figsize=(10, 3))
    plt.title('Waveplot for {} emotion'.format(e), size=15)
    librosa.display.waveplot(data, sr=sr)
    plt.show()

def create_spectrogram(data, sr, e):
    X = librosa.stft(data)
    Xdb = librosa.amplitude_to_db(abs(X))
    plt.figure(figsize=(12, 3))
    plt.title('Spectrogram for {} emotion'.format(e), size=15)
    librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')   
    plt.colorbar()

In [None]:
def noise(data):
    noise_amp = 0.5*np.random.uniform()*np.amax(data)
    data = data + noise_amp*np.random.normal(size=data.shape[0])
    return data

def stretch(data, rate=0.8):
    return librosa.effects.time_stretch(data, rate)

def pitch(data, sampling_rate, pitch_factor=0.7):
    return librosa.effects.pitch_shift(data, sampling_rate, pitch_factor)

In [None]:
emotion='happy'
path = np.array(aggregated_data.Path[aggregated_data.Emotions==emotion])[1]
data, sample_rate = librosa.load(path)
data = librosa.resample(data, sample_rate, 16000)
noised_data = noise(data)

create_waveplot(data, sample_rate, emotion)
create_spectrogram(data, sample_rate, emotion)
Audio(data=noised_data, rate=16000)

# Key Speech Element Span
# Ravdess_df: 0.8-2.8s -> 2.0s
#    Tess_df: 0.3-1.8s -> 1.5s
#   Savee_df: 0.6-3.3s -> 2.7s

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(20,8))
plt.subplots_adjust(hspace=0.4)

axs[0, 0].set_title('Original Signal', size=20)
librosa.display.waveplot(y=data, sr=sample_rate, ax=axs[0, 0])

axs[0, 1].set_title('Noised Signal', size=20)
noise_data = noise(data)
librosa.display.waveplot(y=noise_data, sr=sample_rate, ax=axs[0, 1])

axs[1, 0].set_title('Streched Signal', size=20)
stretch_data = stretch(data)
librosa.display.waveplot(y=stretch_data, sr=sample_rate, ax=axs[1, 0])

axs[1, 1].set_title('Pitched Signal', size=20)
pitch_data = pitch(data, sample_rate)
librosa.display.waveplot(y=pitch_data, sr=sample_rate, ax=axs[1, 1])

plt.show()

## Data Pre-processing
This section preprocesses audio data for speech emotion recognition. It encoded emotion labels, extracts features using MFCC, and augments data with audio transformations. The data is stored in a CSV file and split into training, validation, and testing sets for model training.

In [None]:
labels = {'neutral':0, 'happy':1, 'surprise':2, 'unpleasant': 3}
aggregated_data.replace({'Emotions':labels},inplace=True)
aggregated_data.head()

In [None]:
NUM_MFCC = 13
N_FFT = 2048
HOP_LENGTH = 512
SAMPLE_RATE = 22050
DOWN_SAMPLE_RATE = 16000
SAMPLE_NUM = aggregated_data.shape[0]

data = {
        "labels": [],
        "features": []
    }

def extract_features(data, sample_rate):
    mfcc = librosa.feature.mfcc(data, sample_rate, n_mfcc=NUM_MFCC, n_fft=N_FFT, hop_length=HOP_LENGTH)
    feature = mfcc.T
    return feature

for i in range(SAMPLE_NUM):
    for j in range(2):
        data['labels'].append(aggregated_data.iloc[i,0])
    signal, sample_rate = librosa.load(aggregated_data.iloc[i,1], sr=SAMPLE_RATE)
    
    # Cropping & Resampling
    start_time = 0.4  # Start time in seconds
    end_time = 1.9  # End time in seconds
    start_frame = int(start_time * sample_rate)
    end_frame = int(end_time * sample_rate)
    signal = signal[start_frame:end_frame]
    signal = librosa.resample(signal, sample_rate, DOWN_SAMPLE_RATE)
    
    # Add noise
    signal = noise(signal)
    res1 = extract_features(signal, DOWN_SAMPLE_RATE)
    data["features"].append(np.array(res1))
    
    # Stretch and shift pitch
    new_data = stretch(signal)[:24000]
    data_stretch_pitch = pitch(new_data, DOWN_SAMPLE_RATE)
    res2 = extract_features(data_stretch_pitch, DOWN_SAMPLE_RATE)
    data["features"].append(np.array(res2))
    
    if i % 100 == 0:
        print(f'Processing Data: {i}/{SAMPLE_NUM}')

In [None]:
Features = pd.DataFrame()
Features['features'] = data["features"]
Features['labels'] = data["labels"]
Features.to_csv('Features.csv', index=False)
Features.head()

In [None]:
X = np.asarray(Features['features'])
y = np.asarray(Features["labels"])

# Pad Features to make them of equal length
X = tf.keras.preprocessing.sequence.pad_sequences(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=0.2)

print(f'Training Data:{X_train.shape} with label {y_train.shape}')
print(f'Validate Data:{X_validation.shape} with label {y_validation.shape}')
print(f' Testing Data:{X_test.shape} with label {y_test.shape}')

## Model Training
This sequential model consists of LSTM layers that capture long-term dependencies in the audio data and the dense layers that transform extracted features to classification.

In [None]:
def build_model(input_shape):
    model = tf.keras.Sequential()

    model.add(LSTM(128, input_shape=input_shape, return_sequences=True))
    model.add(LSTM(64))
    
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.3))

    model.add(Dense(4, activation='softmax'))

    return model

# Create network
input_shape = (47,13)
model = build_model(input_shape)

# Compile model
optimiser = tf.keras.optimizers.Adam(learning_rate=0.001)

model.compile(optimizer=optimiser,
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

model.summary()

In [None]:
# Run the training process
EPOCHS = 20
history = model.fit(X_train, y_train, validation_data=(X_validation, y_validation), batch_size=32, epochs=EPOCHS)

In [None]:
import os

# Create a new directory called 'my_data' to store the model
output_dir = '/kaggle/working/Models'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

model.save('Models/Speech-Emotion-Recognition-Model.h5')
print('Save the Tensorflow model!')

## Model Evaluation
We evaluate the speech emotion recognition model by measuring test accuracy, plotting loss and accuracy graphs, printing a classification report, and creating a confusion matrix. These tools help assess the model's performance, identify areas for improvement, and enhance its accuracy and generalization capabilities.

In [None]:
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print("Test Accuracy: ", test_acc*100 , "%")


epochs = [i for i in range(EPOCHS)]
fig, ax = plt.subplots(1, 2)
train_acc = history.history['accuracy']
train_loss = history.history['loss']
val_acc = history.history['val_accuracy']
val_loss = history.history['val_loss']

fig.set_size_inches(20, 6)
ax[0].plot(epochs, train_loss, label='Training Loss')
ax[0].plot(epochs, val_loss, label='Validating Loss')
ax[0].set_title('Training & Validating Loss')
ax[0].legend()
ax[0].set_xlabel("Epochs")

ax[1].plot(epochs, train_acc, label='Training Accuracy')
ax[1].plot(epochs, val_acc, label='Validating Accuracy')
ax[1].set_title('Training & Validating Accuracy')
ax[1].legend()
ax[1].set_xlabel("Epochs")
plt.show()

In [None]:
y_pred = model.predict(X_test)
y_pred = np.argmax(y_pred, axis=1)
label_names = list(labels.keys())
print(classification_report(y_test, y_pred, target_names=label_names))

The model's performance on speech emotion recognition, as shown by the confusion matrix, achieved an overall accuracy of 0.74. For individual emotions, the f1-scores were: Neutral (0.82), Happy (0.71), Surprise (0.74), and Unpleasant (0.67). These scores indicate that the model performs best in recognizing neutral emotions, with a precision of 0.76 and a recall of 0.88. The weakest performance is observed for the unpleasant emotions, with a precision of 0.73 and a recall of 0.61. Overall, the model demonstrates decent performance, but there is room for improvement, particularly in the recognition of unpleasant emotions.

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm = pd.DataFrame(cm)

plt.figure(figsize = (12, 10))
sns.heatmap(cm, xticklabels=label_names, yticklabels=label_names, linecolor='white', cmap='Blues', linewidth=1, annot=True, fmt='')
plt.title('Confusion Matrix', size=20)
plt.xlabel('Predicted Labels', size=14)
plt.ylabel('Actual Labels', size=14)
plt.show()

## TinyML Model Conversion & Evaluation
To deploy the speech emotion recognition model on microcontrollers, there are a few steps we have to do to fit in different hardware environments, including post-quantization and model format conversion.

First, we plot the histogram to visualize the model's weights. This step is essential because the conversion can cause accuracy loss due to the reduced precision of the weights. By examining the weight's distribution and range, we can get a sense of the effectiveness of the post-quantization process.

In [None]:
weights = model.get_weights()

# Plot a histogram of the weights
plt.hist(weights[0].flatten(), bins=50)
plt.xlabel('Weight value')
plt.ylabel('Frequency')
plt.title('Histogram of model weights')
plt.show()

Second, we converts the trained model to a TensorFlow Lite format using the TensorFlow Lite framework, which can optimize the model for deployment on mobile devices. 

In [None]:
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
with tf.io.gfile.GFile("Models/SER.tflite", 'wb') as f:
   f.write(tflite_model)

converter.optimizations = [tf.lite.Optimize.DEFAULT]
quant_tflite_model = converter.convert()
with tf.io.gfile.GFile("Models/SER_quant.tflite", 'wb') as f:
   f.write(quant_tflite_model)

print("Save the Tensorflow 'Lite' model!")

In [None]:
print("Model Sizes:")
!ls -lh Models | awk '{print $5 "\t" $9}'

The goal of TinyML techniques is to reduce the model size and the inference time while maintaining similar accuracy. Therefore, we compare the inference time and accuracy of the original TensorFlow model with the TensorFlow Lite model.

In [None]:
def evaluate_tflite(interpreter, test_data, test_label):
    # Get the input and output tensors.
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()
    
    num_correct = 0
    num_total = 0

    # Iterate over the testing data.
    for i in range(test_data.shape[0]):
        # Get the input data for this example.
        input_data = np.array([test_data[i]], dtype=np.float32)

        # Set the input tensor.
        interpreter.set_tensor(input_details[0]['index'], input_data)

        # Run inference.
        interpreter.invoke()

        # Get the output tensor.
        output_data = interpreter.get_tensor(output_details[0]['index'])

        # Compute the predicted label.
        predicted_label = np.argmax(output_data)

        # Update the results.
        if predicted_label == test_label[i]:
            num_correct += 1
        num_total += 1

    # Reset all variables so it will not pollute other inferences.
    interpreter.reset_all_variables()
    
    # Compute the accuracy.
    accuracy = num_correct / num_total
    
    return accuracy

    
# Load tflite model.
interpreter = tf.lite.Interpreter(model_path="Models/SER_quant.tflite")
interpreter.allocate_tensors()

tflite_test_acc = evaluate_tflite(interpreter, X_test, y_test)
print(f"TF Lite Model Accuracy: {tflite_test_acc * 100:.2f}%")
print(f"Accuracy Difference from Original Model: {test_acc}")

In [None]:
import time
input_data = np.random.randn(1, 47, 13).astype(np.float32)

start_time = time.time()
for i in range(100):
    h5_predictions = model.predict(input_data)
h5_inference_time = time.time() - start_time

start_time = time.time()
for i in range(100):
    interpreter.set_tensor(interpreter.get_input_details()[0]['index'], input_data)
    interpreter.invoke()
    tflite_predictions = interpreter.get_tensor(interpreter.get_output_details()[0]['index'])
    interpreter.reset_all_variables()
tflite_inference_time = time.time() - start_time

print("Inference Time Comparison:")
print(f"Original Model: {h5_inference_time}s")
print(f"TF Lite Model: {tflite_inference_time}s")

Previously, we converted the original model into a TensorFlow Lite format optimized for deployment on edge devices with limited computational resources. However, more is needed to deploy the model on a microcontroller. Therefore, we have to convert the TFLite file into a TFLite "Micro" file which can be uploaded to microcontrollers like Arduino Nano 33 BLE, enabling real-time SER applications with similar accuracy to the original model.

In [None]:
MODEL_TFLITE = 'Models/SER_quant.tflite'
MODEL_TFLITE_MICRO = 'Models/SER_micro.cc'
!xxd -i {MODEL_TFLITE} > {MODEL_TFLITE_MICRO}
REPLACE_TEXT = MODEL_TFLITE.replace('/', '_').replace('.', '_')
!sed -i 's/'{REPLACE_TEXT}'/g_model/g' {MODEL_TFLITE_MICRO}
print("Save the Tensorflow Lite 'Micro' model!")

**That's all about the project. Thank you for joining me on this journey!<br>
Feel free to share your feedback—I would appreciate the chance to learn and grow together. Cheers!**