<a href="https://colab.research.google.com/github/Kamal-Chandra/Speech--Emotion-Recognition/blob/main/Speech_Emotion_Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Introduction**

This project aims to build a machine learning model for **Speech Emotion Recognition (SER)**.

**Speech Emotion Recognition** involves identifying human emotions and affective states from speech. This leverages the fact that tone and pitch in voice often convey underlying emotions. It's a phenomenon also observed in animals like dogs and horses, which use vocal cues to understand human emotions.

**Dataset Used:**

Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

**Download RAVDESSS dataset and remove unnecessary files from it.**

In [None]:
import os
import shutil
import zipfile
import requests

url = "https://www.kaggle.com/api/v1/datasets/download/uwrfkaggler/ravdess-emotional-speech-audio?datasetVersionNumber=1"

response = requests.get(url)
zip_path = "/content/ravdess.zip"

# Write the content to a file
with open(zip_path, "wb") as file:
    file.write(response.content)

# Extract the ZIP file
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall("/content/ravdess")

# Path to the extracted dataset
RAVDESS = "/content/ravdess"

os.remove(zip_path)

path_to_delete = "/content/ravdess/audio_speech_actors_01-24"

if os.path.exists(path_to_delete):
    shutil.rmtree(path_to_delete)
else:
    print(f"Path does not exist: {path_to_delete}")

**Split the data into two parts:**

**1. Training Set**

**2. Test Set**

In [None]:
# Paths to the directories
train_set_path = os.path.join(RAVDESS, "Train Set")
test_set_path = os.path.join(RAVDESS, "Test Set")

# Create the subfolders if they don't exist
os.makedirs(train_set_path, exist_ok=True)
os.makedirs(test_set_path, exist_ok=True)

# List of actor folders in the RAVDESS directory
actor_folders = sorted([f for f in os.listdir(RAVDESS) if f.startswith("Actor_")])

for actor in actor_folders[0:20]:
    shutil.move(os.path.join(RAVDESS, actor), train_set_path)

for actor in actor_folders[20:24]:
    shutil.move(os.path.join(RAVDESS, actor), test_set_path)

## **Filename identifiers as per the official RAVDESS website:**

Modality (01 = full-AV, 02 = video-only, 03 = audio-only).

Vocal channel (01 = speech, 02 = song).

Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 =
fearful, 07 = disgust, 08 = surprised).

Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.

Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").

Repetition (01 = 1st repetition, 02 = 2nd repetition).
Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

## Example:

02-01-06-01-02-01-12.mp4

This means the meta data for the audio file is:

Video-only (02)

Speech (01)

Fearful (06)

Normal intensity (01)

Statement "dogs" (02)

1st Repetition (01)

12th Actor (12) - Female (as the actor ID number is even)

In [None]:
# Define dictionary to map file name to emotions

emotions = {
  '01':'neutral',
  '02':'calm',
  '03':'happy',
  '04':'sad',
  '05':'angry',
  '06':'fearful',
  '07':'disgust',
  '08':'surprised'
}

# Observing these emotions

observed_emotions = ['happy', 'sad', 'angry', 'fearful', 'neutral']

In [None]:
import numpy as np
import pandas as pd

import librosa
import librosa.display

In [None]:
# Training set data frame created from the given data

train_set_list = os.listdir(train_set_path)

data = []

for dir in train_set_list:
  actor = os.listdir(train_set_path+ '/'+ dir)
  for file in actor:
    part = file.split('.')[0]
    part = part.split('-')
    emotion = emotions.get(part[2], None)
    if emotion is not None and emotion in observed_emotions:
      file_path = os.path.join(train_set_path, dir, file)
      data.append({'Emotion': emotion, 'File_Path': file_path})

ravdess_df = pd.DataFrame(data)
print(ravdess_df)

     Emotion                                          File_Path
0        sad  /content/ravdess/Train Set/Actor_08/03-01-04-0...
1    fearful  /content/ravdess/Train Set/Actor_08/03-01-06-0...
2    fearful  /content/ravdess/Train Set/Actor_08/03-01-06-0...
3    neutral  /content/ravdess/Train Set/Actor_08/03-01-01-0...
4      angry  /content/ravdess/Train Set/Actor_08/03-01-05-0...
..       ...                                                ...
715    angry  /content/ravdess/Train Set/Actor_11/03-01-05-0...
716    happy  /content/ravdess/Train Set/Actor_11/03-01-03-0...
717  fearful  /content/ravdess/Train Set/Actor_11/03-01-06-0...
718  neutral  /content/ravdess/Train Set/Actor_11/03-01-01-0...
719    angry  /content/ravdess/Train Set/Actor_11/03-01-05-0...

[720 rows x 2 columns]


## **Data Augmentation**

Data augmentation is a technique used to enhance the size and quality of the training dataset by applying various transformations to the existing data. This process helps in making the model more robust and better suited for real-world scenarios.

## **Benefits of Data Augmentation**

Increases Training Set Size

Enhances Model Robustness

Improves Generalization

In [None]:
import soundfile as sf

# Constants
NOISE_FACTOR = 0.035
SHIFT_MAX = 0.5
PITCH_STEPS = 2
SPEED_FACTOR = 1.5

# Function to add noise to a signal
def add_noise(data, noise_factor=NOISE_FACTOR):
    noise = np.random.randn(len(data))
    augmented_data = data + noise_factor * noise
    return augmented_data

# Function to shift a signal in time
def shift_time(data, shift_max=SHIFT_MAX):
    shift = np.random.randint(int(len(data) * shift_max))
    augmented_data = np.roll(data, shift)
    return augmented_data

# Function to change the pitch of a signal
def change_pitch(data, sampling_rate, pitch_steps=PITCH_STEPS):
    return librosa.effects.pitch_shift(data, sr=sampling_rate, n_steps=pitch_steps)

# Function to change the speed of a signal
def change_speed(data, speed_factor=SPEED_FACTOR):
    return librosa.effects.time_stretch(data, rate=speed_factor)

# Apply data augmentation
augmented_data = []

for index, row in ravdess_df.iterrows():
    y, sr = librosa.load(row['File_Path'])

    # Apply noise injection
    noisy_data = add_noise(y)
    noisy_file_path = row['File_Path'].replace(".wav", "_noisy.wav")
    sf.write(noisy_file_path, noisy_data, sr)
    augmented_data.append({'Emotion': row['Emotion'], 'File_Path': noisy_file_path})

    # Apply time shifting
    shifted_data = shift_time(y)
    shifted_file_path = row['File_Path'].replace(".wav", "_shifted.wav")
    sf.write(shifted_file_path, shifted_data, sr)
    augmented_data.append({'Emotion': row['Emotion'], 'File_Path': shifted_file_path})

    # Apply pitch shifting
    pitched_data = change_pitch(y, sr)
    pitched_file_path = row['File_Path'].replace(".wav", "_pitched.wav")
    sf.write(pitched_file_path, pitched_data, sr)
    augmented_data.append({'Emotion': row['Emotion'], 'File_Path': pitched_file_path})

    # Apply speed change
    try:
        speed_data = change_speed(y, SPEED_FACTOR)
        speed_file_path = row['File_Path'].replace(".wav", "_speed.wav")
        sf.write(speed_file_path, speed_data, sr)
        augmented_data.append({'Emotion': row['Emotion'], 'File_Path': speed_file_path})
    except Exception as e:
        print(f"Error processing file {row['File_Path']} with speed factor {SPEED_FACTOR}: {e}")

augmented_df = pd.DataFrame(augmented_data)
final_df = pd.concat([ravdess_df, augmented_df], ignore_index=True)

print(final_df)

      Emotion                                          File_Path
0         sad  /content/ravdess/Train Set/Actor_08/03-01-04-0...
1     fearful  /content/ravdess/Train Set/Actor_08/03-01-06-0...
2     fearful  /content/ravdess/Train Set/Actor_08/03-01-06-0...
3     neutral  /content/ravdess/Train Set/Actor_08/03-01-01-0...
4       angry  /content/ravdess/Train Set/Actor_08/03-01-05-0...
...       ...                                                ...
3595  neutral  /content/ravdess/Train Set/Actor_11/03-01-01-0...
3596    angry  /content/ravdess/Train Set/Actor_11/03-01-05-0...
3597    angry  /content/ravdess/Train Set/Actor_11/03-01-05-0...
3598    angry  /content/ravdess/Train Set/Actor_11/03-01-05-0...
3599    angry  /content/ravdess/Train Set/Actor_11/03-01-05-0...

[3600 rows x 2 columns]


## **Feature Extraction**
Feature extraction is a crucial step in the process of building a machine learning model. It involves transforming raw audio signals into a set of features that can be effectively used by machine learning algorithms. These features capture important characteristics of the audio signal, enabling the model to learn and make predictions more accurately.

**Key Audio Features**

Mel-Frequency Cepstral Coefficients (MFCCs)

Chroma Features

Spectrogram

Zero-Crossing Rate (ZCR)

Spectral Contrast

Root Mean Square (RMS) Energy

**Benefits of Feature Extraction**

Dimensionality Reduction

Improved Model Performance

Interpretability

In [None]:
# Function to extract features from an audio file
def extract_features(file_path):
    y, sr = librosa.load(file_path, sr=None)

    # Extract MFCCs
    mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
    mfccs_mean = np.mean(mfccs, axis=1)

    # Extract Chroma feature
    chroma = librosa.feature.chroma_stft(y=y, sr=sr)
    chroma_mean = np.mean(chroma, axis=1)

    # Extract Spectral Contrast
    spectral_contrast = librosa.feature.spectral_contrast(y=y, sr=sr)
    spectral_contrast_mean = np.mean(spectral_contrast, axis=1)

    # Extract Zero Crossing Rate
    zero_crossing_rate = librosa.feature.zero_crossing_rate(y)
    zero_crossing_rate_mean = np.mean(zero_crossing_rate)

    # Extract Root Mean Square Energy
    S = np.abs(librosa.stft(y))
    rmse = librosa.feature.rms(S=S)
    rmse_mean = np.mean(rmse)

    # Extract Mel-spectrogram
    mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
    mel_spectrogram_mean = np.mean(mel_spectrogram, axis=1)

    # Concatenate all features
    features = np.hstack((mfccs_mean, chroma_mean, spectral_contrast_mean, zero_crossing_rate_mean, rmse_mean, mel_spectrogram_mean))

    return features

# Extract features for each file in the final_df
features_list = []
for index, row in final_df.iterrows():
    features = extract_features(row['File_Path'])
    features_list.append(features)

# Convert features_list to a DataFrame
features_df = pd.DataFrame(features_list)

# Add the Emotion column from final_df to features_df
features_df['Emotion'] = final_df['Emotion']

print(features_df)

               0          1          2          3          4          5  \
0    -761.483032  71.773376  14.633630  23.550846   8.441536  18.252174   
1    -578.749084  77.713196  -7.886116   6.420325   6.179074  11.582044   
2    -603.779297  64.604401  -6.578300   7.780734   0.403219  11.610985   
3    -712.075562  76.019432  12.366280  21.779139   8.561011  15.253534   
4    -563.111694  60.146988   0.494722  13.174312   3.693977   1.640879   
...          ...        ...        ...        ...        ...        ...   
3595 -745.760376  59.496197  24.346766  18.837852   9.742865   5.289154   
3596 -100.577202   9.175047  -0.578744  -2.271729  -5.858610  -3.676881   
3597 -349.046295  24.155973  -7.570185   1.372138 -12.815687  -5.276818   
3598 -386.030731  22.522659 -11.340427  -2.031952 -15.844463  -2.403362   
3599 -378.480286  27.758413  -9.050571  -0.054251 -14.226600  -5.936737   

             6          7          8          9  ...           153  \
0     0.082437  11.601342  -4

## **Modelling**

In [None]:
import tensorflow as tf
from tensorflow.keras import layers

emotion_labels = pd.get_dummies(features_df['Emotion']).to_numpy()

# Define the neural network model
model = tf.keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=(features_df.shape[1] - 1,)),
    layers.BatchNormalization(),
    layers.Dropout(0.5),
    layers.Dense(64, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.5),
    layers.Dense(32, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(5, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model using the entire dataset
model.fit(features_df.drop('Emotion', axis=1), emotion_labels, batch_size=10, epochs=300, validation_split=0.2)

# Save the model parameters
model.save_weights('/content/ravdess/softmax_regression_params.h5')


Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/300
Epoch 75/300
Epoch 76/300
Epoch 77/300
Epoch 78

## **TESTING**

In [None]:
test_set_list = os.listdir(test_set_path)

test_data = []

for dir in test_set_list:
  actor = os.listdir(test_set_path+ '/'+ dir)
  for file in actor:
    part = file.split('.')[0]
    part = part.split('-')
    emotion = emotions.get(part[2], None)
    if emotion is not None and emotion in observed_emotions:
      file_path = os.path.join(test_set_path, dir, file)
      test_data.append({'Emotion': emotion, 'File_Path': file_path})

test_ravdess_df = pd.DataFrame(test_data)

In [None]:
test_features_list = []
for index, row in test_ravdess_df.iterrows():
    features = extract_features(row['File_Path'])
    test_features_list.append(features)

# Convert features_list to a DataFrame
test_features_df = pd.DataFrame(test_features_list)

# Add the Emotion column from final_df to features_df
test_features_df['Emotion'] = test_ravdess_df['Emotion']

In [None]:
test_emotion_labels = pd.get_dummies(test_features_df['Emotion']).to_numpy()
test_features_df_modified = test_features_df.drop('Emotion', axis=1)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np

def load_model_weights(model, filepath: str):
    try:
        model.load_weights(filepath)
    except Exception as e:
        raise

def predict_labels(model, features: np.ndarray) -> np.ndarray:
    try:
        predictions = model.predict(features)
        predicted_classes = np.argmax(predictions, axis=1)
        return predicted_classes
    except Exception as e:
        raise

def calculate_metrics(true_labels: np.ndarray, predicted_labels: np.ndarray):
    try:
        accuracy = accuracy_score(true_labels, predicted_labels)
        precision = precision_score(true_labels, predicted_labels, average='weighted', zero_division=0)
        recall = recall_score(true_labels, predicted_labels, average='weighted')
        f1 = f1_score(true_labels, predicted_labels, average='weighted')

        print(f"Accuracy: {accuracy * 100:.2f}%")
        print(f"Precision: {precision * 100:.2f}%")
        print(f"Recall: {recall * 100:.2f}%")
        print(f"F1 Score: {f1 * 100:.2f}%")

        return accuracy, precision, recall, f1
    except Exception as e:
        raise

def test_model(model, test_features_df_modified: np.ndarray, test_emotion_labels: np.ndarray, weights_filepath: str):
    """
    Test the model with the provided features and labels.

    Parameters:
    model: The machine learning model to be tested.
    test_features_df_modified (np.ndarray): Modified test features.
    test_emotion_labels (np.ndarray): True emotion labels for the test data.
    weights_filepath (str): Path to the saved model weights.

    Returns:
    tuple: Accuracy, precision, recall, and F1 score of the model on the test data.
    """
    # Load the saved model parameters
    load_model_weights(model, weights_filepath)

    # Make predictions on the test set
    y_pred_class = predict_labels(model, test_features_df_modified)

    # Calculate and return metrics
    return calculate_metrics(np.argmax(test_emotion_labels, axis=1), y_pred_class)

# Example call to the function
# Assuming 'model', 'test_features_df_modified', 'test_emotion_labels', and 'weights_filepath' are defined
accuracy, precision, recall, f1 = test_model(model, test_features_df_modified, test_emotion_labels, '/content/ravdess/softmax_regression_params.h5')

## **Conclusion**

The developed machine learning model for Speech Emotion Recognition (SER) achieves an accuracy ranging from 50% to 55%. This performance can be enhanced with access to a larger dataset, as the currently used RAVDESS dataset is relatively small. Expanding the dataset will provide more diverse training examples, leading to improved model accuracy and robustness.