<a href="https://colab.research.google.com/github/Devashish-dixit/CVIP/blob/main/SpeechEmotionRecognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **SPEECH EMOTION DETECTOR**

![image](https://camo.githubusercontent.com/d7ecf631b87e28e81820007e46b77650b51e2f756ab90849312b4fb3510371d5/68747470733a2f2f692e696d6775722e636f6d2f663154717669542e6a706567)

### Importing the libraries

In [54]:
import librosa
import soundfile
import os, glob
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

## Importing The Datset

[dataset](https://drive.google.com/drive/folders/1rADOCszDo8xkuXIRlgxpFqBmvAh1Anpy?usp=sharing)

In [16]:
Root = "/content/drive/MyDrive/Data/audiodb"
os.chdir(Root)

In [40]:
#Extract features (mfcc, chroma, mel) from a sound file
def extract_feature(file_name, mfcc, chroma, mel):
    with soundfile.SoundFile(file_name) as sound_file:
        X = sound_file.read(dtype="float32")
        sample_rate=sound_file.samplerate
        if chroma:
            stft=np.abs(librosa.stft(X))
        result=np.array([])
        if mfcc:
            mfccs=np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
            result=np.hstack((result, mfccs))
        if chroma:
            chroma=np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
            result=np.hstack((result, chroma))
        if mel:
            mel=np.mean(librosa.feature.melspectrogram(y=X, sr=sample_rate).T,axis=0)
            result=np.hstack((result, mel))
    return result

**MFCC (Mel Frequency Cepstral Coefficients):** It breaks down sounds into different features to help computers understand and work with them, kind of like how we break down words into letters and sounds to understand language.

**Chroma:** Chroma is like the "color" or unique quality of a musical note.

**Mel (Mel Scale):** The Mel scale makes it easier for computers to understand and work with sound in a way that's more similar to how we hear it.

In [41]:
emotions={
  '01':'neutral',
  '02':'calm',
  '03':'happy',
  '04':'sad',
  '05':'angry',
  '06':'fearful',
  '07':'disgust',
  '08':'surprised'
}

observed_emotions=['calm', 'happy', 'fearful', 'disgust']

In [42]:
def load_data(test_size=0.2):
    x,y=[],[]
    for file in glob.glob("/content/drive/MyDrive/Data/audiodb/Actor_*/*.wav"):
        file_name=os.path.basename(file)
        emotion=emotions[file_name.split("-")[2]]
        if emotion not in observed_emotions:
            continue
        feature=extract_feature(file, mfcc=True, chroma=True, mel=True)
        x.append(feature)
        y.append(emotion)
    return train_test_split(np.array(x), y, test_size=test_size, random_state=9)

## Splitting the dataset

### Train Test Split

In [43]:
x_train,x_test,y_train,y_test=load_data(test_size=0.25)

In [44]:
x_train

array([[-6.05917664e+02,  3.11016769e+01,  2.28704619e+00, ...,
         1.43771482e-04,  8.61270964e-05,  4.54914443e-05],
       [-4.38681702e+02,  4.31977043e+01, -1.47737875e+01, ...,
         2.28287303e-03,  1.38554757e-03,  5.29100071e-04],
       [-6.43259521e+02,  3.54421997e+01, -1.21700478e+00, ...,
         2.69463344e-04,  1.32105561e-04,  7.10968088e-05],
       ...,
       [-7.79930420e+02,  3.14263248e+01,  3.59853745e-01, ...,
         6.41600991e-06,  8.35428182e-06,  7.29339718e-06],
       [-6.21849548e+02,  5.53697090e+01,  1.48904428e+01, ...,
         4.00855992e-04,  2.09315054e-04,  5.39605644e-05],
       [-6.62194702e+02,  4.76044350e+01, -1.14914978e+00, ...,
         3.78642726e-05,  3.23660825e-05,  1.96181645e-05]])

In [45]:
print((x_train.shape[0], x_test.shape[0]))

(576, 192)


In [46]:
print(f'Features extracted: {x_train.shape[1]}')

Features extracted: 180


## Building the model

### Using Multi Layer Perceptron Classifier

In [47]:
model=MLPClassifier(alpha=0.01, batch_size=256, epsilon=1e-08, hidden_layer_sizes=(300,), learning_rate='adaptive', max_iter=500)

In [48]:
model.fit(x_train,y_train)

In [49]:
y_pred=model.predict(x_test)

In [50]:
y_pred

array(['fearful', 'happy', 'fearful', 'fearful', 'calm', 'fearful',
       'calm', 'fearful', 'calm', 'fearful', 'calm', 'calm', 'fearful',
       'fearful', 'calm', 'calm', 'calm', 'calm', 'happy', 'calm',
       'disgust', 'fearful', 'calm', 'disgust', 'fearful', 'calm',
       'fearful', 'fearful', 'fearful', 'fearful', 'calm', 'calm',
       'fearful', 'calm', 'calm', 'happy', 'disgust', 'fearful',
       'disgust', 'calm', 'fearful', 'calm', 'calm', 'fearful', 'calm',
       'disgust', 'calm', 'calm', 'fearful', 'fearful', 'calm', 'happy',
       'happy', 'calm', 'calm', 'fearful', 'calm', 'disgust', 'calm',
       'disgust', 'calm', 'fearful', 'happy', 'happy', 'fearful', 'calm',
       'fearful', 'disgust', 'calm', 'happy', 'happy', 'happy', 'calm',
       'happy', 'disgust', 'calm', 'calm', 'disgust', 'happy', 'calm',
       'happy', 'fearful', 'fearful', 'happy', 'disgust', 'happy', 'calm',
       'fearful', 'fearful', 'happy', 'calm', 'happy', 'calm', 'fearful',
       'calm'

## Model Evaluation

In [51]:
accuracy=accuracy_score(y_true=y_test, y_pred=y_pred)
print("Accuracy: {:.2f}%".format(accuracy*100))

Accuracy: 68.23%


Our Model is giving an approximate 68% accuracy which is not bad

## Communicating Results

In [53]:
df=pd.DataFrame({'Actual': y_test, 'Predicted':y_pred})
df.head(20)

Unnamed: 0,Actual,Predicted
0,disgust,fearful
1,happy,happy
2,fearful,fearful
3,fearful,fearful
4,calm,calm
5,fearful,fearful
6,calm,calm
7,fearful,fearful
8,calm,calm
9,happy,fearful


In [56]:
import pickle
# Writing different model files to file
with open( '/content/drive/MyDrive/Data/modelForPrediction1.sav', 'wb') as f:
    pickle.dump(model,f)

# **Try It Yourself !**

### Run the cell to upload your own audio, make sure the audio is in .wav format.
[Note: Since the feature size can differ due to difference in microphone and audio recording software, artificial padding is added in the code below which may reduce the accuracy of model.

In [66]:
# Import necessary libraries
import numpy as np
import librosa
import io
import IPython.display as ipd
from google.colab import files
from sklearn.preprocessing import StandardScaler
import pickle

# Load your trained MLP classifier
filename = '/content/drive/MyDrive/Data/modelForPrediction1.sav'
loaded_model = pickle.load(open(filename, 'rb'))  # Loading the model file from storage

# Define a function to preprocess audio and make predictions
def predict_emotion(audio_file):
    # Load the uploaded audio file
    audio, sr = librosa.load(audio_file, sr=None)

    # Extract audio features (MFCC, chroma, mel, etc.)
    mfccs = np.mean(librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=13).T, axis=0)
    chroma = np.mean(librosa.feature.chroma_stft(y=audio, sr=sr).T, axis=0)
    mel = np.mean(librosa.feature.melspectrogram(y=audio, sr=sr).T, axis=0)

    # Combine the extracted features into a single feature vector
    feature_vector = np.hstack((mfccs, chroma, mel))

    # Define the desired dimension (180)
    desired_dimension = 180

    # Pad the feature vector with zeros to reach the desired dimension
    padded_feature_vector = np.pad(feature_vector, (0, desired_dimension - len(feature_vector)))


    # Standardize the feature vector
    scaler = StandardScaler()
    scaled_feature_vector = scaler.fit_transform(padded_feature_vector.reshape(1, -1))

    # Make a prediction using the loaded MLP model
    predicted_class = loaded_model.predict(scaled_feature_vector)

    return predicted_class

# Create an upload button for the user to upload an audio file
uploaded = files.upload()

# Process the uploaded audio and make predictions
for filename in uploaded.keys():
    print("Uploaded file:", filename)
    prediction = predict_emotion(filename)  # Pass the filename directly
    print("Predicted emotion class:", prediction)

# Display the uploaded audio
if 'audio.wav' in uploaded.keys():
    ipd.Audio(filename)


Saving record (2).wav to record (2) (1).wav
Uploaded file: record (2) (1).wav
Predicted emotion class: ['happy']


*This project is a robust speech emotion recognition system capable
of accurately classifying the emotional states conveyed in spoken language, by analyzing the acoustic features of speech signals, the system should be able to categorize emotions such as
happiness, sadness, anger, fear, and more.*