<a href="https://colab.research.google.com/github/MeidanGR/SER_DeepLearning_LSTM/blob/main/EmotionRecognition_DL_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Speech Emotion Recognition algorithm**
*Using Deep Learning (LSTM) model*

B.Sc. Final project by Meidan Greenberg; Linoy Hadad;

Instructor: Dr. Dima Alberg

# UNDER DEVELOPMENT.

# **PACKAGES & GOOGLE AUTH**

In [1]:
%%capture
!pip install soundfile
!pip install noisereduce

In [2]:
import numpy as np
import librosa
import soundfile as sf
import noisereduce as nr
import IPython.display as ipd #Audio player
import os
import pandas as pd

  from tqdm.autonotebook import tqdm


In [3]:
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

Mounted at /content/drive


# **LOADING FULL DATA**
The speech emotion audio databases used:

- RAVDASS: https://zenodo.org/record/1188976#.X4sE0tDXKUl
  - 1440 files = 24 actors x 60 trails per actor
  - 8 Emotions (neutral, calm, happy, sad, angry, fearful, disgust, surprised).
-TESS: https://tspace.library.utoronto.ca/handle/1807/24487
  - 2800 files = 2 actors x 200 pharses x 7 emotions
  - 7 Emotions (neutral, happiness, sadness, anger, fear, disgust, pleasant surprise)
   - ('calm' is not a part of this DB).


## **RAVDESS Database**

All of  RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics: 

### Filename identifiers 

Modality (01 = full-AV, 02 = video-only, **03 = audio-only**).

Vocal channel (**01 = speech**, 02 = song).

**Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).**

Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.

Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").

Repetition (01 = 1st repetition, 02 = 2nd repetition).

Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

Only files with the format: 03-01-XX-XX-XX-XX-XX.Wav has been imported
 **(speech audio only)**
 
## **TESS Database**

The TESS Database file name contain the emotion by text, for e.g. "YAF_youth_happy.wav". Therefore a find_emotion function has been executed.



---

# **FEATURE EXTRACTION**


In [4]:
#Loading BOTH Databases

def find_emotion(name): 
        if('neutral' in name): return "01"
        elif('happy' in name): return "03"
        elif('sad' in name): return "04"
        elif('angry' in name): return "05"
        elif('fear' in name): return "06"
        elif('disgust' in name): return "07"
        elif('ps' in name): return "08"
        else: return "-1"

audio_data = []
emotions = []
sample_rate = []

folder_path = '/content/drive/My Drive/AudioFiles'

for subdir, dirs, files in os.walk(folder_path):
  for file in files:
    try:
      x, sr = sf.read(os.path.join(subdir,file))  #Loading the audio frames into x, samplerate into sr.
      #f1
      #f2
      #f3
      #f4
      #f5
      if (find_emotion(file) != "-1"): #TESS database validation
        name = find_emotion(file)
      else: 
        name = file[6:8]               #RAVDESS database validation


      audio_data.append(x.T) #Adding each audio file signals into a list.
      #feature1.append(f1)
      #feature2
      #feature3
      #feature4
      #feature5
      emotions.append(name)  #Adding each emotion into a list.
      sample_rate.append(sr) #Adding each sample rate value to a list.

    except ValueError:
      continue



# **NORMALIZING AUDIO DATA**
& Creating a DataFrame

In [44]:
#Normalizing the audio_data arrays.

df = pd.DataFrame(columns= ['x', 'y','sample_rate']) #DataFrame is used to 1) Visualization, 
#                                                                        2) Converting emotion number into a name using 'map' function.
normal_array = []

for i in audio_data:
  norm = np.linalg.norm(i)
  normal_array.append(np.array(i/norm))

df.x = normal_array
df.y = emotions
df.y = df.y.map({'01' : 'neutral', '02' : 'calm', '03' : 'happy', '04' : 'sad', '05' : 'angry',
                             '06' : 'fearful', '07' : 'disgust', '08' : 'suprised'})
df.sample_rate = sample_rate

print(df.info())
df.sample(5)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4240 entries, 0 to 4239
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   x            4240 non-null   object
 1   y            4240 non-null   object
 2   sample_rate  4240 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 99.5+ KB
None


Unnamed: 0,x,y,sample_rate
3615,"[5.09682192897903e-06, 2.548410964489515e-06, ...",angry,48000
2935,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",happy,48000
2317,"[0.0, -1.9674605125965505e-05, -1.967460512596...",suprised,24414
1028,"[0.0, -2.4982736132206125e-05, -2.141377382760...",suprised,24414
3219,"[2.2758323052621864e-06, 2.2758323052621864e-0...",suprised,48000


# AUDIO & EMOTION CHECKS

In [43]:
#Enter a row num between [0-4239]
row = 100

print('Emotion:', df.y[row])
ipd.display(ipd.Audio(data = df.x[row], rate=df.sample_rate[row]))


Emotion: disgust


# **TRAIN & TEST SETS SPLIT**

In [56]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df.x, df.y, test_size = 0.3, random_state = 42)


4097    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
3165    [3.438716217617463e-05, 3.438716217617463e-05,...
2499    [0.0, -1.4948124796742152e-05, -2.491354132790...
3345    [0.0, 0.0, -1.5327209705234457e-05, 0.0, 0.0, ...
3596    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
                              ...                        
3444    [0.00033954920337345075, 0.0003316527102717426...
466     [-4.6976614303288073e-05, -2.818596858197284e-...
3092    [-2.495588644726508e-05, -3.119485805908135e-0...
3772    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
860     [3.6520257815151707e-06, 4.5650322268939635e-0...
Name: x, Length: 2968, dtype: object