<a href="https://colab.research.google.com/github/MeidanGR/SER_DeepLearning_LSTM/blob/main/EmotionRecognition_DL_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Speech Emotion Recognition algorithm**
*Using Deep Learning (LSTM) model*

B.Sc. Final project by Meidan Greenberg; Linoy Hadad;

Instructor: Dr. Dima Alberg

# UNDER DEVELOPMENT.

# **PACKAGES & GOOGLE AUTH**

In [1]:
%%capture
!pip install soundfile
!pip install noisereduce
!pip install pydub
!pip install pywt

In [2]:
import numpy as np
import librosa
import pywt
import noisereduce as nr
import IPython.display as ipd #Audio player
import os
import pandas as pd
import sklearn
from pydub import AudioSegment, effects


  from tqdm.autonotebook import tqdm


In [27]:
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

Mounted at /content/drive


# **LOADING FULL DATA**
The speech emotion audio databases used:

- RAVDASS: https://zenodo.org/record/1188976#.X4sE0tDXKUl
  - 1440 files = 24 actors x 60 trails per actor
  - 8 Emotions (neutral, calm, happy, sad, angry, fearful, disgust, surprised).
-TESS: https://tspace.library.utoronto.ca/handle/1807/24487
  - 2800 files = 2 actors x 200 pharses x 7 emotions
  - 7 Emotions (neutral, happiness, sadness, anger, fear, disgust, pleasant surprise)
   - ('calm' is not a part of this DB).


## **RAVDESS Database**

All of  RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics: 

### Filename identifiers 

Modality (01 = full-AV, 02 = video-only, **03 = audio-only**).

Vocal channel (**01 = speech**, 02 = song).

**Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).**

Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.

Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").

Repetition (01 = 1st repetition, 02 = 2nd repetition).

Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

Only files with the format: 03-01-XX-XX-XX-XX-XX.Wav has been imported
 **(speech audio only)**
 
## **TESS Database**

The TESS Database file name contain the emotion by text, for e.g. "YAF_youth_happy.wav". Therefore a find_emotion function has been executed.



---

# **FEATURE EXTRACTION**
...


In [20]:
#Normalization test for one sample.


print('Original: librosa.load')
x,sr = librosa.load('/content/drive/My Drive/AudioFiles/RAVDESS/Actor_16/03-01-08-02-01-01-16.wav', sr = None, duration = None)
print(np.shape(x))
ipd.display(ipd.Audio(data = x, rate=sr))

from pydub import AudioSegment, effects  
rawsound = AudioSegment.from_file('/content/drive/My Drive/AudioFiles/RAVDESS/Actor_16/03-01-08-02-01-01-16.wav', "wav",duration = None) 
normalizedsound = effects.normalize(rawsound, headroom = 0)  

print('Normalized: AudioSegment.from_file')
samples = normalizedsound.get_array_of_samples()
print(np.shape(samples))
normalizedsound

#ipd.display(ipd.Audio(data = rawsound, rate=sr))
ipd.display(ipd.Audio(data = samples, rate=sr))


Original: librosa.load
(169770,)


Normalized: AudioSegment.from_file
(169770,)


In [28]:
#Emotion kind validation function for TESS database, due to the emotions are written within the file names.
def find_emotion(name): 
        if('neutral' in name): return "01"
        elif('happy' in name): return "03"
        elif('sad' in name): return "04"
        elif('angry' in name): return "05"
        elif('fear' in name): return "06"
        elif('disgust' in name): return "07"
        elif('ps' in name): return "08"
        else: return "-1"

#Initizlizing data lists
audio_data = []
sample_rate = []
mfcc = []
zcr = []
rms = []
dwt = []
pitch = []
emotions = []

#Running over BOTH databases for frames & features extraction (x)
folder_path = '/content/drive/My Drive/AudioFiles'


for subdir, dirs, files in os.walk(folder_path):
  for file in files:
    try:  

  # Loading file frames & normalizing to -3 dBFS
      rawsound = AudioSegment.from_file(os.path.join(subdir,file), "wav",duration = None) 
      normalizedsound = effects.normalize(rawsound, headroom = -3) #-3 dBFS
      normal_x = np.array(normalizedsound.get_array_of_samples(), dtype = float)

      _, sr = librosa.load(path = os.path.join(subdir,file), sr = None) # sr (the sample rate) is used for librosa's features extraction, _ is irrelevant.

  # Features extraction
      f1 = librosa.feature.mfcc(normal_x, sr=sr, S=None, n_mfcc=13,dct_type=2, norm='ortho', lifter=0) # MFCC
      f2 = librosa.feature.zero_crossing_rate(normal_x , frame_length=2048, hop_length=512,center=True) # ZCR
      f3 = librosa.feature.rms(normal_x , S=None, frame_length=2048, hop_length=512, center=True, pad_mode='reflect') # Energy - Root Mean Square
      cA, cD = pywt.dwt(normal_x, 'db2', 'sym') # DWT
      f0 = librosa.yin(normal_x, fmin=20, fmax = 20000, sr = sr, frame_length=2048, win_length=None, hop_length=None) # Pitch

  # Emotion extraction (y)      
      if (find_emotion(file) != "-1"): #TESS database validation
        name = find_emotion(file)
      else: 
        name = file[6:8]               #RAVDESS database validation

  # Filling the data lists for each iteration (each file)
      audio_data.append(x.T) 
      sample_rate.append(sr) 
      mfcc.append(f1)
      zcr.append(f2)
      rms.append(f3)
      dwt.append(cD)
      pitch.append(f0)
      emotions.append(name)  


    except ValueError:
      continue

# **NORMALIZING AUDIO DATA**
& Creating a DataFrame

In [29]:
df = pd.DataFrame(columns= ['x','sample_rate', 'mfccs', 'zcr', 'rms', 'dwt', 'pitch', 'y']) #DataFrame is used to 1) Visualization, 
#                                                                          2) Converting emotion number into a name using 'map' function.

df.x = audio_data
df.sample_rate = sample_rate
df.mfccs = mfcc
df.zcr = zcr
df.rms = rms
df.dwt = dwt
df.pitch = pitch

df.y = emotions
df.y = df.y.map({'01' : 'neutral', '02' : 'calm', '03' : 'happy', '04' : 'sad', '05' : 'angry',
                             '06' : 'fearful', '07' : 'disgust', '08' : 'suprised'})

print(df.info())
df.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4238 entries, 0 to 4237
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   x            4238 non-null   object
 1   sample_rate  4238 non-null   int64 
 2   mfccs        4238 non-null   object
 3   zcr          4238 non-null   object
 4   rms          4238 non-null   object
 5   dwt          4238 non-null   object
 6   pitch        4238 non-null   object
 7   y            4238 non-null   object
dtypes: int64(1), object(7)
memory usage: 265.0+ KB
None


Unnamed: 0,x,sample_rate,mfccs,zcr,rms,dwt,pitch,y
1698,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",24414,"[[678.4956000361758, 702.2634662622502, 720.15...","[[0.31689453125, 0.5126953125, 0.70947265625, ...","[[877.7609810442419, 1323.1771430738686, 1626....","[136.5590531601622, -80.40412322426191, -73.15...","[8700.37095650277, 8495.938053468537, 8701.417...",happy
626,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",24414,"[[744.8907481900668, 779.5290740098569, 790.44...","[[0.25, 0.43603515625, 0.6337890625, 0.7524414...","[[1275.0781968974002, 1358.5874248702676, 1842...","[9.185586535436919, -91.8012736516881, -61.680...","[8512.644090898375, 7863.767103303348, 8562.42...",suprised
719,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",24414,"[[683.4937029036714, 739.5293790830917, 775.00...","[[0.2431640625, 0.42138671875, 0.60595703125, ...","[[834.5935774094628, 1062.2934293678584, 1756....","[102.26619676119768, -1.4334630108741346, 45.7...","[8709.93906161981, 8197.19345716051, 8709.0923...",suprised
1201,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",24414,"[[562.9443446802483, 651.8610629387908, 732.02...","[[0.02294921875, 0.10546875, 0.212890625, 0.34...","[[537.1879958367404, 629.6434976095849, 956.80...","[8.573214099741122, 7.734208230300747, -5.8556...","[4677.147815061305, 24414.0, 4679.393543550255...",suprised
3801,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",48000,"[[521.9953740060787, 521.9953740060787, 521.99...","[[0.015625, 0.015625, 0.017578125, 0.014648437...","[[1.443601182806387, 1.4252192813739224, 1.057...","[1.8371173070873836, 3.5702090829932454, 0.284...","[47.37542585579849, 72.30601363435392, 84.3026...",fearful


In [23]:
#df.sample(frac=1)
#df.to_csv('/content/drive/My Drive/fulldataframe.csv')
df.to_string()

# AUDIO & EMOTION CHECKS

In [None]:
#Enter a row num between [0-4239]
row = 100

print('Emotion:', df.y[row])
ipd.display(ipd.Audio(data = df.x[row], rate=df.sample_rate[row]))


# **TRAIN & TEST SETS SPLIT**

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df.x, df.y, test_size = 0.3, random_state = 42)
