#  Audio Emotion Recognition

> Audio emotion recognition is a field of artificial intelligence and signal processing that focuses on the automatic detection and analysis of human emotions from audio data, such as speech or music

### This Project Has Been Divided Into 9 Parts
- Understanding "Audio" Data
- Creating Metadata
- Extracting Data
- Exploring Data
- Mel-frequency cepstral coefficients (MFCCs)
- Processing Data for Deep Learning
- Setting up Deep Learning Model
- Training and Testing The Model
- Results

## 1. Understanding "Audio" Data

> "Audio" refers to sound, particularly in the form of vibrations or waves that travel through a medium, such as air, water, or solid objects
#### How Sound is Represented?
There are severel ways in which we can represent a sound wave. But the important ones are: 
- Time Domain
>We usually represent the sound in the form of the waveform. The plot is made w.r.t "Time" & "Amplitude"
<img src = "waveform_img.png" style = "width:400px;height:200px"/>
- Frequency Domain
> Here we represent the sound in the form of Spectogram. The plot is made w.r.t "Frequency" & "Amplitude" & "Phase"
<img src = "spectogram_img.png" style = "width:400px;height:200px"/>

## Importing Libraries

In [None]:
import os
import librosa
from librosa.display import waveshow
from IPython.display import Audio
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.io import wavfile
import warnings
from sklearn.model_selection import train_test_split
import keras
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Flatten, Dropout, Conv2D, MaxPooling2D, Conv1D, MaxPooling1D
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import seaborn as sns
warnings.filterwarnings('ignore')

## 2. Creating Metadata

#### Creating Metadata

The following recursive function traverses through each folder and file, and returns audio files in a list.

In [None]:
def get_metadata(path, list_of_files, class_) :
    
    ## list of all content in the folder
    file_folder = os.listdir(path)
    temp_class = class_
    
    ## travering each content
    for content in file_folder :
        ## if file then append
        if '.wav' in content :
            list_of_files.append((path + '/' + content, class_))
        ## if folder, then make a recursive call
        else :
            temp_class += 1
            get_metadata(path + '/' + content, list_of_files, temp_class)

#### Initializing Variables

In [None]:
path = r'TESS Toronto emotional speech set data'
metadata = []
class_initial = -1

#### Function Call

In [None]:
get_metadata(path, metadata, class_initial)

#### Metadata

In [None]:
metadata = pd.DataFrame(metadata, columns = ['File_name', 'class'])
metadata

In [None]:
## adding age factor
age_factor = list()
for class_ in metadata['class'] :
    if class_ < 7 :
        age_factor.append('young')
    else :
        age_factor.append('old')
metadata['Age_Factor'] = age_factor

df = metadata.copy()

def change_class(num_class) :
    if (num_class >= 7) :
        num_class -= 7
    return num_class

new_class = df['class'].apply(change_class)

df['class'] = new_class

df

## 3. Extracting Data

The Following Function loads the audio files and return the Audio Signals

In [None]:
def return_audio(files) :
    
    ## audio signals is the array of all the loaded audio files.
    audio_signals = []
    
    for file_path in files :
        ## the load() returns the array of signal and sample rate w.r.t time, for any audio file.
        audio, sample_rate = librosa.load(file_path)
        audio_signals.append(audio)
    
    return audio_signals

In [None]:
audio_signals = return_audio(np.array(df['File_name']))

In [None]:
## defining constant sample rate
sample_rate = 22050

## 4. Exploring Data

In [None]:
sample_audio = audio_signals[int(np.random.random() * 100)]
list(sample_audio[:10])

In [None]:
plt.figure(figsize=(8,2))
waveshow(sample_audio, sr = sample_rate)
Audio(data = sample_audio, rate = sample_rate)

- The above signal is of TIME DOMAIN
> The follwing audio file is converted into an array of signals.
- Note: The Audio Signals are represented in the form of array of amplitude in each time instance
<img src = "waveform load.png" style = "width:400px;height:200px"/>

In [None]:
df.to_csv('Class_ditribution.csv')

In [None]:
class_distribution = df['class'].value_counts().to_dict()
class_distribution

In [None]:
emotions = ['Angry', 'Disgust', 'Fear', 'Happy', 'Neutral', 'Pleasant Surprise', 'Sad']
count = list(class_distribution.values())

plt.bar(emotions, count, color = 'purple')

## 5. Mel-frequency cepstral coefficients (MFCCs) 


Mel-Frequency Cepstral Coefficients (MFCCs) are a crucial feature extraction technique widely used in the field of audio signal processing and speech recognition. They are particularly important due to their effectiveness in capturing essential patterns and characteristics in audio signals, especially for speech and audio analysis tasks. Here's an explanation of their importance and how they capture patterns in audio

MFCCs are needed to Capture Patterns in Audio:

- MFCCs capture patterns in audio by breaking down the audio signal into frames (typically around 20-30 milliseconds each).

- For each frame, a Fourier Transform is applied to compute the power spectrum of the signal.

- The power spectrum is then filtered through a bank of Mel filters, which approximate human auditory perception.

- After filtering, the logarithm of the filter bank outputs is taken, followed by a Discrete Cosine Transform (DCT) to obtain the MFCC coefficients.

- These coefficients represent the audio signal's spectral content for each frame.

<img src = "mfccs_img.png" style = "width:600;height:400px"/>

#### Extracting MFCCs from Audio Files

In [None]:
def extract_MFCCs(audio_signal, sample_rate) :
    mfccs = (librosa.feature.mfcc(y = audio_signal, sr = sample_rate, n_mfcc = 13)).T
    mfccs = np.mean(mfccs, axis = 0)
    return mfccs

In [None]:
sample_rate = 22050
mfccs = list()
for audio in audio_signals :
    mfccs.append(extract_MFCCs(audio , sample_rate))

In [None]:
mfccs = pd.DataFrame(mfccs)
mfccs

## 6. Processing Data for Deep Learning

#### Train Test Split

In [None]:
feature_data = mfccs.values
target = metadata['class']

In [None]:
x_train, x_test, y_train, y_test = train_test_split(feature_data, target, random_state = 42)

#### Checking Shapes

In [None]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

#### Converting Data w.r.t Neural Network

In [None]:
x_train_reshaped = x_train.reshape(x_train.shape[0], 13, 1)
y_train_reshaped = to_categorical(y_train, num_classes=len(set(target)), dtype='int')
x_test_reshaped = x_test.reshape(x_test.shape[0], 13, 1)
y_test_reshaped = to_categorical(y_test, num_classes=len(set(target)), dtype='int')

## 7. Setting up Model for Deep Learning

#### The Model that will be used will be a Sequentioal Convolutional Neural Networks
<img src = "CNN_img.jpg"/>

#### Why CNN?
Convolutional Neural Networks (CNNs) are primarily associated with image processing, but they can also be adapted for audio classification tasks, such as speech recognition, music genre classification, or environmental sound classification. To apply CNNs to audio data, we can use a spectrogram representation and follow a similar architecture as in image-based CNNs

#### Model Creation

In [None]:
class Audio_Classification :
    
    def __init__(self) :
        self.model = Sequential()
        input_shape = (13, 1)
        self.model.add(Conv1D(32, kernel_size=3, activation='selu', input_shape=input_shape))
        self.model.add(MaxPooling1D(pool_size=2))
        self.model.add(Conv1D(64, kernel_size=3, activation='selu'))
        self.model.add(MaxPooling1D(pool_size=2))
        self.model.add(Flatten())
        self.model.add(Dense(128, activation='selu'))
        self.model.add(Dense(14, activation='softmax'))
        self.model.compile(loss = 'CategoricalCrossentropy', optimizer = 'adam', metrics = ['accuracy'])
    
    def fit(self, x, y, epochs, validation) :
        self.model.fit(x, y, epochs = epochs, validation_data = validation)

    def predict(self, x) :
        y_pred = self.model.predict(x)
        for i in range(len(y_pred)) :
            y_pred[i] = np.argmax(y_pred[i])
        y_pred = np.array(y_pred[:, 0], dtype = int)
        return y_pred

## 8. Training and Testing

#### Creating Classifier

In [None]:
clf = Audio_Classification()

#### Training

In [None]:
clf.fit(x_train_reshaped, y_train_reshaped, epochs = 120, validation = (x_test_reshaped, y_test_reshaped))

#### Testing

In [None]:
y_pred = clf.predict(x_test_reshaped)
y_pred

## 9. Results

In [None]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print("Accuracy : ",accuracy_score(y_test, y_pred) * 100, '%')

### Submitted By - Prateek Sarna & Ayushi