## Speech Emotion Recognition: Audio Classification

Dataset Source: https://www.kaggle.com/datasets/dmitrybabko/speech-emotion-recognition-en

#### Install Missing Libraries

In [1]:
%pip install -U numpy==1.23.5 transformers==4.26.1
%pip install tensorboard ipywidgets
%pip install pandas
%pip install datasets IPython
%pip install torch torchaudio evaluate tqdm 
%pip install soundfile librosa

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#### Import Necessary Libraries

In [2]:
import os, sys, random, glob
os.environ['TOKENIZERS_PARALLELISM']='false'

import numpy as np
import pandas as pd

import datasets
from datasets import load_dataset, Audio, DatasetDict
from datasets import Audio, Features, ClassLabel

import torch

import transformers
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
from transformers import TrainingArguments, Trainer

import evaluate

from IPython.display import display

!git lfs install

Error: Failed to call git rev-parse --git-dir: exit status 128 
Git LFS initialized.


#### Access to HuggingFace Hub

In [3]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) y
Token is valid.
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential he

#### Mount Google Drive

In [4]:
from google.colab import drive

drive.mount("/content/drive")

Mounted at /content/drive


#### Display Library Versions

In [5]:
print("Python:".rjust(15), sys.version[0:6])
print("NumPy:".rjust(15), np.__version__)
print("Pandas:".rjust(15), pd.__version__)
print("Datasets:".rjust(15), datasets.__version__)
print("Torch:".rjust(15), torch.__version__)
print("Transformers:".rjust(15), transformers.__version__)
print("Evaluate:".rjust(15), evaluate.__version__)

        Python: 3.9.16
         NumPy: 1.23.5
        Pandas: 2.0.0
      Datasets: 2.11.0
         Torch: 2.0.0+cu118
  Transformers: 4.26.1
      Evaluate: 0.4.0


#### Create Dictionaries to Convert Labels Between Strings & Integers

In [6]:
labels = ["SAD", 
          "ANGRY",
          "DISGUST",
          "FEAR",
          "HAPPY",
          "NEUTRAL"]


NUM_OF_LABELS = len(labels)

label2id, id2label = dict(), dict()

for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

initial_label_update = {"SAD": "SAD", 
                        "ANG": "ANGRY",
                        "DIS": "DISGUST",
                        "FEA": "FEAR",
                        "HAP": "HAPPY",
                        "NEU": "NEUTRAL"}


print(labels)
print(NUM_OF_LABELS)
print(label2id)
print(id2label)

['SAD', 'ANGRY', 'DISGUST', 'FEAR', 'HAPPY', 'NEUTRAL']
6
{'SAD': '0', 'ANGRY': '1', 'DISGUST': '2', 'FEAR': '3', 'HAPPY': '4', 'NEUTRAL': '5'}
{'0': 'SAD', '1': 'ANGRY', '2': 'DISGUST', '3': 'FEAR', '4': 'HAPPY', '5': 'NEUTRAL'}


#### Prepare Metadata File

In [8]:
parent_dir = "/content/drive/MyDrive/Speech Emotion Recognition/data"

dir_path = os.path.join(parent_dir, "*.wav")

files_and_name = glob.glob(dir_path)

metadata = pd.DataFrame(files_and_name, columns=["file_path"])

metadata['file_name'] = metadata['file_path'].apply(lambda x: x.split("/")[-1])

metadata['label'] = metadata['file_path'].apply(lambda x: x.split("/")[-1].split("_")[-2])
metadata['label'] = metadata['label'].replace(initial_label_update)
metadata['label'] = metadata['label'].replace(label2id)

metadata = metadata.drop(columns=["file_path"])

metadata_file_location = os.path.join(parent_dir, "metadata.csv")
metadata.to_csv(metadata_file_location, index=False)

metadata.head()

Unnamed: 0,file_name,label
0,1080_TIE_HAP_XX.wav,4
1,1079_ITH_DIS_XX.wav,2
2,1079_IOM_NEU_XX.wav,5
3,1080_WSI_ANG_XX.wav,1
4,1081_IOM_SAD_XX.wav,0


#### Ingest & Preprocess Dataset

In [9]:
audio_data = load_dataset(parent_dir)

audio_data

Resolving data files:   0%|          | 0/7443 [00:00<?, ?it/s]

Downloading and preparing dataset audiofolder/data to /root/.cache/huggingface/datasets/audiofolder/data-4f6c853362c1b07a/0.0.0/6cbdd16f8688354c63b4e2a36e1585d05de285023ee6443ffd71c4182055c0fc...


Downloading data files:   0%|          | 0/7443 [00:00<?, ?it/s]

Downloading data files: 0it [00:00, ?it/s]

Extracting data files: 0it [00:00, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset audiofolder downloaded and prepared to /root/.cache/huggingface/datasets/audiofolder/data-4f6c853362c1b07a/0.0.0/6cbdd16f8688354c63b4e2a36e1585d05de285023ee6443ffd71c4182055c0fc. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['audio', 'label'],
        num_rows: 7442
    })
})

In [10]:
audio_data['train'][10]

{'audio': {'path': '/content/drive/MyDrive/Speech Emotion Recognition/data/1001_IEO_DIS_LO.wav',
  'array': array([-0.00228882, -0.00204468, -0.00180054, ...,  0.        ,
          0.        ,  0.        ]),
  'sampling_rate': 16000},
 'label': 2}

#### Cast Audio Feature to Data Type of Audio

In [11]:
audio_data = audio_data.cast_column("audio", Audio(sampling_rate=16_000))

#### Split Dataset into Training & Testing Datasets

In [12]:
audio_data = audio_data.shuffle(seed=42)

audio_data_split = audio_data['train'].train_test_split(test_size=0.25)

ds = DatasetDict({
    'train' : audio_data_split['train'],
    'eval' : audio_data_split['test']
})

#### Some Information About Training & Validation Datasets

In [13]:
print("Training Dataset")
print("Training Dataset Info: ", ds['train'])
print("First Sample in Training Dataset", ds['train'][0])
print("Last Sample in Training Dataset", ds['train'][-1])
print("Unique Values in Label/Class: ", sorted(ds['train'].unique("label")))

print("\n\nEvaluation Dataset")
print("Evaluation Dataset Info: ", ds['eval'])
print("First Sample in Evaluation Dataset", ds['eval'][0])
print("Last Sample in Evaluation Dataset", ds['eval'][-1])
print("Unique Values in Label/Class: ", sorted(ds['eval'].unique("label")))

Training Dataset
Training Dataset Info:  Dataset({
    features: ['audio', 'label'],
    num_rows: 5581
})
First Sample in Training Dataset {'audio': {'path': '/content/drive/MyDrive/Speech Emotion Recognition/data/1004_IEO_FEA_MD.wav', 'array': array([-0.00234985, -0.00180054, -0.00143433, ...,  0.        ,
        0.        ,  0.        ]), 'sampling_rate': 16000}, 'label': 3}
Last Sample in Training Dataset {'audio': {'path': '/content/drive/MyDrive/Speech Emotion Recognition/data/1053_TIE_DIS_XX.wav', 'array': array([ 5.06591797e-03,  4.39453125e-03,  4.05883789e-03, ...,
       -6.10351562e-05,  0.00000000e+00,  9.15527344e-05]), 'sampling_rate': 16000}, 'label': 2}


Flattening the indices:   0%|          | 0/5581 [00:00<?, ? examples/s]

Unique Values in Label/Class:  [0, 1, 2, 3, 4, 5]


Evaluation Dataset
Evaluation Dataset Info:  Dataset({
    features: ['audio', 'label'],
    num_rows: 1861
})
First Sample in Evaluation Dataset {'audio': {'path': '/content/drive/MyDrive/Speech Emotion Recognition/data/1066_IWL_ANG_XX.wav', 'array': array([-3.35693359e-04, -1.43432617e-03, -6.40869141e-04, ...,
       -3.05175781e-05,  1.22070312e-04,  2.74658203e-04]), 'sampling_rate': 16000}, 'label': 1}
Last Sample in Evaluation Dataset {'audio': {'path': '/content/drive/MyDrive/Speech Emotion Recognition/data/1046_TAI_FEA_XX.wav', 'array': array([-0.012146  , -0.01245117, -0.01229858, ...,  0.        ,
        0.        ,  0.        ]), 'sampling_rate': 16000}, 'label': 3}


Flattening the indices:   0%|          | 0/1861 [00:00<?, ? examples/s]

Unique Values in Label/Class:  [0, 1, 2, 3, 4, 5]


#### Display Some Examples with Ability to Listen to Them

In [14]:
for _ in range(5):
    from IPython.display import Audio, display
    rand_idx = random.randint(0, len(ds["train"])-1)
    example = ds["train"][rand_idx]
    audio = example["audio"]
    
    print(f'Label: {id2label[str(example["label"])]}')
    print(f'Shape: {audio["array"].shape}, sampling rate: {audio["sampling_rate"]}')
    display(Audio(audio["array"], rate=audio["sampling_rate"]))
    print()

Label: FEAR
Shape: (48048,), sampling rate: 16000



Label: NEUTRAL
Shape: (37904,), sampling rate: 16000



Label: NEUTRAL
Shape: (41108,), sampling rate: 16000



Label: NEUTRAL
Shape: (58725,), sampling rate: 16000



Label: HAPPY
Shape: (31498,), sampling rate: 16000





#### Basic Values/Constants

In [15]:
MODEL_CKPT = "facebook/wav2vec2-base"
MODEL_NAME = MODEL_CKPT.split("/")[-1] + "-Speech_Emotion_Recognition"

NUM_OF_EPOCHS = 10
LEARNING_RATE = 3e-5

BATCH_SIZE = 32
STRATEGY = "epoch"

#### Set Sample Rate

In [16]:
sampling_rate = ds["train"].features["audio"].sampling_rate
sampling_rate

16000

#### Instantiate Instance of Feature Extractor

In [17]:
feature_extractor = AutoFeatureExtractor.from_pretrained(MODEL_CKPT)

Downloading (…)rocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]



#### Define function to Preprocess Data

In [19]:
def preprocess_function(examples):
    '''
    This function prepares the dataset for the transformer
    by applying the feature extractor to it (among other 
    processes).
    '''
    max_duration = 5.0 # seconds
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(audio_arrays, 
                               sampling_rate=feature_extractor.sampling_rate, 
                               max_length=int(feature_extractor.sampling_rate * max_duration),
                               truncation=True)
    return inputs

encoded_audio = ds.map(preprocess_function, remove_columns="audio", batched=True)

Map:   0%|          | 0/5581 [00:00<?, ? examples/s]

  tensor = as_tensor(value)


Map:   0%|          | 0/1861 [00:00<?, ? examples/s]

#### Define Metrics Evaluation Function 

In [20]:
def compute_metrics(p):
    '''
    This function calculates & returns the following metrics:
    - accuracy
    - f1 score
    - recall
    - precision
    '''
    import evaluate
    
    accuracy_metric = evaluate.load("accuracy")
    
    accuracy = accuracy_metric.compute(predictions=np.argmax(p.predictions, 
                                                             axis=1), 
                                       references=p.label_ids)['accuracy']
    
    ### ------------------- F1 scores -------------------
    
    f1_score_metric = evaluate.load("f1")
   
    weighted_f1_score = f1_score_metric.compute(predictions=np.argmax(p.predictions, 
                                                                      axis=1), 
                                                references=p.label_ids, 
                                                average='weighted')["f1"]
    
    micro_f1_score = f1_score_metric.compute(predictions=np.argmax(p.predictions, 
                                                                   axis=1), 
                                             references=p.label_ids, 
                                             average='micro')['f1']
    
    macro_f1_score = f1_score_metric.compute(predictions=np.argmax(p.predictions, 
                                                                   axis=1), 
                                             references=p.label_ids, 
                                             average='macro')["f1"]
    
    ### ------------------- recall -------------------
    
    recall_metric = evaluate.load("recall")
    
    weighted_recall = recall_metric.compute(predictions=np.argmax(p.predictions, 
                                                                  axis=1), 
                                            references=p.label_ids, 
                                            average='weighted')["recall"]
    
    micro_recall = recall_metric.compute(predictions=np.argmax(p.predictions, 
                                                               axis=1), 
                                         references=p.label_ids, 
                                         average='micro')["recall"]
    
    macro_recall = recall_metric.compute(predictions=np.argmax(p.predictions, 
                                                               axis=1), 
                                         references=p.label_ids, 
                                         average='macro')["recall"]
    
    ### ------------------- precision -------------------
    
    precision_metric = evaluate.load("precision")
    
    weighted_precision = precision_metric.compute(predictions=np.argmax(p.predictions, 
                                                                        axis=1), 
                                                  references=p.label_ids, 
                                                  average='weighted')["precision"]
    
    micro_precision = precision_metric.compute(predictions=np.argmax(p.predictions, 
                                                                     axis=1), 
                                               references=p.label_ids, 
                                               average='micro')["precision"]
    
    macro_precision = precision_metric.compute(predictions=np.argmax(p.predictions, 
                                                                     axis=1), 
                                               references=p.label_ids, 
                                               average='macro')["precision"]
    
    return {"accuracy" : accuracy, 
            "Weighted F1" : weighted_f1_score,
            "Micro F1" : micro_f1_score,
            "Macro F1" : macro_f1_score,
            "Weighted Recall" : weighted_recall,
            "Micro Recall" : micro_recall,
            "Macro Recall" : macro_recall,
            "Weighted Precision" : weighted_precision,
            "Micro Precision" : micro_precision,
            "Macro Precision" : macro_precision
            }

#### Instantiate Model

In [21]:
model = AutoModelForAudioClassification.from_pretrained(MODEL_CKPT, 
                                                        num_labels=NUM_OF_LABELS, 
                                                        label2id=label2id,
                                                        id2label= id2label)



Downloading pytorch_model.bin:   0%|          | 0.00/380M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-base were not used when initializing Wav2Vec2ForSequenceClassification: ['project_hid.bias', 'project_q.weight', 'quantizer.codevectors', 'quantizer.weight_proj.bias', 'project_hid.weight', 'project_q.bias', 'quantizer.weight_proj.weight']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['projector.weight', 'classifier.bias', 'projector.

#### Define Training Arguments

In [25]:
args = TrainingArguments(
    output_dir=MODEL_NAME,
    evaluation_strategy=STRATEGY,
    num_train_epochs=NUM_OF_EPOCHS,
    save_strategy=STRATEGY,
    logging_strategy="steps",
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    warmup_ratio=0.10,
    gradient_accumulation_steps=4,
    logging_first_step=True,
    report_to="tensorboard",
    hub_private_repo=True,
    push_to_hub=True
)

PyTorch: setting up devices


#### Define Trainer

In [26]:
trainer = Trainer(
    model = model,
    args = args,
    train_dataset = encoded_audio["train"],
    eval_dataset = encoded_audio["eval"],
    tokenizer = feature_extractor,
    compute_metrics = compute_metrics,
)

/content/wav2vec2-base-Speech_Emotion_Recognition is already a clone of https://huggingface.co/DunnBC22/wav2vec2-base-Speech_Emotion_Recognition. Make sure you pull the latest changes with `repo.git_pull()`.


#### Train Model

In [27]:
trainer.train()

***** Running training *****
  Num examples = 5581
  Num Epochs = 10
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 4
  Total optimization steps = 430
  Number of trainable parameters = 94570118


Epoch,Training Loss,Validation Loss,Accuracy,Weighted f1,Micro f1,Macro f1,Weighted recall,Micro recall,Macro recall,Weighted precision,Micro precision,Macro precision
0,1.5581,1.404623,0.465341,0.40798,0.465341,0.417374,0.465341,0.465341,0.479324,0.500842,0.465341,0.497393
1,1.5581,1.156626,0.599678,0.583554,0.599678,0.587099,0.599678,0.599678,0.609252,0.624811,0.599678,0.620936
2,1.5581,0.973286,0.68834,0.684489,0.68834,0.686,0.68834,0.68834,0.692289,0.701222,0.68834,0.700913
3,1.5581,0.831314,0.739925,0.739194,0.739925,0.74088,0.739925,0.739925,0.741687,0.741531,0.739925,0.743205
4,1.5581,0.870764,0.702848,0.696289,0.702848,0.696955,0.702848,0.702848,0.708062,0.714767,0.702848,0.711441
5,1.5581,0.796883,0.729715,0.72668,0.729715,0.727662,0.729715,0.729715,0.73326,0.739255,0.729715,0.738241
6,1.5581,0.7349,0.760344,0.761326,0.760344,0.763065,0.760344,0.760344,0.76354,0.769933,0.760344,0.770248
7,1.5581,0.771417,0.74691,0.744359,0.74691,0.745629,0.74691,0.74691,0.748532,0.755385,0.74691,0.756332
8,1.5581,0.718311,0.763031,0.761521,0.763031,0.763143,0.763031,0.763031,0.765242,0.762622,0.763031,0.763747
9,1.5581,0.72636,0.753896,0.751427,0.753896,0.752871,0.753896,0.753896,0.757655,0.756496,0.753896,0.755779


***** Running Evaluation *****
  Num examples = 1861
  Batch size = 32
Saving model checkpoint to wav2vec2-base-Speech_Emotion_Recognition/checkpoint-43
Configuration saved in wav2vec2-base-Speech_Emotion_Recognition/checkpoint-43/config.json
Model weights saved in wav2vec2-base-Speech_Emotion_Recognition/checkpoint-43/pytorch_model.bin
Feature extractor saved in wav2vec2-base-Speech_Emotion_Recognition/checkpoint-43/preprocessor_config.json
Feature extractor saved in wav2vec2-base-Speech_Emotion_Recognition/preprocessor_config.json
***** Running Evaluation *****
  Num examples = 1861
  Batch size = 32
Saving model checkpoint to wav2vec2-base-Speech_Emotion_Recognition/checkpoint-86
Configuration saved in wav2vec2-base-Speech_Emotion_Recognition/checkpoint-86/config.json
Model weights saved in wav2vec2-base-Speech_Emotion_Recognition/checkpoint-86/pytorch_model.bin
Feature extractor saved in wav2vec2-base-Speech_Emotion_Recognition/checkpoint-86/preprocessor_config.json
Feature extract

TrainOutput(global_step=430, training_loss=0.8491150121356166, metrics={'train_runtime': 5889.5451, 'train_samples_per_second': 9.476, 'train_steps_per_second': 0.073, 'total_flos': 1.939146103364604e+18, 'train_loss': 0.8491150121356166, 'epoch': 9.98})

#### Evaluate Model

In [28]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 1861
  Batch size = 32


{'eval_loss': 0.7263597249984741,
 'eval_accuracy': 0.753895754970446,
 'eval_Weighted F1': 0.7514267219693074,
 'eval_Micro F1': 0.7538957549704459,
 'eval_Macro F1': 0.7528711386225364,
 'eval_Weighted Recall': 0.753895754970446,
 'eval_Micro Recall': 0.753895754970446,
 'eval_Macro Recall': 0.757654852200615,
 'eval_Weighted Precision': 0.7564957587826933,
 'eval_Micro Precision': 0.753895754970446,
 'eval_Macro Precision': 0.7557788556996212,
 'eval_runtime': 70.3313,
 'eval_samples_per_second': 26.46,
 'eval_steps_per_second': 0.839,
 'epoch': 9.98}

#### Push Model to Hub (My Profile!!!)

In [29]:
trainer.push_to_hub()

Saving model checkpoint to wav2vec2-base-Speech_Emotion_Recognition
Configuration saved in wav2vec2-base-Speech_Emotion_Recognition/config.json
Model weights saved in wav2vec2-base-Speech_Emotion_Recognition/pytorch_model.bin
Feature extractor saved in wav2vec2-base-Speech_Emotion_Recognition/preprocessor_config.json


Upload file runs/Apr17_21-10-22_d8373414225f/events.out.tfevents.1681765829.d8373414225f.1363.2: 100%|########…

To https://huggingface.co/DunnBC22/wav2vec2-base-Speech_Emotion_Recognition
   4407121..19e53f4  main -> main

   4407121..19e53f4  main -> main



Upload file runs/Apr17_21-10-22_d8373414225f/events.out.tfevents.1681771789.d8373414225f.1363.4: 100%|########…

Dropping the following result as it does not have all the necessary fields:
{'dataset': {'name': 'audiofolder', 'type': 'audiofolder', 'config': 'data', 'split': 'train', 'args': 'data'}, 'metrics': [{'name': 'Accuracy', 'type': 'accuracy', 'value': 0.753895754970446}]}
To https://huggingface.co/DunnBC22/wav2vec2-base-Speech_Emotion_Recognition
   19e53f4..ef11a50  main -> main

   19e53f4..ef11a50  main -> main



'https://huggingface.co/DunnBC22/wav2vec2-base-Speech_Emotion_Recognition/commit/19e53f4831935247d0060df121ac13882a8ddfe3'

### Notes & Other Takeaways From This Project
****
- 
- Results:
    - Accuracy: 0.753895754970446
    - Weighted F1: 0.7514267219693074
    - Weighted Recall: 0.753895754970446
    - Weighted Precision: 0.7564957587826933
****